You are on page 1of 159

-------------------------CHAPTER-1

INTRODUCTION

1
INTRODUCTION

1.1 Introduction

Natural Language Generation (NLG) is a computational task aiming to automatically

generate natural language text (paragraphs, sections and even entire documents)

from non-linguistic input (Reiter E, Dale R 2000). For example, an NLG system

working in the healthcare domain could start with a patient record and physiological

data (such as heart rate and blood pressure) collected over a shift period to auto-

generate a shift handover report from an outgoing nurse to an incoming nurse (James

Hunter et al., 2011).

NLG has come a long way from (in the 60s and 70s) being a research field with

limited objectives (e.g. to demonstrate language generation capabilities as part of an

application) into a major technological advancement backed by solid science and a

huge commercial potential as predicted by Gartner (Gartner predicts,

http://www.gartner.com/smarterwithgartner/gartner-predicts-our-digital-future/, that

NLG software will auto-generate 20% of business content by 2018, just the next

year). With mountains of electronic data available to organizations (public, private

and third sector) and users’ demand for information shooting up all the time, NLG

holds tremendous potential for delivering an unprecedented transformation in our

society by auto-generating human comprehensible information in a language of

choice (English, Hindi or Telugu).

At the state-of-the-art the complex task of NLG is divided into subtasks that either

make text design decisions (both decisions related to subject matter expertise as well

linguistic) or execute the decisions made earlier to produce the surface text. The

2
second set of subtasks that execute decisions are collectively grouped into a subtask

called surface realization. The research work presented in this thesis relates to the

surface realization subtask focusing particularly on Telugu, a south Indian

morphologically rich language belonging to the Dravidian family of languages

(https://en.wikipedia.org/wiki/Dravidian_languages).

The subtask of surface realization is challenging, despite the fact that all the text

design decisions would have been made prior to surface realization, because a

surface realizer is responsible for ensuring the grammatical (e.g. formation of well-

formed words, phrases and sentences) as well as orthographic validity (e.g.

punctuation) of the auto-generated text. Because a natural language (say Telugu)

would have evolved over millennia, its grammar and orthography would also have

evolved into a complex body of knowledge. The real challenge in building a surface

realizer is to acquire all this complex body of knowledge first, represent this

knowledge for computational modelling and finally develop algorithms that exploit

this knowledge for constructing sentences algorithmically.

For English and other similar European languages with well-developed grammatical

resources, building surface realizers has been undertaken with significant success in

the last three decades. Several off-the-shelf English surface realizers such as

FUF/SURGE, RealPro and SimpleNLG are available. However, general purpose

surface realizers for Indian languages ae not commonplace. In the current socio-

economic context of India, when technological advancements have been driving

societal transformation, language technologies have a big role to play. This is being

acknowledged by the government of India and several organizations (e.g. IIIT

Hyderabad and CIIL Mysore) have been setup to develop Indian language

technology. While machine translation has always been the primary focus of Indian

3
language technology, it is time that other language technologies such as NLG should

be encouraged in modern India. With the proliferation of smart phones in recent

times, there is a great potential to deliver finance, health and transport related

information personalized to individuals on their mobile phones in any Indian

language using NLG.

It is worth noting that in all these diverse set of applications (from finance to

transport) the surface realizer is the only NLG module that could be designed and

built domain-independent. The decision-making subtasks of NLG mentioned earlier

are domain-specific and therefore involve significant adaptation work to be portable

across multiple domains. This means, realizers are reusable and as such a thorough

scientific investigation into a realizer for a new language, such as Telugu, will

deliver return on investment (ROI) with the passage of time.

1.2 Motivation

In the standard Reiter and Dale architecture (Reiter E, Dale R 2000) of NLG a

surface realizer is the most language dependent module and also the one that could

be reused across several applications (domain independent). To the best of our

knowledge there have been no previous research studies that explored Telugu

realizers systematically. On the other hand in the recent past there has been a

resurgence of interest in general purpose realizers in the research field of NLG. This

resurgence is largely caused by the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)

realizer for English. This realizer is unique in comparison to earlier general purpose

realizers for English such as PENMAN and FUF/SURGE; SimpleNLG is designed

using commonly used grammatical concepts without the overweight of any linguistic

4
theory. SimpleNLG is the most widely used realizer for English both in academia

and in the industry.

Inspired by the success of the SimpleNLG approach of being linguistic theory

neutral, realizers for a wide range of languages have been reported in the literature

French, German , Brazilian Portuguese and Filipino are a few examples.

The use of software applications is increasing day by day in all parts of India,

especially in the rural areas of the sates of Telangana and Andhra Pradesh where

people do not understand English. The need for a general purpose surface realizer

which generates sentences in their native language helps them understand the

applications better is quite evident. The current work is motivated by the need for a

general purpose Telugu realizer and the availability of the popular SimpleNLG

design.

This thesis reports a realization engine for Telugu that automates the task of building

grammatically well-formed Telugu sentences from an input specification consisting

of lexicalized grammatical constituents and associated features. Our realization

engine adapts the design approach of SimpleNLG family of surface realizers.

1.3 Objectives

In this thesis the design and development of a surface realization engine for Telugu

is investigated. The following objectives have been chosen to achieve the aim:

1. The objectives for the Telugu language are:

a. To identify the most commonly used Telugu sublanguage and its

associated features.

5
b. To identify the right sources of the required Telugu grammatical

(linguistic) knowledge and to acquire the required knowledge from these

sources.

2. The objectives for the realization engine are:

a. To design and develop a framework for modelling the acquired

knowledge for the purpose of Telugu sentence construction

algorithmically.

b. To design, develop and evaluate a surface realization engine for Telugu

using the knowledge acquired and using the developed framework.

3. The objectives for the noun and pronoun morphological engine are:

a. To design and develop the plural formation mechanism for Telugu nouns

and pronouns.

b. To design and develop the oblique stem formation mechanism for Telugu

nouns and pronouns.

c. To design and develop the case-marker agglutination mechanism for

Telugu nouns and pronouns

4. The objectives for the verb morphological engine are:

a. To design and develop a mechanism for identification of verb class.

b. To design and develop a mechanism for the final verb formation with the

agglutination of the required constituents.

5. The objectives for the sentence formation are:

a. To design and develop a mechanism for identifying the input as

predefined grammatical constituents and constructing appropriate element

builders (software component in the implementation) for them.

6
b. To design and develop a mechanism for applying agreement rules, and to

facilitate free word order of the constituents of Telugu sentence.

1.4 Methodology

This thesis presents the research work carried out to design and develop a Telugu

realization engine adapting the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)

framework for Telugu. The input to the Telugu realization engine is an XML file

specifying lexicalized sentence constituents and their associated features.

The methodology followed in the thesis is as follows:

1. The input specification for the Telugu realization engine is an XML file

modelled on the SimpleNLG approach.

2. Primary meaning of Telugu sentences is mainly expressed using

inflected forms of content words and case markers or postpositions.

Therefore each part-of-speech has a separate morphology engine to

generate the inflected forms of words.

3. Telugu is not governed by a phrase structure grammar, instead fits better

into the dependency grammar. Therefore head-modifier dependency

structures are used at the phrase level.

4. Sentence constituents in Telugu can be ordered freely without impacting

the primary meaning of the sentence. Therefore the constituents of a

declarative sentence use a predefined sequence (Subject+ Object +Verb).

5. Agreement among sentence constituents in Telugu is done at the

sentence level. Therefore in the current thesis the agreement of the verb

with the subject is not performed in the morphology engine, instead it is

performed at the sentence level.

7
1.5 Organization of the Thesis

The thesis is organized into seven chapters. The following are the major

contributions in the thesis.

1. This thesis presents a systematic and thorough investigation into the first

Telugu surface realizer.

2. Telugu syntactic structures are very simple by nature and Telugu sentence

construction is dominated by morphological generation. Based on this

understanding Telugu sentence construction as a computational task has been

developed.

3. SimpleNLG approach has been adapted for Telugu sentence construction.

4. SimpleNLG approach like any other surface realization requires grammar

engineering. We found that the grammar available in Krishnamurti's book

serves the purpose. We found a good match between Krishnamurti's Telugu

grammar (linguistic knowledge) and SimpleNLG (Computational approach)

The description of the contributions in the thesis is properly scattered through all the

seven chapters. A brief introduction to each of the chapters is as follows:

Chapter 1 deals with the introduction to surface realization the topic of the thesis. It

provides an overview of the requirement of surface realization in India in the present

scenario. It also provides a very brief introduction to Telugu Language.

Chapter 2 deals with review of literature on foundational concepts and past work in

surface realization and computational morphology. It provides information about the

past work in a hierarchical manner starting from Natural language processing (NLP),

Natural Language Generation (NLG) to Surface Realization and computational

morphology in the context of languages all over the world and in India.

8
Chapter 3 deals with surface realizer for Telugu which is a java based application

that accepts a lexicalized XML file as input and generates well-formed Telugu

sentences. It provides software architecture for the construction of surface realization

engine. It also provides a useful standard surface realization input mechanism for

Telugu which can be extended to other Indian languages.

Chapter 4 deals with the construction of morphological engine process for plural

formation, the inflection based on number, person and gender, and agglutination with

case markers of nouns and pronouns.

Chapter 5 deals with the detailed study of the computational process of

morphophonemic changes the verb undergoes when inflected with tense-mode

suffixes. The verbs are classified into six conjugations based on the

morphophonemic changes, of which five classes contain regular verbs and the last

one contains irregular verbs.

Chapter 6 deals with adjective morphology, adverb morphology, aspects of phrase

formation, sentence level issues like subject verb agreement, and the word order of

the constituents of the sentence. The chapter also provides a detailed explanation of

the process followed and the modules in the implementation of the software for the

Telugu realization engine.

Chapter 7 discusses the conclusion of the thesis which includes a critical review of

the current work and the possibilities of the future work for the extension of the

current research.

9
-------------------------CHAPTER-2

REVIEW OF LITERATURE

10
REVIEW OF LITERATURE

2.1 Introduction

This chapter presents a literature review on the topics of this thesis which are natural

language generation (NLG), realization engine and morphology particularly within

the context of Telugu language. In addition, this chapter also presents background

knowledge supporting the main topics presented in this thesis.

Because the current research is in the area of computational linguistics/natural

language processing, this chapter starts with a brief introduction of this area and how

it relates to artificial intelligence and Computer Science. Following this, a brief

introduction of natural language generation (NLG), which is the main topic of this

thesis, is presented. In this introduction, the task of NLG is first defined and then the

most commonly used subtasks and the different architectures for NLG are described.

Then, a survey of NLG literature focusing on major milestones is described.

Following this, NLG of Indian languages is discussed focusing mainly on

computational morphology and sentence formation in Telugu language.

Computer Science

Artiificial
Intelligence

Computational
Lingusitics/Natural
Language Processing

Natural Language Natural Language


Understanding Generation

Figure 2.1 Hierarchy showing NLG in the context of broader fields

11
The topic of this thesis is a subfield of Artificial Intelligence (AI) (Winograd, 1972;

Appelt, 1985). AI is a field of computing science where computational techniques to

automate mental processes such as language comprehension, language production

and general problem solving are studied. A subfield of AI is Natural Language

Processing which focusses on language understanding and generation by computers

shown in Figure 2.1.

2.2 Natural Language Processing

Natural Language Processing is a subfield of Artificial Intelligence concerned with

programming the computer to process natural language text. Natural language

processing has two subfields namely Natural Language Understanding (NLU) and

Natural Language Generation (NLG). Natural Language Understanding (NLU)

systems take strings of words (sentences) as their input and produce structured

representations capturing the meaning of those strings as their output. The nature of

this output depends heavily on the task at hand. For instance, a natural language

understanding system serving as an interface to a database might accept queries in

natural language which relate to the kind of data held by the database. In this case

the meaning of the input (the output of the system) might be expressed in terms of

structured SQL queries which can be directly submitted to the database. The now

discontinued English Query by Microsoft (https://technet.microsoft.com/en-

us/library/ms143754(v=sql.90).aspx) is a good example of NL interface to relational

databases. Natural Language Generation (NLG) of which is the topic of my research

is discussed in detail in the next section.

12
2.3 Natural Language Generation

Natural Language Generation (NLG) is a subfield of artificial

intelligence/computational linguistic/Natural Language Processing (NLP) that

focuses on developing computational techniques that can automatically produce

understandable text in human languages. It starts from some non-linguistic

representation of input, NLG systems use knowledge about language and the

application domain to automatically produce documents, reports, explanations, help

messages and other kinds of text. Many applications have been developed over the

years which automatically generate text from non-linguistic data including but not

limited to the following:

 soccer reports (e.g., Theune et al., 2001; Chen & Mooney, 2008);

 virtual ‘newspapers’ from sensor data (Molina et al., 2011);

 textual descriptions of the day-to-day lives of birds based on satellite data

(Siddharthan et al., 2013);

 weather and financial reports (Goldberg et al., 1994; Reiter et al., 2005;

Turner et al., 2008; Ramos-Soto et al., 2015; Wanner et al., 2015; Plachouras

et al., 2016);

 summaries of patient information in clinical contexts (Huske-Kraus, 2003;

Harris, 2008; Portet et al., 2009; Gatt et al., 2009; Banaee et al., 2013);

 interactive information about cultural artefacts, for example in a museum

context (e.g., O’Donnell, 2001; Stock et al., 2007); and

 text intended to persuade (Carenini & Moore, 2006) or motivate behaviour

modification (Reiter et al., 2003).

13
NLG is both a fascinating area of research and an emerging technology with many

real-world applications. As a research area, NLG brings a unique perspective on

fundamental issues in artificial intelligence, cognitive science, and human computer

interaction. These include questions such as how linguistic and domain knowledge

should be represented and reasoned with, what it means for a text to be well written,

and how information is best communicated between machine and human. From a

practical perspective, NLG technology is capable of partially automating routine

document creation, removing much of the drudgery associated with such tasks. It is

an actively researched topic in the research laboratories around the world, and is also

deployed in real applications, to present and explain complex information to people

who do not have the background or time required to understand the raw data. In the

longer term, NLG is also likely to play an important role in human-computer

interfaces and will allow much richer interaction with machines than is possible

today.

In one sense language generation is the oldest subfield of language processing when

computers were able to understand only the most unnatural of command languages

they were spitting out natural texts. For example, the oldest and most famous C

program, the ‘hello, world’ program is a generation program. It produces useful

literate English in context. Unfortunately, whatever subtle or sublime communicative

force this text holds is produced not by the program itself but by the author of that

program. This canned text approach to generation is easy to implement but is unable

to adapt to new situations without the intervention of a programmer.

Language generation is also the most pervasive subfield of language processing.

Most of us received a form letter with our name carefully inserted in just the right

places along with eloquent appeals for one thing or another. This sort of program is

14
easy to implement as well but I doubt if many are fooled into thinking that such a

letter is handwritten English. This approach called template filling is more flexible

than canned text and has been used in a variety of applications but is still limited. For

example, (Weizenbaum 1966) use of templates in Eliza worked well in some

situations but produced nonsense in others.

Both the canned text and template approaches to NLG do not really capture the rich

linguistic information that native language users associate with utterances – a

sentence in a natural language is not a sequence of disconnected words. Instead, a

sentence has a structure (syntax of phrases and clauses) and consists of well-formed

words (morphology) all of which communicate a specific message to its recipients.

As a field of research, NLG aims to develop computational models of NLG that

capture the rich linguistic structures of natural languages for the purpose of language

production.

2.3.1 History of NLG

John Bateman and Michael Zock created an exhaustive list of NLG systems which is

available in a wiki form at http://www.nlg-wiki.org/. From this Wiki, it can be

observed that NLG has been applied to a wide spectrum of applications, from

medicine to meteorology and from finance to engineering. Table 2.1 shows a list of

major NLG research before the nineties.

In the nineties, the field of NLG further consolidated the idea of NLG as a complex

task and moved towards the consensus architecture for Natural Language Generation

Systems consisting of three major components namely document planning,

microplanning, and surface realization described in detail in (Reiter, 1994;

15
http://homepages.abdn.ac.uk/e.reiter/pages/papers/nlgw94.pdf). Much of the applied

NLG work since then has been majorly influenced by this consensus architecture.

Year Authors Papers

Richard J Hanson Robert F Generating English Discourse from Semantic


1972
Simmons and J Slocum Networks
Meaning and Discourse - A Computer Model
1977 John H Clippinger
of Psychoanalytic Speech and Cognition
1975 Goldman N Conceptual Generation
Tale-Spin an Interactive Program that writes
1977 Meehan J R
Stories
Strategy and Tactics: A model for Language
1977 Thompson.H
Production
Philip R Cohen and C Elements of a Plan-Based Theory of Speech
1979
Raymond Perrault Acts
James A Moore and A Snapshot of KDS A knowledge Delivery
1979
William C Mann System
Meteer M.,D.McDonald,S
Mumble-86”: Design and Implementation,
1987 .Anderson,D.Forster,L.Ga
Technical Report 87–87
y,A.Huettner, and P.Sibun
Walther V Hahn, The Anatomy of the Natural Language
1978
Wolfgang Hoeppner, Dialogue System HAM-RPM
Antony Jameson,
1985 Appelt D Planning English Sentences
Wolfgang Wahlster
Table 2.1 NLG Research before 1990

2.3.2 NLG Tasks

The NLG problem of converting input data into output text was addressed by

splitting it up into a number of sub problems. The following six are frequently found

in many NLG systems (Reiter & Dale, 1997, 2000)

1. Content determination: Deciding which information to include in the text under

construction,

2. Text structuring: Determining in which order information will be presented in

the text,

16
3. Sentence aggregation: Deciding which information to present in individual

sentences,

4. Lexicalisation: Finding the right words and phrases to express information,

5. Referring expression generation: Selecting the words and phrases to identify

domain objects,

6. Linguistic realisation: Combining all words and phrases into well-formed

sentences.

The above mentioned NLG sub tasks are very complex addressing the following two

issues:

a) Firstly, given a communication goal and the context of communication, it

sets the research agenda for developing computational techniques for

computing the required content (information) that should be verbalized in the

text. Logically, the organization of content (information) into a coherent

narrative has also been closely associated with content determination.

b) Secondly, the focus is on developing computational techniques for

expressing the content (or information) in well-formed words, phrases and

sentences.

2.3.2.1 Content Determination

In the generation process the NLG system needs to decide which information should

be included in the text under construction and which should not. In general the data

contains more information than required. It is the task of NLG systems to decide the

required content for the given application. Content determination involves choice.

Content determination need to be closely related to domain of application (cf.

Millish et al., 2006).

17
2.3.2.2 Text Structuring

After the determination of what information is to be conveyed the NLG system

needs to decide the order of presentation to the user. This task is referred to as Text

Structuring. The ordering of the text in most applications is based on importance,

and grouping of information based on relatedness (Portet et al., 2009).

2.3.2.3 Sentence Aggregation

Each and every message in the text plan need not be represented as a separate

sentence. Some messages can be combined to form a sentence. The generated text by

combining some messages becomes more fluid and readable (Dalianis, 1999; Cheng

& Mellish, 2000)

2.3.2.4 Lexicalization

Lexicalization is a very important decision regarding which words or phrases to use

to express the messages effectively. The complexity involved in the lexicalization

process depends on the number of alternatives the NLG system can offer. Lexical

choice in applications may be informed by contextual constraints, stylistic

constraints, or attitude and affective stance towards the event in question

(Fleischman & Hovy, 2002).

2.3.2.5 Referring Expression Generation

Referring Expression Generation is defined (Reiter and Dale 1997) as the task of

selecting words or phrases to identify domain entities. This definition suggests a

close similarity to lexicalisation, but (Reiter and Dale 2000) point out that the

essential difference is that referring expression generation is a discrimination task,

where the system needs to communicate sufficient information to distinguish one

18
domain entity from other domain entities. Referring Expression Generation is among

the tasks within the field of automated text generation (Mellish et al., 2006;

Siddharthan et al., 2011).

2.3.2.6 Surface Realization

This task referred to as Surface Realization or Linguistic Realization involves

ordering constituents of a sentence and gathering the right morphological forms. An

important complication in this task is that the output needs to have many linguistic

components which may not be present in the input. Thus this task is the projection of

non-isomorphic structures (cf. Ballesteros et al., 2015). Some of the approaches that

are proposed for linguistic realization are:

a) Human-crafted templates

b) Statistical approaches

c) Human-crafted grammar-based systems

2.3.2.6.1 Human-crafted templates

When the application is small and the variation in the output is minimal the outputs

can be specified in the form of templates. For example:

$AtagAdu $parugulu parugulu koVttAdu

This template has two variables which can be filled with the names of a player and

the number of runs scored by the player. It can generate sentences like:

virAt kohli vaMxa parugulu koVttAdu. (Virat Kohli hit hundred runs)

The advantage with the use of templates is that they avoid the generation of

ungrammatical structures and allow for full control over the quality of the output.

19
The template based methods have started including syntactic information and

sophisticated rules for filling the gaps (Theune et al., 2001) making it difficult to

distinguish template based methods from other methods (van Deemter et al., 2005).

The disadvantage of template based systems is they are labour intensive and do not

scale to applications which require considerable linguistic variation.

2.3.2.6.2 Statistical Approaches

Some applications acquire probabilistic grammars from large corpora reducing the

manual labour required while increasing coverage. Two approaches have been taken

to include statistical information in the realization process. One approach

(Langkilde-Geary, 2000; Langkilde-Geary & Knight, 2002) on the

HALOGEN/NITROGEN systems relies on a two-level approach in which a hand

crafted grammar is used to generate alternative realizations represented as a forest,

from which a stochastic re-ranker selects the optimal candidate. The system relies on

corpus based statistical knowledge in the form of n-grams. There are also other

sophisticated models to perform re-ranking (e.g., Bangalore & Rambow, 2000;

Ratnaparkhi, 2000; Cahill et al., 2007) and models trained on user ratings of

utterance quality (Walker et al., 2002).

A second line of research has focused on introducing statistics at the generation

decision level, by training models that find the set of generation parameters

maximising an objective function, e.g. producing a target linguistic style (Paiva and

Evans, 2005; Mairesse and Walker, 2010), generating the most likely context-free

derivations given a corpus (Belz, 2008), or maximising the expected reward using

reinforcement learning (Rieser and Lemon, 2009). While such methods do not suffer

from the computational cost of an over generation phase, they still require a

20
handcrafted generator to define the generation decision space within which statistics

can be used to find an optimal solution.

2.3.2.6.3 Human-crafted Grammar-based systems

The topic of the current thesis which constructs a surface realization engine for

Telugu is modelled on the human-crafted grammar-based systems. A brief

introduction to human-crafted grammar-based systems is provided in this section. As

template based systems cannot scale to applications which require considerable

linguistic variation general purpose domain-independent systems are used as an

alternative. Most of these systems are grammar-based, that is, they make their

choices on the basis of a grammar of the language under consideration. The human-

crafted grammar-based surface Realizer while generating the surface forms must

satisfy the following requirements:

a) The semantics of the Input Specification are to be preserved.

b) The generated surface forms are grammatical with respect to the language in

question.

In order to satisfy the above requirements, the Surface Realizer may select content

words, insert function words, perform morphological inflections, and take care of the

order of the surface forms making the grammar of the language under consideration

as the basis. The difficulty with grammar-based systems is how to make choices

among the related options for generation of the surface forms.

The grammar for hand-crafted systems can be manually written, as in many off-the-

shelf realizers such as fuf/surge (Elhadad & Robin, 1996), mumble (Meteer et al.,

1987), kpml (Bateman, 1997), Nigel (Mann & Matthiessen, 1983), and RealPro

(Lavoie & Rambow, 1997). Hand-crafted grammar-based realizers require very

21
detailed input as in KPML (Bateman, 1997) which is based on Systemic-Functional

Grammar (sfg; Halliday & Matthiessen, 2004). The levels of details that are required

make these realizers difficult to use as simple plug-and-play or off the shelf modules

(e.g., Kasper, 1989). The difficulty with systems like KPML has motivated the

development of simple realization engines which provide syntax and morphology

APIs, but leave choice-making up to the developer (Gatt et al., 2009; Vaudry &

Lapalme, 2013; Marcel Bollmann, 2011; Rodrigo de Oliveira & Sripada, 2014).

2.3.2.6.3.1 SimpleNLG

The current thesis is modelled on a hand-crafted grammar-based surface realizer

SimpleNLG (Gatt and Reiter, 2009). SimpleNLG is a simple java API designed to

function as a realization engine which is the last stage in natural language generation.

It has been used successfully in a number of projects belonging to both academic and

commercial. SimpleNLG automates some of the mundane tasks that all natural

language generation systems require to perform. The tasks such as:

a) Orthography which includes pouring, formatting lists, absorbing punctuation,

and inserting appropriate white spaces in sentences and paragraphs.

b) Morphology, which includes handling inflected forms that are modifying a word

or a lexeme to reflect grammatical information such as number, person, tense,

and gender.

c) Simple Grammar which includes providing appropriate syntactic structure,

creating well- formed word groups, and enforcing noun-verb agreement.

SimpleNLG is used by researchers having their own implementation of document

planning or micro planning so that the mundane tasks of realization need not bother

22
them. It is also used by anyone who wants to write programs to generate English

sentences.

2.3.2.6.3.2 SimpleNLG XML Realizer Framework

The input specification mechanism for the current thesis is modelled on the

SimpleNLG XML Realizer framework. The input for the current thesis is an XML

file which is similar to the SimpleNLG XML input design but not same. A simple

description of the SimpleNLG XML Realizer framework is provided here.

The SimpleNLG XML Realizer walks through the Input Specification in a top-down

left-to-right fashion to produce appropriate output from the nodes encountered

during the traversal. The nodes in the Input Specification provide the information as

to what the realizer needs to do along with the linguistic units.

The xml realiser framework works by:

1. The XML realizer framework uses the code generation tools, xjc for java

and xsd for C# to generate wrapper classes for the relevant elements in the

schema. A client application can invoke simpleNLG to get the realized text

through the wrapper classes which act as Data Transfer Objects. Wrapper

classes are contained in the package simplenlg.xmlrealiser.wrapper. These

classes have the same names as real simpleNLG classes, with the prefix

“Xml”. Wrapper classes need to be generated only once, and only if changes

are actually made to the XML schema.

2. The simplenlg.xmlrealiser.UnWrapper class uses the java DTO that are

created by un-marshalling the xml specification of a DocumentElemement,

conforming to the schema. A document element object

23
simplenlg.xmlrealiser.wrapper.XmlDocumentElement object that represents

the xml is created using the javax.xml.bind.Unmarshaller. The Unwrapper

recursively processes the Data Transfer Objects to produce a

simplenlg.framework.DocumentElement which is then passed to the realiser,

and realized in the usual way.

2.3.3 A Short History of Surface Realization

Mann and Matthiessen 1985 have done a lot of research in text generation and

created a software Nigel in the framework of systemic linguistics. Along with the

specification of function and structures of English Nigel also has a semantic stratum

to specify the situations in which the grammatical features are to be used.

Coch pioneered the developed of AlthGen an automatic multi-paragraph text-

generation tool box. It was first developed for French language during 1993-94. The

English and Spanish versions were also developed later. The main characteristics of

AlthGen are as follows:

1) The high quality of the multi-paragraphs generated, in terms of flow of text,

and customizability.

2) Its ability to produce an extensive set of different text structures due to its

data-driven planning approach.

Knight and Hatzivassiloglou, in 1995 proposed NITROGEN: “Two-Level, Many-

Paths Generation” which is a hybrid generator that takes the help of statistical

methods to fill the gaps in the symbolic knowledge.

Elhadad and Robin in 1996 proposed SURGE (Systemic Unification Realization

Grammar of English) a reusable comprehensive syntactic realization component.

24
Lavoie and Rambow, in 1997 developed a new off-the-shelf realizer REALPRO

which is derived from existing systems with a new design and completely new

implementation. REALPRO has the following characteristics:

1) It is fast and portable across platforms as it was implemented in C++.

2) It can run as a stand-alone server and has both C++ and Java API.

Bateman, in 1997 proposed KPML a development environment to enable technology

for multilingual natural language generation. KPML provides the following features

to generation processes:

1) A set of standardized linguistic resources useful for text generation which are

always improving.

2) A tactical generation engine to for using such linguistic resources.

3) A number of highly focussed debugging aids to further support efficient

maintenance and development of such linguistic resources.

4) A number of customization tools.

5) Specialized techniques to support multilingual works such as contrastive

language development and automatic merging of independently developed

resources for distinct languages.

Nizar Habash in 2000 proposed Oxygen a language independent linearization

engine. The grammars which come as input to this linearization engine are written in

a powerful and flexible language oxyL which is as good as conventional

programming languages. The linearization engine compiles the grammars of the

target language into programs that accept feature graphs as input and generate word

lattices. The word lattices are passed as input to the statistical extraction module of

the generation system Nitrogen.

25
Irene Langkilde, in 2000 proposed Forest-Based Statistical Sentence Generator in

which a phrase is chosen by statistically ranking a set of alternative phrases packed

as trees or forests.

Susan W. McRoy, Songsak Channarukul, and Syed S. Ali in 2000 proposed YAG

(Yet Another Generator) a template-based generator for real-time systems. This

generator works well in interactive applications providing natural language output to

the interactive context. It does not require the extensive knowledge of the grammar

of the target language or all possible output strings ahead of time. YAG provides

support for unspecified inputs, robustness, speed, expressiveness, and coverage to

applications and application designers.

Gatt and Reiter in 2009 created a java API SimpleNLG a realization engine for

English with an aim to provide simple and robust interfaces to generate syntactic

structures and linearize them. This realization engine is the main source of

inspiration for the work reported in this thesis. Therefore details of SimpleNLG are

described in greater detail in the chapter 3.

Marcel Bollman in 2011 reports Adapting SimpleNLG for German a surface

realization engine which is a java framework modelled on SimpleNLG. The paper

describes the characteristics of German language and the changes made to the

SimpleNLG framework to meet the requirements of the relatively free word order

German language.

Pierre-luc Vaudry and Guy Lapalme 2013 report Adapting SimpleNLG for bilingual

English-French realisation a bilingual surface realization for English and French.

SimpleNLG-EnFr is the name given to the bilingual realisation engine. This paper

26
describes the general characteristics of the software and the adaptions made for the

French Language.

Rodrigo de Oliveira and Somayajulu Sripada 2014 report Adapting SimpleNLG for

Brazilian Portuguese realisation which reports the ongoing implementation and the

current coverage of SimpleNLG-BP an adaption of SimpleNLG-EnFr for Brazilian

Portuguese.

Alessandro Mazzei, Cristina Battaglino, and Cristina Bosco 2016 report

SimpleNLG-IT a surface realization engine for Italian modelled on the principle of

SimpleNLG. The paper gives some details about the grammar and the lexicon

employed by the system and reports some results about a first evaluation based on a

dependency tree bank for Italian.

2.3.4 Surface Realization in Indian Context

Smriti Singh, Mrugank Dalal, Vishal Vachhani, Pushpak Bhattacharyya, Om P.

Damani in 2007 created HinD, a Hindi generation software from Interlingua (UNL).

This software is an Interlingua for knowledge representation in the context of

machine translation. The generation process consists of three main stages:

1) Morphological generation of lexical words, function words insertion, and

syntax planning.

2) Case marker insertion after the subject and the object.

3) Finally all the words are arranged to form a valid sentence.

Uma Maheshwar Rao, G. and Christopher Mala (2011) presented Telugu Word

Synthesizer which is a generic engine that can be used for any language by plugging

in a specific language database. The generator synthesizes all and only the well-

27
formed word forms. The generator engine is independent of language and works

effectively and is based on word-and-paradigm method.

Vishal Goyal and Gurpreet Singh (2011) developed a Hindi to Punjabi machine

translation system. The key activities involved during translation process are pre-

processing, translation engine and post processing. Lookup algorithms and pattern

matching algorithms formed the basis for solving these issues. The system accuracy

has been evaluated using intelligibility test, accuracy test and BLEU score. The

hybrid system is found to perform better than the constituent systems.

2.4 Morphological Theories

Johann Wolfgang von Goethe (1749-1832) a German poet, novelist, and philosopher

was the first person to coin the term morphology. He coined it in the early nineteenth

century in a biological context. The term morphology originated from the Greek

where morph- means ‘shape or form’ and morphology is the study of morphs or

forms. In linguistics morphology is the branch which deals with words, their internal

structure, and the way in which they are formed.

2.4.1 Linguistic Theories of Morphology

A number of theoretical models have been developed over the years. Each model has

a specific set of claims about the nature of morphology and specific focuses in terms

of data that is covered by the theory. A lot of research is done on various aspects of

theoretical morphology. The classification of theoretical morphology is done by

(Hockett, 1954) and (Stump, 2001). A description of their theories is provided in the

following sections.

28
2.4.1.1 Two Models of Grammatical Description

Linguistic theories of morphology differ in their view with respect to treating

morpheme as the basic building block of morphological analysis or generation. “Two

models of grammatical description” was proposed by (Hockett, 1954) which are Item

and Arrangement model (IA) and Item and Process model (IP). He also mentioned

about the Word and Paradigm approach (WP) which is the oldest among the three

approaches.

2.4.1.1.1 Item and Arrangement Model

Morpheme-based morphology is an approach in which word forms are analyzed as

arrangements of morphemes. A morpheme is treated as the minimal meaningful unit

of a language. This way of analyzing word forms is called ‘item-and-arrangement’

which treats words as concatenation of morphemes.

In Telugu Language when verbs are classified using item-and-arrangement approach

they result in six classes of conjugation types out of which five classes are regular

verbs and one irregular class of verbs. The classes I, II and III of regular verbs have

ten subclasses. In a word such as ammu-wA-nu (I will sell), the morphemes are

ammu-,-wA, and -nu where ammu- is the root, -wA is the tense-mode suffix, and –

nu is the personal suffix.

2.4.1.1.2 Item and Process Model

Lexeme-based morphology takes an approach called item-and-process. In lexeme-

based morphology a word form is assumed to be a result of applying rules that alter a

29
stem to produce a new one. Inflectional rules, derivational rules and compounding

rules are applied to stems to obtain the required word form.

In Telugu language, inflected word forms are derived by a set of sandhi rules

operating on stem and suffixes. The item-and-process approach when applied to

verbs falls into two types regular, and irregular. The regular verb pilus-wA-nu (I will

call) is a result of phonological substitution of the morpheme piluc- and –wA to

produce pilus-. Some of the irregular verb stem variants are derived by lexical

substitution rules and not by phonological substitution. The morpheme vacc-

becomes rA- when followed by a beginning with the vowel ‘a’.

2.4.1.1.3 Word and Paradigm Approach

The approach used by word-based morphology is called as word-and-paradigm

approach. This theory instead of stating rules to combine morphemes to form words

states generalizations that holds between the different forms of inflectional

paradigms.

In Telugu language when word-and-paradigm approach is applied the verbs are

classified into twelve paradigmatic classes of regular verbs and ten irregular verbs.

Each class is described by giving a verb paradigm for that class. The word ‘pilucu’

(to call) represents one class of the twelve regular paradigmatic classes and all the

words that have the same inflectional paradigm like ‘kalcu’ (to burn) come under

this paradigmatic class. The irregular verbs do not have a paradigmatic class instead

each verb is treated separately. The word ‘icc’ (to give) is one among the ten

irregular verbs in this classification.

30
2.4.1.2 A Two Dimensional Taxonomy of Morphology

A two dimensional taxonomy of morphological theories was proposed by (Stump,

2001). He distinguished two axes along which inflectional morphology may be

situated relative to one another. He proposed the lexical/inferential axis and the

incremental/realizational axis (Stump, 2001) which are orthogonal to each other.

In a lexical theory of inflectional morphology, inflectional morphemes are lexically

listed, and are therefore subject to the same principles of lexical insertion as ordinary

lexical morphemes. In a lexical theory, the Telugu verb form “pAduwAnu” arises

through the insertion of the lexically listed morphemes “pAdu”, “wA”, and “nu” into

a particular constituent structure. Inferential theories, by contrast, rely on rules to

infer inflectionally complex word forms from more basic stems or from other word

forms; inflectional morphemes are not listed in the lexicon, but are the mark of a

particular step in the inference of a complex word form. Such inferences may be

stem-based or word-based: for example, “pAduwAnu” might be deduced from more

basic stems through a chain of inferences: pAd- → pAdu → pAduwA →

pAduwAnu alternatively, “pAduwAnu” might be inferred from the contrasting word-

form “pAdu”.

In an incremental theory, each inflectional morpheme is associated with a particular

morpho-syntactic content—in the lexicon, if the theory is lexical, and in a rule of

inference, if the theory is inferential—and each complex combination of morphemes

acquires its morpho-syntactic properties cumulatively, through the combination of

the morpho-syntactic properties of the individual inflectional morphemes of which it

is composed. In a theory of this sort, “pAduwAvu” acquires the properties ‘second

person’ and ‘singular’ through the lexical insertion of the agreement suffix “vu” or

31
by means of a rule inferring “pAduwAvu”, from the perfective stem “pAdu”. Thus,

in an incremental approach to inflection, a word form’s morpho-syntactic content is

supplied in steps. In a realizational theory, a word form’s association with a

particular set of morpho-syntactic properties logically precedes the expression of

those properties by particular inflectional markings: it is precisely this association

that determines the lexical insertion of its affixes (if the theory is lexical) or

determines the rules by which it is inferred from a stem or related word form (if the

theory is inferential). In such a theory, the association 〈pAdu, {2 sg present

perfective indicative active}⟩ licenses either the lexical insertion of the morphemes

pAd, u, wA and vu, or the stem based chain of inferences pAd → pAdu → pAduwA

→ pAduwAvu, or the word-based inference of pAduwAvu from pAdu. Thus, in a

realizational approach to inflection, a language’s grammar specifies the sets of

properties with which a lexeme L may be associated, and for each such property set

σ, the morphology of the language defines the word form realizing the pairing (L, σ).

2.4.2 Computational Morphology

Computational approaches to morphology are concerned with formal devices such as

grammars and stochastic models and algorithms. Finite state automata and

transducers can be used as formal devices to implement morphological grammars or

statistical part-of speech taggers (Roark Brain et.al 2007). Computational

morphological generators subscribe more to the Item and Arrangement model

discussed in the previous section (Section 2.4.1.1.1). In the current thesis the

morphological generation of Telugu words uses finite automata to check for patterns

modeled on the rules of the grammar.

32
2.4.2.1 Finite State Morphology

Ordered rules, however, are not particularly conducive to computational applications

as there is no appropriate formal framework in which rules can be readily formulated

and implemented. In a nutshell, FSM provides the tools for turning rules into

practical morphological analysers and generators. The theoretical groundwork was

laid out (Johnson 1972) and, independently in the early 1990s, (Kaplan & Kay

1994). They show that rewrite rules are equivalent in power to Finite State

transducers, which are a variant of Finite State automata that linguists are more

familiar with. Instead of accepting or rejecting a single string, as in the case of Finite

State automata, a Finite State transducer accepts or rejects two strings whose letters

are pair-matched, while still retaining the Markovian property of Finite State

transitions. As a result, Finite State transducers are simple, well understood and easy

to implement computationally. Moreover, it is also found that an ordered cascade of

rewrite rules can in principle be automatically compiled into a single Finite State

transducer, thus capturing the mapping from the underlying form to the surface form

in terms of paired strings.

Computational morphology saw the development of Two-Level Morphology

(Koskenniemi 1983), where contextual constraints are expressed in parallel directly

between lexical and surface levels, rather than as rules applied in serial order. Ever

since gaining prominence in the 1980s, Two-Level Morphology has become a staple

in computational linguistics. But it is not the easiest tool to use. The two-level

commitment forces one to directly manipulate input-output letter strings, and

represent serial rules as parallel constraints. This can be a highly unintuitive and

labour-intensive process, even for experienced programmers.

33
In the current thesis Finite state automata is used extensively in the morphology

engine. Finite state automata are used in pattern matching to categorize the given

word into one of the number of classes specified by the grammar rules in

(Krishnamurti 1985). All the grammar rules implemented by the Telugu morphology

engine are specified in the form of Finite automata.

2.5 A Short History of Morphological Analyzers and Generators

Guido Minnen, John Carroll and Derren Pearce (2000) developed a robust applied

morphological generator for English. The morphological generator generates a word

from a given specification of a lemma, a part of speech label and an inflectional type.

The generator was built using data from several large corpora and machine readable

dictionaries. The generator is packed as a UNIX filters making it easy to integrate

into applications.

Akshara bharati et.al (2001) developed an algorithm for unsupervised learning of

morphological analysis and generation of inflectionally rich languages. This

algorithm uses the frequency of occurrences of word forms in a raw corpus. They

introduce the concept of “observable paradigm “by forming equivalence classes of

feature-structures which are not obvious. Frequency of word forms for each

equivalence class is collected from such data for known paradigms. In this algorithm,

suppose the morphological analyser cannot recognize the inflectional form. The

possible stem and paradigm was guessed using the corpus frequencies. The method

assumes that the morphological package makes use of paradigms. This package was

able to guess stem paradigm pair for an unknown word. This method only depends

on the frequencies of the word forms in raw corpora and does not require any

34
linguistic rules or tagger. The performance of this system is depends on the size of

the corpora.

Madhavi Ganapathiraju and Lori Levin (2006) presented a rule based morphological

generator for Telugu nouns and verbs. The implementation was a perl program

modelled on the grammar rules of C P Brown and Krishnamurti’s grammar books.

Sri Badri Narayanan R, Saravanan S, Soman KP.(2009) presented a data driven

suffix list and concatenation algorithm for Telugu morphological generator which

doesn’t require any orthographic and morphotactics rules, using an automated

extraction of the suffix list and efficient algorithm for concatenating the lemma and

the morphological features. The preliminary results obtained from this system are

significant.

Dr. Ramakanth Kumar P, Shambhavi. B. R, Srividya K, Jyothi B J, Spoorti

Kundargi, Varsha Shastri G (2011) developed a paradigm based morphological

generator and analyser using a tries based data structure. The generator and analyser

can handle up to maximum 3700 root words and around 88K inflected words.

Parameswari.K (2011) developed a Tamil morphological Analyzer and generator

using APERTIUM tool kit. This attempt involves a practical adoption of lttoolbox

for the modern standard written Tamil in order to develop an improvised open source

Morphological Analyzer and generator. The tool uses the computational algorithm

Finite State Transducers (FST) for one-pass analysis and generation, and the

database is developed in the morphological model called word and paradigm.

Malin Ahlberg Sprakbanken, Markus Forsberg Sprakbanken, and Mans Hulden

(2014) present Semi-supervised learning of morphological paradigms and lexicons a

35
semi-supervised approach to the problem of paradigm induction from inflectional

tables. The system extracts generalizations from inflectional tables representing and

resulting paradigms in an abstract form.

Malin Ahlberg Sprakbanken, Markus Forsberg Sprakbanken, and Mans Hulden

(2015) present Paradigm classification in supervised learning of morphology which

combines a non-probabilistic strategy of inflection table generalization with a

discriminative classifier to permit the reconstruction of complete inflection tables of

unseen words.

Girija V R and T Anuradha (2017) report the design of a morphological analyser for

Malayalam modelled on finite state techniques that can be used for text analysis

where the model recognizes and strips the morphemes in a string of text.

2.6 Summary

This chapter presents a literature review of all the subtasks in Natural Language

Generation, surface realization in particular which is the topic of this thesis.

Research in surface realization of Indian languages is still very young in comparison

to European languages. Morphological generation is very complex in Indian

languages therefore a review of morphological generation in general context and also

Indian context is presented in this chapter.

36
-----------------------CHAPTER-3

TELUGU REALIZATION ENGINE


OVERVIEW

37
TELUGU REALIZATION ENGINE OVERVIEW

3.1 Introduction

Telugu is a Dravidian language with nearly 100 million first language speakers. It is

a morphologically rich language (MRL) with a simple syntax where the sentence

constituents can be ordered freely without impacting the primary meaning of the

sentence. In this thesis we describe a surface realization engine for Telugu. Surface

realization is the final subtask of an NLG pipeline (Reiter and Dale, 2000) that is

responsible for mechanically applying all the linguistic choices made by upstream

subtasks (such as microplanning) to generate a grammatically valid surface form.

3.1.1 Architecture of Natural Language Generation Systems

Based on a significant amount of experience in building NLG system an appropriate

architecture for Natural Language Generation Systems became very important. In the

early days of work in NLG a distinction between ‘strategy’ and ‘tactics’ was made,

where the strategy is concerned with determining ‘what to say’ and tactics are

concerned with deciding ‘how to say it’. The result of this distinction is a particular

modularization where NLG systems had two specific tasks referred to as text

planning and linguistic realization.

Later as the understanding of NLG systems increased it became common to

construct NLG systems with an additional module in between the two modules

discussed earlier. The intermediate module is referred to as the microplanner (shown

in Fig 3.1). The use of embedded graphics, formatting mark-ups, and hypertext links

motivated the use of the term document planner in place of text planner. Finally, the

38
linguistic realizer is more generally termed as surface realizer to acknowledge the

fact that the surface forms are not always linguistic in nature.

Figure 3.1 NLG System Architecture

3.1.1.1 Document Planning

A Document Planner decides what information to communicate and determines how

this information is to be structured for presentation. There are two tasks performed in

Document Planning namely content determination which takes care of what

information is to be communicated and document structuring which takes care of

structuring the information for presentation. The input for the document planner is a

four-tuple (k,c,u,d) where k is the knowledge source, c is the communicative goal, u

is the user model and d is the discourse history. The Document Planner takes the

four-tuple input and performs the following activities:

39
a) Construction of messages from the information source;

b) Decision making as to which message is required to satisfy the

communicative goal;

c) Document structuring to present the messages satisfying the communicative

goal in a proper manner.

3.1.1.2 Microplanning

A Document Plan specifies the final structure and content of the text to be generated

in very broad terms. The purpose of Microplanner is to refine the document plan to

produce a more specified text specification from many possible output texts that are

compatible with the document plan. The Microplanner performs the following

subtasks:

a) Expressive Lexicalization: It decides the lexical items to be used to realize

the conceptual elements specified by the Document Plan.

b) Linguistic Aggregation: It determines how messages should be composed

to generate specifications for linguistic units.

c) Referring Expression Generation: It determines how the entities in the

messages are to be referred.

3.1.1.3 Surface Realization

The surface realization component of a Natural Language Generation System

produces a well-formed sentence as constrained by the contents of a lexicon and

grammar. It takes as input an abstract specification of the text. There are three kinds

of processing involved in surface realization:

40
Syntactic Realization: Syntactic realization uses grammatical knowledge to choose

inflections based on grammatical features, add function words if required, and decide

the order of the components. For example, in Telugu the object of the sentence

usually precedes the verb and the syntactic realizer has to take care of the order.

Morphological Realization: Morphological realization computes the inflected

forms of the words depending on the grammatical features. For example the plural

form of “ceVyyi” (hand) is “cewulu” (hands).

Orthographic Realization: Orthographic realization deals with punctuation,

formatting and casing.

The abstract specification of the text that comes as input to the surface realizer is the

text specification which is constructed from the document plan by the microplanner.

The text specification describes what text is to be generated and how the text is to be

formatted. Thus there are two distinct aspects of processing of text specifications.

One concerned with mapping logical constructs (specific formatting constructs in the

text) to appropriate document formatting and the other concerned with the

application of grammatical knowledge to the phrase specifications (grammatical

objects such as phrases) so that the final surface forms are realized.

The process of generating the surface forms from the phrase specification is the area

to which my thesis is concerned. The generation process depends on the degree to

which the phrase specification abstracts away from the actual surface forms. The

phrase specifications are referred to as input specifications further in the thesis.

Our Telugu realization engine is designed following the SimpleNLG (Gatt and

Reiter, 2009) approach which recently has been used to build surface realizers for

German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry

41
and Lapalme, 2013) and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,

2014). Figure 3.2 shows an example input specification in XML corresponding to

the following Telugu sentence.

vAlYlYu aMxamEna wotalo neVmmaxigA naduswunnAru. (They are walking

slowly in a beautiful garden.)

<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”pronoun” gender=”human” number=”plural” person=”third”
casemarker=” ” stem=”basic”> vAdu</head>
</nounphrase>
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”lo” stem=”basic”> wota</head>
</nounphrase>
<verbphrase type=” ”>
<modifier pos=”adverb” suffix=”gA”> neVmmaxi</modifier>
<head pos=”verb” tensemode=”presentparticiple”> naducu</head>
</verbphrase>
</sentence>
</document>
Figure 3.2. Example XML Input Specification

Several realizers are available for English and other European languages (Gatt and

Reiter, 2009; Vaudry and Lapalme, 2013; Marcel Bollmann, 2011; Elhadad and

Robin, 1996). Some general purpose realizers (as opposed to realizers built as part of

an MT system) have started appearing for Indian languages as well. Smriti Singh et

al. (2007) report a Hindi realizer that includes functionality for choosing post-

position markers based on semantic information in the input. This is in contrast to

the realization engine reported in the current chapter which assumes that choices of

constituents, root words and grammatical features are all preselected before

realization engine is called. As per the review of literature there are no realization

engines for Telugu. However, a rich body of work exists for Telugu language

42
processing in the context of machine translation (MT). In this context, earlier work

reported Telugu morphological processors that perform both analysis and generation

(Badri et al., 2009; Rao and Mala, 2011; Ganapathiraju and Levin, 2006) but none of

the authors reported a surface realization engine for Telugu language. Here in this

thesis we designed a surface realization engine for Telugu.

3.2 The Simple NLG Framework

A realization engine is an automaton that generates well-formed sentences according

to a grammar. Therefore, while building a realizer the grammatical knowledge

(syntactic and morphological) of the target language is an important resource.

Realizers are classified based on the source of grammatical knowledge. There are

realizers such as FUF/SURGE that employ grammatical knowledge grounded in a

linguistic theory (Elhadad and Robin, 1996). There have also been realizers that use

statistical language models such as Nitrogen (Knight and Hatzivassiloglou, 1995)

and Oxygen (Habash, 2000). While linguistic theory based grammars are attractive,

authoring these grammars can be a significant endeavor (Mann and Matthiessen,

1985). Besides, non-linguists (most application developers) may find working with

such theory heavy realizers difficult because of the initial steep learning curve.

Similarly building wide coverage statistical models of language too is labour

intensive requiring collection and analysis of large quantities of corpora. It is this

initial cost of building grammatical resources (formal or statistical) that becomes a

significant barrier in building realization engines for new languages. Therefore, it is

necessary to adopt grammar engineering strategies that have low initial costs. The

surface realizers belonging to the SimpleNLG family incorporate grammatical

knowledge corresponding to only the most frequently used phrases and clauses and

43
therefore involve low cost grammar engineering. The main features of a realization

engine following the SimpleNLG framework are:

1. A wide coverage morphology module independent of the syntax module.

2. A light syntax module that offers functionality to build frequently used phrases

and clauses without any commitment to a linguistic theory. The large uptake of

the SimpleNLG realizer both in the academia and in the industry shows that

the light weight approach to syntax is not a limitation.

3. Using ‘canned’ text elements to be directly dropped into the generation process

achieving wider syntax coverage without actually extending the syntactic

knowledge in the realizer.

4. A rich set of lexical and grammatical features that guide the morphological and

syntactic operations locally in the morphology and syntax modules

respectively. In addition, features enforce agreement amongst sentence

constituents more globally at the sentence level.

3.3 Telugu Realization Engine

The current work follows the SimpleNLG framework. However, because of the

known differences between Telugu and English SimpleNLG codebase could not be

reused for building Telugu realizer. Instead our Telugu realizer was built from

scratch adapting several features of the SimpleNLG framework for the context of

Telugu. There are significant variations in spoken and written usage of Telugu.

There are also significant dialectical variations, most prominent ones correspond to

the four regions of the state of Andhra Pradesh, India – Northern, Southern, Eastern

and Central (Brown, 1991). In addition, Telugu absorbed vocabulary (Telugised)

from other Indian languages such as Urdu and Hindi. As a result, a design choice for

44
Telugu realization engine is to decide the specific variety of Telugu whose grammar

and vocabulary needs to be represented in the system. In our work, we use the

grammar of modern Telugu developed by (Krishnamurti and Gwynn, 1985). We

have decided to include only a small lexicon in our realization engine. This is

because host NLG systems that use our engine could use their own application

specific lexicons. More over modern Telugu has been absorbing large amounts of

English vocabulary particularly in the fields of science and technology whose

morphology is unknown. Thus specialized lexicons could be required to model the

morphological behaviour of such vocabulary. In the rest of this section we present

the design of our Telugu realizer.

As stated in section 3.2, a critical step in building a realization engine for a new

language is to review its grammatical knowledge to understand the linguistic means

offered by the language to express meaning. We reviewed Telugu grammar as

presented in our chosen grammar reference (Krishnamurti and Gwynn 1985). From a

realizer design perspective the following observations proved useful:

1. Primary meaning in Telugu sentences is mainly expressed using inflected

forms of content words and case markers or postpositions than by position of

words/phrases in the sentence. This means morpho-phonology plays bigger role

in sentence creation than syntax.

2. Because sentence constituents in Telugu can be ordered freely without

impacting the primary meaning of a sentence, sophisticated grammar knowledge

is not required to order sentence level constituents. It is possible, for instance, to

order constituents of a declarative sentence using a standard predefined

sequence (e.g. Subject + Object + Verb).

45
3. Telugu, like many other Indian languages, is not governed by a phrase structure

grammar, instead fits better into a Paninian Grammar Formalism (Bharati et al.,

1995) which uses dependency grammar. This means, dependency trees represent

the structure of phrases and sentences. At the sentence level verb phrase is the

head and all the other constituents have a dependency link to the head. At the

phrase level too, head-modifier dependency structures are a better fit.

4. Agreement amongst sentence constituents can get quite complicated in Telugu.

Several grammatical and semantic features are used to define agreement rules.

Well-formed Telugu sentences are the result of applying agreement rules at the

sentence level on sentence constituents constructed at the lower level processes.

Based on the above observations we found that the SimpleNLG framework with its

features mentioned in section 3.2 is a good fit for guiding the design of our Telugu

realization engine. Thus our realization engine is designed with a wide coverage

morphology module and a light-weight syntax module where features play a major

role in performing sentence construction operations.

Having decided the SimpleNLG framework for representing and operationalizing the

grammatical knowledge, the following design decisions were made while building

our Telugu realizer (we believe that these decisions might drive design of realizers

for any other Indian Language as well):

1. Use WX-notation(shown in Appendix) for representing Indian language

orthography (shown in section 3.3.1 in detail)

2. Define the tag names and the feature names used in the input XML file

(shown in Figure 3.2) adapted from SimpleNLG and (Krishnamurti and

Gwynn, 1985) for specifying input to the realization engine. It is hoped

46
that using English terminology for specifying input to our Telugu

realizer simplifies creating input by application developers who usually

know English well and possess at least a basic knowledge of English

grammar.

3. In order to offer flexibility to application developers our realization

engine orders sentence level constituents (except verb which is always

placed at the end) using the same order in which they are specified in the

input XML file. This allows application developers to control ordering

based on discourse level requirements such as focus.

4. The grammar terminology used in our engine does not directly

correspond to the Karaka relations (Bharati et al., 1995) from the

Paninian framework because we use the grammar terminology specified

by Krishnamurti and Gwynn (1985) which is lot closer to the

terminology used in SimpleNLG. We are currently investigating

opportunities to align our design lot closer to the Paninian framework.

We expect such approach to help us while extending our framework to

generate other Indian languages as well.

3.3.1 WX-Notation

WX notation (shown in Appendix) is a very popular transliteration scheme for

representing Indian languages in the ASCII character set. This scheme is widely used

in Natural Language Processing in India. In WX notation (shown in Appendix) the

small case letters are used for un-aspirated consonants and short vowels while the

capital case letters are used for aspirated consonants and long vowels. The

retroflexed voiced and voiceless consonants are mapped to ‘t, T, d and D’. The

47
dentals are mapped to ‘w, W, x and X’. Hence the name of the scheme “WX”,

referring to the idiosyncratic mapping.

3.3.2 The Input Specification Scheme

There are a wide range of approaches to specify input to linguistic realization. The

differences in the approaches vary from signalling deep views of what is to be

involved in the process of realization to simple notational differences. A few abstract

inputs starting with very abstract representation are discussed in the following

sections. The following example will be given as input in the Input Specification

discussed in the following sections.

rAmudu kowini karrawo kottAdu. (Ramu bet the monkey with a stick)

3.3.2.1 Skeletal Proposition

The Skeletal Proposition in Figure 3.3 is a very abstract representation of the

example sentence in section 3.3.2. This representation does not say anything about

the content of the individual noun phrases in the sentence. This representation only

indicates that an event of “kottu” (beat) happened and identifies three participants in

this event as c1, p1, and m respectively.

Figure 3.3 Skeletal Proposition

48
3.3.2.2 Meaning Specification

The representation of Figure 3.4 does not specify that the object of action “kottu”

(beat) was “kowi” (monkey). This and other information omitted by the Skeletal

Proposition representation is identified in the knowledge base by the Microplanner.

Figure 3.4 Meaning Specification

The Microplanner not only selects the required elements in the knowledge base for

inclusion in the text to be generated, but also takes certain decision about the

structure into which the information will be placed. The result of this is a

representation called the Meaning Specification.

3.3.2.3 Lexicalized Case Frames

The structure presented in the previous section is still abstract because many

realizers expect the selection of base lexemes to be used to express the semantic

content to be carried out by the previous stage. Once these decisions are made the

usefulness of the content of index and features are exhausted. These features are

omitted in the Lexicalized Case Frame representation shown in Figure 3.5.

49
Figure 3.5 Lexicalized Case Frame

The base lexemes used in this representation, still need to go through the

morphological process to achieve the status of being words.

3.3.2.4 Abstract Syntactic Structures

In certain cases, it may be appropriate for the processes carried out before the

linguistic realizer is invoked to make certain decisions about the grammatical

structure. For example, some additional information is added to Figure 3.5 about

which argument of the representation should be placed in focus. If the representation

specifies that the second argument is to be in focus then the realizer produces the

following sentence:

kowini rAmudu karrawo kottAdu.

The role of a realizer in such a case becomes very simple only to encode the

grammatical knowledge of the language in question applying them to an input

specification called Abstract Syntactic Structure. Figure 3.6 shows such a

representation.

50
Figure 3.6 Abstract Syntactic Structures

3.3.2.5 Canned Text and Templates

Sometimes certain constituents of a sentence which are sufficiently invariant can be

predetermined and stored directly as text strings. For example, the closing salutation

of a letter like “aMxariki suBAkAMkRalu” can be stored as text strings.. Figure 3.7

shows an input with Canned Text.

Figure 3.7 Template for Canned Text

3.3.2.6 SimpleNLG XML Specification

. A text specification, together with its children (for example, SPhraseSpecs) can be

expressed in XML, based on a predefined XML schema that mirrors the relevant

parts of the internal structure of a SimpleNLG specification. Figure 3.8 is an XML

input specification for the following example sentence.

51
The patient as a result of the procedure had an adverse contrast media reaction, had

a decreased platelet count and went into cardiogenic shock.

<?xml version="1.0" encoding="utf-8"?>


<Document xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cat="PARAGRAPH"
xsi:schemaLocation="http://code.google.com/p/simplenlg/schemas/version1"
xmlns="http://code.google.com/p/simplenlg/schemas/version1">
<child xsi:type="SPhraseSpec">
<preMod xsi:type="PPPhraseSpec">
<head cat="PREPOSITION">as a result of</head>
<compl xsi:type="NPPhraseSpec">
<head cat="NOUN">procedure</head>
<spec cat="DETERMINER">the</spec>
</compl>
</preMod>
<subj xsi:type="NPPhraseSpec">
<head cat="NOUN">patient</head>
<spec cat="DETERMINER">the</spec>
</subj>
<vp xsi:type="CoordinatedPhraseElement" conj="and">
<coord xsi:type="VPPhraseSpec" TENSE="PAST">
<head cat="VERB">have</head>
<compl xsi:type="NPPhraseSpec">
<head cat="NOUN">adverse contrast media reaction</head>
<spec cat="DETERMINER">a</spec>
</compl>
</coord>
<coord xsi:type="VPPhraseSpec" TENSE="PAST">
<head cat="VERB">have</head>
<compl xsi:type="NPPhraseSpec">
<head cat="NOUN">decreased platelet count</head>
<spec cat="DETERMINER">a</spec>
</compl>
</coord>
<coord xsi:type="VPPhraseSpec" TENSE="PAST">
<head cat="VERB">go</head>
<postMod xsi:type="PPPhraseSpec">
<head cat="PREPOSITION">into</head>
<compl xsi:type="NPPhraseSpec">
<head cat="NOUN">cardiogenic shock</head>
</compl>
</postMod>
</coord>
</vp>
</child>
</Document>
Figure 3.8 SimpleNLG XML Specifications

52
The input to the Telugu surface realization engine is a tree structure specified in

XML, modelled on the SimpleNLG XML specification, an example is shown in

Figure 3.2. The root node is the sentence and the nodes at the next level are the

constituent phrases that have a role feature representing the grammatical functions

such as subject, verb and complement performed by the phrase. Each of the lower

level nodes could in turn have their own head and modifier children. Each node also

can take attributes which represent grammatical or lexical features such as number

and tense.

For example the subject node in Figure 3.2 can be understood as follows:

<nounphraserole=”subject”><head
pos=”pronoun”gender=”human”number=”plural”person=”third”casemarker=” ”
stem=”basic”> vAdu</head>
</nounphrase>

This node represents the noun phrase that plays the role of subject in the sentence.

There is only one feature, the head to the subject node whose type is nominative. The

lexical features of the head “vAdu” are part-of-speech (pos) which is pronoun,

person which is third person, number which is plural, gender which is human and

case marker which is null.

3.4 System Architecture

The sentence construction for Telugu involves the following three steps:

1. Construct word forms by applying morpho-phonological rules selected

based on features associated with a word (word level morphology)

2. Combine word forms to construct phrases using ‘sandhi’ (a morpho-

phonological fusion operation) if required (phrase building)

53
3. Apply sentence level agreement by applying agreement rules selected

based on relevant features. Order the sentence constituents following a

standard predefined sequence. (Sentence building)

Our system architecture is shown in Figure 3.9 which involves morphology engine,

phrase builder and sentence builder corresponding to these three steps. The rest of

the section presents how the example sentence of section 3.1.1.3 is generated from

the input specification in Figure 3.2.

Figure 3.9. System Architecture

3.5 Input Reader

The Input Reader is the module which acts as an interface between the sentence

builder and the input. Currently the input reader accepts only our XML input

specification but in the future we would like to extend it to accept other input

specifications such as SSF (Bharati et al., 2007). This module ensures that the rest of

the engine receives input in the required form.

54
3.6 Sentence Builder

The Sentence Builder is the main module of the current system which has a

centralized control over all the other modules. It performs four subtasks:

1. Sentence Builder first checks for predefined grammatical functions such as

subject, object, complement, and verb which are defined as features of the

respective phrases in the input. It then calls the appropriate element builder for

each of these to create element objects which store all the information extracted

from the XML node.

2. These element objects are then passed to appropriate phrase builder to receive

back a string which is the phrase that is being constructed according to the

requirements of the input.

3. After receiving all the phrases from the appropriate phrase builders the Sentence

Builder applies the agreement rules. Since Telugu is nominative-accusative

language the verb agrees with the argument in the nominative case. Therefore the

predicate inflects based on the gender, person and number of the noun in the

nominative case. There are three features at the sentence level namely type,

predicate-type, and respect. The feature type refers to the type of the sentence.

The current work handles only simple sentences therefore it is not set to any

value. The feature predicate-type can have any one of the three values namely

verbal, nominative, and abstract. The feature respect can have values yes or no.

The agreement also depends on the features predicate-type, and respect.

4. Finally, the sentence builder orders the phrases in the same order they are

specified in the input.

55
In the case of the example in Figure 3.2 the sentence builder finds three grammatical

functions - one finite verb, one locative complement, and one nominative subject. In

the example input in section 3.1.1.3 the values for the feature predicate-type is

“verbal” and for respect is “no”. The Sentence Builder retrieves appropriate rule

from an externally stored agreement rule base. In the example input in section

3.1.1.3 where predicate-type is set to verbal, the number of the subject is plural and

the gender is human the Sentence Builder retrieves the appropriate suffix “nnAru”.

This suffix is then agglutinated to the verb “naduswu” which is returned by the

morphology engine to generate the final verb form, “naduswunnAru” with the

required agreement with subject.

“naduswu”+ “nnAru” “naduswunnAru”

After the construction of the sentence the Sentence Builder passes it to the Output

Generator which prints the output.

3.7 Element Builder

The element builder of each grammatical function checks for lower level functions

like head and modifier and calls the appropriate element builder for the head and

modifier which converts the lexicalized input into element objects with the

grammatical constituents as their instance variables and returns the element objects

back to the Sentence Builder. Our realizer creates four types of element objects

namely SOCElement, VAElement, AdjectiveElement, and AdverbElement. The

SOCElement represents the grammatical functions subject, object and complement.

The subject in the example sentence of section 3.1.1.3 is “vAdu” for which a

SOCElement is created with the specified features. Similarly a SOCElement is

created for the complement “wota” and its modifier “aMxamu” which is an

56
AdjectiveElement. Finally a VAElement is created for the verb “naducu” and the

modifier “neVmmaxi” which is an AdverbElement.

3.8 Phrase Builder

Telugu sentences express most of the primary meaning in terms of morphologically

well-formed phrases or word groups. In Telugu the main and auxiliary verbs occur

together as a single word. Therefore their generation is done by the morphology

engine. Telugu sentences are mainly made up of four types of phrases - Noun Phrase,

Verb Phrase, Adjective Phrase, and Adverb Phrase. Noun phrases and verb phrases

are the main constituents in a sentence while the Adjective Phrase and the Adverb

Phrase only play the role of a modifier in a noun or verb phrase. There is one feature

at the Noun Phrase level “role” which specifies the role of the Noun Phrase in the

sentence. The phrase builder passes the elements constructed by the element builder

to the morphology engine and gets back the respective phrases with appropriately

inflected words. In the example input in section 3.1.1.3, there are three constituent

phrases, viz, two noun phrases for subject and complement and a verb phrase. One of

the noun phrases also contains an adjective phrase which is an optional modifying

element of noun heads in head-modifier noun phrases. The adjective phrase may be a

single element or sometimes composed of more than one element. The verb phrase

also contains an adverb phrase which is generally considered as a modifier of the

verb. The phrase builder passes five objects i.e., two SOCElement objects, one

AdjectiveElement object, one VAElement object, and one AdverbElement object to

the morphology engine and gets back five inflected words which finally become

three phrases, viz, two noun phrases “vAlYlYu”, “aMxamEna wotalo”, and one verb

phrase “neVmmaxigA naduswu”.

57
3.9 Morphology Engine

The morphology engine is the most important module in the Telugu realization

engine. It is responsible for the inflection and agglutination of the words and phrases.

The morphology engine behaves differently for different words based on their part of

speech (pos). The morphology engine takes the element object as the input, and

returns to the phrase builder the inflected or agglutinated word forms based on the

rules of the language. In the current work morphology engine is a rule based engine

with the lexicon to account for exceptions to the rules. The rules used by the

morphology engine are stored in external files to allow changes to be made

externally.

3.9.1 Noun

Noun is the head of the noun phrase. Telugu nouns are divided into three classes

namely (i) proper nouns and common nouns, (ii) pronouns, and (iii) special types of

nouns (e.g. numerals) (Krishnamurti and Gwynn, 1985). All nouns except few

special type nouns have gender, number, and person. Noun morphology involves

mainly plural formation and case inflection. All the plural formation rules from

sections 6.11 to 6.13 of our grammar reference have been implemented in our

engine. The head of the complement in the example of section 3.1.1.3 has one noun

“wotalo”. The word “wota” along with its feature values can be written as follows:

“wota”, noun, nonmasculine, singular, third, basic, “lo”--- wotalo

The formation of this word is very simple because the word “wota” in its singular

form and the case marker “lo” get agglutinated through a sandhi (a morpho-

phonological fusion operation) formation as follows:

58
‘wota’+lo----- wotalo

3.9.2 Pronoun

Pronouns vary according to gender, number, and person. There are three persons in

Telugu namely first, second, and third. The gender of the nouns and pronouns in

Telugu depend on the number. The relation between the number and gender is shown

in Table 3.1.

Number Gender
Singular masculine, non-masculine
Plural human, nonhuman
Table 3.1: Relationship between Number and Gender

Plural formation of pronouns is not rule based. Therefore they are stored externally

in the lexicon. The first person pronoun “nenu” has two plural forms “memu” which

is the exclusive plural form and “manamu” which is the inclusive plural form. In the

generation of the plural of the first person a feature called “exclusive” has to be

specified with the value “yes”, or “no”. Along with gender, number, and person there

is one more feature which is stem. The stem can be either basic or oblique. The

formation of the pronoun “vAlYlYu” in the example of section 3.1.1.3 which is the

head of the subject along with its feature values can be written as follows:

“vAdu”, pronoun, human, plural, third, basic,“”-vAlYlYu

In this case the stem is basic. The gender of the pronoun is human because the

number is plural as mentioned in Table 1. The word “vAlYlYu” is retrieved from the

lexicon as the plural for the word “vAdu” and the feature values.

59
3.9.3 Adjective

Adjectives occur most often immediately before the noun they qualify. The basic

adjectives or the adjectival roots which occur only as adjectives are indeclinable (e.g.

oka (one), ara (half)). Adjectives can also be derived from other parts of speech like

verbs, adverbs, or nouns. The adjective “aMxamEna” in the example of section

3.1.1.3 is a derived adjective formed by adding the adjectival suffix “aEna” to the

noun “aMxamu”. The formation of the word “aMxamEna” in the example of section

3.1.1.3 along with its feature values can be written as follows:

“aMxamu”, adjective, descriptive,“aEna”--aMxamEna

The current work does not take into consideration the type of an adjective and will

be included in a future version. The formation of this word is again through a sandhi

formation as follows:

aMxamu+aEna-------- aMxamEna

Here the sandhi formation eliminates the “u” in the first word; “a” in the second

word and the word “aMxamEna” is formed.

3.9.4 Verb

Telugu verbs inflect to encode gender, number and person suffixes of the subject

along with tense mode suffixes. As already mentioned gender, number and person

agreement is applied at the sentence level. At the word level, verb is the most

difficult word to handle in Telugu because of phonetic alterations applied to it before

being agglutinated with the tense-aspect-mode suffix (TAM). Telugu verbs are

classified into six classes (Krishnamurti, 1961). Our engine implements all these

classes and the phonetic alternations applicable to each of these classes are stored

60
externally in a file. The verb in the example of Figure 3.2 has one verb “naducu”

along with its feature values. The formation of the verb “naduswu” can be written as

follows:

“naducu”,verb, present participle------naduswu

The word “naducu” belongs to class IIa, for which the phonetic alteration is to

substitute “cu” with “s”, and therefore the word gets inflected as follows:

naducu----------------nadus

The tense mode suffix for present participle is “wu”, and the word becomes

“naduswu”. The gender and number of the subject also play a role in the formation

of the verb which is discussed in section 3.6.

3.9.5 Adverb

All adverbs fall into three semantic domains, those denoting time, place and manner

(Krishnamurti and Gwynn 1985). The adverb “neVmmaxigA” in the example (1) is a

manner adverb as it tells about the way they are walking “neVmmaxigA

naduswunnaru (walking slowly)”. In Telugu manner adverbs are generally formed

by adding “gA” to adjectives and nouns. The formation of the adverb

“neVmmaxigA” in the example (1) along with its feature values can be written as

follows:

“neVmmaxi”, adverb,“gA”-------------neVmmaxigA

The formation of the above word is a simple sandhi formation.

61
3.10 Output Generator

Output Generator is the module which actually generates text in Telugu font. The

Output generator receives the constructed sentence in WX-notation (shown in

Appendix) and gives as output a sentence in Telugu based on the Unicode Characters

for Telugu. The output generated for the example input of Figure 3.2 in Telugu

language generated by the output generator is as follows:

వాళ్ళు అందమైన తోటలో నెమ్మ దిగా నడుస్తున్నా రు. (They are walking slowly in

the garden).

3.11 Evaluation

The current work addresses the problem of generating syntactically and

morphologically well-formed sentences in Telugu from an input specification

consisting of lexicalized grammatical constituents and associated features. In order

to test the robustness of the realization engine as the input to the realizer changes we

initially ran the engine in a batch mode to generate all possible sentence variations

given an input similar to the one shown in Figure 3.2. In the batch mode the engine

uses the same input root words in a single run of the engine, but uses different

combinations of values for the grammatical features such as tense, aspect, mode,

number and gender in each new run. Although the batch run was originally intended

for software quality testing before conducting evaluation studies, these tests showed

that certain grammatical feature combinations might make the realization engine

produce unacceptable output. This is an expected outcome because our engine in the

current state performs very limited consistency checks on the input. The purpose of

our evaluation is to measure our realizer’s coverage of the Telugu language. One

62
objective measure could be to measure the proportion of sentences from a specific

text source (such as a Telugu newspaper) that our realizer could generate. As a first

step towards such an objective evaluation, we first evaluate our realizer using

example sentences from our grammar reference. Although not ideal this evaluation

helps us to measure our progress and prepares us for the objective evaluation. The

individual chapters and sections in the book by Krishnamurti and Gwynn (1985)

follow a standard structure where every new concept of grammar is introduced with

the help of a list of example sentences that illustrate the usage of that particular

concept. We used these sentences for our evaluation. Please note that we collect

sentences from all chapters. This means our realizer is required to generate for

example verb forms used in example sentences from other chapters in addition to

those from the chapter on verbs. A total of 738 sentences were collected from

chapter 6 to chapter 26, the main chapters which cover Telugu grammar. Because the

coverage of the current system is limited, we don’t expect the system to generate all

these 738 sentences. Among these, 419/738 (57%) sentences were found to be within

the scope of our current realizer. Many of these sentences are simple and short. For

each of the 419 selected sentences our realizer was run to generate the 419 output

sentences. The output sentences matched the original sentences from the book

completely. This means at this stage we can quantify the coverage of our realizer as

57% (419/738) against our own grammar source. A more objective measure of

coverage will be estimated in the future.

Total No. of Sentences in the Sentences not in Sentences Coverage in


Sentences scope of our realizer the scope of our generated percentage
realizer
738 419 319 419 57%
Table 3.2 Evaluation Results of Sentence Generation

63
Having built the functionality for the main sentence construction tasks, we are now

in a good position to widen the coverage. Majority of the remaining 319 sentences

(=738-419) involve verb forms such as participles and compound verbs and medium

to complex sentence types. As stated above, we intend to use this evaluation to drive

our development. This means every time we extend the coverage of the realizer we

will rerun the evaluation to quantify the extended coverage of our realizer. The idea

is not to achieve 100% coverage. Our strategy has always been to select each new

sentence or phrase type to be included in the realizer based on its utility to express

meanings in some of the popular NLG application domains such as medicine,

weather, sports and finance.

3.12 Summary

This chapter describes a surface realizer for Telugu which was designed by adapting

the SimpleNLG framework for free word order languages. This chapter mainly

focused on the architecture of the Telugu realization engine and the input

specification mechanism employed. The other aspects like morphology of the

different parts of speech, the phrase formation and the sentence formation are only

introduced in this chapter and will be discussed in detail in the further chapters.

64
-----------------------CHAPTER-4

MORPHOLOGY OF NOUNS AND


PRONOUNS
65
MORPHOLOGY OF NOUNS AND PRONOUNS

4.1 Introduction

Telugu is a free word order language in which various grammatical categories (case,

gender, number, person etc.) are morphologically encoded making it a

morphologically rich language. In this chapter we present a morphological generator

for Telugu nouns and pronouns modelled on finite state techniques. The

morphological generator generates the required word form for nouns and pronouns

from an input specification consisting of the lemma and its associated features. The

module discussed in this chapter is an independent module of a surface realization

engine that automates the task of building grammatically correct Telugu sentences.

The morphological generator also supports verb morphological generator in

generating the appropriate verb form by passing the appropriate features (person,

number and gender) required for the formation of the appropriate verb form.

Natural Language Processing (NLP) applications which are growing in number these

days can be categorized into two broad areas namely Natural Language

Understanding (NLU) where Morphological Analysers (MA) play a very important

role and Natural Language Generation (NLG) where Morphological Generators play

a very important role. A Morphological Analyser takes a word as input, processes it

and gives as output its root along with its grammatical features. A Morphological

Generator takes a root along with its grammatical features as input and generates the

required word form. Morphological Generators (MG) have a very important role to

play in applications like Surface Realization (Dokkara, Penumathsa, and Sripada,

2015, Gatt and Reiter, 2009) and Machine Translation (Kristina Toutanova, Hisami

Suzuki, Achim Ruopp, 2008) of free word order languages like Telugu. It is always

66
advantageous for free word order languages like Telugu to have Morphological

Generator as a separate component that is separate from the rest of the NLG system

(Guido Minnen et.al 2000). The current work is a separate module of a surface

realization engine for Telugu (Dokkara, Penumathsa, and Sripada, 2015), a java

application which is the final subtask of a Natural Language Generation (NLG)

pipeline (Reiter and Dale, 2000). The sentence realization engine for Telugu is

designed following the SimpleNLG (Gatt and Reiter, 2009) approach which is a very

popular surface realization engine for English.

Telugu is a morphologically rich free word order language spoken by people from

the south Indian states of Andhra Pradesh and Telangana. In this paper we describe a

morphology engine which automatically generates the different forms of nouns and

pronouns in Telugu. The current work is modelled on the morphology engine for

English (Guido Minnen et.al 2000).

Telugu nouns are divided into three classes namely (i) proper nouns and common

nouns, (ii) pronouns, and (iii) special types of nouns (e.g. numerals) (Krishnamurti

and Gwynn, 1985). All nouns except few special type nouns have gender, number,

and person. Noun morphology involves mainly plural formation, oblique stem

formation and case inflection. The current work discusses in detail the first two

classes of nouns.

Linguistic theories of morphology differ in their view with respect to treating

morpheme as the basic building block of morphological analysis or generation. “Two

models of grammatical description” was proposed by (Hockett, 1954) which are Item

and Arrangement model (IA) and Item and Process model (IP). Item and

Arrangement model is a concatenative approach where morphemes are the lexical

67
units and morphology is an agglutination of such units to form words. Item and

Process (IP) is considered as a derivational process where new word forms can be

produced by acting on morphemes and words. He also mentioned about an already

existing model the Word and Paradigm (WP) which is a word based morphological

approach which states generalizations that hold between the different forms of

inflectional paradigms and used in languages like Latin, Greek, and Sanskrit. A two

dimensional taxonomy of morphological theories was proposed by (Stump, 2001).

He distinguished two axes along which inflectional morphology may be situated

relative to one another. He proposed the lexical/inferential axis and the

incremental/realizational axis (Stump, 2001) which are orthogonal to each other.

From a computational perspective lexical-incremental and inferential-realizational

are computationally equivalent and can be implemented using finite state techniques.

A diverse range of languages have used finite state techniques to build

morphological analysers (MA) and morphological generators (MG) (Karttunen,

2003, Beesley and Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat,

2007). Therefore, in this paper we apply finite state techniques to Telugu

morphology of nouns and pronouns.

Among the morphological tools for Indian languages (Goyal and Lehal 2011) report

a machine translation system from Hindi-Punjabi modelled on database approach

where all word forms are stored in relational database. A number of morphological

tools for Tamil are reported by (PJ Antony and Soman 2012) which range from

corpus based through suffix stripping to finite state techniques. For Telugu language,

(Rao et.al 2011) describe a word and paradigm based morphological analyser and

generator. (Sribadrinarayan et al 2009) describe an item and arrangement based

68
morphological generator for Telugu. (Ganapatiraju et al 2006) describe a rule based

(item and process based) morphological generator for Telugu.

4.2 Inputs to the Morphology Engine

The noun and pronoun morphology engine is an independent module of a surface

realization engine which produces grammatically correct Telugu sentences. The

input specification for the surface realization engine consists of lexicalized

grammatical constituents and associated features in the form of an XML file. The

XML file given as input provides the required grammatical information not only at

the sentence level but also at the word level which acts as the input to the

morphology engine. An example XML specification corresponding to the Telugu

sentence in WX-notation given below is shown in Figure 4.1.

vAlYlYu iMtiki vaccAru. (They came home.)

<?xml version=”1.0” encoding=”UTF8” standalone=”no”>


<document>
<sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”pronoun” gender=”human” number=”plural” person=”third”
casemarker=” ” stem=”basic”>vAdu</head>
</nounphrase>
<nounphrase role=”complement”>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”ki” stem=”oblique”> illu</head>
</nounphrase>
<verbphrase type=” ”>
<head pos=”verb” tensemode=”pasttense”> vaccu</head>
</verbphrase>
</sentence>
</document>
Figure 4.1. Example XML Input Specification

4.3 Noun Morphological Process

The head of the complement in the example of section 4.2 has one noun “iMtiki”.

The word “illu” given in the XML specification of Figure 4.1 as the head of the noun
69
phrase which plays the role of a complement in the sentence along with its feature

values can be written as follows:

“illu”, noun, nonmasculine, singular, third, oblique, “ki”--- iMtiki

First the oblique stem of the word “illu” is formed as the word needs to get

agglutinated with the case marker. The formation of the oblique stem is a two-step

process. In the first step the class to which the root word belongs is identified. In the

current work the identification of the class is modelled on finite state techniques. The

root word “illu” belongs to class III. A pictorial representation of the finite automata

for identification of class III stems is shown in Figure 4.2.

Figure 4.2: Finite Automata to identify class III stems for Oblique Stem Formation

In the second step the oblique stem of “illu” which is “iMti” is formed by replacing

“llu” by “Mti”.

After the formation of the oblique stem the case marker gets agglutinated to the

oblique stem to form the final word as follows:

70
“iMti” + “ki”----- “iMtiki”.

The formation of the pronoun “vAlYlYu” in the example of section 4.2 which is the

head of the subject along with its feature values can be written as follows:

“vAdu”, pronoun, human, plural, third, basic,“”-“vAlYlYu”

The formation of plurals for pronouns does not have any rules and therefore they are

stored in a lexicon. The word “vAlYlYu” is retrieved from the lexicon stored in an

external file “pronounplural.xml” as the plural for the word “vAdu” and the feature

values.

4.4. Noun Morphology Engine

The steps taken by the noun or pronoun root to get the required inflection form are as

follows:

a) Formation of the plural of the root if required.

b) Formation of the oblique stem if required.

c) Agglutination of the case-marker to the oblique stem if required.

4.4.1 Plural Formation

Common nouns can be divided into count and non-count nouns. Non-count nouns

(mass nouns, indivisible objects and abstract nouns) cannot be distinguished for

number they are either singular or plural. Some count nouns do not conform to any

rules of plural formation. The singular and plural forms of the non-count nouns and

the count nouns which do not conform to any rules of plural formation are stored

externally in a lexicon as an XML file named “plural.xml”. Some mass nouns that

exist only in the singular form are given in Table4.1.

71
Word Meaning in English
uppu Salt
nUne Oil
inumu Iron
veVMdi Silver
biyyaM Rice
janaM People
Table 4.1: Example Mass Nouns in Singular

Some mass nouns that exist only in the plural form are given in Table 4.2.

Word Meaning in English


vadlu Paddy
peVsalu Green gram
kaMxulu Red gram
nIlYlYu Water
pAlu Milk
Table 4.2: Example Mass Nouns in Plural

Indivisible objects cannot have both singular and plural forms. Some indivisible

objects are shown in Table 4.3.

Word Meaning in English


AkAsaM Sky
samuxraM Sea
Table 4.3: Example Indivisible Objects

Some example abstract nouns which are non-count nouns are shown in Table 4.4.

Word Meaning in English


welupu Whiteness
welivi Intelligence
balaM Strength
saMwoRaM Happiness
nixra Sleep
Table 4.4: Example Abstract Nouns

Among the count nouns some nouns do not have any rules for the formation of

plurals. Table 4.5 is a list of count nouns which do not confirm to any rules for the

formation of plurals:

72
Singular Word Plural Word
rAyi rAlYlYu
poyyi Poyyilu
peMdli peMdliMdlu
vari Vadlu
gAru gArlu
sAri sArlu
kumArudu kumArulu
eVxxu eVdlu
veVyyi Velu
cenu Celu
penu Pelu
kAdi kAMdlu
jIwagAdu jIwagAlYlYu
alludu allulYlYlYlYu
manamarAlu manamarAlYlYu
ceVlleVlu ceVlleVlYlYu
kUwuru kUwulYlYu
koVdavali koVdavalYlYu
rAwri rAwrilYlYu
Table 4.5: Count Nouns with no rules for Plural Formation

Pronouns do not conform to any rules regarding the formation of plurals. All the

pronouns and their plurals listed in Table 4.6 are also stored externally in a lexicon

as an XML file named “pronounplural.xml”.

Person Singular Plural


First nenu memu,manamu
Second nuvvu mIru
vAdu vAlYlYu
axi Avi
vAru vAlYlYu
ixi Ivi
awanu vAlYlYu
Ayana vAlYlYu
AmeV vAlYlYu
Third imeV vIlYlYu
Avida vAlYlYu
vIdu vIlYlYu
iwanu vIlYlYu
Iyana vIlYlYu
vIru vIlYlYu
Ivida vIlYlYu
exi Evi
wAnu wAmu
Table 4.6: Example Pronouns and their Plurals

73
The regular way of forming the nominative plural is by adding the plural suffix “lu”

to the basic stem.

Example:

Avu (cow)Avulu (cows)

A number of morphophonemic changes may occur because of which sometimes the

plural suffix “lu” become “lYlYu”. The morphophonemic changes occur based on

the class to which the singular stem belongs. The formation of the plural is a two-

step process. First the class to which the stem belongs is identified. In the current

work the identification of the class to which the stem belongs is implemented as a

finite automata illustrated in Figure 4.2 for class VII. Second the sandhi

(morphophonemic changes) occurs and the final plural form is generated.

The stems can be categorized into different classes for plural formation of nouns

(Krishnamurti and Gwynn, 1985) as follows:

I) Stem final ending in “i/u” preceded by “t”,”Mt”,”Md” is lost before the plural

suffix “lu”

Example:

koti (one crore)kotlu (crores)

II) In all stems ending in “di”,”du”,”lu” and “ru” and in stems of more than two

syllables ending in “li” and “ri” the final syllable becomes lYlY before lYlYu

Example:

badi (school) balYlYu (schools)

Exception 1:

Masculine nouns of Sanskrit origin ending in “du” replace “du” by

“lu” to form the plural

74
Example:

snehiwudu (friend) snehiwulu (friends)

Exception 2:

Loanwords from foreign languages ending in “ru” form the plural

by adding “lu” to the basic stem.

Example:

nOkaru (servant)nOkarlu (servants)

III) Stem final “tti”,”ttu”,”ddi”,”ddu” becomes “t”,”d” before “lu”

Example:

ceVttu (tree)ceVtlu (trees)

IV) Stem final following “llu”,”nnu” following a short vowel becomes “Md” or

lY before lYu

Example:

illu (house) ilYlYu (houses)

Exception 1:

Some basic stems ending in “nnu” form the plural by adding lu.

Example:

pannu (tax)pannulu (taxes)

V) Stem final aM,AM is replaced by A and stem final ending in eVo is replaced

by E before the plural suffix lu

Example:

puswakaM (book)puswakAlu (books)

VI) Stems ending in Ayi form the plural in the regular way by adding lu.

Example:

abbAyi (boy)abbAyilu (boys)

75
VII) Stem final ending in “yi”,”yyi” is replaced by “wu” before “lu”. The vowel

before “wu” is a long vowel.

Example:

ceVyyi (hand)cewulu (hands)

The class identification for stems belonging to class VII can be done through the

finite automata in Figure 4.3.

Figure 4.3: Finite Automata to identify class VII stems for Plural Formation

VIII) Stems that do not confirm to the above classes and when the stem ends in “i”:

1) If the stem consists of two syllables, or it consists of more than two

syllables and the vowel in the middle syllable is not “i” then the final “i”

changes to “u” before “lu”.

Example:

bAvi (water well) bAvulu (water wells)

2) If the stem consists of two or more syllables and the vowel in the middle

syllable is “i” then all the non-initial “i”s become “u”.

Example:

maniRi (human being) manuRulu (human beings)

76
Exception:

In the nouns of Sanskrit origin the “i” in the middle syllable

does not change.

Example:

pariXi (boundary)pariXulu (boundaries)

Proper nouns are not generally used in the plural but when they are used the rules are

similar to those of the common nouns.

4.4.2 Oblique Stem Formation

Each noun in Telugu has an oblique stem along with the basic stem in both singular

and plural forms. The oblique stem is used to indicate possession or adjectival

relationship. It corresponds in meaning to the possessive form‘s (singular), s’(plural)

in English.

4.4.2.1 Oblique Stem in Singular

The oblique stems of the personal pronouns and a few demonstrative pronouns in the

singular form like “axi”, “ixi”, “exi” do not confirm to any rules for oblique stem

formation. They need to be memorized and are listed in Table 4.7.

Singular Nominative Singular Oblique


nenu nA
nuvvu nI
axi xAni
ixi xIni
exi Xeni
Table 4.7: Pronouns and their Oblique Stems in Singular

77
The oblique stems in the singular for common nouns and demonstrative pronouns

are formed based on some morphophonemic rules. Common nouns in Telugu are

divided into six classes based on the manner in which the oblique stem is formed

(Krishnamurti and Gwynn, 1985).

The six classes are as follows:

I) All nouns denoting human beings, demonstrative pronouns ending in “du”,

“ru”, “nu”, “lu” and a few non-human nouns ending in “ru”, “lu” preceded by

a long vowel fall into this category. These form the oblique stem by deleting

the final “u” and adding “i” to the basic stem. Some example nouns and

pronouns belonging to Class I are listed in Table 4.8.

Singular Nominative Singular Oblique


mogudu (husband) mogudi (of husband)
vAdu (he) vAdi (his)
kuwuru (daughter) kuwuri (of daughter)
ceVllelu (sister) ceVlleli (of sister)
awanu (he) awani (his)
vAru (they) vAri (theirs)
kAlu (leg) kAli (of leg)
Uru (village) Uri (of village)
Table 4.8: Examples for Class I Oblique Stems

II) Non-human nouns of two or more syllables ending in “du”,”di”,

”ru”,”ri”,”lu”,”li” replace the final syllable by “ti” in forming the oblique

stem. Some nouns belonging to Class II are listed in Table 4.9.

Singular Singular Oblique


gUdu (house) gUti (of a house)
Nominative
eru (canal) eti (of a canal)
wAbelu (tortoise) wAbeti(of a tortoise)
nAgali (plough) nAgati (of a plough)
kAvadi (balance) kAvati (of a balance)
Table 4.9: Examples for Class II Oblique Stems

78
III) Six stems ending in nnu, llu, lYlYu replace them by Mti in forming the

oblique stem. All the six stems belonging to Class III are listed in Table 4.10.

Singular Singular Oblique


illu (house) iMti (of a house)
Nominative
villu (bow) viMti (of a bow)
pannu (tooth) paMti (of a tooth)
kannu (eye) kaMti (of an eye)
cannu (breast) caMti (of breast)
olYlYu (body) oMti (of body)
Table 4.10: Examples for Class III Oblique Stems

IV) Five stems of two syllables ending in “yi” and two stems ending in “rru”

replace the final syllable by “wi” in the formation of the oblique stem. All the

seven stems belonging to Class IV are listed in Table 4.11.

Singular Singular Oblique


ceyi/ceVyyi cewi (of hand)
Nominative
neyi/neVyyi newi (of ghee)
(hand)
nUyi/nuyyi (well) nUwi (of well)
(ghee)
goyi/goVyyi (pit) gowi (of pit)
rAyi(stone) rAwi (of stone)
goVrru (harrow) goVrwi(of harrow)
Table 4.11: Examples for Class IV Oblique Stems

V) All nouns ending in “M” have two oblique stems, one in the genitive with no

modification and the other before the accusative and dative case which is

formed by replacing “M” with “Ani”. Some example stems belonging to

Class V are listed in Table 4.12.

Singular nominative Singular Oblique Singular Accusative/Dative


kalaM (pen) kalaM (pen) kalAni-
puswakaM (book) puswakaM (book) puswakAni-
Table 4.12: Examples for Class V Oblique Stems

VI) Basic stems ending in “e”,”a” or a long vowel, those ending in “i” or “u”

preceded by double consonants except “ll” or “nn” and all nouns not covered

79
by classes I-V have both their basic stem and oblique stem identical. Some

example stems belonging to Class VI are listed in Table 4.13.

Singular Nominative Singular Oblique


anna (elder brother) anna (of an elder
peVtteV (box) peVtteV (of a box)
poti (contest) brother)
poti (of a contest)
ceVttu (tree) ceVttu (of a tree)
Table 4.13: Examples for Class VI Oblique Stems

4.4.2.2 Oblique stem in plural

Telugu language has two words for the English word “we”, one exclusive “memu”

which does not include the person who is addressed and one inclusive “manamu”

which includes the person who is addressed. The list of personal pronouns and a few

demonstrative pronouns like “avi”, “ivi”, “evi” in the plural form do not confirm to

any rules for oblique stem formation. They need to be memorized and are listed in

Table 4.14.

Plural Nominative Plural Oblique


memu/manamu mA/mana
mIru mI
avi vAti
ivi vIti
evi Veti
Table 4.14: Example Pronouns and their Oblique Stems in the Plural

The plural oblique stem of the common noun and demonstrative pronoun is formed

by uniformly changing “lu” or “lYlYu” to “la” or “lYlYa”. The oblique suffix is

“a” added to the plural stem. In Sandhi the final “u” of the plural stem is lost

before “a”.

80
4.4.2.3 Oblique Stems of Proper Nouns

The oblique stem of proper nouns both singular and plural is formed in the same way

as those of common nouns. Table 4.15 lists some example proper nouns and their

oblique stems.

Proper Noun Oblique Stem


rAmudu (Rama) rAmudi (Rama’s)
subbArAvu (Subba Rao) subbArAvu(Subba Rao’s)
AMXrulu (people from AMXrula (of the people
Table 4.15: Example Proper Noun Oblique Stems
Andhra) from Andhra)

4.4.3 Case Marker Agglutination

In Indian Languages postpositions (case markers) serve the purpose of prepositions

in English. Postpositions which express spatial or temporal relations or mark various

semantic roles establish some grammatical relations between the nouns which they

follow and the verbs of the sentence. In Telugu postpositions are added to the

oblique stems in both singular and plural forms.

Postpositions in Telugu can be classified into two types namely Type-1 and Type-2.

Postpositions belonging to Type-1 only occur bound to oblique stems. They never

occur as separate words in a sentence or in combination with other postpositions.

Most commonly used postpositions of this type are listed in Table 4.16.

Postposition Meaning
ni/nu Accusative
ki/ku Dative
kosaM for the sake of
wo with, along with
nuMci/niMci From
a/na/ni in/on/at
kaMteV than, compared
guMdA/xwArA Through
Table 4.16: List of Type-1 Postpositions

Among the Type-1 postpositions dative and accusative can be grouped into a

subclass because of the similarity in the morphophonemic changes they exhibit

81
different from the other Type-1 postpositions.

The accusative and dative postpositions, “nu” and “ku” respectively take the form

“ni” and “ki” if the preceding syllables end in “i” or “I”, except in the case of

personal pronouns like “nI-ku” (for you), “mI-ku” (for you in plural) with single

syllable.

Example:

rAmudu(nominative )+ki  rAmudi(oblique)+ki rAmudiki

The use of accusative suffix for nouns denoting inanimate objects is optional. The

nouns take the same form as nominative in the accusative.

Example:

amma mAku kaXa ceppiMxi (Mother told us a story)

In the above sentence “kaXa” (story) is an inanimate noun which must have taken

the form” kaXanu” but the accusative suffix being optional it takes the nominative

form “kaXa” .In the singular of nouns ending in “aM”,”AM”, and “eVM” the dative

suffix “ki” and the accusative suffix “ni” are added to variant forms of the oblique

stems. The stems ending in “aM” or “AM” (shown in section 4.4.2.1 class V) have

“Ani” as the variant and stems ending in “eVM” have “eni” as the variant of the

oblique stem.

Example:

gurraM + ki gurrAni-ki (to the horse)

palYlYeVM + ki palYlYeni-ki (to the plate)

Example sentences of other Type-1 postpositions are as follows:

awanu maxrAsu nuMci vaccAdu (He came from Madras)

rAXa jyowikosaM vacciMxi (Radha came for Jyothi)

82
Postpositions belonging to Type-2 are separate words generally denoting place and

time. Although they sometimes occur as postpositions they also occur as

independent words mostly as adverbial nouns. A feature of Type-2 postpositions is

that postpositions of Type-1 can be added to them for example “lo” which is a Type-

2 postposition can be added to “nuMci” and “ki” to form “lonuMci” and “loki”.

Most common postpositions of this type are listed in Table 4.17.

Postpostion Meaning
lo In
lopala Inside
mIxa On
kiMxa Under
bEta Outside
xaggara Near
veVnaka Behind
muMxu in front of, before
lA,lAgu,lAgA Like
prakAraM according to
warvAwa After
varaku,xAkA up to, until
eVxuta Opposite
maXyana Between
pakkana by the side of
pAtu for (of time)
vEpu in the direction
Table 4.17: List of Type-2 Postpositions
of, towards
Example sentences of Type-2 postpositions are as follows:

mA illu narasApuramlo uMxi (Our house is in Narsapuram)

rAju lopala unnAdu (Raju is inside)

puswakaM ballamixa uMxi (The book is on the table)

4.5 Genders in Telugu

In Telugu the gender of nouns and pronouns depend on the number. There are two

genders masculine and non-masculine when the number is singular. All nouns

denoting male persons belong to the masculine gender and all the others belong to

non-masculine gender. There is no feminine gender and all nouns denoting female

83
persons are treated as non-masculine when the number is singular. There are two

genders human and non-human when the number is plural. All nouns denoting male

and female persons belong to the human gender and all others belong to the non-

human gender. The relationship between gender and number is shown in Table 3.1

(Dokkara, Penumathsa, and Sripada, 2015).

As a result two demonstrative pronouns “axi”(that thing , that lady) and “ixi”(this

thing, this lady) are non-masculine when the number is singular but when the

number is plural they fall into two genders human “vAlYlYu”(those people) and

“vIlYlYu” (these people), non-human “avi”(those things) and “ivi”(these things). In

the current work both “axi” and “ixi” are treated as referring to things and not female

persons because using these words to refer to female persons happens only in casual

talk.

Nouns generally do not have any marker of gender but some words and suffixes are

used to differentiate between male and female sexes. The different nouns that use

suffixes to differentiate between male and female sexes are as follows:

a) Some masculine nouns end with “rAlu” to indicate the female sex.

Example:

snehiwudu(male friend)

snehiwurAlu(female friend)

b) Some descriptive words use the suffixes “amma”, “kawwe” to denote female

persons and “ayya”, “kAdu” to denote male persons.

Example:

paMwulu (school master)

paMwulamma (school mistress)

84
musalayya (old man)

musalamma (old woman)

AtakAdu (male player)

Atakawwe (female player)

c) The word “moVga” (male) and “Ada” (female) are used to distinguish sex in

both human beings and animals.

Example:

moVgapilla

Adapilla

d) Various words are used to distinguish male and female animals and birds.

Example:

kodipuMju (cock)

kodipeVtta (hen)

e) Among pronouns and numerals certain forms are used to distinguish male

and female persons.

Example:

vAdu (he)

AmeV (she)

okkadu (one man)

okawe (one woman)

4.6 Grammatical Persons in Telugu

There are three grammatical persons namely First Person, Second Person, and Third

Person in Telugu. They are used to distinguish between the speaker, the addressee,

and others. The personal pronouns of Telugu language are defined by the

grammatical person.

85
Verbs in Telugu take a form dependent on the person and number of the subject.

Table 4.18 is a list of verb endings depending on the number and person of the noun

or pronoun.

Verb
Number Person
Ending

First Person -nu

Singular Second Person -vu


Pronouns and Nouns ending for the verb
Third Person -du/-xi

First Person -mu

Plural Second Person -ru

Third Person -ru/yi

Table 4.18: Verb endings based on Number and Person

Example sentences of different grammatical persons with their verb endings are as

follows:

nenu annaM winnA-nu (First Person singular)

nuvvu annaM winnA-vu (Second Person singular)

vAdu annaM winnA-du (Third Person Masculine singular)

Ame annaM winna-xi (Third Person Non-Masculine singular)

memu annaM winnA-mu (First Person plural)

mIru annaM winnA-ru (Second Person plural)

vAlYlYu annaM winnA-ru (Third Person Human plural)

avi annaM winnA-yi (Third Person Non-Human plural)

86
4.7 Evaluation

An evaluation of the performance of the Morphology engine for Nouns and

Pronouns is reported here. The evaluation of the morphology engine is performed

with respect to the Telugu Noun database downloaded from the web resource Telugu

Wiktionary at https://en.wiktionary.org/wiki/Category:Telugu_nouns. A total of 524

nouns were downloaded to perform the evaluation. The nouns were tested for plural

generation and oblique stem generation and case marker agglutination.

Evaluation was not done for pronouns as both plural formation and oblique stems

formation for pronouns are shown in Table No 4.6 and Table No 4.7 in section 4.4.1

and 4.4.2.1 respectively.

Among the 524 nouns that were downloaded some nouns which were repeated and

some of them like “kri . pU” (Before Christ) which are not suitable for pluralization

and oblique form generation were removed from the list. A total of 480 nouns were

identified to be suitable for the generation of plurals and oblique forms.

The evaluation was performed by giving the nouns as input to the surface realization

engine because it speeds up the evaluation as more number of nouns can be tested at

the same time. The evaluation results for the plural formation of the nouns are given

in Table 4.19.

The evaluation results in Table 4.19 show that no nouns are categorized under Class

VI and Class VII. Class VI consists of stems that end with “Ayi”. The downloaded

database which majorly consists of proper nouns does not contain any nouns ending

with “Ayi” which generally occurs at the end of common nouns like “abbAyi” (boy),

“ammAyi” (girl) etc. Class VII consists of only three stems “ceVyyi” (hand), goVyyi

87
(pit), and “nuyyi (water well)” which end with “yyi”. All the three stems are

thoroughly tested and one example is shown in section 4.4.1.

Class No. of nouns identified


in each class
I 6
II 63
III 1
IV 2
V 30
VI 0
VII 0
VIII-a 28
VIII-b 1
Regular 349
Table 4.19 : Evaluation Results of Plural Formation

The evaluation results for oblique stem formation of the nouns are given in Table

4.21.

Class No. of nouns identified


in each class
I 53
II 0
III 0
IV 0
V 35
VI 392
Table: 4.20 Evaluation Results of Oblique Stem Formation

The evaluation results in Table 4.20 show that none of the nouns are categorized

under Class II, Class III, and Class IV. Class II consists of non-human common

nouns ending in “du”, “di”, “ru”, “ri”, “lu”, “li” which get replaced by “ti”. The

downloaded dataset which majorly consists of proper nouns does not have any noun

belonging to this class. Some example nouns belonging to this class are listed in

Table 9 of section 4.4.2.1. Class III consists of only six stems ending in “nnu”, and

“llu” which are thoroughly tested (shown in Table 10 Section 4.4.2.1). The head of

the complement in the example sentence (1) “iMtiki” shown in its root form “illu” in

88
Figure 1 also belongs to Class III. Class IV consists of only seven stems ending in

“yi”, and “rru” which are also tested thoroughly (shown in Table 11 Section 4.4.2.1).

Case marker generation primarily includes the oblique form generation and then

joining the case marker to the form. As the results in Table 4.20 show that our engine

generates the correct oblique forms. Our evaluation of case marker agglutination

shows that for all the oblique forms our engine produces the correct surface noun

forms.

The results show that all the 480 nouns were identified as belonging to different

classes but in both the plural formation and oblique stem formation a few classes do

not have any nouns in them. The list majorly consists of proper nouns and a very few

common nouns because of which the nouns are not evenly distributed through all the

classes.

4.8 Summary

The morphology engine for nouns and pronouns described in this chapter along with

verb, adverb and adjective morphology engines are separate modules in a surface

realization engine for generating well-formed sentences in Telugu. The noun and

pronoun morphology engine plays a very important role in the surface realization

engine for the Telugu realization engine.

89
-----------------------CHAPTER-5

MORPHOLOGY OF VERBS
90
MORPHOLOGY OF VERBS

5.1 Introduction

Telugu is a Dravidian language spoken by people from the south Indian states of

Andhra Pradesh and Telangana. It is a morphologically rich free word order

language with nearly 90 million first language speakers. In this paper we describe a

morphology engine which automatically generates the different forms of verbs in

Telugu. Morphological Analyser (MA) and Morphological Generator (MG) are two

very important parts of Natural Language Processing (NLP) applications like

machine translation systems (Rao et.al 2006) and surface realization engines

(Dokkara et al 2015). A Morphological Analyser analyses a given word and

processes it into its root along with its grammatical information whereas a

Morphological Generator given a root along with its grammatical information

generates the corresponding word. Morphological Generators (MG) play a very

important in Natural Language Generation (NLG) of free word order languages like

Telugu. In practice it is always advantageous to have Morphological Generator as a

separate component that is separate from the rest of the NLG system (Guido Minnen

et.al 2000). The current work is a separate module of a surface realization engine for

Telugu (Dokkara et al 2015), a java application which is the final subtask of a

Natural Language Generation (NLG) pipeline (Reiter and Dale, 2000). The sentence

realization engine for Telugu is designed following the SimpleNLG (Gatt and Reiter,

2009) approach which is a very popular surface realization engine for English.

The morphological engine described in this paper is modelled on the morphological

engine for English (Guido Minnen et.al 2000). Because Telugu is morphologically

rich language the morphology of Telugu verbs and nouns is comparatively more

91
complex. For example the morphology of Telugu verbs involves defining

morphology for six different classes of verbs. Similarly morphology of nouns

involves defining morphology for seven different classes of nouns with respect to the

grammatical feature number. We therefore have two separate morphological engines

one for verbs and one for nouns and pronouns. In the implementation instead of

using tools like Flex or JFlex we programmed our morphological engine in Java

using the regular expression package. The process of verb morphology depends on

the way in which the verbs are classified. Linguistic classification of verbs in

Telugu into a small number of conjugation types is done based on the

morphophonemic changes the verb stems undergo when inflected with tense-mode

suffixes. The model of the analysis decides the number of types into which the verbs

can be classified.

In the current work the verb morphological generator does not have an explicit

lexicon or word list but has a computational model based on finite state techniques to

classify all the verbs into a few regular classes and a very small list of words for the

irregular class. The suffixes to be added to the verbs are maintained in separated

XML files and concatenated to the variants of the verb roots to form the final

inflected form.

Morphology has been well studied both by theoretical (Hockett, 1954, Stump 2001)

and computational linguists (Beesley and Kartunnen, 2003, Roark and Sproat, 2007).

From a theoretical perspective, structure of words is explained by the following three

models:

Item and Arrangement model which is a morpheme based morphological approach

in which word forms are analysed as arrangements of morphemes. In this model a

92
morpheme is treated as the minimal meaningful unit of a language and words are

treated as concatenation of morphemes.

Item and Process model which is lexeme based morphology in which a word form

is assumed to be a result of applying rules that alter a stem to produce a new one. In

this model, inflectional rules, derivational rules, and compounding rules are applied

to a stem to obtain the required word form.

Word and Paradigm model which is a word based morphological approach which

states generalizations that hold between the different forms of inflectional

paradigms.

From a computational perspective though, the three theoretical models described

above have been shown to offer no significant computational advantage to the finite

state approaches that have been widely applied to building MAs and MGs for a

diverse range of languages (Guido Minnen et.al 2000, Karttunen, 2003, Beesley and

Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat, 2007). Therefore,

in the current work we apply finite state techniques to Telugu Morphology. Amongst

Indian languages (PJ Antony and Soman 2012) reported highest number of

morphology tools for Tamil. According to their survey a wide range of approaches,

from corpus based through suffix stripping to finite state exist. A database approach

is described (Goyal and Lehal, 2011) where they store all the word forms in a

relational database. For Telugu language, (Rao et.al 2011) describes a word and

paradigm based morphological analyser and generator. An item and arrangement

based morphological generator is described in (Sribadrinarayan et al 2009) for

Telugu. A rule based (item and process based) morphological generator for Telugu is

describe in (Ganapatiraju et al 2006).

93
5.2 Inputs to the Morphology Engine

The verb morphology engine in the current work is part of a surface realization

engine which is responsible for automatic generation of grammatically well-formed

Telugu sentences. The input for the surface realization engine is an XML file which

has all the grammatical information required both at the sentence level and word

level. Figure 5.1 shows an example XML specification corresponding to the Telugu

sentence given below in WX-notation.

sIwa rAmudini piliciMxi. (Sita called Rama.)

<? xml version=”1.0” encoding=”UTF8” standalone=”no”>


<document>
<sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=” ” stem=”basic”>sIwa</head>
</nounphrase>
<nounphrase role=”complement”>
<head pos=”noun” gender=”masculine” number=”singular” person=”third”
casemarker=”ni” stem=”basic”> rAmudu</head>
</nounphrase>
<verbphrase type=” ”>
<head pos=”verb” tensemode=”pasttense”> piluc</head>
</verbphrase>
</sentence>
</document>
Figure 5.1. Example XML Input Specification

5.3 Morphological Generation Process

The current chapter describes a computational model based on finite state techniques

and XML files. The computational model is a java application which uses the

“java.util.regex” package. The input to the computational model is the verb lemma,

the tense mode of the verb, PNG (person, number and gender) of the subject and the

case marker of the subject. The “Pattern” class of the “regex” package has a method

94
“matches” which creates a finite state automata for a given regular expression to

identify the class to which a given verb lemma belongs. The computational model

also computes the final constituent of the stem in the inflected verb and finally

concatenates it to the required suffixes extracted from the XML files.

5.4 Verb Forms

The input for the verb in the example XML specification of Figure 5.1 is as follows:

<head pos=”verb” tensemode=”pasttense”> piluc</head>

The first attribute is “pos” which stands for part of speech and the second attribute is

“tensemode”.

Verbs in Indian languages inflect for tense, aspect, modality (mode), and PNG

(person, number and gender) endings. The verbs co-occur with tense, aspect, and

modality in most of the languages whereas aspect and modality are packed into a

single verbal inflection word in Telugu and referred to as “tensemode” in the current

work. There are a total of 18 verb forms including both finite and non-finite forms

which are of importance in Telugu. Our morphological engine has the capacity to

generate all the verb forms but only the finite forms are used by the surface

realization engine and therefore we discuss about the finite forms in detail.

5.4.1 The Imperative

The imperative verbs are used to express a command or a request. The meaning of

the imperative verb takes the form of a command in the singular and a request in the

plural. The imperative forms of the verb are only used when the first person in the

singular addresses the second person either in the singular or in the plural. Therefore,

the imperative forms carry two suffixes. In the case of negative imperative the

95
second person suffix is added to the verb root + “ak” (negative imperative suffix).

The imperative suffixes are as shown in Table 5.1.

Form II Person Singular II Person Plural


Affirmative u(in some cases “i”) aMdi
Negative ak-u ak-aMdi
Table 5.1: Imperative Suffixes

Principles for the formation of the imperative verbs are as follows:

a) The basic verb stems when the imperative suffixes and the negative tense are

same.

b) The rules of stem final vowel loss and harmony (i.e. change of medial “u” to

“a” when followed by “a” apply to imperative verbs.

Example:

pAdu (sing) + aMdi pAdaMdi (request to sing)

c) Stems ending in “s” preceded by a long vowel change “s” to “y” in the

imperative mood. These stems optionally add the suffix “i” instead of “u” in

the singular. When the “i” suffix is added the stem vowel is optionally

shortened and “y” becomes “yy”.

Example:

ces (do) + u  ceVyyi (command to do)

ces (do) + aMdi  ceVyyaMdi (request to do)

Exception:

When the stem vowel is “A” it is not shortened.

Example:

rAs (write) + u  rAyi (command to write)

d) In the case of basic stems having two syllables ending “c” or “s” the final

consonant is replaced by “v” before the imperative suffix.

96
Example:

piluc (call) + u  piluvu (command to call)

kalus (meet) + aMdi  kalavaMdi (request to meet)

e) When the stem variant ends in a long vowel the beginning of the imperative

suffixes is dropped.

Example:

rA (come) + u rA (command to come)

f) One irregular verb in the imperative is “pax-a” (go). The last “a” here is

treated as the imperative suffix.

5.4.2 The Abusive

Many verbs cannot occur in this mood due to semantic restrictions. A few verbs like

“kAlu” (to burn), “kUlu” (to fall), “cAvu” (to die), “pagulu” (to break) etc., occur in

this mood. Some example sentences using abusive verb forms are as follows:

nI illu kUla (May you house fall)

nI kadupu kAla (May your womb (children) burn)

nI mokaM pagala (May your face break)

5.4.3 The Obligative

The obligative is formed by adding the finite or perfective form of a defective verb

“vAl” to the infinitive of a main verb. The finite form of this verb in the future

habitual tense is “vAli” (must). Some example sentences using Obligative verb

forms are as follows:

nenu iMtiki veVlYlYAli (I need to go home)

97
mIru mA Uru rAvAli (You should come to our town)

The perfective participle of “vAl” is “vAlsi” only inflected in non-masculine

singular. The obligative verb does not agree with the subject in person, gender, or

number. It occurs always in the third person non-masculine singular or without any

personal suffix.

Example:

mIru gudiki rAvAlsiMxi (You must have come to the temple)

5.4.4 The Future Habitual

The future habitual tense in Telugu can express an action or a state that will take

place in the future or an action or state that is habitual. The sentence “nenu annaM

wiMtAnu” can mean either ‘I will eat food’ or ‘I eat food’. Principles for the

formation of Future Habitual Tense:

a) The basic tense suffix for future habitual tense is “wA/wun”

b) The verb stems like “ammu” (sell), “adugu” (ask) occur unchanged before

the tense suffix.

Example:

ammu (sell) + wA  ammuwA (will sell)

c) In the case of the basic stems ending in “s” or a long vowel the tense suffix is

added directly.

Example:

kalus (meet) + wA  kaluswA (will meet)

d) In the case of basic stems ending in “n” the tense suffix changes to “tA/tun”.

98
Example:

win + tA wiMtA

e) Single syllable stems ending in “tt” (koVttu) (beat), “pp” (ceVppu) (tell)

change to “da” (koVdawA) (will beat), “bu” (ceVbuwA) (will tell)

respectively before the tense suffixes “wA” and “du” (koVduwuMtA)

(beats), “bu” (ceVbuwuMtA) (tells) respectively before the tense suffix

“wuM”.

f) Stems ending in “c”, “cc”, “Mc” changes those elements to “s” before the

tense suffix.

Example:

piluc (call) + wA  piluswA (will call)

5.4.5 The Past

In Telugu the past tense corresponds to two past tenses in English for example

“vaccAnu” in Telugu represents both ‘I came’ and ‘I have come’. The following are

the principles for the formation of past tense:

a) The tense suffix “e/iM”, and the personal suffix are added to the verb stem to

form the past tense

Example:

piluc +iM  piliciM

b) The stem final “u” before the tense suffix “e/iM” is dropped as a result of

sandhi formation.

Example:

woVdugu + e woVdigA

c) A non-initial “u” in the stem becomes “i” when the past tense suffix is added.

99
Example:

piluc + e  pilicA

d) Verb stem suffixes that end with a short vowel + n generally have “nA” as

the past tense suffix but the 3rd person singular female has “na” as the past

tense suffix.

Example:

win + nA winnA

e) The past tense suffix for the verb stem “pad” (fall) is “dA” but in the case of

third person female singular it is “da”.

Example:

pad + dA  paddA

f) The verb stem ending in “s” becomes “S” in some cases when the past tense

suffix follows.

Example:

kalus + e kaliSA

5.4.6 The Hortative

An Imperative verb that includes the speaker is called the hortative verb. In Telugu

the hortative verb is formed by adding to the verb stem the hortative suffix “xA”

followed by the first person plural “mu/M”. The hortative form also conveys a future

meaning involving both the addresser and the addressed.

Principles for forming the hortative verb form are as follows:

a) The hortative tense form is obtained by adding the verb stem in the habitual

future to the hortative suffix followed by the first person plural

100
Example:

ammu (sell) + xA-M  ammuxAM ((we) will sell)

b) In the case of the future habitual tense forms ending in “c” and “s” they

change to “d” in the hortative.

Example:

piluc (call) + xA-M  pil-ux + xA-M  piluxxAM ((we)

will call) (Table 5.3 class IIa1)

5.4.7 The Negative Finite

In Telugu the negative tense happens by the formation of a verb paradigm rather than

the use of a separate word of negation as in most languages. The negative verbs are

in the future habitual tense and negate the affirmative verb occurring in that tense.

Some example sentences using the negative finite verb forms are as follows:

nenu annaM winanu (I will not eat food)

vAdu iMtiki rAdu (He will not come home)

The negative suffix in Telugu is “a”. It occurs after the verb root and before the

personal suffix in the verb. The personal suffix in the negative tense is same as in the

other tenses except for third person singular non-masculine and third person plural

non-human which are “xi” and “yi” become “xu” and “vu” in negative tense.

Principles for the formation of negative tense

a) The negative tense is formed by adding the negative tense suffix “a” to the

basic stem followed by the personal suffix.

Example:

win (eat) + a  win-a (will not eat)

101
b) The medial “u” of the basic stems having two or more syllables of the form

(C)VCuC(u) changes to “a” when followed by the negative suffix.

Example:

woVdugu (wear) + a  woVdag-a (will not wear)

c) A number of basic stems ending in “c”, “s” replaces these constants by “v”,”

y” in the negative tense as in the case of imperative.

Example:

ces (do) + a  ceVyya (will not do)

piluc (call) + a  piluva (will not call)

5.4.8 The Durative

The durative verb is not a regular finite verb as the other finite forms discussed

earlier. The durative verb is a compound verb as at least two verb roots are involved

in its construction (the main verb and “un”).

Telugu language does not distinguish present, past and perfect continuous tenses as

English does. It is shown by the use of adverb of time or only by the context of

discourse. In the absence of time specifying clues the durative verb carries the

present continuous meaning. The durative verb is formed by adding to the basic verb

stem the durative suffix “w/t” followed by “un” in its finite form.

The principles for the formation of the durative finite verb are:

a) In the case of verb stems ending in a short vowel followed by n the durative

suffix are “t”. The durative verb is “basic stem + t + finite form of un”

Example:

vin + t + un vin-t-un (hearing)

102
b) In all the other cases the durative suffix is “w”. The durative verb is “basic

stem + w + finite form of un”

Example:

cus+w + un cus-w-un (seeing)

5.5. Verb Morphology Engine

In the current work the verb root “piluc” of the example in Figure 5.1 becomes

“piliciMxi” after going through a few steps. The steps the verb root undergoes to get

the required inflection are as follows:

a) Identification of the morphophonemic group based on the tense mode.

b) Identification of the verb inflectional class of the given verb.

c) Extraction of the phonetic alternations based on the morphophonemic group

and the verb inflection class.

d) Extraction of the tense mode suffix.

e) Extraction of the personal suffix based on person, gender, number of the

subject.

f) Formation of the final inflected verb by concatenating the extracted

constituents to the verb root.

5.5.1 Identification of Morphophonemic group

There are three morphophonemic groups namely A, B and C in Telugu. In the

current work the morphophonemic group A is divided into three groups namely

A123, A4, and A5 because the phonetic alteration of certain verb classes are

different for these subgroups of the group A. Group C is also divided into two

103
groups namely C 1-8 and C9 for the same reason as A. Each of the tense modes in

Telugu belongs to one morphophonemic group. Table 5.2 shows the list of tense

modes and the morphophonemic group they belong. In the example of Figure 5.1 the

tense mode for the verb is specified as “pasttense”. The morphophonemic group for

past tense is identified as group B by looking at Table 5.2. In the current work Table

5.2 is an XML file named “tensemodeidentification.xml” and is used for identifying

the morphophonemic group.

Tense mode Morphophonemic Group


Present Participle A123
Durative
Habitual Future
Conditional A4
Hortative A5
Past Participle
Past Tense
Past Verbal Adjective B
Concessive
Future Habitual Verbal
Conditional
Adjective
Infinitive
Abusive
Negative Tense
Negative Participle C1-8
Negative Verbal Adjective
Obligative
Negative Imperative
Imperative Plural
Imperative Singular C9
Table 5.2. Tense Modes and Morphophonemic Groups

5.5.2 Identification of verb inflection class

Telugu verbs are divided into six classes’ (krishnamurti 1961) of which classes I, II,

III, IV, and V are conjugations of weak (regular) verbs and Class VI consists of

strong (irregular) verbs.

104
Class I consist of four subclasses which are as follows:

a) Verb bases with three syllables of the form (C1)V1C2V2C3V3 (C stands for

consonant, V stands for vowel, and the occurrence of consonant inside ( ) is

optional) in which “u” occurs as V2 and V3, and C2 is not “c” or “s”.

Example:

woVdugu (to wear), kuduru (to be settled).

b) Disyllabic bases of the form. (C1)V1C2V2 or (C1)V1C2C3V2.

Example:

padu (to fall), ekku (to climb)

c) Monosyllabic bases of the form (C1)V1C2 where “n” or “l” occur as the final

consonant.

Example:

nAn (to become wet), cAl (to be sufficient)

d) Disyllabic bases of (C1)V1C2V2C3 type where the final consonant is “l” and

the second vowel is “u”.

Example:

kadul (to move)

Class II consists of two subclasses which are as follows:

a) Disyllabic bases of the (C1)V1C2V2C3 type in which the final consonant is

“c” or “s” and the second vowel is “u”.

Example:

piluc- (to call), wadus- (to get wet)

b) Monosyllabic bases of the (C1)V1C2 type in which the final consonant is “c”

or “s”.

105
Example:

wis- (to take out), rac- (to smear)

In the implementation of the morphology engine the Class II verbs are further

divided into sub classes. The subclass ‘a’ is further divided into ClassIIa1 and

ClassIIa2 where ClassIIa1 has the final consonant as “c” and ClassIIa2 has the final

consonant as “s”. The subclass ‘b’ is also further divided into two subclasses namely

ClassIIb1, and ClassIIb2.

Class III consists of three sub classes which are defined as follows:

a) A few monosyllabic bases of the form (C1)V1C2 with final “c” belong to this

sub class.

Example:

cAc- (to stretch out)

b) A few stems in final “uc” or “c” belong to this sub class.

Example:

kAluc- (to set fire)

c) A few stems with final “inc” belong to this sub class.

Example:

wittiMc- (to cause to scold)

Class IV consists of two sub classes which are defined as follows:

a) Monosyllabic bases of the type (C1)V1C2C3- which end in final “tt” or in final

“pp” belong to this sub class.

Example:

kott- (to beat), ceVpp- (to tell or speak).

106
b) Two monosyllabic bases of the same type, one in final “nn” another in

“lYlY” belong to this sub class.

Example:

wann- (to kick), veVlYlY (to go)

In the current work the ClassIVa sub class is further subdivided into ClassIVa1, and

ClassIVa2.

Each monosyllabic base in ClassIVb is treated as a separate class and ClassIVb

becomes ClassIVb1 and ClassIVb2.

Class V consists of seven monosyllabic bases of type (C1)V1C2 in final “n” belong to

this class. The seven bases are an- (to say), kan-1 (to see) kan-2 (to bring forth), kon-

(to buy), win- (to eat), vin- (to hear).

Class VI consists of irregular bases. The irregular bases that belong to this class are

icc- (to give), cacc- (to die), weVcc- (to bring), vacc- (to come), av- (to become), po

(to go), cUc- (to see), leVc- (to rise), le (to be), pax- (to go, depart).

The verb “piluc” in the example of Figure 5.1 is of the form (C1)V1C2V2C3 a

disyllabic base where the final consonant is “c” and the second vowel is “u”. It

belongs to the Class IIa. Figure 5.2 is the diagrammatic representation of the finite

automata created by the computational model for Class IIa. The state 0 is the start

state of the finite automata and the state 4 is the final state. We can see that the first

consonant C1 is optional going to the same state. In the example of Figure 5.1 the

first consonant is “p”, which the finite automata takes as input and goes to the same

state 0. The finite automata then takes V1 which is “i” as input and goes to state 1, at

state 1 it takes C2 which is “l” as input and goes to state 2, at state 2 it takes V2 which

107
is “u” as input and goes to state 3 and finally at state 3 it takes C3 which is “c” as

input and goes to the final state 4.

Figure 5.2: Finite Automata for Class IIa

5.5.3 Extraction of Phonetic Alternations

The extraction of phonetic alternations is done based on the verb class and the

morphophonemic group of the specified tense mode. Table 5.3 clearly shows the

phonetic alterations each verb class goes through in the process of generating the

final inflected form of the verb. In the case of the verb “piluc” in the example of

Figure 5.1 it is clearly shown in Table 5.3 at class IIa1 under group B (to which the

tense mode “pasttense” belongs) the value is “pil-ic”.

In the current work the Table 5.3 is implemented in two steps.

1) The required deletions and replacements are performed on the verb root

through the programming logic.

2) The required alterations to be added are extracted from the XML file

“palterations.xml”.

The first part is the java programming logic which along with the identification of

the verb class performs the required deletion to form the variant of the verb which is

108
the final constituent of the stem in the inflected verb. The fragment of the java code

which does the required process is presented in Figure 5.3.

Class Basic
alternant Morphophonemic Groups
and
Example
word Group Group Group Group Group Group
A123 A4 A5 B C1-8 C9
Ia (C)VCuCu uCu uCu uCu iC aC uC
adugu ad-ugu ad-ugu ad-ugu ad-ig ad-ag ad-ug
Ib (C)VC(C)u u u u - - -
pAdu pAd-u pAd-u pAd-u pAd pAd pAd
Ic (C)Vn/l - - - - - -
nAn- nAn nAn nAn nAn nAn nAn
Id (C)VCul ul il ul il al ul
kaxul- kax-ul kax-il kax-ul kax-il kax-al kax-ul
IIa1 (C)VCuc us is ux ic av -
piluc pil-us pil-is pil-ux pil-ic pil-av pil-uc
IIa2 (C)VCus us Is ux is av -
wadus wad-us wad-is wad-ux wad-is wad-av wad-us
II b1 (C)Vs s s x s (V)yy (V)yy
wIs wI-s wI-s wI-x wI-s wi-yy wi-yy
IIb2 (C)Vc s S x c y y
vAc vA-s vA-s vA-x vA-c vA-y vA-y
IIIa (C)Vc s s x c c c
kAc kA-s kA-s kA-x kA-c kA-c kA-c
IIIb (C)VCuc us is ux c c c
kAluc kAl-us kAl-is kAl-ux kAlu-c kAlu-c kAlu-c
IIIc .*iMc is is ix iMc iMc iMc
wittiMc witt-is witt-is witt-ix witt-iMc witt-iMc witt-iMc
IVa1 (C)Vtt du Di Da tt Tt tt
koVtt koV-du koV-di koV-da koV-tt koV-tt koV-tt
IVa2 (C)Vpp bu bi ba pp pp pp
ceVpp ceV-bu ceV-bi ceV-ba ceV-pp ceV-pp ceV-pp
IVb1 (C)Vnn M M M nn Nn nn
wann wa-M wa-M wa-M wa-nn wa-nn wa-nn
IVb2 (C)VlYlY lY lY lY lYlY lYlY lYlY
veVlYlY veV-lY veV-lY veV-lY veV- veV- veV-
V (C)Vn N N N nlYlY lYlY
N nlYlY
vin vi-n vi-n vi-n vi-n vi-n vi-n
Table 5.3: Phonetic Alternations for Verb Classes

109
The fragment of code presented in Figure 5.3 deletes the last two letters in the

example word “piluc” as follows:

piluc  pil

In step 2 the alternation “ic” is extracted from the XML file “palterations.xml”.

If
(Pattern.matches("[^aAiIIuUeEoOM]?[aAiIIuUeEoOM][^aAiIIuUeEoOM][u][c]",v
erb)) { vclass = "classIIa1";
verb = verb.substring(0,verb.length() - 2);
}
Figure 5.3: Fragment of Java Code for Class IIa1

5.5.4 Extraction of Tense mode Suffix

Tense mode suffixes are those suffixes which are agglutinated to the verb based on

the tense mode.

Morphophonemic Criteria

The tense-mode suffixes in Telugu language are based on the morphophonemic

groups namely A, B, and C.

Group A Suffixes beginning with consonants (w, t, x, d)

Group A123

Grammatical Name Suffix Alternant


Durative Participle wu/tu
Durative w/t
Habitual Future wA/tA/wun/tun
Table 5.4: Suffix Alternant Table for Group A123

Group A4

Grammatical Name Suffix Alternant


Conditional we/te
Table 5.5: Suffix Alternant Table for group A4

110
Group A5

Grammatical Name Suffix Alternant


Hortative xA
Table 5.6: Suffix Alternant Table for group A5

Group B Suffixes which begin with a front vowel (i, eV, e)

Grammatical Name Suffix Alternant


Past Participle I
Past Tense E
iM
nA
dA
Past Verbal ina/na
Concessive inA/nA
Adjective
Future Habitual E
Conditional Iwe
Verbal
Table 5.7:Adjective
Suffix Alternant Table for Group B

Group C Suffixes which begin with a back vowel (a, A, u)

Group C1-8

Grammatical Name Suffix Alternant


Infinitive a/an/-
Abusive a/nu
Negative Tense a/-
Negative Participle aka/ka
akunda/kunda
Negative Verbal ani/ni
Obligative Ali
AdjectiveImperative aku/ku
Negative
Imperative Plural aMdi/ndi
Table 5.8: Suffix Alternant Table for Group C1-8

Group C9

Grammatical Name Suffix Alternant


Imperative Singular u/i/-
Table 5.9: Suffix Alternant Table for Group C9

111
In the case of the verb “piluc” in the example of Figure 5.1 the tense mode being

“pasttense” which belongs to Group B and the subject being “nonmasculine” the

tense mode suffix is “iM”.

5.5.5 Extraction of Personal Suffix

Telugu verbs inflect to encode gender, number, and person suffixes of the subject. In

the current work the morphology engine gets the information about attributes of the

subject and uses that information to agglutinate the gender, number, person suffixes

and tense mode suffix to the verb. The eight personal suffixes of the finite verb for

different persons and numbers can be listed as follows:

Person Singular Plural


1st person -nu -mu
2nd person -vu -ru
3rd person -du (masculine) -ru (human)
3rd person -xi (non- -yi (non-human)
Table 5.10: Personal Suffixes
masculine)
In the case of the verb “piluc” in the example of Figure 5.1 the subject “sIwa” is non-

masculine, singular, 3rd person which means the personal suffix is “xi” from the

above Table 5.10.

5.5.6 Formation of the Final Inflected Verb

The final inflected verb is formed by concatenation of all the strings formed from

section 5.5.3 to section 5.5.5.

Final verb  verb +phonetic alternation+ tense mode suffix+ personal suffix

In the case of the verb “piluc” in the example of Figure 5.1:

Final verb  pil+ic+iM+xi which is piliciMxi.

112
5.6 Evaluation

We report an evaluation of the accuracy of our Telugu Verb Morphology engine

with respect to the Telugu verb database downloaded from the Telugu Wiktionary at

https://en.wiktionary.org/wiki/Category:Telugu_verbs. We have run the verbs in the

database by giving them as input to the surface realizer rather than the morphology

engine separately because we wanted to test them with various alternatives of the

subject with respect to person, number and gender.

A total of 503 verbs were downloaded and the evaluation was performed. The verbs

were tested for habitual future, durative and past tense. The most important part of

the evaluation was categorizing the verbs into different verb classes based on our

reference grammar book (Bh Krishnamurti 1985). The results of the evaluation are

given in Table 5.11.

The results show that 418 verbs were identified as to belonging to different classes

and were able to generate the different verb forms without any errors. The results

also show that 85 verbs were not recognized as belonging to any verb class

according to the current work. Among the 85 words which were not identified as

belonging to the verb classes are words like “pilupu” which are not considered as

verbs according to our grammar reference. Some of the words end with “agu” which

means “to become” but our grammar reference considers only “avu” as the verb to

be used to mean “to become” and we did not consider these two to be the same

otherwise the number of failed verbs would have been reduced by 20. We intend to

use this evaluation results to drive the development of the morphology engine to

extend the coverage.

113
Class No. of verbs identified in
each class
Ia 33
Ib 118
Ic 2
Id 1
II a1 10
II a2 0
II b1 24
II b2 0
III a 4
III b 6
III c 141
IV a1 26
IV a2 3
IV b1 4
IV b2 0
V 38
VI 8
Table 5.11: Evaluation Results for Verb Formation

5.7 Summary

This chapter describes a morphology engine for Telugu verbs based on finite state

techniques. The different verb forms and the different verb classes are described in

the chapter. The process of the formation of inflected verb is also described in detail

in the chapter. The verb morphological engine plays a very important role in the

surface realization engine.

114
-----------------------CHAPTER-6

SENTENCE FORMATION
115
SENTENCE FORMATION

6.1 Introduction

The grammatical operations of a language like Telugu are basically based on the

category of the words rather than the structure of its constituents. The category of the

words can be pronoun, noun, verb, adjective, adverb etc. The words are grouped to

form more syntactically relevant categories called phrases or word groups. The

phrases or word groups form sentences in the language. Sentence formation in

Telugu is the major purpose of the current thesis. The responsibility of the sentence

formation in the current work lies with the SentenceBuilder module. The

SentenceBuilder module is the one which has centralized control over all the other

modules in the surface realization engine. The SentenceBuilder performs the

following tasks in the process of sentence generation.

1) Identifies the constituents of the input with predefined grammatical functions

such as subject, object, complement and verb.

2) Appropriate element builders for the different grammatical functions are

called to create element objects to store all the information received from the

input XML file.

3) The element objects are passed to the appropriate phrase builders to generate

phrases according to the requirements specified in the input. The phrases are

received back from the phrase builders in the form of strings.

4) Agreement rules are applied to the grammatical functions to form the

complete sentence.

5) The word order of the sentence constituents in which they were given in the

XML input file is preserved.

116
6) The sentence is sent to the output generator to produce the sentences in

Telugu language from WX-notation a transliteration scheme for representing

Indian languages in ASCII notation.

6.2 Identification of Grammatical Functions

Grammatical Functions are syntactic roles played by words or phrases in the context

of a particular sentence. In the current work we have four grammatical functions

namely subject, object, complement and verb. The grammatical function is

represented through a sentence level feature called role. The grammatical function is

used to identify the noun phrase which plays the role of subject so that verb

agreement becomes easy. One more advantage with the use of grammatical functions

in the current work is it facilitates free word order for the sentence constituents

which are required in Telugu language. The grammatical functions are identified by

the SentenceBuilder module by looking for the sentence level feature called role.

6.3 Element Builder

The ElementBilder module in the current work is used only to make the application

better and facilitate better communication among the modules of the application.

Each grammatical function has a specific element builder. There are four element

builders namely SOCElementBuilder, VAElementBuilder,

AdjectiveElementBuilder, and AdverbElementBuilder. The purpose of element

builders is to convert the lexicalized input provided in the input XML file into

element objects. Element objects are objects of grammatical functions with their

grammatical features as instance variables. There are four element objects namely

SOCElement, VAElement, AdjectiveElement, and AdverbElement. The

SentenceBuilder calls the appropriate element builder to construct the required

117
element object and return it to the SentenceBuilder. The element objects are passed

to the phrase builders to get back strings as phrases.

6.4 Phrase Formation

In Telugu language the primary meaning of a sentence is expressed in the form of

word groups or phrases. The order in which the phrases are grouped in a sentence is

relatively free in Telugu when compared to languages like English. In the current

work the module that takes care of the formation of a phrase is the Phrase Builder

module to which the Sentence Builder module sends the required objects to be

generated as phrases. The Phrase Builder can be NounPhraseBuilder,

AdjectivePhraseBuilder, VerbPhraseBuilder, or AdverbPhraseBuilder because noun

phrase, adjective phrase, verb phrase, and adverb phrase are the main four types of

phrases in Telugu language. In Telugu language the phrases generally exist as head-

modifier phrases and therefore in our current word the phrase are head-modifier

phrases. The noun phrase and the verb phrase play a very important role in the

formation of a sentence as the head of a head-modifier phrase. The adjective phrase

and the adverb phrase play the secondary role of being a modifier of the phrase head.

In the following sections a detailed description about the construction of all the

phrases is provided along with the morphology of the adjectives and the adverbs.

6.4.1 Noun Phrase

The noun phrases are composed of one or more nouns or pronouns or a noun

modified by one or more adjectives. Every noun phrase has an identifiable head, a

noun or a pronoun. In the current work the noun phrase is a very simple phrase

which has a head and an optional modifier which is an adjective. The construction of

the noun phrase is the responsibility of the NounPhraseBuilder.

118
sIwa aMxamEna pAta gattigA pAdiMxi. (Sita sang a beautiful song loudly)

The XML file that generates the above sentence is as follows:

<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=” ” stem=”basic”> sIwa</head>
</nounphrase>
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”” stem=”basic”> pAta</head>
</nounphrase>
<verbphrase type=” ”>
<modifier pos=”adverb” suffix=”gA”> gatti</modifier>
<head pos=”verb” tensemode=”pasttense”>pAdu</head>
</verbphrase>
</sentence>
</document>
Figure 6.1. Example Input XML Specification

In the above sentence there are two noun phrases which are passed to the noun

phrase builder as two SOCElements (see section) by the sentence builder module.

The first SOCElement is passed to the morphology engine by the

NounPhraseBuilder to get back the noun phrase “sIwa” which has only the head and

no modifier. The second SOCElement contains a noun and another element which is

the AdjectiveElement. The noun is passed to the morphology engine and the

AdjectiveElement is passed to the AdjectivePhraseBuilder. A noun group or noun

phrase in nominative (explicitly unmarked) is the subject of the clause or sentence.

In the above sentence the noun phrase “sIwa” plays the role of a subject and the noun

phrase “aMxamEna pAta” plays the role of a complement. Telugu language has the

nominative-accusative pattern with the subject predicate agreement and not the

ergative-absolutive. Since Telugu is nominative-accusative language the verb agrees

with the argument in the nominative case.

119
In the current work there can be one noun phrase which plays the role of a subject

and one or more noun phrases which play the role of complements. The noun phrase

in the subject role can be in the nominative (unmarked) or the dative (marked with

“ki” or “ku”).

rAmudiki Akali veswuMxi (Rama is hungry)

The above sentence is an example where the noun phrase in the subject role is

marked with the dative case.

6.4.2 Adjective Phrase

Adjective Phrase is an optional modifying element of a noun phrase in a head-

modifier noun phrase. The adjective phrase can be a single word or a group of words

which act as the modifier of a noun. In the current work the generation of the

adjective phrase Is the responsibility of the AdjectivePhraseBuilder. The adjective

phrase currently is restricted to a single word which is expected to change in the next

versions. In the example XML file of Figure 6.1 the AdjectivePhraseBuilder passes

the AdjectiveElement to the morphology engine to get back the adjective phrase

“aMxamEna”. The AdjectivePhraseBuilder passes the adjective phrase “aMxamEna”

back to the noun phrase which finishes the construction of the noun phrase

“aMxamEna pAta”.

6.4.2.1 Adjective Morphology

Telugu adjectives are generally indeclinable and most often occur before the noun

they qualify. Adjectives are divided into four classes.

120
Class I. The first class of adjectives are called basic adjectives. The adjective roots

that occur only as adjectives belong to this class. These roots always appear before

the nouns they qualify.

Class II. The second class of adjectives are called derived adjectives. These

adjectives are derived from nouns,verbs, or adverbs.

Class III. The third class of adjectives are called positional adjectives. These

words can either be used as nouns or as adjectives depending on the position in the

sentence.

Class IV. The fourth class of adjectives are called bound adjectives. They occur in

limited number of attributed compounds. They belong to particular class of

adjectives, nouns and adverbs.

6.4.2.1.1 Basic Adjectives

There are very few words that can be used as adjectives only and not as anything

else. Some examples of sentences that use basic adjectives are as follows:

A illu mAxi (That house is ours)

I puswakaM nAxi (This book is mine)

nAku oka kalaM kAvAli (I want a pen)

Ame aragaMta sepu mAtlAdiMxi (She talked for half an hour)

Ayana prawiroju nannu widawAdu (He scolds me every day )

The italicized words in the above sentences are basic adjectives.

121
6.4.2.1.2 Derived Adjectives

Adjectives derived from other parts of speech like nouns, verbs, adverbs are called

derived adjectives. Some example sentences consisting of derived adjectives are as

follows:

iMti kappu kuruswuMxi (The roof of the house leaks)

Ayana wammudu nAku welusu (I know his younger brother)

6.4.2.1.3 Positional Adjectives

Almost all the nouns in the nominative singular can function as adjectives when

followed by a noun. Here the noun which does not take the oblique form acts as an

adjective based on its position and therefore it is called a positional adjective. All

the cardinal numbers used adjectively belong to this class of adjectives. Some

example sentences are as follows:

seru pappu. (a seer full of dal)

reMdu puswakAlu. (Two books)

ixxaru manuRulu (Two people)

6.4.2.1.4 Bound Adjectives

All words of colour, taste, and density belong to this class of adjectives.

Noun Bound Adjective


civara cittacivara
moVxalu mottamoVxalu
koVsa kottakoVsa
bayalu battabayalu
naduma nattanaduma
Table 6.1. Example Bound Adjectives

122
6.4.2.1.5 Pronominalized Adjectives

Adjectives with pronominal suffixes (vAdu (singular masculine), xi (singular non-

masculine), vAru (plural human), vi (plural non-human)) are called

pronominalized adjectives. Pronominalized adjectives are used to express

sentences like ‘a good one’, ‘a big one’ in English.

Some examples of sentences using pronominalized adjectives are as follows:

I gaxi pexxaxi (This room is big)

I puswakAlu koVwwavi (These books are new)

vAdu goVppavAdu (He is great)

Ame piccixi (She is mad)

vAlYlYu maMci vAlYlYu (They are good people)

In sentences where adjective is used as a predicate it takes the form of a

pronominal suffix that agrees with the subject phrase in number and gender.

Pronominal adjectives like nA, mA, mana, nI, mI when used as predicate take the

appropriate pronominal suffix in the same way as other adjectives. A few example

sentences are as follows:

I illu mAxi (This house is ours)

A puswakaM nAxi (That book is mine)

I kalaM nIxi (This pen is yours)

I Uru manaxi (This village is ours)

123
Pronominalized adjectives can also occur in the subject position because any noun

can be pronominalized.

6.4.2.2 Adjective Morphological Process

Adjectives play the role of a modifier in a noun phrase. All the four categories of

adjectives discussed above can be used as modifier in a noun phrase. In the current

work the XML file which comes as input to form the sentence contains the word

and the associated feature “suffix” of the adjective as modifier in a noun phrase.

Figure (6.1) shows an example noun phrase along with the adjective as modifier.

<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”ti”>pulupu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”plural” person=”third”
casemarker=”” stem=”basic”> paMdu</head>
</nounphrase>
Figure 6.2 . Adjective as a Modifier in a Noun Phrase

The noun phrase specified by the input in Figure 6.2 is “pullati palYlYu” (sour

fruits). The modifier “pullati” is a derived adjective formed by adding the suffix

“ti” to the noun “pulupu”. The formation of the modifier in the Figure 6.2 can be

written along with its features as follows:

“pulupu”, adjective, descriptive, “ti”  pullati

The current work does not take the type of the attribute into consideration which will

be included in the future work. The formation of the adjective “pullati” is as follows:

pulupu pul

pulpulla

pulla + ti  pullati

124
After the final sandhi the adjective “pullati” is formed from the noun “pulupu” as the

modifier of the noun phrase.

6.4.3 Verb Phrase

Simple verbs in their finite forms are inflected for tense followed by person, number,

and gender endings or states. In order to indicate aspectual, modal and voice

distinctions in the actions or states denoted by the verbs, various auxiliaries are

employed. In Telugu, simple past, future/habitual and progressive or present tense

forms of verbs are derived by affixing “A”, “wA”, and “wunnA”, to the root/stem

directly in the case of masculine nouns as illustrated below:

rAmArAvu Ata AdAdu (Ramarao played a game)

rAmArAvu Ata AduwAdu (Ramarao will play a game)

rAmArAvu Ata AduwunnAdu (Ramarao is playing a game)

In the current work the responsibility of the generation of the verb phrase is with the

VerbPhraseBuilder. The verb phrase is also a head-modifier verb phrase with a verb

as the head and an adverb as the modifier of the phrase. In the example XML file of

Figure 6.1 there is only one verb phrase. The VerbPhraseBuilder gets a VAElement

which has a verb along with its features and an AdverbElement. The verb is passed

on to the morphology engine to get back the head of the verb phrase “pAdiMxi”. The

AdverbElement is passed on to the AdverbPhraseBuilder.

bAgA pAdAdu ((He) sang well)

125
6.4.4 Adverb Phrase

The word adverb is derived from Latin where ad means attached to which indicates

that an adverb is a modifier of a verb. In the current work the construction of the

adverb phrase is the responsibility of the AdverbPhraseBuilder. In the example XML

file of Figure 6.1 the AdjectivePhraseBuilder passes the AdverbElement to the

morphology engine to get back the adverb phrase “gattigA”. The

AdverbPhraseBuilder then passes the adverb phrase back to the VerbPhraseBuilder

which completes the verb phrase “gattigA pAdiMxi”.

6.4.4.1 Adverb Morphology

Adverbs generally occur as modifiers of the verb in a sentence. All the adverbs fall

into three semantic domains denoting time, place and manner. Adverbs belonging to

time and place semantic domain are morphologically nouns since they form oblique

stems and inflect with case suffixes. Some words like “muMxu”(before),

“venaka”(after), “kiMxa”(below), “pEna” (above) referring to directions occur both

as independent time-place adverbs and postpositions of complements within the

predicate phrase.

6.4.4.1.1 Adverbs of Time

Some adverbial nouns of time occur uninflected in a sentence. Some examples of

adverbial nouns that occur uninflected are illustrated in the following sentences:

nenu repu mAtlAdawAnu. (I will talk tomorrow)

vAdu rAwrulu wiruguwuMtAdu.(He roams around in the night)

Ame iMkA iMtiki rAlexu.(She didn’t come home yet)

126
In the above sentences the italicised words “repu”, “rAwrulu”, “iMkA” occur

uninflected in the sentence. Some adverbial nouns of time include bound particles or

suffixes like “e”,”ke”, ”lo”. Some sentences that include these suffixes are as

follows:

awanu iMtinuMci veMtane vaccAdu. (He came immediately from the house)

rAmu nAku ixivarake weVlusu. (I know Ramu since long time)

In the above sentences the italicized words veVMtan + e, ixivara + ke add the

suffixes “e” and “ke”. Some adverbs are derived from nouns by the addition of “gA”

the infinitive of “av” (to become). Some example sentences to illustrate the use of

“gA” are as follows:

nuvvu AlasyaMgA vaccAvu. (You came late)

vAdu nemmaxigA naduswunnAdu. (He is walking slowly)

Some postpositions like “warwAwa”, “maxya” when used independently after

demonstrative adjectives like “A”,”I” act as adverbs in the sentence. Some example

sentences to illustrate the use of those postpositions are as follows:

I maxya mA Uru veVlYlYAnu. (I went to my home town recently)

A warvAwa prasAxu vaccAdu. (Prasad came later)

6.4.4.1.2 Adverbs of place

Some adverbs of place like “akkada”, “ikkada” are used in uninflected form. Some

example sentences containing such adverbs are as follows:

akkada kurci uMxi. (The chair is there)

nuvvu ikkada kurcoV (You sit here)

127
Some nouns of place or direction become adverbs by the addition of “gA”. Some

examples are as follows:

awanu nAku xUraMgA kurcunnAdu (He sat far away from me)

nuvvu nAku xaggaragA uMdu (You be closer to me)

The verb “niMdu” can be used as “niMdA” in a sentence as postposition to form an

adverb. Some examples are as follows:

vAdi saMciniMdA atukulunnAyi.

Ame bIruvAniMdA puswakAlunnAyi. (Her cupboard is full of books)

6.4.4.1.3 Adverbs of manner

One way of forming manner adverb is by adding “gA” to adjectives and nouns

(which do not specify about time or place). The following is a list of adjectives and

nouns from which adverbs are formed by adding “gA”.

Adjective/Noun Adverb
peVxxa peVxxagA
bAgu bAgugA
ceVdda ceVddagA
cinna cinnagA
meVwwa meVwwagA
Table 6.2 Example Manner Adverbs

Some example sentences are as follows:

awanu pexxagA navvadu.(He doesn’t laugh much)

Ame bAgA caxuvuwuMxi. (She studies well)

rAxa cAla aMxaMgA vuMxi. (Radha is very beautiful)

128
Words like “alA”, “ilA”, “eVlA” are shorter forms of “alAgA”, “ilAgA”, “eVlAgA”

used as manner adverbs. Some example sentences are as follows:

okasAri ilA raMdi. (Come here once)

mIru kudA alA ceyyaMdi. (You also do like that)

The suffixes “gA” and “lAgA” convert nominal predicates to adverbs when used

before verbs like “un”, “kanabadu/kanipiMc”, and “natiMc”. Some example

sentences are as follows:

awanu pexxamaniRilAgA unnAdu. (He looks like an elderly man)

I illu cinnaxigA uMxi. (This house seems to be small)

The suffix “gA” when added to nouns referring to physical or psychological states

converts them to manner adverbs. The subject in such sentences occurs in the dative

case and the finite verb is always “un”.

nAku caligA uMxi. (I am feeling cold)

vAdiki I Uru kowwagA uMxi. (This town is new to him)

The words “niMdA”, and “ArA” are added to nouns to form adverbs. Some example

sentences are as follows:

nenu kadupuniMdA annaM winnAnu. (I ate up to my brim)

annamayya xevuNNi kalYlYArA cusAdu. (Annamayya saw god)

129
6.4.4.2 Adverb Morphological Process

Adverbs play the role of a modifier in the verb phrase. All the categories of adverbs

discussed above can be used as a modifier in a verb phrase. In the current word the

adverb and its suffix are specified as the modifier of a verb phrase in the XML file.

An example verb phrase along with the adverb as modifier is given in Figure 6.2.

The verb phrase specified in the Figure 6.3 is “cinnagA pAduwu” (singing slowly).

The modifier “cinnagA” is a manner adverb formed by adding the suffix “gA” to the

noun “cinna”.

<verbphrase type=” ”>


<modifier pos=”adverb” suffix=”gA”> cinna</modifier>
<head pos=”verb” tensemode=”presentparticiple”> pAdu</head>
</verbphrase>
Figure 6.3. Adverb as a Modifier in a Verb Phrase

The formation of the modifier in the Figure 6.3 can be written along with its features

as follows:

“cinna”, adverb, “gA” cinnagA

The formation of the adverb “cinnagA” which is a simple “sandhi” formation is as

follows:

cinna+gAcinnagA.

6.5 Agreement in Verb

Grammatical agreement of the word form co-occurring in a clause is the

morphological phenomenon in which the words get sensitive to each other. Predicate

agreement explains the morphological changes that occur in the predicate appearing

in a sentence with respect to the presence of some other word (Subject or object) in

130
the sentence. The predicates in Telugu can be divided into three different categories

namely verbal, nominative and abstract. In the case of the verbal predicate in Telugu

the finite verb exhibits agreement in number, gender, and person with its subject

nominal, which is always in the nominative. The use of accusative case for inanimate

objects is optional. In the case of inanimate objects the accusative case can be

omitted and the word can be written in nominative form in a sentence.

sIwa puswakA(nni) konnaxi (Sita bought a book)

sIwa puswaka(M) konnaxi (Sita bought a book)

The first sentence above is the one in which the accusative suffix is added to the

complement of the sentence. The second sentence is the one in which the accusative

suffix is not added to the complement because the use of accusative case for

inanimate objects (puswakam, paMdu) is optional in Telugu language. Both the

above sentences mean the same but the second sentence is the one which is regularly

used.

In the current thesis the predicate type can only be verbal and the other predicate

types like abstract and nominative will be included in the future versions of the

work. Some example input XML files and the surface forms generated are discussed

in the following sections to understand the implementation of agreement. In the

current work the responsibility of the agreement between the nominative subject and

the verb is with the Sentence Builder a module which has centralized control over all

the other processing modules. Here are some examples which discuss the agreement

of person, number, and gender (PNG) between the subject and the verb.

131
6.5.1 Agreement with the First Person

In Telugu language “nenu” is the pronoun for the first person. Here we look at an

example XML file which illustrates the agreement between the first person and the

verb.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<document>
<sentence type=" " predicate-type="verbal" respect="no">
<nounphrase role="subject">
<head pos="pronoun" gender="masculine" number="singular" person="first"
casemarker=" " stem="basic">nenu</head>
</nounphrase>
<verbphrase type=" ">
<head pos="verb" tensemode="futuretense">vaccu</head>
</verbphrase>
</sentence>
</document>
Fig 6.4 Example XML Specification for Agreement with First Person

The XML file in Fig 6.4 generates the following sentence:

nenu vaswAnu (I will come)

The sentence has a subject and a verb. The subject “nenu” along with its features can

be written as follows:

“nenu”, pronoun, masculine, singular, first, basic  nenu

The verb “vaswAnu” along with its features can be written as follows:

“vaccu”, verb, futuretense  vas

The future tense mode suffix “wA” is agglutinated to the word and it becomes:

vas +wAvaswA

132
The sentence builder then agglutinates the PNG suffix for the first person “nu” to the

verb in the future tense as follows:

vaswA+nuvaswAnu.

In case the number feature of the XML file is “plural” and not singular then the

sentence is generated as follows:

manamu vaswAmu (We (including the listener) will come)

or

memu vaswAmu (We (excluding the listener) will come)

The subject being “manamu” or “memu” is dependent on a feature called

“exclusive” which is only used for pronouns in the first person and when the number

is plural.

The subject “manamu” along with its features can be written as follows:

“nenu”, pronoun, masculine, plural, first, basic, no manamu

The verb will be inflected as usual except for the PNG suffix. The PNG suffix “mu”

for the first person plural is added to the verb as follows:

vaswA+muvaswAmu.

6.5.2 Agreement with the Second Person

The pronoun used to represent the second person in Telugu language is “nuvvu”. An

example XML file which illustrates the agreement between the second person and

the verb is provided here.

133
The XML file in Fig 6.5 generates the following sentence:

nuvvu vaswAvu (You (singular) will come)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<document>
<sentence type=" " predicate-type="verbal" respect="no">
<nounphrase role="subject">
<head pos="pronoun" gender="masculine" number="singular" person="second"
casemarker=" " stem="basic">nuvvu</head>
</nounphrase>
<verbphrase type=" ">
<head pos="verb" tensemode="futuretense">vaccu</head>
</verbphrase>
</sentence>
</document>
Fig 6.5 Example XML Specification for Agreement with Second Person

The subject “nuvvu” along with its features can be written as follows:

“nuvvu”, pronoun, masculine, singular, second, basic  nuvvu

The verb “vaswAvu” along with its features can be written as follows:

“vaccu”, verb, futuretense  vas

The future tense mode suffix “wA” is agglutinated to the word and it becomes:

vas +wAvaswA

The sentence builder then agglutinates the PNG suffix for the second person “vu” to

the verb in the future tense as follows:

vaswA+vuvaswAvu.

In case the number feature of the XML file is “plural” and not singular then the

sentence is generated as follows:

mIru vaswAru (You (plural) will come)

134
The subject “mIru” along with its features can be written as follows:

“nuvvu”, pronoun, masculine, plural, second, basic mIru

The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”

for the second person plural is added to the verb as follows:

vaswA+ruvaswAru.

The gender of the person does not show any distinction in the case of second person

also just like the first person.

6.5.3 Agreement with the Third Person Masculine

An example XML file to illustrate the agreement between the third person and the

verb is provided here.

The XML file in Fig 6.6 generates the following sentence:

vAdu vaswAdu (He will come)

The subject “vAdu” along with its features can be written as follows:

“vAdu”, pronoun, masculine, singular, third, basic  vAdu

The verb “vaswAdu” along with its features can be written as follows:

“vaccu”, verb, futuretense  vas

The future tense mode suffix “wA” is agglutinated to the word and it becomes:

vas +wAvaswA

135
The sentence builder then agglutinates the PNG suffix for the third person

masculine“du” to the verb in the future tense as follows:

vaswA+duvaswAdu

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<document>
<sentence type=" " predicate-type="verbal" respect="no">
<nounphrase role="subject">
<head pos="pronoun" gender="masculine" number="singular" person="third"
casemarker=" " stem="basic">vAdu</head>
</nounphrase>
<verbphrase type=" ">
<head pos="verb" tensemode="futuretense">vaccu</head>
</verbphrase>
</sentence>
</document>
Fig 6.6 XML Specification for Agreement with Third Person Masculine

In case the number feature of the XML file is “plural” and not singular then the

sentence is generated as follows:

vAlYlYu vaswAru (They will come)

The subject “vAlYlYu” along with its features can be written as follows:

“vAdu”, pronoun, human, plural, third, basic vAlYlYu

The gender for the third person as shown above when the number is plural is human

to denote human beings and non-human to denote animals and non-living things.

The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”

for the third person, plural and human is added to the verb as follows:

vaswA+ruvaswAru.

136
6.5.4 Agreement with the Third Person Non-Masculine

As mentioned earlier in Telugu language nouns and pronouns denoting female

persons are treated as non-masculine gender in the singular but in the plural those

nouns are treated as human along with the nouns denoting male persons. Here we

have an example XML file which illustrates the agreement between the non-

masculine nouns and the verb.

The XML file in Fig 6.7 generates the following sentence:

axi vaswAxi (She (It) will come)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<document>
<sentence type=" " predicate-type="verbal" respect="no">
<nounphrase role="subject">
<head pos="pronoun" gender="nonmasculine" number="singular" person="third"
casemarker=" " stem="basic">axi</head>
</nounphrase>
<verbphrase type=" ">
<head pos="verb" tensemode="futuretense">vaccu</head>
</verbphrase>
</sentence>
</document>
Fig 6.7 Specification for Agreement with Third Person Non-Masculine

The sentence has a subject and a verb. The subject “axi” along with its features can

be written as follows:

“axi”, pronoun, nonmasculine, singular, third, basic  axi

The verb “vaswAxi” along with its features can be written as follows:

“vaccu”, verb, futuretense  vas

The future tense mode suffix “wA” is agglutinated to the word and it becomes:

vas +wAvaswA

137
The sentence builder then agglutinates the PNG suffix for the third person non-

masculine “xi” to the verb in the future tense as follows:

vaswA+xivaswAxi.

In case the number feature of the XML file is “plural” and not singular then the

sentence is generated as follows:

vAlYlYu vaswAru (They (human) will come)

or

avi vaswAyi (Those (non-human) will come)

The sentence that will be generated among the above two is decided by the gender

feature which can be “human” or “nonhuman”.

In case the gender is human the subject “vAlYlYu” is generated as described in the

previous section.

In case the gender is non-human the subject “avi” along with its features can be

written as follows:

“axi”, pronoun, nonhuman, plural, third, basic  avi

The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”

for the third person, plural and human is added to the verb as follows:

vaswA+ruvaswAru.

The PNG suffix “yi” for the third person, plural and non-human is added to the verb

as follows:

vaswA+ yivaswAyi

138
6.6 Word order

Certain languages like English which have a relatively fixed word order are called as

positional languages. Telugu unlike English is a free word order language like most

other South Asian languages (Dravidian and Indian). The word order of grammatical

functions like subjects, complements and objects is largely free. Internal changes in

the sentences or position swap between various phrases will not affect grammatical

functions of the nominals.

In Telugu the position or order of occurrence of a noun group does not contain the

information about the karaka or theta roles which specifies the number and type of

noun phrases required syntactically by a particular verb. The information is

contained in the post positions or the surface case endings of nouns (Akshara Bharati

et al., 1995). Therefore the relative free order of the words does not affect the

meaning of the sentence.

rAmudu rAvanudini BANaMwo campAdu (Rama Killed Ravana with an Arrow)

rAmudu BANaMwo rAvanudini campAdu (Rama killed Ravana with an Arrow)

BANaMwo rAmudu rAvanudini campAdu (Rama killed Ravana with an Arrow)

rAmudu campAdu rAvanudini BANaMwo (Rama killed Ravana with an Arrow)

In the current work also the word order is free. The word order of the output is same

as the word order in which the input is given. The SentenceBuilder module makes

sure that the order in which the words in a sentence are sent to the output module is

the same order in which they are given as input.

139
6.7 Output Generator

The output generator is the module which actually gives the output in Unicode

character set as Telugu sentences. A sample output in Telugu language for the

sentence in WX-notation is as follows:

rAmudu iMtiki vaccAdu (Sentence in WX-notation)

రాముడు ఇంటికి వచ్చా డు (Sentence in Telugu)

Telugu Unicode Chart


0 1 2 3 4 5 6 7 8 9 A B C D E F
0C0 ఁ ఁ ఁ అ ఆ ఇ ఈ ఉ ఊ ఋ ఌ ఎ ఏ

0C1 ఐ ఒ ఓ ఔ క ఖ గ ఘ ఙ చ ఛ జ ఝ ఞ ట

0C2 ఠ డ ఢ ణ త థ ద ధ న ప ఫ బ భ మ య

0C3 ర ఱ ల ళ వ శ ష స హ ఽ ఁ ఁ

0C4 ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ

0C5 ఁ ఁ ౘ ౙ

0C6 ౠౡ ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯

0C7 ౸ ౹ ౺ ౻ ౼ ౽ ౾ ౿

Figure 6.8 Unicode Block set for Telugu

Figure 6.8 is the Unicode block for Telugu as of Unicode version 10.0 which

contains characters for Telugu, Gondi and Lambadi languages of Andhra Pradesh

and Telangana. In its original incarnation the code points U+0C01…U+0C4D were a

direct copy of the Telugu characters A1-ED from the 1988 ISCII standard. The grey

areas indicate non-assigned code points.

140
6.8 Summary

This chapter presents the details of the sentence formation mechanism used in

current thesis. This chapter gives a detailed description of the different modules used

in the construction of the java application for Telugu sentence formation.

141
-------------------------CHAPTER-7

CONCLUSION

142
CONCLUSION

7.1 Introduction

This thesis presented research work on designing and developing a surface

realization engine (surface realizer or realizer for short) for Telugu, an Indian

language belonging to the Dravidian family of languages. A surface realizer is a

module in a natural language generation (NLG) system that accepts input

specification consisting of lexicalized sentence constituents and their associated

features to auto-generate grammatically valid sentences by applying correct

morphology, syntax and orthography.

Although surface realization engines for European languages such as English and

French have been available since the 90s, to the best of our knowledge there is no

general-purpose realization engine for any Indian language. In this context, the

current research work assumes special significance in developing a surface

realization framework that is suitable for Indian languages, demonstrated to work

with Telugu language.

The developed framework is an adaptation of the now popular SimpleNLG

framework that has recently been applied to a wide spectrum of languages including

German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry

and Lapalme, 2013), and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,

2014).

A major effort while building a realization engine for a new language relates to

acquire the required linguistic knowledge. Not only finding the required knowledge

sources can be difficult, but having found the right knowledge sources, it can be

quite challenging to then acquire all the required knowledge. The SimpleNLG

143
framework provided the right guidance in both the identification of the correct

knowledge sources and then provided guidelines for acquiring the required

knowledge.

Because Telugu is a morphologically rich language (MRL), the realization

framework required for it should offer a central role to morphology. The SimpleNLG

framework separates morphology and syntax offering, therefore, a realization

framework that supports the independent development of the rich morphological

processing required by Telugu language. The current research work developed a

finite-state technology based framework for Telugu morphology.

Another important feature of the SimpleNLG framework is the usage of features to

represent grammatical and at times semantic information. Our adapted framework

too makes an extensive use of features to represent grammatical/semantic

information in the realization engine. These features help in performing a wide

range of operations including morphology, syntax and agreement among sentence

constituents (e.g. subject- verb agreement).

The evaluation studies carried out during the research work show that the developed

realizer generates grammatical sentences and the morphological modules can

generate grammatically correct word forms.

7.2 Critical Review

An important property of a realizer is its coverage. The SimpleNLG framework is

based on the principle that realizers should offer complete morphological coverage

while supporting only the most frequently used syntactic forms (which is one of the

reasons why the framework is called Simple). It should be noted that SimpleNLG

144
framework was originally developed for realizing English, which is known to

involve more syntactic complexity than morphological complexity. While adapting

the SimpleNLG framework for the morphologically rich Telugu, it has been accepted

that achieving an exhaustive coverage of Telugu morphology is only possible at the

theoretical level (the finite state morphological framework) but not necessarily in the

software. As a result, the finite state morphological framework developed in the

current research work provides wide coverage theoretical basis, our evaluation

studies showed that the realizer (the software) built using the framework needs to

broaden the coverage further. This is particularly true with coverage of noun and

verb forms, the open class words. Our approach while building the software has been

to focus on all the major types of nouns and verbs as described in our knowledge

source.

In addition to the limited morphological coverage, the current version of the

realization engine software used the WX notation to deal with Telugu orthography

that is not the only notation used by Telugu language technology software. Again,

this is a limitation of the software developed in the research work, but it should be

emphasized that the realization engine framework developed in the current research

is generic and extends beyond what was implemented in the software.

An ideal evaluation of the current work should aim to show that the realization

framework and the linguistic knowledge represented by the framework ensure auto-

generation of well-formed words, phrases and sentences. It is difficult to directly

perform evaluation studies on the theoretical framework. Instead the evaluations

studies in the current work focus on showing that the realization engine software

which implements the theoretical framework generates the required well-formed

surface forms. It is worth emphasizing that further evaluation studies are required to

145
quantify the quality of the software output which in turn brings greater clarity on the

quality of the theoretical framework.

7.3 Future Work

The most important task for future is to apply the developed framework to other

Indian languages so that the developed framework can be claimed to be suitable for

realization of Indian languages. This may involve making further refinements to the

framework which reflects the differences among the Indian languages. A deeper

understanding of these individual differences would be interesting both from a

theoretical and an applied perspective.

As argued in the previous section, the current version of the software does not

provide complete coverage of the Telugu grammar. Building on from the current

version, extensions can be made to cover grammatical features currently not covered.

An important extension, in this context, is to extend the current realization engine to

generate more than one grammatically valid form for a given input. This over

generation feature should apply at all levels including words, phrases and sentences.

Software development wise this is a significant extension, but one that makes the

realization software truly attractive for real world usage.

Another important direction for future work is to actually use the realization engine

as part of an NLG application. (Several small-scale efforts have been carried out

during the current research work to apply the realization engine none of which are

significant enough to be reported here.) This will be an important step in the

evaluation of the engine. In addition, this will also help in specifying the relative

importance of different modules and where the extensions are really required for a

given use case.

146
Yet another important future work is related to rebuilding the morphology modules

using the finite state tools such as JFlex. Although this will not change the quality of

the output, it would improve the portability of the framework to other Indian

languages significantly. Because the current version already uses Java’s regular

expression library, the grammatical knowledge required to write the JFlex input files

already exists and it should not be hard to incorporate JFlex based lexer for the

Telugu realization engine.

147
REFERENCES

148
REFERENCES

1. Akshara Bharati, Rajeev Sangal, S.M. Bendre, Pavan Kumar, Aishwarya


2001 “Unsupervised Improvement of Morphological Analyser for
Inflectionally Rich Languages” Published in the Proceedings of NLPRS-,
Tokyo, 27-30 November 2001 Report No: IIIT/TR/2001/4 pp 685-692

2. Akshara Bharati, Rajeev Sangal, Dipti M Sharma 2007 “SSF: Shakti


Standard Format Guide” LTRC, IIIT, Hyderabad, Report No: TR-LTRC-33.

3. Akshara Bharati, Vineet Chaitanya, Rajeev Sangal 1995 “Natural Language


Processing A Paninian Perspective” Prentice-Hall of India, New Delhi,.

4. Albert Gatt and Ehud Reiter 2009 “SimpleNLG: A realization engine for
practical applications”, Proceedings of ENLG 2009, pp 90-93.

5. Alessandro Mazzei, Cristina Battaglino and Cristina Bosco 2016


“SimpleNLG-IT: adapting SimpleNLG to Italian”, Proceedings of the 9th
International Natural Language Generation conference, Edinburgh, UK,
September 5-8 2016. © 2016 Association for Computational Linguistics pp
184–192,

6. Appelt, D. 1985. “Planning English Sentences”. Cambridge University Press,


Cambridge, UK.

7. Ballesteros, M., Bohnet, B., Mille, S., & Wanner, L. 2015. “Data-driven
sentence generation with non-isomorphic trees”. In Proc. NAACL-HTL’15,
pp. 387– 397.

8. Banaee, H., Ahmed, M. U., & Loutfi, A. 2013. “Towards NLG for
Physiological Data Monitoring with Body Area Networks”. In Proc.
ENLG’13, pp. 193– 197.

9. Bangalore, S., & Rambow, O. 2000. Corpus-based lexical choice in Natural


Language Generation. In Proc. ACL’00, pp. 464–471.

10. Bateman, J. A. 1997. Enabling technology for multilingual natural language


generation: the KPML development environment. Natural Language
Engineering, Vol 3, Issue1, pp. 15–55.

11. Beesley, Kenneth R, Lauri Karttunen. 2003 “Finite State Morphology”. Palo
Alto, CA: CSLI Publications.

12. Belz, A. 2008. “Automatic generation of weather forecast texts using


comprehensive probabilistic generation-space models”. Natural Language
Engineering, Vol 14, Issue 04 pp 431-455.

149
13. Benoit Lavoie and Owen Rambow, 1997 “A Fast and Portable Realizer for
Text Generation Systems” Proceedings of the Fifth Conference on Applied
Natural Language Processing (ANLP97), Washington pp. 265–268.
14. Brown, C.P 1991. “The Grammar of the Telugu Language”. New Delhi:
Laurier Books Ltd.

15. Cahill, A., Forst, M., & Rohrer, C. 2007. Stochastic realisation ranking for a
free word order language. In Proc. ENLG’07, pp. 17–24.

16. Carenini, G., & Moore, J. D. 2006. Generating and evaluating evaluative
arguments. Artificial Intelligence, Vol 170 Issue 11, pp. 925–952.

17. Chen, D. L., & Mooney, R. J. 2008. Learning to sportscast: a test of


grounded language acquisition. In Proc. ICML’08, pp. 128–135.

18. Cheng, H., & Mellish, C. 2000. Capturing the interaction between
aggregation and text planning in two generation systems. In Proc. INLG ’00,
Vol. 14, pp. 186–193.

19. Dalianis, H. 1999. Aggregation in Natural Language Generation.


Computational Intelligence, Vol 15 Issue 4, pp. 384–414.

20. Dokkara S R S, Penumathsa S V, Sripada S G. 2015 “A Simple Surface


Realization Engine for Telugu” Proceedings of the 15th European Workshop
on Natural Language Generation (ENLG), Brighton Sep, pp. 1–8.

21. Dr. Ramakanth Kumar P, Shambhavi. B. R, Srividya K, Jyothi B J, Spoorti


Kundargi, Varsha Shastri G 2011 "Kannada Morphological Analyzer And
Generator Using Trie", IJCSNS International Journal Of Computer Science
And Network Security, Vol.11 Issue 1, January,pp-112-116.

22. Elhadad, M., & Robin, J. 1996. An overview of SURGE: A reusable


comprehensive syntactic realization component. In Procedings of the 8th
International Natural Language Generation Workshop (IWNLG’98), pp. 1–4.

23. Ethel Ong, Stephanie Abella, Lawrence Santos, and Dennis Tiu 2011 “A
Simple Surface Realizer for Filipino” 25th Pacific Asia Conference on
Language, Information and Computation, pp. 51–59, 2011.

24. Fleischman, M., & Hovy, E. H. 2002. Emotional Variation in speech-based


Natural Language Generation. In Proc. INLG’02, pp. 57–64.

25. Ganapathiraju M; Lori Levin. 2006: “TelMore: Morphological Generator for


Telugu Nouns and Verbs”, in Proceedings of Second International
Conference on Universal Digital Library, Alexandria, Egypt, pp. 17-19.

26. Gatt, A., Portet, F., Reiter, E., Hunter, J. R., Mahamood, S., Moncur, W., &
Sripada, S. 2009. From data to text in the neonatal intensive care Unit: Using
NLG technology for decision support and information management. AI
Communications, Vol 22 Issue 3, pp.153–186.

150
27. Girija V. R. and T. Anuradha 2017 Application of Finite State Methods in
Malayalam Text Analysis International Journal of Computer Applications
(0975 – 8887) Volume 168 Issue.12, June 2017, pp. 43-47.

28. Goldberg, E., Driedger, N., & Kittredge, R. I. 1994. Using Natural Language
Processing to Produce Weather Forecasts. IEEE Expert, 2, pp. 45–53.

29. Goldman N 1975 “Conceptual Generation”, in Schank,R. Conceptual


Information Processing. North-Holland/Elsevier, pp. 289-372.

30. Gregory T. Stump 2001: “Inflectional Morphology: A Theory of Paradigm


Structure” Cambridge University Press.

31. Halliday, M., & Matthiessen, C. M. 2004. Introduction to Functional


Grammar (3rd Edition edition). Hodder Arnold, London.

32. Harris, M. D. 2008. Building a large-scale commercial NLG system for an


EMR. In Proc. INLG ’08, pp. 157–160.

33. Hockett 1954: "Two models of grammatical description", in : Word, 10,


pp. 210–234. [= Readings in Linguistics, vol. I, pp. 386–399].

34. Huske-Kraus, D. 2003. Text generation in clinical medicine: A review.


Methods of information in medicine, Vol 42 Issue 1, pp. 51–60.

35. James A Moore and William C Mann 1979 “A Snapshot of KDS A


knowledge Delivery System” Proceedings of the Seventh Annual Meeting,
Association of Computational Linguistics, pp. 51-52.

36. James Hunter, Yvonne Freer, Albert Gatt, Ehud Reiter, Somayajulu Sripada,
Cindy Sykes, and Dave Westwater 2011 “BT-Nurse Computer Generation of
Natural Language Shift Summaries from Complex Heterogeneous Medical
Data”. Journal of the American Medical Informatics Association Sep-Oct Vol
18 Issue 5 pp. 621-624.

37. John Henry Clippinger, Jr. 1977 “Meaning and Discourse - A Computer
Model of Psychoanalytic Speech and Cognition”. The Johns Hopkins Univ.
Press, Baltimore, ISBN 0-8018-1943-1.

38. Johnson, C. D. 1972. Formal Aspects of Phonological Description. Mouton,


The Hague.

39. Kaplan, R. M. and Kay, M. 1994. Regular models of phonological rule


systems. Computational Linguistics, Vol 20 Issue 3:pp. 331–378.

40. Kasper, R. T. 1989. A Flexible Interface for Linking Applications to


Penman’s Sentence Generator. In Proc. Workshop on Speech and Natural
Langauge, pp. 153–158.

151
41. Knight, K., Hatzivassiloglou, V 1995 “NITROGEN: Two-Level, Many-
Paths Generation”. Proceedings of the ACL-95 conference. Cambridge, MA
pp. 252-260.

42. Koskenniemi, K. 1983. “Two-level morphology: A general computational


model for word-form recognition and production”. Publication 11, University
of Helsinki, Department of General Linguistics, Helsinki.

43. Krishnamurti B H, Gwynn J P L. 1985 “A Grammar of Modern Telugu”


Oxford University Press.

44. Krishnamurti B H. 1961 “Telugu Verbal Bases a comparative and


Descriptive Study” University of California Press Berkley & Los Angeles.

45. Kristina Toutanova, Hisami Suzuki, Achim Ruopp 2008: Applying


Morphology Generation Models to Machine Translation. Proceedings of
ACL-08: HLT, Columbus, Ohio, USA, pp 514–522.

46. Langkilde-Geary, I. 2000. Forest-based statistical sentence generation. In


Proc. ANLP-NAACL’00, pp. 170–177.

47. Langkilde-Geary, I., & Knight, K. 2002. HALogen Statistical Sentence


Generator. In Proc. ACL’02 (Demos), pp. 102–103.

48. Lauri Karttunen 2003 “Computing with Realizational Morphology”.

49. Lauri Karttunen and Kenneth R. Beesley 2005 “Twenty-Five Years of Finite
State Morphology”. Inquiries into Words, Constraints and Contexts.

50. Mairesse, F., & Walker, M. A. 2010. Towards personality-based user


adaptation: Psychologically informed stylistic language generation. User
Modelling and User-Adapted Interaction, Vol 20 Issue 3, pp 227–278.

51. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2014. “Semi-supervised
learning of morphological paradigms and lexicons” in Proceedings of the
14th Conference of the European Chapter of the Association for
Computational Linguistics, Gothenburg, Sweden, April 26-30 2014 pp 569–
578.

52. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm
classification in supervised learning of morphology. In Human Language
Technologies: The 2015 Annual Conference of the North American Chapter
of the ACL, Denver, Colorado, May 31 – June 5, 2015 pp 1024–1029.

53. Mann, W. C., & Matthiessen, C. M. 1983. Nigel: A systemic grammar for
text generation. Tech. rep., ISI, University of Southern California, Marina del
Rey, CA (Technical Report RR) pp.83-105.

152
54. Marcel Bollmann, 2011 “Adapting SimpleNLG to German” Proceedings of
the 13th European Workshop on Natural Language Generation (ENLG),
Nancy, France, September pp 133– 138,.

55. Meehan J R, 1977 “Tale-Spin an Interactive Program that writes Stories” In


proceedings of the 5th International Joint Conference on Artificial
Intelligence pp 91-98.

56. Mellish, C., Scott, D., Cahill, L., Paiva, D. S., Evans, R., & Reape, M.
(2006). A Reference Architecture for Natural Language Generation Systems.
Natural Language Engineering, Vol 12 Issue 01, pp.1–34.

57. Meteer, M. W., McDonald, D. D., Anderson, S., Forster, D., Gay, L.,
Iluettner, A., & Sibun, P. 1987. “Mumble-86: Design and Implementation”.
Tech. rep., University of Massachusetts at Amherst, Amherst, MA (Technical
Report COINS 87-87).

58. Minnen GJ, Carroll, 2000 “Robust DP. Robust Applied morphological
generation”. Mitzpe Ramon, Israel: Proceedings of the 1st International
Natural Language Generation Conference. pp. 201-208.

59. Molina, M., Stent, A., & Parodi, E. 2011. “Generating Automated News to
Explain the Meaning of Sensor Data”. In Gama, J., Bradley, E., & Hollm´en,
J. (Eds.), Proc. IDA 2011Springer, Berlin and Heidelberg, pp. 282–293.

60. Nizar Habash 2000 “OxyGen: A Language Independent Linearization


Engine”. JS White (Ed): Amta 2000 LNAI 1934 2000 Springer Verlag
Berlin Heidelberg 2000 pp 68-74.

61. O’Donnell, M. 2001. ILEX: An Architecture for a dynamic hypertext


generation system. Natural Language Engineering, Vol 7 Issue 3, pp. 225–
250.

62. Paiva, D. S., & Evans, R. 2005. Empirically-based control of natural


language generation. In Proc. ACL’05, pp. 58–65.

63. Parameswari K 2011. An Implementation of APERTIUM Morphological


Analyzer and Generator for Tamil. Language in India
www.languageinindia.com 11:5 May 2011 Special Volume: Problems of
Parsing in Indian Languages pp. 41-44.

64. Philip R Cohen and C Raymond Perrault 1979 “Elements of a Plan-Based


Theory of Speech Acts” Technical Report No: 141, Center for the Study of
Reading September pp. 1-67.

65. Pierre-Luc Vaudry and Guy Lapalme 2013 “Adapting SimpleNLG for
bilingual English-French realisation” Proceedings of the 14th European
Workshop on Natural Language Generation, Sofia, Bulgaria, August 8-9 pp
183–187.

153
66. PJ Antony, KP Soman 2012:”Computational morphology and natural
language parsing for Indian languages”: a literature survey. International
Journal of Computer Science and Engineering Technology. International
Journal of Scientific & Engineering Research Volume 3, Issue 3, March-2012
ISSN 2229-5518, pp 1-11.

67. Plachouras, V., Smiley, C., Bretz, H., Taylor, O., Leidner, J. L., Song, D., &
Schilder, F. 2016. Interacting with financial data using natural language. In
Proc. SIGIR’16, pp. 1121–1124.

68. Portet, F., Reiter, E., Gatt, A., Hunter, J. R., Sripada, S., Freer, Y., & Sykes,
C. 2009. Automatic generation of textual summaries from neonatal intensive
care data. Artificial Intelligence, Vol 173 Issue 7-8, pp. 789–816.

69. Richard J Hanson, Robert F. Simmons and J Slocum 1972 “Generating


English Discourse from Semantic Networks” Communications of the ACM,
October Volume 15 Issue 10, pp. 891-905

70. Ramos-Soto, A., Bugarin, A. J., Barro, S., & Taboada, J. 2015. Linguistic
Descriptions for Automatic Generation of Textual Short-Term Weather
Forecasts on Real Prediction Data. IEEE Transactions on Fuzzy Systems,
Vol 23 Issue 1, pp. 44–57.

71. Rao G U M, Kulkarni P A. Computer Applications in Indian Languages,


Hyderabad 2006: The Centre for Distance Education, University of
Hyderabad,.

72. Ratnaparkhi, A. 2000. Trainable methods for surface natural language


generation. In Proc. NAACL’00, pp. 194–201.

73. Reiter E, Dale R. 2000 “Building natural language generation systems”,


Cambridge University Press, New York.

74. Reiter, E., Robertson, R., & Osman, L. M. 2003. Lessons from a failure:
Generating tailored smoking cessation letters. Artificial Intelligence, Vol 144
Issue 1-2, pp. 41–58.

75. Reiter, E., Sripada, S., Hunter, J. R., Yu, J., & Davy, I. 2005. Choosing
words in computer-generated weather forecasts. Artificial Intelligence, Vol
167 Issue 1-2, pp. 137–169.

76. Rieser, V., & Lemon, O. 2009. Natural Language Generation as Planning
Under Uncertainty for Spoken Dialogue Systems. In Eacl 2009, pp. 683–
691.

77. Roark, Brain, Sproat R. 2007 “Computational approaches to Morphology and


Syntax” Oxford.

78. Rodrigo de Oliveira, Somayajulu Sripada 2014 “Adapting SimpleNLG for


Brazilian Portuguese realisation”. Proceedings of the 8th International

154
Natural Language Generation Conference, Philadelphia, Pennsylvania, 19-21
June 2014 pp 93–94.

79. Siddharthan, A., Green, M., van Deemter, K., Mellish, C., & van der Wal, R.
2013. Blogging birds: Generating narratives about reintroduced species to
promote public engagement. In Proc. INLG’13, pp. 120–124.

80. Siddharthan, A., Nenkova, A., & McKeown, K. R. 2011. Information Status
Distinctions and Referring Expressions: An Empirical Study of References to
People in News Summaries. Computational Linguistics, Vol 37 Issue 4, pp.
811–842.

81. Smriti Singh, MrugankDalal, Vishal Vachhani, Pushpak Bhattacharyya, Om


P. Damani 2007 “Hindi Generation from Interlingua (UNL)” in Proceedings
of MT summit.

82. Sri Badri Narayanan R, Saravanan S, Soman KP. 2009 “Data Driven Suffix
List and Concatenation Algorithm for Telugu Morphological Generator”.
International Journal of Engineering Science and Technology (IJEST) ISSN :
0975-5462 Vol. 3 Issue 8 pp. 6712-6717.

83. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Kru¨ger, A., Kruppa, M.,
Kuflik, T., Not, E., & Rocchi, C. 2007. “Adaptive, intelligent presentation of
information for the museum visitor in PEACH”. User Modeling and User-
Adapted Interaction, Vol 17 Issue 3, pp. 257–304.

84. Theune, M., Klabbers, E., de Pijper, J.-R., Krahmer, E., & Odijk, J. 2001.
From data to speech: a general approach. Natural Language Engineering, Vol
7 Issue 1, pp. 47–86.

85. Thompson H 1977 “Strategy and Tactics: A model for Language Production”
Papers from the Thirteenth Regional Meetings, Chicago Linguistic Society
pp. 651-668.

86. Turner, R., Sripada, S., Reiter, E., & Davy, I. 2008. Selecting the Content of
Textual Descriptions of Geographically Located Events in SpatioTemporal
Weather Data. In Applications and Innovations in Intelligent Systems XV,
pp. 75–88.

87. Uma Maheshwar Rao, G. and Christopher Mala 2011 “TELUGU WORD
SYNTHESIZER” International Telugu Internet Conference Proceedings,
Milpitas, California, USA 28th-30th September pp 1-8.

88. Van Deemter, K., Krahmer, E., & Theune, M. 2005. Real versus
templatebased natural language generation: A false opposition?.
Computational Linguistics, Vol 31 Issue 1, pp. 15–24.

89. Vaudry, P.-L., & Lapalme, G. 2013. Adapting SimpleNLG for bilingual
French English realisation. In Proc. ENLG’13, pp. 183–187.

155
90. Vishal Goyal, Gurpreet Singh Lehal 2011: “Hindi to Punjabi Machine
Translation System Proceedings of the ACL-HLT 2011 System
Demonstrations, Portland, Oregon, USA, 21 June 2011 pp 1–6.

91. Walker, M. A., Rambow, O., & Rogati, M. 2002. Training a sentence planner
for spoken dialogue using boosting. Computer Speech and Language, Vol 16
Issue 34, pp. 409–433.

92. Walther V Hahn, Wolfgang Hoeppner, Antony Jameson, Wolfgang Wahlster,


1978 “The Anatomy of the Natural Language Dialogue System HAM-RPM”
In AJCL Microfiche 77, pp 53-67.

93. Wanner, L., Bosch, H., Bouayad-Agha, N., & Casamayor, G. 2015. Getting
the environmental information across: from the Web to the user. Expert
Systems, Vol 32 Issue 3, pp. 405–432.

94. Weizenbaum, Joseph 1966. "ELIZA—a computer program for the study of
natural language communication between man and
machine". Communications of the ACM. Vol 9: pp. 36–45.

95. Winograd T. 1972. “Understanding Natural Language”. Cognitive


Psychology, Vol 3 Issue 1, pp. 1–191.

156
APPENDIX

157
Appendix

WX-Notation for Telugu

158
PAPERS EMANATED FROM THE THESIS
List

1) “A Simple Surface Realization Engine for Telugu”. Proceedings of the 15th


European Workshop on Natural Language Generation (ENLG), Bighton,
September 2015. © 2015 Association for Computational Linguistics pp 1-8.

2) “Verb Morphological Generator for Telugu”. Indian Journal of Science and


Technology, Vol 10 Issue 13, DOI: 10.17485/ijst/2017/v10i13/110448, April
2017 pp 1-11.

3) “Morphological Generator for Telugu Nouns and Pronouns”. International


Journal of Computer Applications (0975 – 8887) Volume 165 Issue.5, May
2017 pp 6-14.

159

You might also like