Professional Documents
Culture Documents
Contributions To English To Hindi Machine Translation Using Example-Based Approach
Contributions To English To Hindi Machine Translation Using Example-Based Approach
DEEPA GUPTA
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY DELHI
HAUZ KHAS, NEW DELHI-110016, INDIA
JANUARY, 2005
CONTRIBUTIONS TO ENGLISH TO HINDI
MACHINE TRANSLATION USING
EXAMPLE-BASED APPROACH
by
DEEPA GUPTA
Department of Mathematics
Submitted
Doctor of Philosophy
to the
My Parents,
My Thesis Supervisor...
Certificate
The thesis has reached the standards fulfilling the requirements of the regulations
relating to the degree. The work contained in this thesis has not been submitted to
any other university or institute for the award of any degree or diploma.
Department of Mathematics
Indian Institute of Technology Delhi
Delhi (INDIA)
Acknowledgement
If I say that this is my thesis it would be totally untrue. It is like a dream come true.
There are people in this world, some of them so wonderful, who helped in making
this dream, a product that you are holding in your hand. I would like to thank all
Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research
and stayed with me right till the end. His efforts, comments, advices and ideas
developed my thinking, and improved my way of presentation. Without his con-
stant encouragement, keen interest, inspiring criticism and invaluable guidance, I
would not have accomplished my work. I admit that his efforts need much more
I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech
Research Lab who funded this research. I sincerely thank all the faculty members of
Department of Mathematics, especially, I express my gratitude for Prof B. Chandra
and Dr. R. K. Sharma, for providing me continuous moral support and help. I
thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time
and efforts. I also thank the department administrative staff for their assistance. I
extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and
Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,
and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening
discussions on basics of languages.
support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for
cheering me, always.
Shailly and Geeta - amazing friends who read the manuscript and gave honest com-
ments. Both of them also stayed with me in the process, and handled me, and
sometimes my out-of-control emotions so well. Especially, I wish to extend my
thanks to Geeta for providing me stay in her hostel room, and also for her wonderful
help when my leg got fractured when we knew each other for a month only. I wish
to acknowledge Krishna for his constant help, both academic and nonacademic, and
his continuous encouragement.
I convey my sincere regards to my parents, and brothers for the sacrifices they have
made, for the patience they have shown, and for the love and blessing they have
showered. I thank Arun for his moral support. Most imperative of all, I would like
to express my profound sense of gratitude and appreciation to my sister Neetu. Her
irrational and unbreakable belief in me bordered on craziness at times.
I cannot avoid to mention my friend Sharad who deserves more than a little ac-
knowledgement. His constant inspiration and untiring support has sustained my
Deepa Gupta
Abstract
b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in
the Indian subcontinent. Demand for developing MT systems from English to
these languages is increasing rapidly. But at the same time, development of
i
The major contributions of this research may be described as follows:
We feel that the overall scheme proposed in this research will pave the way for
ii
Contents
1 Introduction 1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
iv
Contents
5.4 Cost Due to Different Functional Slots and Kind of Sentences . . . . 185
v
Contents
198
5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229
vi
Contents
Appendices 280
A 281
B 291
vii
Contents
C 299
C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures 299
D 303
E 305
Bibliography 308
viii
List of Figures
2.1 The five possible scenarios in the SL → SL’ → TL’ interface of partial
case matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
x
LIST OF FIGURES
xi
List of Tables
2.5 Different Functional Tags Under the Functional Slot <S> or <O> . . 56
2.6 Different Possible Morpho Tags for Each of the Functional Tag under
the Functional Slot <S> or <O> . . . . . . . . . . . . . . . . . . . . 58
Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
xiv
LIST OF TABLES
5.6 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence “I work.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.7 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence “Sita sings ghazals.” . . . . . . . . . . . . . . . . . . . . . . . . 201
5.8 Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202
5.9 Best Five Matches by Syntactic Similarity for the Input Sentence “I
work.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.10 Best Five Matches by Syntactic Similarity for the Input Sentence “Sita
sings ghazals.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.11 Functional-morpho Tags for the Input English Sentence (IE) and the
Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204
5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the
Input Sentence “I work.” . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the
5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence “I work.” by Using Semantic and Syntactic Based Similarity
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence “Sita sings ghazals” by Using Semantic and Syntactic based
xv
LIST OF TABLES
5.27 Five Most Similar Sentence for RC “You go to India.” Using Cost of
5.28 Five Most Similar Sentence for MC “ You should speak Hindi.” Using
5.29 Five Most Similar Sentence for RC “He wants to learn Hindi.” Using
5.30 Five Most Similar Sentence for MC “The student should study this
book.” Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263
xvi
LIST OF TABLES
xvii
Chapter 1
Introduction
Chapter 1. Introduction
Machine Translation (MT) is the process of translating text units of one language
(source language) into a second language (target language) by using computers. The
need for MT is greatly felt in the modern age due to globalization of information,
where global information base needs to be accessed from different parts of the world.
Although most of this information is available online, the major difficulty in dealing
with this information is that its language is primarily English. Starting from science,
technology, education to manuals of gadgets, commercial advertisements, everywhere
predominant presence of English as the medium of communication can be easily
observed. This world, however, is multi-lingual, where different languages are spoken
for developing MT systems for translating from English into some native Indian
languages is very acute. In this work we looked into different aspects of designing an
English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two
fundamental questions that we feel we should answer at this point are:
Development of MT systems has taken a big leap in the last two decades. Typ-
ically, machine translation requires handcrafted and complicated large-scale knowl-
1
edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending
upon how the translation knowledge is acquired and used. For example,
1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis
and representation of the “meaning” of the source language texts, and the
generation of equivalent target language texts (Grishman and Kosaka, 1992),
posed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.
al., 1993), (Germann, 2001).
However, these techniques have their own drawbacks. The main drawback of
RBMT systems is that sentences in any natural language may assume a large vari-
ety of structures. Also, machine translation often suffers from ambiguities of various
types (Dorr et. al., 1998). As a consequences, translation from one natural lan-
guage into another requires enormous knowledge about the syntax and semantics of
both the source and target languages. Capturing all the knowledge in rule form is
daunting task if not impossible. On the other hand, SBMT techniques depend on
2
Chapter 1. Introduction
Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes
use of past translation examples to generate the translation of a given input. An
EBMT system stores in its example base of translation examples between two lan-
guages, the source language (SL) and the target language (TL). These examples are
Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered
EBMT to be one major and effective approach among different MT paradigms,
and different Indian languages in the form of government notices, translation books,
1
Sometimes more than one sentence is also retrieved
3
advertisement material etc. Although this data is generally not available in elec-
tronic form yet, converting them into machine readable form is much easier than
formulating explicit translation rules as required by an RBMT system. In fact some
parallel data in electronic form has been made available through some projects (e.g.
Of the different Indian languages8 Hindi has some major advantages over the oth-
ers as far as working on MT is concerned. Not only is Hindi the national language of
India, it is also the most popular among all Indian languages. With respect to Indian
languages, all the major works that have been reported so far (e.g. ANGLAHINDI
(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), Ma-
Tra (Human aided MT)9 ) are primarily concerned English and Hindi as their pre-
ferred languages. In 2003 Hindi has been considered as the “surprise language”
4
Chapter 1. Introduction
This world-wide popularity of the language makes the study of English to Hindi
machine translation more meaningful in today’s context.
above systems are not always the correct translations of the inputs. The following
Table 1.1 illustrates the above statement with respect to the systems “AnglaHindi”
and “Shakti”. In this table we show the translations produced by the above two
systems for different inputs, and also show the correct translations of these sentences.
5
1.1. Description of the Work Done and Summary of the Chapters
We have found many such instances where the outputs produced by the systems
may not be considered to be correct Hindi translations of the respective inputs. This
observation prompts us to study different aspects of English to Hindi translations in
order to understand the difficulty in machine translations, particularly with respect
to English to Hindi translation, also, how can these shortcomings be dealt with
under an EBMT framework. The research is concerned with the above studies.
of the Chapters
The success of an EBMT system lies on two different modules: (i) Similarity mea-
surement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a
suitable translation example is retrieved from a system’s example base. Adapta-
example is similar to the input sentence. This is due to the fact that the fundamental
intuition behind EBMT is that translations of similar sentences of the source lan-
guage will be similar in the target languages as well. Thus the concept of retrieval is
intricately related with the concept of “similarity measurement” between sentences.
But the main difficulty with respect to this assumption is that there is no straight-
forward way to measure similarity between sentences. In different works different
approaches have been defined for measuring similarity between sentences. For exam-
ple, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based
6
Chapter 1. Introduction
metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and
Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid
retrieval scheme (e.g. (Collins, 1998)).
to look at similarity from the point of view of adaptation. We suggest that a past
example will be considered as the most similar with respect to an input sentence, if
its adaptation towards generating the desired translation is the simplest. The work
carried out in this research is aimed at achieving this goal. Our studies therefore start
in the following way. We first look at adaptation in detail. An efficient adaptation
scheme is very important for an EBMT system because even a very large example
base cannot, in general, guarantee an exact match for a given input sentence. As
a consequence, the need for an efficient and systematic adaptation scheme arises
for modifying a retrieved example, and thereby generating the required translation.
Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,
1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these
schemes suggest that primarily there are four basic adaptation operations, i.e. word
addition, word deletion, word replacement and copy.
In our approach we started with these basic operations: word addition, word
deletion, word replacement and copy. However, in this respect we notice the follow-
ing:
1. Both English and Hindi relies heavily on suffixes for morphological changes.
There are a number of suffixes for achieving declension of verbs and nouns.
Further, in Hindi there are situations when morphological changes in the ad-
7
1.1. Description of the Work Done and Summary of the Chapters
jectives is also required depending upon the number and gender of the corre-
sponding noun/pronoun. Since the number of suffixes is limited, we feel that
instead of purely word-based operations if adaptation operations are focused
on the suffixes, then in many situations significant amount of computational
2. A further observation with respect to Hindi is that there are situations when in-
stead of suffixes whole words are used for bringing in morphological variations.
For example, the present continuous form of Hindi verbs is: <Root form of the
verb> + <rahaa/rahii /rahe> + <hai /hain/ho>. Here the words “rahaa”,
A major fall out of the above observation is that in some situations, adaptation
may be carried out by dealing with the morpho-words instead of whole words, which
are computationally much less expensive than dealing with constituent words as a
One point, however, we notice with respect to the above operations is that the
8
Chapter 1. Introduction
do not translate into sentences that are similar in structures in the target language.”
(Dorr, 1993). We therefore felt study of divergence is an important aspect for any
MT system. With respect to an EBMT system the need arises because of the two
reasons:
• The past example that is retrieved for carrying out the task of adaptation
has a normal translation, but translation of the input sentence should involve
divergence.
• The translation of the retrieved example involves divergence, whereas the input
sentence should have a normal translation.
with respect to translation between European languages (e.g. (Dorr et. al., 2002),
(Watanabe et. al., 2000)) very little studies on divergence may be found regarding
translations in Indian languages. The only work that came into our notice is in (Dave
et. al., 2002). In this work the author has followed the classifications given in (Dorr,
1993) and tried to find examples of each of them with respect to English to Hindi
translation. In this regard it may be noted that Dorr has described seven differ-
ent divergence types: structural, categorical, conflational, promotional, demotional,
thematic and lexical, with respect to translations between European languages.
However, we find that all the different divergence types explained in Dorr’s work
do not apply with respect to Indian languages. In fact, we found very few (if not
none) examples of “thematic” and “promotional” divergence with respect to English
9
1.1. Description of the Work Done and Summary of the Chapters
to Hindi translation. On the other hand we identified three new types of divergence
that have not so far been cited in any other works on divergence. We named these
divergences as “nominal”, “pronominal” and “possessional”, respectively. We have
further observed that all the different divergence types (barring “structural”) for
which we found instances in English to Hindi translation may be further divided into
several sub-categories. Chapter 3 explains in detail different divergence types and
their sub-types that we have observed with respect to English to Hindi translation,
and illustrates them with suitable examples. Some of these results have already been
presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).
this difficulty we suggest that the example base may be partitioned into two parts:
one containing examples of normal translation, the other containing the examples
of divergence, so that given an input sentence an EBMT system may retrieve an
example from the appropriate part of the example base. However, implementation
translation divergence, i.e. if an English sentence and its Hindi translation are given
as input, these algorithms will detect whether this translation involves any of the said
10
Chapter 1. Introduction
(SPAC11 ) of the SL and TL sentences. When these two do not match for a source
language sentence and its translation in the TL, a divergence can be identified. With
respect to each divergence categories and their sub-categories we have identified
the appropriate FTs and SPACs whose presence/absence indicate possibilities of
certain divergence. By systematically analyzing the FTs and SPACs of the English
sentence and its Hindi translation the algorithms arrive at a decision on whether
this translation involves any divergence. Thus the algorithm partitions the example
base in two parts: Normal Example Base and Divergence Example Base. Some of
these algorithms have already been presented in (Gupta and Chatterjee, 2003b).
To answer the second question, we feel that given an input sentence if it can be
decided a priori whether its translation will involve divergence then the retrieval can
be made accordingly. To handle the situation when the translation of input sentence
does not involve any divergence, we devise a cost of adaptation based two-level
filtration scheme that enables quick retrieval from normal example base12 . Chapter 4
describes our scheme of retrieval from divergence example base in situations involving
divergence. Here our primary attempt is to develop a procedure so that given an
input English sentence it can decide whether its Hindi translation will involve any
type of divergence. Obviously, this decision has to be made before resorting to
11
1.1. Description of the Work Done and Summary of the Chapters
algorithm seeks evidence from the example base and the WordNet. In this work we
have used WordNet 2.013 to measure semantic similarity of the constituent words
of the input sentence, and various words present in the example base sentences to
arrive at a decision in this regard. The scheme works in the following way. We first
identified the roles of different Functional Tags (FT) towards causing divergence.
We observe with respect to different divergence type and sub-types that each FT
may have one of the three following roles;
This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given
an input sentence the scheme first determines its constituent FTs. We have used
ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding
is then compared to the above-mentioned knowledge base (Table 4.2) to identify
the set (D) of divergence types that may possibly occur in the translation of this
sentence. Further investigation is carried out to discard elements from the set D, so
that the divergence that may actually occur can be pin-pointed. In this respect we
proceed in the following way. Corresponding to each divergence type we identify the
functional tag that is at the root of causing the divergence. We call it the “problem-
atic FT” corresponding to that particular divergence. Table 4.3 presents our finding
in this regard. Corresponding to each possible divergence (as found in D) the scheme
13
http://www.cogsci.princeton.edu/cgi-bin/webwn
14
http://www.lingsoft.fi/cgi-bin/engcg
12
Chapter 1. Introduction
works as follows. It first retrieves from the input sentence the constituent word cor-
responding to the problematic FT of the divergence type under consideration. Then
the semantic similarity of this word is compared to other words. Proximity in this
semantic distance is then used as a yardstick for similarity measurement. Chapter
ilarity schemes: “syntactic similarity” and “semantic similarity”. The ideas have
been borrowed from the domain of Information Technology (Manning and Schutze,
1999). According to the definition given therein semantic similarity is measured on
the basis of commonality of words. The more is the number of words common be-
tween two sentences, the more similarity is said to exist between the two sentences
under consideration. However, it has been shown in (Chatterjee, 2001) that this
measurement of similarity is not always helpful from EBMT point of view. For ex-
ample, it has been shown there that although the sentences “The horse had a good
run.” and “The horse is good to run on.” have most of the key words common, the
In this case, adaptation may require a large number of constituent word replacement
(WR) operations. Each of these WR operations involves reference to some dictio-
nary for picking up the appropriate words in the target language. Typically the
13
1.1. Description of the Work Done and Summary of the Chapters
dictionary access will involve accessing an external storage, and thereby will incur
significant computational cost. Thus a purely syntax-based similarity measurement
scheme may not be suitable for an EBMT system.
In this work we therefore propose that from EBMT perspective “retrieval” and
“adaptation” should be looked at in a unified way. In this chapter (i.e. Chapter
be in the actual words, or in the overall structure of the sentences. For illustration,
suppose the input sentence is “The boy eats rice everyday.”, whose Hindi translation
“ladkaa har roz chaawla khaataa hai ” has to be generated. The nature of the adap-
tation varies depending upon which example is retrieved from the example base. For
illustration:
a) If the retrieved example is “The boy eats rice”, the adaptation procedure needs
to apply a constituent word addition operation (WA) to take care of the adverb
“everyday”.
b) However, if the retrieved sentence is “The boy plays cricket everyday.” ∼ “ladkaa
roz cricket kheltaa hai ”, then the adaptation procedure needs to invoke two
c) In case the retrieved example is “The boy is eating rice.”, one adaptation op-
eration that is constituent word addition (WA) is required for the adverb
14
Chapter 1. Introduction
eats rice everyday” should be “ladkaa har roz chaawal khaataa hai ”. Thus the
morpho-word “rahaa”, which is required for the present continuous tense of
the retrieved sentence needs to be deleted. Further the suffix “taa” is to be
added to the root main verb to get the required present indefinite verb form
of the input.
d) However, if the retrieved example is “Does the boy eat rice?”, then adaptation
procedure needs to take care of the structural variation between the “inter-
rogative” form of the retrieved example, and the affirmative form of the input
sentence.
Obviously, the more will be the discrepancy between the retrieved example and
the input sentence, the more will be the number of adaptation operations towards
generating the desired translation. The above illustrations make certain points evi-
dent:
a) Adaptation operations are required for performing two general tasks: dealing
with constituent words (along with their suffixes, morpho-words), and dealing
15
1.1. Description of the Work Done and Summary of the Chapters
The above observations help us to proceed towards achieving the intended goal
of using cost of adaptation as a measurement of similarity. As a first step towards
achieving the intended goal, we suggest to divide the dictionary into several parts
based on the part-of-speech (POS) of the words. Division of the dictionary into
several parts according to the POS reduces the search time for each invocation of
the above procedures, and as a consequence, the search time is reduced. The cost of
adaptation based similarity measurement approach then proceeds along the following
line:
a) We first estimate the average cost for each of the ten adaptation operations.
We observe that these costs depend on two major types of parameters. On
one hand they depend on certain linguistic aspects, such as, the average length
of the sentences in both source and target languages, the number of suffixes
(used with different POS), the number of morpho-words etc. On the other
hand, these costs are related to the machine on which the EBMT system is
16
Chapter 1. Introduction
b) At the second step, we estimated the costs incurred in adapting various func-
tional tags17 . In particular, we have considered cost of adaptation due to vari-
ations in active and passive verb morphology, subject/object, pre-modifying
adjective, genitive case and wh-family words. These costs are stored in various
tables, in Section 5.4.
Once these basic costs are modelled, we are in a position to experiment on costs
of adaptation as a similarity measure vis-à-vis semantics and syntax based similarity
measurement scheme discussed above. Our experiments have clearly established the
efficiency of the proposed scheme over the others. Part of this work is also presented
in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:
1) It may end up in comparing a given input with all the example-base sentences
to ascertain the least cost of adaptation.
2) Another major question that may arise is that whether the cost of adaptation
scheme is efficient enough to handle sentences that are structurally more com-
17
In fact we worked on “Functional Slots” which are more general than “Functional Tags”. This
is discussed in detail in Section 2.2
17
1.1. Description of the Work Done and Summary of the Chapters
In order to deal with first difficulty we have proposed a two-level filtration scheme.
This scheme helps in selecting a smaller number of examples from the example base,
which may subsequently be subjected to the rigorous treatment for determining their
costs of adaptation with respect to the given input. We have also justified that this
scheme does not leave out the sentences whose translations are easier to adapt for
In this work we have given a solution for the second problem too. We have
given rules for splitting a complex sentence into more than one simple sentence.
Translations of these simple sentences may then be generated by the EBMT system.
These individual translations may then be combined to obtain the translation of the
given complex sentence input. If the cost of adaptation based similarity measurement
scheme is applied for translating the simple sentences, then the cost of adaptation
of the complex sentence too can be estimated, by adding the individual costs with
the cost of combining the individual translations. Since the last operation is purely
algorithmic its computational complexity can be easily computed, and hence the
overall cost of adaptation be estimated. With respect to dealing with complex
sentences, we have however used certain restrictions. We considered sentences with
only one subordinate clause. Further, the presence of a connecting word is also
mandatory. Evidently, more complicated complex sentence structures are available,
and further investigations are required for developing techniques for handling them
in an EBMT framework.
In this connection we like to mention that we have explained the cost of adap-
18
Chapter 1. Introduction
tation with respect to a selected set of sentence structures, and for a selected set of
Functional slots. Definitely many more variations are available with respect to these
parameters. Consequently, more work has to be done to form rules for handling
these variations. However, we feel that the work described in research provides the
1) The aim of this research is not to construct an English to Hindi EBMT system.
Rather our intuition is to analyze the requirements that help in building an
effective EBMT system. The motivation behind this research came from two
major observations:
2) In order to design our scheme we have studied about 30,000 English to Hindi
translation examples available off-line. Although now large volumes of English
19
1.2. Some Critical Points
English sentence: The horses have been running for one hour.
Tagged form: @DN> ART “the”, @SUBJ N PL “horse” %ghodaa%,
@+FAUXV V PRES “have”, @-FAUXV V PCP2 “be”, @-FMAINV V PCP1
“run” %daudaa%, @ADVL PREP “for”, @QN> NUM CARD “one” %ek %, @<P
N SG “hour” %ghantaa%.
Hindi sentence: ghode ek ghante se daudaa rahen hain
as much as possible.
tag18 information for each of the constituent word of the source language (En-
glish) sentence along with the sentence, its Hindi translation, and the root
word correspondence. Figure 1.1 provides an example of the records stored in
our example base.
In this work we have studied the two major pillars of EBMT: Retrieval and
Adaptation. We feel that the studies made as well as the techniques developed by
18
Appendix B provides different morpho tags and functional tags that have been used in this
work. These tags are obtained by editing the sentence tagging given by the ENGCG parser :
(http://www.lingsoft.fi/cgi-bin/engcg) for English sentences.
20
Chapter 1. Introduction
this research will be helpful for developing MT system not only for Hindi but also for
other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer
from the same drawback - unavailability of linguistics resources. However, demands
for developing MT systems from English to these languages is increasing with time
not only because these are prominent regional languages of India, but also they
are important minority languages in other countries such as U.K. (Somers, 1997).
The studies made in the research should pave the way for developing EBMT system
involving these languages as well.
21
Chapter 2
2.1 Introduction
The need for an efficient and systematic adaptation scheme arises for modifying a
retrieved example, and thereby generating the required translation. This chapter is
devoted to the study of systematic adaptation approach. Various approaches have
been pursued in dealing with adaptation aspect of an EBMT system. Some of the
1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:
high-level grafting and keyhole surgery. High-level grafting deals with phrases.
Here an entire phrasal segment of the target sentence is replaced with another
phrasal segment from a different example. On the other hand, keyhole surgery
deals with individual words in an existing target segment of an example. Under
this operation words are replaced or morphologically fine-tuned to suit the
current translation task. For instance, suppose the input sentence is “The girl
is playing in the park.”, and in the example base we have the following examples:
For the high level grafting the sentences (a) and (d) will be used. Then keyhole
surgery will be applied for putting in the translations of the words “park” and
2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on
three steps: finding the difference, replacing the difference, and smoothing the
23
2.1. Introduction
output. The differing segments of the input sentence and the source template
are identified. Translations of these different segments in the input sentence
are produced by rule-based methods, and these translated segments are fitted
into a translation template. The resulting sentence is then smoothed over by
checking for person and number agreement, and inflection mismatches. For
example, assume the input sentence and selected template are:
into Tt giving the following: Tt : ek bahut yogya mahilaa chikitsak vyasta hai.
3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here
two different cases are considered: full-case adaptation and partial-case adap-
tation. Full-case adaptation is employed when a problem is fully covered by the
Partial-case adaptation is used when a single unifying example does not exist.
Here three more operations are required on the top of the above five. These
three operations are ADD, DELETE and DELETZERO.
24
Adaptation in English to Hindi Translation: A Systematic Approach
Figure 2.1: The five possible scenarios in the SL → SL’ → TL’ interface of partial
case matching
For ADAPT as well as for ADAPTZERO, both SL and SL0 have same links
but different chunks. If TL0 has words corresponding to the chunk which is
different in SL and SL0 , then the words in TL0 should be modified and this is
the case of ADAPT. One the other hand, if no corresponding chunk is present
in TL0 then it is the case of ADAPTZERO. Therefore, in that case no work is
25
2.1. Introduction
are:
I That old woman has died.
R That old man has died. ∼ wah boodhaa aadmii mar gayaa
To generate the desired translation of the word “man” ∼“aadmii ” is first re-
placed with the translation of “woman” ∼ “aurat” in R. This operation is called
reinstantiation. At this stage an intermediate translation “wah boodhaa aurat
mar gayaa” is obtained. To obtain the final translation “wah boodhii aurat
mar gayii ”, the system must also change the adjective “boodhaa” to “boodhii ”
and the word “gayaa” to “gayii ”. This is called parameter adjustment.
5. The adaptation scheme proposed by McTait (2001) works in the following way.
Translation patterns that share lexical items with the input and partially cover
it are retrieved in a pattern matching procedure. From these, the patterns
whose SL side cover the SL input to the greatest extent (longest cover) are
selected. They are termed base patterns, as they provide sentential context in
the translation process. It is intuitive that the greater extent of the cover is
provided by the base patterns, the more is the context, and the lesser is the
ambiguity and complexity in the translation process. If the SL side of the base
pattern does not fully cover the SL input, any unmatched segments are bound
to the variable on the SL side of the base pattern. The translations of the SL
segments bound to the SL variables of the base pattern are retrieved from the
remaining set of translation patterns, as the text fragments and variables on
the TL side of the base pattern from translation strings.
The following is a simple example: given the source language input is I: “AIDS
control programme for Ethiopia”, suppose the longest covering base pattern is:
D1: AIDS control programme for (....)∼ ke liye AIDS contral smahaaroo (...).
26
Adaptation in English to Hindi Translation: A Systematic Approach
To complete the match between I and the source language side of D1, a trans-
lation pattern containing the text fragment “Ethiopia” is required i.e.
produce T.
6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for deter-
mining the structural similarity between the input sentence and the example
sentences. The target language sentence is generated using the target pat-
tern of the sentence that has lesser distance with the input sentence. The
English, therefore, we explain its adaptation process with the example of Hindi
to English translation.
For example, suppose the input sentence is “merii somavara ko jaa rahii hai ”
and its matches with examples sentence R: “meraa dosta itavaar ko aayegaa”.
Steps (a) to (f) below, show the process of translation.
27
2.1. Introduction
Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,
1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (Güvenir and Cicekli,
1998). But overall in our view the adaptation procedures employed in different
EBMT systems primarily consist of four operations:
• Copy, where the same chunk of the retrieved translation example is used in
the generated translation;
• Replace, where some chunk of the retrieved example is replaced with a new
one to meet the requirements of the current input.
The operations prescribed in different systems vary in the chunks they deal with.
Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional
suffix).
1
snp : noun, adj+noun, noun+ kaa+noun
2
npk2: noun+ko
3
mv: verb-part
28
Adaptation in English to Hindi Translation: A Systematic Approach
With respect to English and Hindi, we find that both the languages depend
heavily on suffixes for verb morphology, changing numbers from singular to plu-
ral and vice versa, case endings, etc. Appendix A provides detail descriptions
of various Hindi suffixes. Keeping the above in view we differentiated the adap-
tation operations in two groups: word based and suffix based. The word based
operations are further subdivided into two categories: constituent word based and
morpho-word based. Thus the adaptation scheme proposed here consists of ten op-
erations: Copy (CP), Constituent word deletion (WD), Constituent word addition
(WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morpho-
Firstly, it helps in identifying the specific task that has to be carried out in the step-
by-step adaptation for a given input. Secondly, it helps in measuring the average
cost of each of the above operations in a meaningful way, which in turn helps in
estimating the total adaptation cost for a given sentence. This estimate can be used
as a tool for similarity measurement between an input and the stored examples.
These issues are discussed in Chapter 5.
1. Constituent Word Replacement (WR): One may get the translation of the
input sentence by replacing some words in the retrieved translation example.
29
2.2. Description of the Adaptation Operations
Suppose the input sentence is: “The squirrel was eating groundnuts.”, and the
most similar example retrieved by the system (along with its Hindi translation)
is: “The elephant was eating fruits.” ∼ “haathii phal khaa rahaa thaa”. The
desired translation may be generated by replacing “haathii ” with the Hindi of
“squirrel”, i.e. “gilharii ” and replacing “phal ” with the Hindi of “groundnuts”,
i.e. “moongphalii ”. These are examples of the operation of constituent word
replacement.
2. Constituent Word Deletion (WD): In some cases one may have to delete some
words from the translation example to generate the required translation. For
example, suppose the input sentence is: “Animals were dying of thirst”. If the
retrieved translation example is : “Birds and Animals were dying of thirst.” ∼
“pakshii aur pashu pyaas se mar rahe the”, then the desired translation can
be obtained by deleting “pakshii aur ” (i.e the Hindi of “birds and”) from the
retrieved translation. Thus the adaptation here requires two constituent word
deletions.
lation example is required for generating the translation. For illustration, one
may consider the example given above with the roles of input and retrieved
sentences being reversed.
30
Adaptation in English to Hindi Translation: A Systematic Approach
khaa rahii hai ”. In order to take care of the variation in tense the morpho-
word “hai ” is to be replaced with “thaa”. This is an example of Morpho-word
replacement.
5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the
7. Suffix Replacement (SR): Here the suffix attached to some constituent word
of the retrieved sentence is replaced with a different suffix to meet the current
The suffix “aa” is replaced with “e” so in order to get its plural form in
Hindi.
The suffix “aa” is replaced with “ii ” to get the adjective “burii ”.
4
Of course the final translation will be obtained by adding the the suffix “taa” with the word
“khaa”.
31
2.2. Description of the Adaptation Operations
The suffix “taa” is replaced with “tii ” to get the verb “padtii ”, which is
required to indicate that the subject is feminine.
The suffix “aa” is replaced with “e” to get the nouns “ladke” and “kamre”.
8. Suffix Deletion (SD): By this operation the suffix attached to some constituent
word may be removed, and thereby the root word may be obtained. This
operation is illustrated in the following examples:
The suffix “en” is deleted from “aauraten” to get the Hindi translation
of “woman”.
The suffix “taa” is deleted from “padtaa” to get the root form “pad ” of
the English verb “read”.
The suffix “on” is deleted from “gharon” and “shabdon” to get the Hindi
translation of nouns “houses” and “words”, respectively.
32
Adaptation in English to Hindi Translation: A Systematic Approach
9. Suffix Addition (SA): Here a suffix is added to some constituent word in the
retrieved example. Note that here the word concerned is in its root form in
the retrieved example. One may consider the examples given above with the
roles of input and retrieved sentences reversed as suitable examples for suffix
addition operation.
10. Copy (CP): When some word (with or without suffix) of the retrieved example
is retained in toto in the required translation then it is called a copy operation.
Figure 2.2 provides an example of adaptation using the above operations. In this
example the input sentence is “He plays football daily.”, and the retrieved translation
example is:
The translation to be generated is : “wah roz football kheltaa hai ”. When carried
out adaptation using both word and suffix operations the adaptation steps look as
given in Figure 2.2. In this respect one may note that Hindi is a free order language,
and consequently the position of adverb is not fixed. Hence the above input sentence
may have different Hindi translations:
While implementing an EBMT system one has to stick to some specific format.
The adverb will be added according to the format adapted by the system.
33
2.2. Description of the Adaptation Operations
depends upon the translation example retrieved from the example base. A variety
of examples may be adapted to generate the desired translation, but obviously with
varying computational costs. For efficient performance, an EBMT system, therefore,
needs to retrieve an example that can be adapted to the desired translation with
least cost. This brings in the notion of “similarity” among sentences. The proposed
adaptation procedure has the advantage that it provides a systematic way of evalu-
ating the overall adaptation cost. This estimated cost may then be used as a good
measure of similarity for appropriate retrieval from the example base. How cost of
Here our aim is to count the number of adaptation operations required in adapt-
ing a retrieved example to generate the translation of a given input. Obviously, de-
pending upon the situation one has to apply some adaptation operations for changing
different functional slot 5 (Singh, 2003), such as, subject(<S>), object (<O>), verb
(<V>). Also certain operations are required for changing the kind of sentence, e.g.
5
The following example illustrates the difference between functional slots and functional tags.
Consider the sentence “The old man is weak.”. The subject of this sentence is the noun phrase “The
old man”. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that “the”
is a determiner, “old” is adjective, and “man” is the subject. But, as mentioned above, the entire
noun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is
<S>, i.e. subject slot. Note that a particular functional slot may have variable number of words.
The sequence of functional slots in a sentence provide the sentence pattern. The difference between
various tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.
34
Adaptation in English to Hindi Translation: A Systematic Approach
affirmative to negative, negative to interrogative etc. Table 2.2 contains the nota-
tions for roles of different functional slot and operators, which are required for the
subsequent discussion.
35
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
The following sections describe how many such operations are required in dif-
ferent cases. In particular we consider the following functional slot and sentence
kinds:
1. Tense and Form of the Verb. Since there are three tenses (viz. Present,
Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect
Continuous), in all one can have 12 different structures of verb and passive
form verb structure also.
Hindi verb morphological variations depend on four aspects: gender, number and
person of subject and tense (and form) of the sentence. All these variations effect
6
-ing verb form other than the main verb
7
-ed or -en verb forms other than the main verb
36
Adaptation in English to Hindi Translation: A Systematic Approach
the adaptation procedure. In Hindi, these conjugations are realized by using suffixes
attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of
Appendix A). Since there are 12 different structures (depending upon the tense and
form), the adaptation scheme should have the capabilities to adapt any one of them
for any of the input type. Hence altogether 12×12, i.e. 144 different combinations are
possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect continuous
form as that of any tense has the same verb structure of continuous form in the same
tense. Therefore we exclude perfect continuous form from our discussion. Thus our
work concentrate on 9×9, i.e. 81 possibilities.
These 9×9 possible combinations of verb morphology variations are divided into
variations depend not only on the tense and form of the verb, but also on the gender,
number and person of the subject of the sentence. However, since Hindi grammar
does not support neutral gender, every noun is considered as masculine or feminine.
Therefore, adaptation rules have been developed keeping in view the above. In
general, these rules have been represented in the form of tables where the column and
row headers specify the nature of the subject of the input sentence and the retrieved
37
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
example, respectively. The row and column headers are of the form gender, person
and number of the subject where gender can be one of M or F, person can be one of
1, 2 or 3 specifying whether first, second and third person, and the number is either
S or P suggesting a singular or plural. Note that, here the gender of the English
sentence subject is assigned according to the Hindi grammar rule. The content of
the (i, j)th cell suggests the adaptation operations need to be carried out when the
subject of the input sentence matches the specification of the j th column header,
and that of the retrieved example matches the specification of the ith row header.
Here the input sentence and the retrieved example both have the same tense and
form. Yet, verb morphological variations may occur in the translation depending
For illustration, we consider the case when both the input and the retrieved sen-
tences have the main verb in present indefinite form. Table 2.3 lists the adaptation
operations involved for verb morphological variations. In general, in this situation
the verb adaptation requires at most one suffix replacement and one morpho-word
replacement. Suffix replacement will confine to the set {taa, te, tii } (call it S1 ),
while the morpho-word replacement is associated with the set {hain, hai, ho, hoon}
(call it M1 ) (Refer Table A.3). Note that if the person, the number and the gender of
a subject in both input and retrieved sentences are same then only copy operations
will be performed.
We illustrate with an example how Table 2.3 is to be used for adaptation of verb
morphological variations. Suppose the input sentence is “She eats rice.”, and
38
Input→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
the retrieved example is “We eat rice.” ∼ “ham chaawal khaate hain”. In the input
sentence the subject is 3rd person, feminine and singular, whereas in the retrieved
sentence the subject is 3rd person, masculine and plural.
The cell (3, 9), i.e. corresponding to (M1P, F3S) suggests that two adapta-
tion operations are required: suffix replacement (SR) and morpho-word replacement
(MR). The suffix “te” is replaced with “taa” in the main verb “khaate” as a suffix
replacement operation, the morpho-word “hain” is replaced with “hai ” in the re-
trieved Hindi sentence to get the Hindi translation of the input sentence. Although,
there is a need to replace the subject “ham” with “wah”, to get the appropriate
Hindi translation of the input sentence but this is not considered in the discussion
on verbs. The translation of the input sentence is: “wah chaawal khaatii hai ”.
Under this group nine combinations are possible, taking into an account three
tenses and three forms for each of them. Adaptation rule tables similar to the other
eight possibilities have been developed in a similar way. Salient features of these
verb morpho variations are discussed below:
1. Past indefinite to past indefinite: Here the verb morpho variation is doing in
a way similar to the present indefinite case discussed just above. However, for
morpho-word replacement, the set to be considered is {thaa, the, thii } (call it
M2 ) instead of the set M1 .
40
Adaptation in English to Hindi Translation: A Systematic Approach
the set {oongaa, oongii, oge, ogii, egaa, egii, enge, engii } (call it S2 ) is to be
considered (See Table A.3 of Appendix A) instead of the set S1 , i.e. {taa, te,
tii } used for present indefinite.
tion or one/two morpho-word replacements are required to deal with the verb
morphological variations depending upon the variations in the gender, number
and person of the subjects concerned (see in Section A.2 of Appendix A). Thus
the rule table for handling this case may be obtained by modifying Table 2.3
4. Past continuous to past continuous: Here the verb morpho variation is done
in a way similar to the present continuous case discussed above. Hence, here
too, one may have two morpho-word replacements. For one of them the set
M2 , i.e. {thaa, thii, the} is to be considered instead of the set M1 , i.e. {hai,
hain, ho, hoon}. The set required for other morpho-word replacement is M3 ,
5. Future continuous to future continuous: In this case too either a copy op-
eration or one/two morpho-word replacements are required. If morpho-word
replacement operations are carried out then the relevant sets are as follows:
41
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
6. Present perfect, Past perfect, Future perfect: If the input and the retrieved
example both have any one of these three then the verb morphology and adap-
tation operations imitate the rules of continuous form of the respective tense.
The only relevant change is that instead of the set M3 in all the three cases
the set {chukaa, chukii, chuke} (call it M5 ) is to be considered.
The morpho-words and suffixes for adaptation operations for all above discussed
cases can be referred from Table A.3.
In case of present perfect, past indefinite and past perfect sometimes there is a
case ending “ne” with the subject (see Section A.1). In that case, the verb mor-
phology variation will change according to the gender and number of the object,
instead of the gender, number and person of the subject. For past indefinite to past
indefinite transformation, the adaptation operation will either be copy operation or
suffix replacement, whereas in the other two cases the adaptation operations can
be either copy operation or suffix replacement and morpho-word replacement. All
possible suffix variations and morpho-word variations are listed in Section A.2.
In this group the verb morphological variation depends on gender, number and per-
son of the subject, and also on the variation in the tenses of the input and the
retrieved example. This group comprises of eighteen combinations of verb morphol-
ogy variations. These 18 possibilities occur due to three different tenses (present,
past, future), and three verb forms (indefinite, continuous, perfect). Some members
42
Adaptation in English to Hindi Translation: A Systematic Approach
of this group are present indefinite to past indefinite, present indefinite to future
indefinite, present continuous to past continuous, etc.
Example 1 : Suppose the input sentence is “She drinks water.”, and the retrieved sen-
tence is “She drank water.” with the Hindi translation as “wah paanii piitii thii ”. The
subjects of both the input and the retrieved example are feminine, 3rd person and
singular. In this situation, only one adaptation operation is required, i.e. morpho-
word replacement. The morpho-word “thii ” is to be replaced with the morpho-word
“hai ” to convey the sense of present indefinite form of the input sentence. Therefore,
Example 2 : Here the input sentence is “She reads books”, and the retrieved sentence
is “He read books.” with the Hindi translation as “wah kitaabe padhtaa thaa”. The
subject of the input is feminine, 3rd person and singular whereas in the retrieved
sentence the subject is masculine, 3rd person and singular. In this situation two
adaptation operations are required:
1. One suffix replacement: the suffix “taa” is to be replaced with “tii ”; and
43
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
Input−→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
The above two examples summarize the relevant adaptation operations to deal
with verb morphology variations while adapting from past indefinite sentence to
present indefinite sentence. Adaptation may be carried out by either one MR
(morpho-word replacement) operation, or one SR (suffix replacement ) and one MR
Table 2.4 provides all possible adaptation operations which occur due to the
variation in the gender, number and person of the subject.
Some important points regarding the adaptation rules for other remaining 17
1. If the input sentence is in future indefinite form, and the retrieved example is
The adaptation operations for this set are suffix replacement and morpho-word
deletion.
(i.e. {hoon, hai, ho, hain}) or that occurs with past indefinite (i.e. {thaa,
the, thii } has to be deleted.
45
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
As an illustration, suppose the input sentence is “I will eat rice.”, and the
retrieved sentence is “I eat rice.” with Hindi translation “main chaawal khaataa
hoon”. Here the suffix “taa” is replaced with suffix “oongaa”, and the morpho-
word “hoon” is deleted from the retrieved Hindi translation. Therefore, the
3. If the verb form is continuous or perfect then regardless of the tense of the
sentence, the same Table 2.4 will work. This verb form and tenses consideration
may occur in the input sentence or in the retrieved sentence. The only change
to be incorporated is that a morpho-word replacement (MR) has to be carried
out instead of suffix replacement (SR) operation in the adaptation rule Table
2.4.
For these tenses and verb forms, the suffixes for the suffix replacement and the
morpho-words for the morpho-word replacement can be referred from Table
A.3.
Here the input sentence and the retrieved example have the same tense but they
have different verb forms. Here too eighteen combinations of verb morphological
46
Adaptation in English to Hindi Translation: A Systematic Approach
variation are possible. For example: present indefinite to present continuous, past
indefinite to past perfect, future perfect to future indefinite, etc. Different cases are
discussed below.
Suppose the verb of the input sentence is in future indefinite form, and the verb in
the retrieved example is in future continuous form. Three adaptation operations are
required to take care of all the possible variations in gender, number and person of the
subject. These operations are one suffix addition and two morpho-word deletions.
The suffix addition one item from the suffix set {oongaa, oge, oongii, ogge, egi,
enge, egaa} is to be added in the root form of the main verb of the retrieved Hindi
translation. Note that, since the retrieved example is in future continuous form,
the main verb will be in its root form only, and, therefore, no suffix deletion or
replacement is required. The two morpho-word deletions will be restricted to the sets
{rahaa, rahii, rahe} and {hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
The following example illustrates this adaptation procedure:
Let the input sentence be “She will eat rice.”, and the retrieved example is “She
will be eating rice.” ∼ “wah chaawal khaa rahii hogii ”. In this case, a suffix will
be added to “khaa” that is “oongii ”, and the last two words of the retrieved Hindi
sentence will be deleted that is “rahii ” and “hogii ”. The addition of suffix “oongii ”
takes place because the subject of the input sentence is feminine, 3rd person and
singular.
For illustration, suppose for the same input as given above, the retrieved example
47
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
is “She will have eaten rice.” ∼ “wah chaawal khaa chukii hogii ”. In order to adapt
verb morphology the morpho-wordsf “chukii ” and “hogii are to be deleted, and the
suffix “oongii ” is to be added to the verb “khaa”. And thus one gets the required
verb morphology.
If the roles of the input and the retrieved sentence are reversed in the above
cases, then, in place of suffix addition, suffix deletion has to be carried out. Further,
the two morpho-word deletions are to be replaced with corresponding morpho-word
additions.
Adaptation rules for dealing with other verb morphological variations belonging
to this group have been developed in a similar way. One may refer to Section A.2
of Appendix A to figure out the appropriate suffixes and morpho-words that will be
involved in the necessary addition/deletion/replacement operation.
The remaining thirty six possibilities out of the total eighty one combinations of verb
morphological variations belong to this group. Since it is not possible to discuss all
of them in this report, some typical once are considered for the present discussion.
In particular, we discuss the case where the input sentence is in present indefinite
form. For illustration, we consider the retrieved examples of the following types: (i)
past continuous (ii) past perfect (iii) future continuous (iv) future perfect. It will
be shown that a single set of adaptation operations is sufficient for all the four cases
mentioned above. These adaptation operations are one suffix addition (SA), one
morpho-word replacement (MR) and one morpho-word deletion (MD). The purpose
of these three operations are as follows:
48
Adaptation in English to Hindi Translation: A Systematic Approach
• For the present indefinite tense the relevant suffix for the main verb is one of
{taa, tii, te} depending upon the gender, person and number of the subject.
However, if the retrieved sentence is one of the four types mentioned above
then the root verb in Hindi translation is in its root form. Consequently, the
• In the present indefinite form, one of the following morpho-words { hoon, hai,
ho, hain} has to be used depending upon the number, gender and person of
the subject. However, if the retrieved example is in past tense (irrespective of
continuous or perfect verb form), then the relevant morpho-word set is {thaa,
thii, the}. On the other hand, if the retrieved sentence is in future tense,
whether continuous or perfect verb form, then the relevant morpho-word set is
{hoongaa, honge, hogii, hoge, hogaa, hongii }. The morpho-word replacement
is required to have the right morpho-word in the generated translation.
For illustration, suppose the input sentence is “She eats rice.”, and the retrieved
example is one of the following:
49
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
(A) She was eating rice. ∼ wah chaawal khaa rahii thii
(B) She had eaten rice. ∼ wah chaawal khaa chukii thii
(C) She will be eating rice. ∼ wah chaawal khaa rahii hogii
(D) She will have eaten rice. ∼ wah chaawal khaa chukii hogii
Evidently, the sentences (A), (B), (C) and (D) are in past continuous, past
perfect, future continuous and future perfect form, respectively. The modification
in the retrieved Hindi translations are as follows:
• In case of translation of all retrieved examples the suffix “tii ” will be added in
• The morpho-word “rahii ” or “chukii ” (depending upon the case) will be deleted.
• The morpho-word “thii ” will be replaced with “hai ” if the retrieved example
is either (A) or (B).
In a similar way one can identify that the set of three adaptation operations
will be required if the input is in past indefinite form, and the retrieved is one of
present continuous, present perfect, future continuous or future perfect. However,
in order to carry out the morpho-word replacement one has to confine the selection
to the set {thaa, the, thii }. It will replace the relevant morpho-word, which is one
50
Adaptation in English to Hindi Translation: A Systematic Approach
of {hoon, hai, ho, hain} for present tense, or one of {hoongaa, honge, hogii, hoge,
hogaa, hongii } for future tense, of the retrieved Hindi example. The suffix addition
and morpho-word deletion operations are restricted to the same set as mentioned
above.
Similarly, one can identify adaptation operations when the roles of the input
and the retrieved sentence are reversed in the cases discussed above. Evidently, in
these cases one suffix deletion, one morpho-word replacement and one morpho-word
addition will be required for adapting the verb morphology variations. One can
easily figure out the relevant sets of morpho-words and suffixes keeping in view the
above discussions.
have been identified. However, due to the stereotype nature of discussion, we do not
present all other different cases in this report. One may identify the relevant suffixes
and the morpho-words by referring to Section A.2 of Appendix A.
The above discussion of adaptation procedures for verb morphological variation has
been limited to the active form of verb. Similar adaptation procedures have also
been studied when the verb is in the passive form. Ideally, the passive form should
exist for all the three tenses and all the four verb forms. However, the passive forms
of verbs for present perfect continuous, past perfect continuous, future continuous,
51
2.4. Adaptation Procedure for Morphological Variation of Passive Verbs
and future perfect continuous tenses are cumbersome, and are rarely used (Ansell,
2000). We, therefore, restrict our discussion to the other eight more commonly used
forms of the passive voice only. Since adaptation may take place from an active voice
sentence to a passive one, and vice versa, we classified these adaptation procedures
For each of the above mentioned three groups, we discuss few cases in detail:
If the input sentence is in past indefinite passive verb form, and the retrieved
example is in present continuous passive verb form or past continuous passive verb
form, then a single set of adaptation operations is sufficient. These adaptation
operations are one morpho-word replacement, two morpho-word deletions and one
suffix replacement. The suffix replacement depends upon the particular Hindi verb
under consideration, and also upon the gender and number of the subject. So this
operation is required in some examples of such cases. However, the other three
adaptation operations are mandatory. The purpose of these four operations are as
follows:
• In the past indefinite passive verb form, one of the following morpho-words
{gayaa, gayii, gaye} has to be used depending upon the number and gender
of the subject. However, if the retrieved example is in continuous passive form
52
Adaptation in English to Hindi Translation: A Systematic Approach
comes from the set {thaa, the, thii } in case of past tense. For adaptation of
verb morphology these two morpho-words are to be deleted from the retrieved
example.
A.2.
Example 1 : The input sentence is in past indefinite passive verb form, and the
retrieved example is in present continuous passive verb form.
53
2.4. Adaptation Procedure for Morphological Variation of Passive Verbs
and “hain” are to be deleted, the morpho-word “jaa” is to be replaced with the
morpho-word “gayii ”. Note that there is no change to the main verb “khaayii ”.
Example 2 : Here we consider the same input sentence but the retrieved example is
“The apple was begin eaten by the squirrel.” ∼ “gilharii ke dwaaraa seb khaayaa jaa
rahaa thaa”. Evidently, to generate the required translation “gilharii ke dwaaraa
moongphalii khaayii gayii ” all the operations given in Example 1 are to be carried
out. Further, due to the change in gender of the subject8 , i.e. the suffix “yaa” is
replaced with the suffix “yii ”. To generate the final translation of the input sentence
one more adaptation operation is needed, the constituent word “seb” is replaced with
“moongphalii ”, but that is not a part of the set of adaptation operations mentioned
above.
Here too we illustrate the verb morphology adaptation with the help of a specific
case: the input sentence is in the present indefinite passive verb form, and the
retrieved sentence is in active verb form in the same tense and form. Here one can
identify that one suffix replacement, one morpho-word addition and one morpho-
word replacement (depending upon the situation) are required to carry out the verb
morphology adaptation task. Significance of these three operations are as follows:
• The suffix {taa, te, tii} in the main verb of the active retrieved sentence is
replaced with an appropriate suffix according to the rules of PCP form of verb
given in Section A.2 of Appendix A.
• A morpho-word from the set {jaataa, jaatii, jate}, whose elements are es-
sentially declensions of the verb “jaa”, is to be added after the main verb.
8
“seb” (“apple”) is masculine but “moongphalii” (“nut”) is feminine
54
Adaptation in English to Hindi Translation: A Systematic Approach
• Since the retrieved example is in present tense, it must contain one of the
morph-word {hai, hain, ho, hoon}. Again since the input sentence is also
in present tense its Hindi translation will also have one morpho-word from
the same set. Hence, depending upon the gender, number and person of the
respective subjects, the same morpho-word may be retained, or it may have
to be replaced with another morpho-word from the same set.
The Hindi translation of the input sentence is “yah khaanaa sitaa ke dwaaraa
banaayaa jaataa hai ”. Evidently, to deal with the verb morphology in the generated
translation, two adaptation operations have to be performed. The suffix “tii ” of the
main verb of the retrieved sentence is to be replaced with “yaa”, and the morpho-
word “jaataa” is to be added after the main verb.
As the input sentence is the passive form of the retrieved sentence, “ke dwaaraa”
is added before the subject “siitaa”. This is necessary to generate the appropriate
translation of the input sentence but is not a part of the set of adaptation operations
mentioned above.
If the roles of the above mentioned input and the retrieved example are reversed,
one suffix replacement, one morpho-word replacement and one morpho-word deletion
55
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
will be required for adapting the verb morphology. One can easily figure out the
relevant sets of morpho-words and suffixes keeping in view the above discussions.
The adaptation rules for all other possible variations mentioned earlier have been
formulated in a similar way. However, the similar nature of discussions prohibits us
to describe all of them in detail.
Subject (<S>) and Object (<O>) functional slots can be sub-divided into a number
of functional tags. These tags act as pre-modifier and post-modifier of the subject
(@SUBJ) and/or object (@OBJ) functional tag. The maximum possible structure
of the <S> or <O> functional slot using different tags is:
56
Adaptation in English to Hindi Translation: A Systematic Approach
Table 2.5 lists only those structures which are present in our example base, and
are studied in the course of present research work. Here, {} is used for showing non-
obligatory (see Table 2.2) functional tags/slots. The definitions of the functional tags
are given in detail in Appendix B. The part of speech and its transformation under
the morpho tags for <S> or <O> functional slots are noun(N), pronoun(PRON),
proper noun (<Proper>), adjective(A) with transformations ABS, PCP1 (“-ing”
participle form), and PCP2 (“-ed” participle form), adverb(ADV) and gerund(PCP1
form). All possible variations in the morpho tags of the functional tags under the
<S> and <O> functional slot are listed in Table 2.6.
57
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
We explain Table 2.5 and Table 2.6 with an example. Consider the sentence “This
old man is sitting in Ram’s office”. Its parsed version, obtained using the ENGCG
parser, is:
• @SUBJ N SG “man”,
58
Adaptation in English to Hindi Translation: A Systematic Approach
• @OBJ N SG “office”
• < $. >.
Here, the tags that start with ‘@’ are called functional tags e.g. @DN> - determiner,
@GN> - genitive case, @AN> - pre-modifier adjective etc. In Table 2.6 these tags
are succeeded by morpho tags, such as, SG - singular, PERS - personal pronoun,
In the following discussion, the adaptation rules for functional tags due to the
variation of morpho tags are given.
of @DN>
The morpho tags ART and DEM are associated with the functional tag @DN>
(see Table 2.6). The morpho tag ART is associated with the English words “the”,
“an” and “a”, and DEM is associated with this, these, that etc. The word “the”
does not have any Hindi equivalent, hence it is absent in all Hindi translations.
Corresponding to articles “a” and “an” often no Hindi word used in the translation.
However, in some cases the word “ek ” (meaning “one”) is used depending upon
the context. For adaptation of these words no morphological changes take place.
Therefore, if “@DN> ART” is present in the parsed version of either in the input
or in the retrieved sentence, and it is corresponding to the word “the”, then no
adaptation operation will be performed. With respect to determiners (word having
DEM morpho tags such as this (“yah”), these (“ye”), that (“wah”) etc.), adaptation
procedure is straightforward.
59
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
For illustration, consider the input sentence “This man is kind.”, and the retrieved
example is “The man is kind.” ∼ “aadmii dayaalu hai ”. Note that no Hindi word
exists in the retrieved Hindi sentence corresponding to the word “the”. But the input
sentence contains the determiner “this”. Therefore, its Hindi translation “yah” is
required to be added before the subject “aadmii ” in the generated translation. Hence
the translation of the input sentence is “yah aadmii dayaalu hai ”.
of @GN>
The functional tag @GN> is used for a genitive (i.e. possessive) case. Eight possible
morpho tag variations are listed in Table 2.6. These variations occur due to the
variations in gender, number and person of three different POS, which are N, PRON
and <Proper>. When the part of speech of the genitive word is N or <Proper>,
then the genitive case in Hindi is indicated with one of the case endings from the set
{kaa, ke, kii } as a morpho-word. Its usage depends upon the gender and number
of the noun following the word corresponding to the tag @GN>. When the genitive
word is pronoun (PRON) the case endings are transformed into suffixes. Following
examples illustrate different genitive case structures in Hindi.
• “kaa” is used when the noun following it is masculine singular. For example:
• “ke” is used when the noun following it is masculine plural. For example:
60
Adaptation in English to Hindi Translation: A Systematic Approach
• “ke” is also used when the noun following it is masculine singular with a case
ending. For example:
There are occasions when morpho changes occur to the genitive word (when it
is noun) due to the case ending “kaa”, “ke” and “kii ”. These rules are listed in
Appendix A. For example: “the boy’s horse” ∼ “ladke kaa ghodha”. Although, the
Hindi of “boy” is “ladkaa”, its oblique form “ladke” has been used in the above
example. This happens because of the case ending “kaa”.
If POS of the genitive case is proper noun, then too, the same case ending {kaa,
ke, kii } is used as a morpho-word according to the gender and number of the noun
following it. In this case no morpho changes occur in the genitive word due to the
case ending. For example: “Parul’s home - paarul kaa ghar ”, “Ram’s book - raam kii
kitaab”, “in Ram’s home - raam ke ghar mein”.
As mentioned above, when the POS of the genitive word is pronoun the case
ending is attached with it in a form of the suffix. In case of first and second person
pronoun the suffix comes from the set {aa, e, ii }. However, in case of third person
pronoun the entire morpho-word is used as a suffix. The following examples illustrate
the genitive case with respect to pronoun.
61
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
Once these structures are known, adaptation rules for different variations of
genitive case may be formulated by referring to Table 2.6. Table 2.8 has been
designed to indicate the adaptation procedure of different genitive case. The headers
of the rows and columns of this table correspond to three POS: <Proper>, N and
PRON.
Ret’d ↓
62
Adaptation in English to Hindi Translation: A Systematic Approach
We explain these adaptation rules with the help of the following example. Sup-
pose the input sentence is “The boy’s uniform is new.”, and the retrieved example
is “Parul’s toy is new.” ∼ “paarul kaa khiloonaa nayaa hai ”. The translation of the
input sentence is “ladke kii wardii naii hai ”. In order to generate this translation
from the retrieved example the following adaptation operations need to be carried
out.
The word “boy” corresponds to the genitive case in the input sentence, its part
of speech is noun (N), and in the retrieved sentence the part of speech of “paurl” is
proper noun (<Proper>). Hence, according to cell(1, 2), i.e. (<Proper>, N) the
set of adaptation operation is WR+ {MR}+ {SR or SA}. This indicates that one
word replacement is mandatory and other two operations are carried out depending
upon the particular example under consideration.
Here, the nouns that follow the genitive cases are “uniform” and “toy”, respec-
tively. Their Hindi translations are “wardii ” (which is feminine and singular) and
“khiloonaa” (which is masculine and singular), respectively. The possessive case
ending, therefore, will not be the same. One morpho-word replacement is needed
to adapt genitive case ending: the morpho-word “kaa” is to be replaced with “kii ”.
Therefore, that optional morpho-word replacement is required in this example. The
genitive word “paarul ” in retrieved Hindi sentence is to be replaced with “ladkaa”.
However, a suffix replacement is necessary in this genitive word, viz. “ladkaa” be-
comes “ladke”.
Thus, in this example all the three adaptation operations are needed to adapt
the genitive case. In some situations all these operations may not be required. For
illustration, to adapt the genitive case “Parul’s uniform” ∼ “paarul kii wardhi ” to
63
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
of @QN
The functional tag @QN is a quantifier tag. It is of two types: numeral (NUM) (e.g.
Hindi translation is concerned these seven variations do not play any role in the Hindi
translation. Therefore, no suffix operations or morpho-word operations are relevant
in this case. Only a single word operation, i.e deletion/addition/replacement/copy,
is required depending upon the tags in the input and the retrieved sentences.
For illustration, to adapt the translation of the retrieved example: “Two men
are coming here.” ∼ “do aadmii yahaan aa rahe hain” to generate the translation of
the input sentence “Some men are coming here.”, the adaptation procedure should
replace “kuchh” (i.e. “some”) with “do” (i.e. “two”). The Hindi translation of the
input sentence is, therefore, “kuchh aadmii yahaan aa rahe hain”.
Adjectives fall into two classes, viz., uninflected and inflected (Kellogg and Bailey,
1965). Uninflected adjectives, as the term implies, remain unchanged before all
64
Adaptation in English to Hindi Translation: A Systematic Approach
nouns and under all circumstances. English adjectives are necessarily uninflected -
they undergo no morphological changes with the variation in the nouns they qualify.
But Hindi adjectives may fall under both the categories. For example: “achchhaa”
(“good”) is inflected adjective, while “iimaandaar ” (“honest”) is uninflected. For
illustration:
Adjectives are of two types: basic adjectives, and participle forms, i.e. those that
are derived from verbs (Kachru, 1980). The inflection rules of these two types are
discussed below.
Basic adjectives: These adjectives are those which are adjective themselves such as
“sundar ∼ beautiful”, “achchhaa ∼ good”. ENGCG parser denotes them as “ABS”.
The rules of inflection for these adjectives are as follows.
1. If an adjective in Hindi ends with “aa”, then it changes into “e” for plural.
For example, “bur aa ladkaa” (bad boy) and “bur e ladke” (bad boys).
2. An adjective ending with “aa” changes into “ii ” for feminine, e.g. “bur ii
ladkii ” (bad girl) and “bur ii ladkiyaan” (bad girls).
3. If an adjective in Hindi ends with any other vowel, it does not change in any
case.
65
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
In Hindi the following rules govern the structures of these adjective forms.
1. In order to attain A(PCP1) form of adjective a suffix from the set {taa, te,
tii } is added to the root form of the verb. But in case of past participle form
A(PCP2) an appropriate suffix is attached according to the rules of the PCP
2. Further, in most cases a morpho-word (from the set {huaa, huye, huii }) also
needs to be added after the modified verb.
3. Participle forms of adjectives are also inflected according to the gender and
66
Adaptation in English to Hindi Translation: A Systematic Approach
The pre-modifier adjective tag @AN>, therefore, has three possible morpho tag
variations (see Table 2.6): “@AN> A ABS”, “@AN> A PCP1”, and “@AN> A
PCP2”. Adaptation rules for adjectives have been formulated keeping in view all
the morpho-transformations discussed above. The following Table 2.10 presents
these rules.
The following examples illustrate the usage of the rule Table 2.10. Suppose
the input sentence is “Faded flower do not look good.”. We consider two different
retrieved examples to describe the adaptation procedure.
The adaptation operations for this example should follow the cell(1, 3), i.e. (ABS,
A(PCP2)) of the above table as the pre-modifier adjective of the subject is of the
form “ABS” in the retrieved sentence, and of the form “A(PCP2)” in the input
sentence.
Ret’d ↓
67
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
The pre-modifier adjective “sundar ” is replaced with “murjhaa”, and suffix “yaa”
is added in that word. The morpho-word “huaa” will be added after this modified
verb in the retrieved Hindi sentence as the subject (“phool ”) is singular masculine.
Hence three adaptation operations (viz. one constituent word replacement, one
morpho-word addition and one suffix addition) are required for carrying out the
adaptation task. However, there may be situations when the suffix replacement may
have to be carried out in place of suffix addition.
As “not” is present in the input, and its translation (in Hindi is “nahiin”) is
added to retrieved Hindi sentence to generate the appropriate translation of the
input sentence. But this modification is not a part of the adaptation operation listed
in Table 2.10. Hence, the Hindi translation of the input sentence is “murjhaayaa
huaa phool acchaa nahiin dikhtaa hai ”.
In this example only two suffix replacements are required, i.e. first set of op-
erations. The suffix “te” is replaced with “yaa” in the pre-modifying adjective
“murjhaate”, and the suffix “ye” is replaced with “aa” in the morpho-word “huye”.
In some situations, the second suffix operation may not be needed if the gender and
person of the qualified nouns are same in both the input and the retrieved exam-
ple. If the input and the retrieved example have different verbs in the participle
68
Adaptation in English to Hindi Translation: A Systematic Approach
form then, accordingly, one word replacement operation has to be invoked. This
operation will take care of the variation in the participle verb. One can realize this
if “blooming flowers” ∼ “khilte huye phool ” has to be adapted to translate “fading
flowers” (“murjhaate huye phool ”).
Note that the present discussion is limited to the adaptation procedure of pre-
modifier adjective, therefore, in order to generate the Hindi translation of the input
sentence, it has been assumed that other modifications in the sentence have been
incorporated already, therefore, the Hindi translation is “murjhaayaa huaa phool
achchhaa nahiin dikhtaa hai ”.
jective form. The adaptation rule table developed therein corresponds to nouns
belonging to subject and object functional slots, for which the adjective works as an
attributive one. The same rule Table 2.10 works for an attributive adjective corre-
sponding to noun belonging to any functional slot/tag other than subject and object
as discussed above. Another usage of adjective may be noticed, in both English and
Hindi, that is the predicative one. In Hindi, a predicative adjective (subjective com-
plement) agrees with its subject in number and gender. For example, “He is good ∼
wah achchhaa hai ” and “She is good ∼ wah achchhii hai ”. The rules given in Table
2.10 works for predicative adjective as well.
of @SUB
The subject tag “@SUBJ” is the main and obligatory tag under the subject slot.
As listed in Table 2.6, nine possible morpho tag variations have been observed for
69
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
the subject functional tag. Within these nine possible morpho tags, there are total
four parts of speech: noun (N), proper noun (<Proper>), pronoun (PRON) and
gerund (PCP1). The variations in these parts of speech may occur due to either a
case ending or number. In this respect the following may be observed.
• The only case ending that may occur with respect to subject is “ne”. If the
POS of the subject is noun or pronoun then morphological changes may occur
due to this case ending. For example,
“ladkaa + ne” -“ladke ne” (boy) “bchchaa + ne” - “bchche ne” (child)
More details of this case ending are given in Appendix A. It may be noted that
no morphological changes occur to the subject due to this case ending if the
POS of the subject is proper noun or PCP1.
Singular Plural
boy - ladkaa boys - ladke
house - ghar houses - ghar (No change)
cloth - kapadaa clothes - kapade
girl - ladkii girls - ladkiyaan
• In PCP1 form, always a suffix “naa” is added to the root form of the verb.
70
Adaptation in English to Hindi Translation: A Systematic Approach
The rule Table 2.11 presents the relevant adaptation operations for different
variations in the subject discussed above. The following examples illustrate some of
these rules.
Example 1 : Suppose the input sentence is “The boy is playing.”, and the retrieved
example is “Boys are playing.” ∼ “ladke khel rahe hain”. Since the subject of the
input sentence is “boy”, to generate its Hindi translation “ladkaa” only the suffix “e”
is replaced with “aa” from the subject “ladke” in the retrieved Hindi sentence. This
is because the root word of the subject in both the input and the retrieved sentence
are same, that is “boy”. However, if the root words of the subjects differ, then a
“sisters” (i.e. plural form) then a word replacement (“ladkaa” → “bahan”) followed
by suffix addition (“bahan” → “bahanen”) will be required.
71
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
Therefore, the cell(1,1), i.e. (N, N) corresponds to above discussed operation set,
i.e. (CP or ({WR} + {SR or SA or SD}).
Example 2 : Consider the input sentence “He is a good man.”. Let the correspond-
ing retrieved example be “Walking is a good exercise.” ∼ “sair karnaa ek achchhaa
vyaayaam hai ”. The subject of the input sentence is “he”, and its POS is pronoun
(PRON), while the subject of the retrieved example is “walking”, and its POS is “-
ing” verb form, i.e. gerund (PCP1). In this case, adaptation operation as mentioned
in cell(4, 3), i.e. (PCP1, PRON) is required for doing the changes in the retrieved
translation. In this case the adaptation operation is word replacement. The word
For the functional tags @OBJ and @>P, the same adaptation rule table can be
used because the morpho variation for these functional tags are same as that for
@SUBJ as given in Table 2.6.
For the last two functional tags (@<NOM-OF, @<NOM) there is only one pos-
sible morpho tag variation where the POS is preposition in both the cases. The
functional tag @<NOM-OF corresponding to the preposition “of”, and its trans-
lation in Hindi is either “kaa”, “ke” or “kii ”. It is based on the gender, number
and person of the word corresponding to @<P tag. In case of @<NOM, there is
no particular postposition as in @<NOM-OF, a fixed Hindi translation cannot be
mentioned, and the translation takes place according to the prepositions in the input
sentence.
72
Adaptation in English to Hindi Translation: A Systematic Approach
This section discusses sentences that start with interrogative words, which are of
two types: interrogative pronoun (such as, who, what, whom, which, whose) and
interrogative adverbs (such as, when, where, how, why). This study has been done on
a selected set of representative sentences from the example base. This study focuses
on finding the usages of different interrogative words and corresponding translation
patterns. The major findings of this study are as follows.
• The same interrogative word may have different Hindi translations in different
contexts, and consequently, the structures of the corresponding Hindi transla-
tions may also vary.
structure.
The above findings are important from EBMT point of view because commonality
of the interrogative words may not lead to the most useful retrieval. In order to
retrieve the most similar translation example, one may have to look into sentences
involving some other interrogative words. Table 2.12 shows the examples and their
patterns. The interrogative sentence patterns are denoted as INi , i=1,2,...,26. These
examples have been taken from the example base. The patterns of the sentences are
decided from the parsed versions of various examples given by the ENGCG parser.
73
2.6. Adaptation of Interrogative Words
74
Adaptation in English to Hindi Translation: A Systematic Approach
75
2.6. Adaptation of Interrogative Words
76
Adaptation in English to Hindi Translation: A Systematic Approach
by the parser to the interrogative words of the above sentences are given in Table
2.13.
77
2.6. Adaptation of Interrogative Words
Note that, Table 2.12 by no means provides an exhaustive list of English sentence
patterns involving interrogative words. However, these are the sentence patterns that
Variation in Translation of “who”: Table 2.13 shows four different functional tags
for this word. The observed translation patterns for these are as follows:
2. @SUB: When “who” is used as a subject, as in pattern IN01 and IN2 its trans-
78
Adaptation in English to Hindi Translation: A Systematic Approach
lation into Hindi may have two possibilities depending upon the tense of the
sentence in case of IN2 sentence pattern. If the tense and verb form are present
perfect, past indefinite or past perfect, the translation of “who” in Hindi is
“kisne”. In all other tense and verb form “who” is translated as “kaun”.
3. @OBJ: If the functional tag assigned to “who” is @OBJ, (as in pattern IN3 )
then its Hindi translation is “kisko”
4. @<P: This tag implies that here “who” is used as a complement of preposition
1. In translation patterns IN5 and IN6 the interrogative word “what” is used as
subject (@SUB) and object (@OBJ), respectively. In both the cases “what” is
translated as “kyaa”.
2. In case of sentence patterns IN7 and IN8 the word “what” is used as a de-
terminer and its functional tag is @DN>. However, due to the variations in
overall sentence patterns in these two cases the different translations for the
word “what” have been observed. In both the cases, the Hindi translation is
of the form kaun followed by one of {saa, se, sii } depending upon the number
and gender of the noun following the word “what”. However, in both the cases
of IN7 and IN8 the more translation of “what” has been observed, i.e. “kis” or
79
2.6. Adaptation of Interrogative Words
“kin” according to the number of the noun following the word “what”. Fur-
ther, the morpho-word “ki ” is added after the noun in case of IN7 sentence
pattern. While in case of IN8 , the morpho-word “ko” is added after the noun.
Variation in translation of the word “which”: As shown in Table 2.12, five different
sentence patterns, viz. IN9 to IN13 , have been observed corresponding to the word
“which”. In all these cases, although the functional tag for the word “which” varies,
its translation to Hindi is done in the same way using the word “kaun” followed by
one of the morpho-word from the set {saa, sii, se} depending upon the number and
gender of the noun following the word “which”.
However, in both the cases IN12 and IN13 , one more translation of “which” has
been observed, i.e. “kis” or kin according to the number of the noun following the
word “which”. Further, the morpho-word “ko” is added after the noun.
Variation in translation of the word “whose”: Although four different sentence pat-
terns (i.e. IN14 , IN15 , IN16 and IN17 ) have been observed for English sentences
involving the word “whose”, in all of them the functional tag of this word has been
found to be @GN>. Consequently, its translation into Hindi is also found to be the
same, i.e. one from the set {kiskaa, kiske, kiskii, kinkii, kinkaa, kinke}. The actual
usage depends upon the gender and number of the noun following the word “whose”.
Variation in translation of the word “whom”: Two possibilities have been observed
in this case:
1. @<P: Under this functional tag the word “whom” is used as a complement of
the preposition, as in sentence pattern IN18 . In this case the Hindi translation
of this word is “kis”.
80
Adaptation in English to Hindi Translation: A Systematic Approach
Variation in translation of interrogative adverb words: Under this case four words
have been studied: “why”, “where”, “when” and “how”. Their Hindi translations
are as follows:
• Irrespective of the sentence patterns (i.e. IN20 , IN21 , IN22 , IN23 and IN24 ) the
first three of the above four interrogative adverbs have unique translations in
Hindi. The Hindi translation of “why” is “kyon”, that of “where” is “kahaan”,
while “when” is translated as “kab”.
• In both the sentence patterns IN25 and IN26 , the translation of the word “how”
into Hindi is one from the set {kaisaa, kaisii, kaise}. This variation in the
translation is governed by the gender and number of the subject of the under-
lying sentence pattern.
The above study of interrogative words suggests that sentence having different
interrogative words may have same translation patterns. The following examples
illustrate this point.
81
2.6. Adaptation of Interrogative Words
Suppose one has to generate the translation of the English sentence ‘How are you
going today?”. Its Hindi translation is “tum aaj kaise jaa rahe ho? ”. Obviously, this
translation can be generated easily if one of the above three examples is considered
as a retrieved sentence.
Based on the above observations we cluster the above sentence patterns into
easy, and typically, can be done using simple operations. Table 2.14 shows the
operations required for the adaptation within the group G5 . However, the adaptation
between two different groups may not be so simple because the remaining part of the
sentences also need to be taken into consideration. And therefore, more structural
transformational of the retrieved examples will be needed.
82
Adaptation in English to Hindi Translation: A Systematic Approach
tences
Ram does not eat rice. ∼ ram chaawal nahiin khaataa hai
Does Ram not eat rice? ∼ kyaa ram chaawal nahiin khaataa hai?
83
2.7. Adaptation Rules for Variation in Kind of Sentences
One may notice that in Hindi the negative and interrogative structures are ob-
tained by addition of the words “nahiin” and “kyaa”, respectively. Also note that
the position of “kyaa” is always at the beginning of the sentence - hence its ad-
dition or deletion needs no traversing through the sentence. Typically, “nahiin”
occurs before the main verb of the Hindi sentence. However, since Hindi is relatively
free order, it may occur at some other position also. Adaptation operations are,
therefore, as follows:
Table 2.15 gives the required operations for all types of variation in the kind of
sentences. The expressions are obtained by deciding upon which of the words are
being added and/or deleted for the adaptation.
AFF CP WA MA WA + MA
NEG WD CP MA + WD MA
INT MD WA + MD CP WA
NINT WD + MD MD WD CP
84
Adaptation in English to Hindi Translation: A Systematic Approach
In this chapter we have described different adaptation operations that may be used
This chapter looks into the process of adaptation itself. The adaptation opera-
tions described in this chapter are to be used in succession in order to generate the
required translation. The overall adaptation scheme will first have to look into the
discrepancies in the input sentence and the retrieved example. Discrepancies may
occur in different functional slots of the sentences, and also in the kind of sentences.
In this chapter we have considered variations in the different tense and verb forms
both in active and passive, variations in subject/object functional slot, variation in
wh-family word (e.g. “what”, “who”, “where”, “when”) and their sentence patterns.
We have also worked on Modal verbs (e.g “should”, “might”, “can”, “could”, “may”),
and their respective sentence patterns. However, due to similar nature of discussion
we do not elaborate on them in this report.
Of the different sentence kinds, we have discussed four (viz. affirmative, negative,
interrogative and negative interrogative) in this chapter. Evidently one may find
many other kinds of sentences (e.g. Imperative, Exclamatory). We have not dealt
85
2.8. Concluding Remarks
with them in this work, however, we feel that they can be treated in similar fashion.
With respect to each of the variations we have identified the minimum number
of operations that are required for the overall adaptation of the retrieved example.
We presented these required operations in the form of various tables. The advan-
tage of these tables is that they can be used as yardsticks for measuring the total
The above-mentioned scheme of adaptation works well under the implicit as-
sumption that translations of similar source language sentences are similar in the
target language as well. However, in reality one may find examples when the above
assumption does not hold good. For example, consider the two English sentences
“It is running.” and “It is raining.”. Although these two sentences are structurally
very similar, their Hindi translations are structurally very different. The first sen-
tence is translated as “wah (it) bhaag (run) rahaa (..ing) hai (is)”. But the second
one is translated as “baarish (rain) ho (be) rahii (..ing) hai (is)”. Hence in order
to translate the first sentence if the second one is retrieved from the example base,
then the translation generated through the above-mentioned adaptation procedure
will not be able to produce the correct translation of the said input. Such instances
are primarily due to some inherent characteristics of the source and target language,
which is termed as translation “divergence” (Dorr, 1993). The existence of trans-
lation divergences makes the straightforward transfer from source structures into
target structures difficult. Study of adaptation therefore needs a careful study of
86
Chapter 3
3.1 Introduction
For illustration, consider the following English sentences and their Hindi trans-
lations:
(A) : She is in a shock. ∼ wah sadme mein hai
(she) (shock) (in) (is)
syntax. However, in example (C) one may notice huge structural variation. Here,
the sense of the prepositional phrase “in panic” is realized by the verb “ghabraa rahii
hai ” (“is panicking”). Hence this is an instance of a translation divergence.
Assuming that the English sentence in (A) is given as the input to an English to
Hindi EBMT system, two scenarios may be considered:
1. The retrieved example is (B) i.e. “She is in trouble”. In this case, the correct
Hindi translation may be generated in a straightforward way by using word
87
3.1. Introduction
Thus the output of the system will depend entirely on the sentence ((B) or (C))
which will be retrieved to generate the translation of the input (A). Given a very
similar structure of the three sentences, the retrieval may eventually depend on the
semantic similarity of the prepositional phrase (PP) of the input with the PPs of the
stored examples. With respect to the above illustration, this implies that similarity
between the sentences may be measured by the semantic similarity between “shock”
and “trouble” in case (1), and the semantic similarity between “shock” and “panic”
in case (2). Table 3.1 gives this similarity value under different schemes given in
WordNet::Similarity web interface (http://www.d.umn.edu /∼mich0212/ cgi-bin/
similarity/similarity.cgi), considering the words as nouns, and taking their sense
88
An FT and SPAC Based Divergence Identification Technique
The above values show that under all the above measures “panic” is more similar
to “shock”. But we observe from the translation point of view that example (B)
proves to be more useful in producing the appropriate translation. This happens
because of the presence of divergence in the translation of example (C).
system. Such an algorithm may be used in partitioning the example base into
different classes: divergence and normal. This, in turn, helps in efficient retrieval of
past examples which enhances the performance of an EBMT system. The present
work aims at designing algorithms for identification of divergence examples in a
This chapter is organized as follows. Section 3.2 discusses some related past
work on divergence and its identification. Section 3.3 presents a detailed study of
divergence categories for English to Hindi translation along with their identification
algorithms.
Various approaches have been pursued in dealing with translation divergence. These
1. Transfer approach. Here transfer rules are used for transforming a source
language (SL) sentence into target language (TL) by performing lexical and
structural manipulations. These rules may be formed in several ways: by
89
3.2. Divergence and Its Identification: Some Relevant Past Work
manual encoding (Han et. al., 2000), by analysis of parsed aligned bilingual
corpora (Watanabe et. al, 2000) etc.
(Habash, 2003), a system for translation between Spanish and English uses
this approach.
4. Universal Networking Language. (UNL) based approach. UNL has been devel-
90
An FT and SPAC Based Divergence Identification Technique
oped to play the role of Interlingua to access, transfer and process information
on the internet (Uchida and Zhu, 1998). In UNL, sentences are represented
using hypergraphs with concepts as nodes and relations as directed arcs. A
dictionary of concepts (termed as Universal Word or UW) is maintained. A di-
vergence is said to occur if the UNL expression generated from the both source
and target language analyzer differ in structure. This approach has been pro-
posed for English to Hindi machine translation in (Dave et. al., 2002).
Each of the above schemes, however, has its own shortcomings when applied in
English to Hindi context. For example, the Generation-Heavy approach requires rich
resources for the target language. Creation of such heavy resources requires signif-
icant amount of efforts, and it is not currently available for Hindi. The Interlingua
approach requires deep semantic analysis of the sentences, but it has been observed
elsewhere that an MT system can work even without such semantic details (Dorr
et. al., 1998). Similarly, creation of exhaustive set of rules to capture all the lexical
and structural variations that may be witnessed in English to Hindi translation is
too formidable. Even in case of UNL based approach, each UW of the dictionary
contains deep syntactic, semantic and morphological knowledge about the word.
Creation of such a dictionary even for a restricted domain is difficult, and needs
deep semantic analysis of each word.
With respect to Hindi the major problem in applying the above techniques is
that such linguistic resources are not freely available. As a consequence, applica-
tion of the above techniques in English-Hindi context is severely constrained, at
least presently, due to scarcity of linguistic resources for Hindi. Although Hindi
is one of the major languages of the present world, research in NLP on Hindi
(and other Indian languages too) is still in its infancy. Even though research in
91
3.2. Divergence and Its Identification: Some Relevant Past Work
NLP involving Indian languages has been enthusiastically pursued, and is spon-
sored by the government and several educational institutes over the last few years
(http://tdil.mit.gov.in/tdilsept2001.pdf), it will take some time before various lin-
guistic resources are easily available. This motivates us to develop a simpler algo-
rithm that requires as little linguistic resources as possible. The usefulness of such
techniques will be twofold:
2. The methods can be used for other languages too where linguistic resources
are scarce.
The proposed approach uses only the functional tags (FT) and the syntactic
phrasal annotated chunk (SPAC) structures of the source language (SL) and target
and target language sentence. Thus the proposed approach aims at designing an
algorithm that uses as little linguistic resources as possible.
The most fundamental work before developing any such algorithm is to determine
the different types of divergence that may be found in English to Hindi translation.
Since divergence is a language-dependent phenomenon, it is not expected the same
set of divergences will occur across all languages. In this respect one may refer to
(Dorr, 1993) which provides the most detailed categorization of Lexical-semantic
divergences for translation among the European languages. There divergence has
92
An FT and SPAC Based Divergence Identification Technique
been put in seven broad types: structural, conflational, categorial, promotional, de-
motional, thematic and lexical. Section 3.3 discusses these divergence types in detail.
In a more recent work (Dorr et. al., 2002) and (Habash and Dorr, 2002), the diver-
gence categories have been redefined in the following way. Under the new scheme six
different types of divergence have been considered: light verb construction, manner
conflation, head swapping, thematic, categorial, and structural. The differences in
the two categorizations may be summarized as follows:
1. A light verb construction involves a single verb in one language being translated
using a combination of a semantically “light” verb and another meaning unit
(a noun, generally) to convey the appropriate meaning. In English to Hindi
(and perhaps in many other Indian languages) context such happenings are
very common. Hence this is not considered as a divergence for English to Hindi
translation. Later, this point will be discussed in detail under the conflational
divergence.
3. Lexical divergence, which is a mixture of more than one divergence, has not
been considered.
4. All other divergence categories remain as they are under the new scheme.
93
3.2. Divergence and Its Identification: Some Relevant Past Work
obtained from different bilingual sources (such as, storybooks, translation books,
recipe books). This analysis suggests that English to Hindi translation divergence
is in many cases somewhat different in its characteristics, and therefore need to be
redefined. In the following subsections we describe the various types of divergence
that may be found in the context of English to Hindi translation, and their sub-
types. We also discuss the algorithm to identify each type of divergence, and its
characteristics in more detail.
It may be noted that Dave et. al (2002) also studied English to Hindi divergence
in detail. However, they have restricted their discussions to the above-mentioned
seven categories only. Our studies of English to Hindi translation divergences reveal
the following:
1. All of the above-mentioned seven categories do not apply with respect to En-
glish to Hindi translation.
2. Instances of thematic and promotional divergence have not been found in En-
glish to Hindi translation
3. Structural divergence, in the English to Hindi context, occurs in the same way
as in European languages.
4. Some variations from the definitions given in (Dorr, 1993) may be noticed in
the occurrence of categorial, conflational, demotional divergences.
5. Three new types of divergence may be found with respect to English to Hindi
translation. These are named as nominal, pronominal and possessional.
6. Most of the divergence types may be further subdivided into several sub-types.
94
An FT and SPAC Based Divergence Identification Technique
In Section 3.3 we discuss all the relevant divergence types and their sub-types
that we have observed in English to Hindi translation, and provide algorithms for
their identification. As mentioned earlier, the identification technique uses functional
tags (FT) and syntactic phrase annotated chunk (SPAC) of both the source language
sentence and its translation. For each divergence type we identify the FTs that are
instrumental in causing the divergence. Each divergence type is defined on the basis
of which FTs of the English sentence it is concerned with, and also to which FTs it
is mapped upon translation.
The proposed algorithm requires the following FTs and SPAC of both the lan-
guages:
• Subject (S), object (O), verb (V), subjective complement (SC), adjectival com-
plement by preposition(SC C), subjective predicative adjunct (PA), verb com-
plement (VC) and adjunct (A).
POS tags: noun (N), adjective (Adj), verb (V), auxiliary verb (AuxV), prepo-
sition (P), adverb (Adv), determiner (DT), personal Pronoun (PRP), possesive
case of personal pronoun (PRP$) and cardinal number (CD).
Phrases: N, Adj, V, ADV and P are called the “lexical heads” of the phrases.
For each category a suffix “P” is used to denote a phrase.
In Appendix B and Appendix C, definitions of these FTs and SPAC are discussed
in detail. With this background we proceed to define the divergence types/ sub-types
and their identification scheme.
95
3.3. Divergences and Their Identification in English to Hindi Translation
We order the different divergence types on the basis of the FTs of the source language
sentence with which they are concerned. Accordingly, we observe the following:
lation.
may be identified by studying how the main verb of the English sentence is
realized upon translation.
In the following subsections we provide the different divergence types and their
identification schemes. In the description of all the algorithms the following conven-
tion for representation will be followed:
a) The input to the algorithms will be an English sentence and its Hindi sentence.
These two will be denoted as E and H, respectively.
96
An FT and SPAC Based Divergence Identification Technique
A structural divergence is said to have occurred if the object of the English sentence
is realized as a noun phrase (NP) but upon translation in Hindi is realized as a
prepositional phrase (PP). The following examples illustrate this. One may note
that different Hindi prepositions (e.g. se, par, ko, kaa) have been used in different
contexts leading to structural divergence.
97
3.3. Divergences and Their Identification in English to Hindi Translation
Analysis of various translation examples reveals the following points with respect
to structural divergence, which we use to design the algorithm for identification of
structural divergence:
• If the main verb of an English sentence is a declension of “be’’ verb, then the
structural divergence cannot occur.
• Structural divergence deals with the objects of both the English sentence and
its Hindi sentence. Therefore, if any one of the two sentences has no objects
then structural divergence cannot occur.
• If both the sentences have objects, and their SPAC structures are same then
also structural divergence does not occur.
• In other situations structural divergence may occur only if the SPAC of the
object of the English sentence is an NP, and the SPAC of the object of the
Hindi sentence is a PP.
98
An FT and SPAC Based Divergence Identification Technique
structural divergence, as discussed above, there is only one possible sub-type. Thus
depending upon the case the algorithm given in Figure 3.1 returns either 0 or 1.
Illustration
The SPACs of these two sentences and their correspondences are given in Fig-
ure 3.2. Here bold arrows represent correspondence, and dotted lines indicate no
correspondence. Note that, the objects of E and H are not null; in E the object
99
3.3. Divergences and Their Identification in English to Hindi Translation
is “Steffi”, whereas in H the object is “steffi se”. But their SPACs are [NP [Steffi /
N]] and [PP [NP [ steffi / N]] [ se / P]], respectively, which are not equal. Therefore,
structural divergence is identified.
cerned with adjectival SCs which upon translation map into noun, verb or PP. This
subtle difference allows us to redefine categorial divergence in English to Hindi con-
text. In particular, depending upon the nature of the SC or PA , four sub-types
have been identified.
The definitions and characteristics of the four sub-types are given below.
The adjective of the English sentence “afraid” is realized in Hindi by the verb
“darnaa” meaning “to fear”, and “dartaa hai ” is its conjugation for present
100
An FT and SPAC Based Divergence Identification Technique
indefinite tense, when the subject is third person, singular and masculine.
Here the focus is on the word “user” which is a noun, and has been used as
an SC in the above English sentence. This provides the main verb “istemaal
karnaa” (meaning “to use”) of the Hindi sentence. Its conjugation for present
indefinite tense is istemaal kartaa hai, when the subject is third person, sin-
gular and masculine. The adjective “regular” of the noun “user” is realized as
the adverb “baraabar ”.
The main verb of the Hindi sentence is “chalnaa” i.e “to move”. Its sense
comes from the adverbial PA “on” of the English sentence. The present con-
tinuous form of this verb is “chal rahaa hai ”, when the subject is third person,
singular and masculine. It may be noted that in Hindi grammar neuter gender
101
3.3. Divergences and Their Identification in English to Hindi Translation
does not exist. Inanimate objects are treated as masculine or feminine, and
this categorization follows some systematic rules but occasionally with some
exception (See Appendix A).
are realized in English as PP, but are realized in Hindi as the main verb. For
example, one may consider the following pair:
the verb “chalnaa”. One may notice that here in the Hindi translation the
auxiliary verb is “rahii ” in order to convey that the subject of the sentence is
feminine and singular.
• We further notice that for categorial divergence to occur, the Hindi translation
should not have any subjective complement or predicative adjunct.
102
An FT and SPAC Based Divergence Identification Technique
The identification algorithm has been designed taking care of the above observations.
The algorithm returns 0 if the translation does not involve any categorial divergence.
Otherwise, depending upon the case it returns 1, 2, 3, or 4. Figure 3.3 provides the
Illustration
Let E be the sentence “She is in tears.”. Its Hindi translation H be “wah (she) ro
rahii hai (is crying)”. As the sentences are parsed, and their SPACs are obtained
the algorithm proceeds as follows.
In Step 1, it finds that the English sentence root main verb is “be”, hence it
proceeds to Step 2. In Step 2 the root main verb of the Hindi sentence is determined.
In this case it is ronaa (i.e. “to cry”), which is not the “be” verb. The algorithm,
therefore, proceeds to Step 3 where it detects that Hindi sentence does not have a
PA. Thus this is a case of categorial divergence.
The algorithm now checks the SPAC of the PA “in tears” which is a prepositional
phrase comprising a preposition and a noun. The algorithm, therefore, detects
103
3.3. Divergences and Their Identification in English to Hindi Translation
Nominal divergence is concerned with the subject of the English sentence. In the
event of nominal divergence, upon translation the subject of the English sentence
becomes the object or verb complement. In this respect this divergence is somewhat
similar to the thematic divergence as defined in (Dorr, 1993). However, in case of
thematic divergence the object of the source language sentence becomes the subject
upon translation, whereas, in case of nominal divergence the subject of the Hindi
translation is derived from the adjectival complement of the English sentence. Thus,
characteristically nominal divergence differs from thematic divergence.
The subject of the English sentence is realized in Hindi with the help of a prepo-
sitional phrase. In particular, with respect to nominal divergence use of two prepo-
sitions: “ko” and “se” can be observed, which are typically used for an object or
ablative case, respectively (kachru, 1980). Hence the latter one is called as “verb
104
An FT and SPAC Based Divergence Identification Technique
complement”.
1. Nominal sub-type 1: Here the subject of the English sentence becomes object
upon translation. For illustration the following example may be considered:
Here, the adjective “hungry” is an SC. Its sense is realized in Hindi by the
word bhukh, meaning “hunger” that acts as the subject of the Hindi sentence.
The subject “Ram” of the English sentence becomes the object “ram ko” of
2. Nominal sub-type 2: In this case the subject of the English sentence provides
Note that, the subject of the English sentence “This gutter” is realized as the
modifier “iss naale se”of the verb “aatii hai ”.
105
3.3. Divergences and Their Identification in English to Hindi Translation
points:
1. Nominal divergence cannot occur if the main verb of the English sentence is a
declension of the “be” verb. This is because in that case, the English sentence
does not have an SC, which is essential for a nominal divergene to occur.
2. Otherwise, even if the root word of the main verb of the English sentence is
not “be”, nominal divergence cannot occur if the English sentence does not
have an SC.
3. Otherwise, if the SC is null, and the object is not null in H, then it is the
instance of nominal divergence of sub-type 1. In place of the object, if verb
The algorithm has been designed by taking care of the above observations. Figure
Illustration
Let E be the sentence “I am feeling sleepy”, and H be its translation “mujhe (to me)
niind (sleep) aa rahii hai (is coming)”. The root form of the main verb of E is not
106
An FT and SPAC Based Divergence Identification Technique
“be”. Therefore, it satisfies the else condition of step1. So, we can proceed to check
further steps. Here, the SC of E is “sleepy”, which is an adjective (Adj).
Hence steps 2 and 3 do not apply. In step 4 the SC of H is checked and that is
null. This implies that conditions for step 4 are not satisfied. In step 5 the object of
the H is identified. It implies that the given example pair has nominal divergence
of sub-type 1. Figure 3.6 gives the correspondence of the SPACs of the example
discussed above.
used as the subject. The Hindi equivalent of “it” is “wah” or “yah”. Thus, typically
the Hindi translation of such a sentence should have one of these two words as the
subject of the sentence. For examples, the following translations may be considered:
107
3.3. Divergences and Their Identification in English to Hindi Translation
In example (a) the word “morning”, a noun, acts as an SC. Upon translation
it provides the subject “subaha” of the Hindi sentence. In example (b) the SC
is still a noun but it is preceded by an adjective. Upon translation the whole
noun phase “andherii raat” becomes the subject of the corresponding Hindi
sentence.
108
An FT and SPAC Based Divergence Identification Technique
In this example, the adjectival complement “humid”, and its adverb “very” of
the English sentence are together realized with the help of the noun phrase
“bahut umas”, which acts as the subject of the Hindi sentence. As a conse-
quence pronominal divergence occurs.
(in) daudhnaa (to run) kathin (difficult) hai (is)”. The subject of the Hindi
translation has become “daudhnaa”, which means “to run”. One may note
that the adjunct “in the Sun” of the infinitive form “to run” translates to
“dhoop mein” that becomes a post modifier for the subject “daudhnaa”.
from the main verb of the source language sentence. Consider, for example,
the following translation:
The main verb “to rain” of the English sentence provides the subject “barsaat”
of the Hindi translation. One may notice the difference between this trans-
lation, and the translation of the sentence “It is crying” given earlier in this
section to appreciate the divergence.
Thus we find four different sub-types of the pronominal divergence each having
its own characteristics. If the subject of the English sentence is not “it”, then the
possibility of pronominal divergence can be ruled out. Further, even if the English
sentence has “it” at the subject position, if the subject of the Hindi sentence is
109
3.3. Divergences and Their Identification in English to Hindi Translation
110
An FT and SPAC Based Divergence Identification Technique
one of “wah” or “yah”, then too pronominal divergence cannot occur. Otherwise,
depending upon the SC or main verb of the English sentence the sub-type of the
pronominal divergence is identified. Figure 3.7 gives the corresponding algorithm.
Illustration
For the English sentence (E) “It is raining”, and its translation H in Hindi is “barsaat
ho rahii hai ”. The syntactic phrase annotated chunk (SPAC) structures of the
example pair, and their correspondences are given in Figure 3.8. Here the subject
of E is “it” and the subject of H is “barsaat” but not “yah” or “wah”. It does not
satisfy step 2. In step 3, this algorithm finds that root form of the main verb of the
English sentence is “rain” which is not “be”. Therefore, the condition of step 3 is
also not satisfied. Hence step 4 detects pronominal divergence of sub-type 4.
The characteristic feature of demotional divergence is that here the role of the main
verb of the source language sentence is demoted upon translation. In case of Eu-
ropean languages this implies that the main verb of the target language is realized
from the object of the source language, and the main verb of the source language
upon translation becomes the adverbial modifier. However, with respect to English
to Hindi translations a subtle variation may be noticed. We observed several ex-
amples where the main verb of the English sentence upon translation is demoted
to the subjective complement or predicative adjunct of the Hindi sentence, but not
to adverbial modifier, which we call an adjunct. Hence in the event of demotional
divergence, the main verb of the Hindi translation is realized as a “be” verb. Thus
111
3.3. Divergences and Their Identification in English to Hindi Translation
1. Demotional sub-type 1: This divergence occurs when the main verb and the
object of the English sentence are realized as predicative adjunct in the Hindi
sentence. However, the subject of the English sentence remains the subject
after translation to Hindi. For illustration, we consider the following example:
In this example the main verb “feeds” and the object “four people” of the
English sentence together give the predicative adjunct, which is the PP, “chaar
logon ke liye” (in English “for four people”) of the Hindi sentence. The subject
“this dish” remains subject after translation.
2. Demotional sub-type 2: Unlike the above sub-type, here the main verb and its
complement (instead of the object) of the English sentence are realized as the
112
An FT and SPAC Based Divergence Identification Technique
In this example, “belong to” and “a doctor” are the main verb and its comple-
ment of the English sentence, respectively. They jointly provide the predicative
adjunct (daaktaar kaa) of the Hindi sentence.
3. Demotional sub-type 3 : Under this sub-type the main verb and the object
of the English sentence are realized as the adjectival SC and the adjectival
complementation by preposition (SC C), respectively, in the Hindi translation.
Here also, the subject of the English sentence remains the subject of the Hindi
sentence. The following example explains this sub-type:
In this example, the main verb of the English sentence “face” is realized as
the SC “saamne” in the Hindi sentence. Also, the object “each other” of the
English sentence becomes an SC C, i.e. “ek dusre ke”. Thus, this translation
belongs to demotional divergence of sub-type 3. The literal meaning of this
translation is “These two sofas are opposite to each other”.
This soup lacks salt. ∼ iss soop mein namak kam hai
(this) (soup) (in) (salt) (less) (is)
113
3.3. Divergences and Their Identification in English to Hindi Translation
In the above example, the main verb “lack” of the English sentence is realized as
“kam”, the SC of the Hindi sentence. The object “salt” (“namak ”) becomes
the subject of the target language, and the sense of “the soup” is realized
1. In all the instances of demotional divergence we find that the main verb of
the English sentence is different from “be” or “have”. Thus if the main verb
114
An FT and SPAC Based Divergence Identification Technique
2. On the other hand, if the main verb of the Hindi translation is not the “ho”
verb (i.e. in English “be”), then demotional divergence cannot occur.
3. If the Hindi equivalent of the subject of the English sentence E is same as the
subject of the Hindi sentence H, then occurrence of demotional divergence is
decided as follows. Since the English main verb is realized upon translation as
SC or PA, if the Hindi translation has no SC or PA, then here also demotional
divergence cannot occur.
Illustration 1.
Consider the English sentence (E) “The soup lacks salt”, and its Hindi translation
(H) “soop mein namak kam hai ”. The SPACs of these sentences and their term
correspondences are given in Figure 3.10.
115
3.3. Divergences and Their Identification in English to Hindi Translation
Here the root form of the main verb of H is “ho” (i.e. “be”), and for E it is
“lacks”. Hence, steps 1 and step 2 are not satisfied, and therefore, computation
proceeds to step 3. However, if condition of step 3 fails as the subjects of E and H
are not same. Step 4 and 5 check that both SC and SC C are present in the Hindi
sentence. Hence, step 6 is considered. Since E has no object, the algorithm returns
4 indicating that the above sentence pair has a demotional divergence of sub-type 4.
Illustration 2.
Consider another example, where E is “This dish feeds four people.”, and H is “yah
pakvaan chaar logon ke liye hai ”. The SPACs of these two sentences and their
correspondences are given in Figure 3.11.
The root form of the main verb of E and H are “to feed” and “ho”, respectively.
Therefore, both step 1 and step 2 are not satisfied. Further, the algorithm checks
other steps for determining the sub-type of demotional divergence. The subject of
both E and H are same. The PA is present in H, and the object of E is not null.
This implies that the conditions of step 3 are satisfied. The algorithm, therefore,
116
An FT and SPAC Based Divergence Identification Technique
Conflational divergence pertains to the main verb of the source language sentence.
Typically, as characterized in (Dorr, 1993), conflational divergence occurs when some
new words are required to be incorporated in the target language sentence in or-
der to convey the proper sense of the verb of the input. However, with respect to
English to Hindi translation we need to deviate from this definition because of the
following reason. Many English verbs do not have a single-word equivalent in Hindi.
In fact, a large number of English verbs are expressed in Hindi with the help of a
noun followed by a simple verb. Such a combination is called a “Verb Part” (Singh,
2003), where the verb used in the Verb Part is some basic verb such as “honaa” (to
become), “karnaa” (to do) etc. Some examples of Hindi Verb Parts are given below.
For illustration, consider the verb “to begin”. Its Hindi equivalent is “aarambh
karmaa”. In Hindi, “aarambh” is the abstract noun meaning the “beginning”;
whereas, “karnaa” means “to do”. Thus the verb is realized in Hindi as a com-
bination of noun and verb. In a similar vein, the verbs “denaa” (meaning “to give”)
and “honaa” (meaning “to become”) are used as the basic verbs along with appro-
priate nouns to provide the meanings of the English verbs cited above. There are
also examples of Verb Parts involving other basic verbs, such as “maarna” as well.
117
3.3. Divergences and Their Identification in English to Hindi Translation
However, there are situations when the action, suggested by the main verb of
an English sentence, needs the help of a prepositional phrase or adverbial phrase to
convey the proper sense of the verb. These cases are encountered occasionally, and
therefore deviate from the normal Hindi verb structure. We call these variations as
1. Conflational sub-type 1 : Divergence of this type occurs when the new words
are added as adjunct to the verb. Typically, this adjunct is realized as a
prepositional phrase. For illustration, consider the following English sentences
and their Hindi translations:
Ram stabbed John. ∼ ram ne john ko chaaku se maaraa
(Ram) (to John) (knife) (by) (hit)
The sense of the verb “stab” is conveyed through the introduction of the prepo-
sitional phrase “chaaku se”. There are cases when the adjunct appears in the
form of an adverbial phrase instead of a prepositional phrase.
To convey the proper sense of the verb “hurry”, the adverbial phrase “jaldi
se” is used along with the main verb “jaanaa” meaning “to go”. Note that
118
An FT and SPAC Based Divergence Identification Technique
Although, the conflational verb adds the lexicon in the adjunct normally, in
English to Hindi translation we have found some examples in which the con-
flational verb adds lexicon in the subject of the target language. This we call
sub-type 2 of conflational divergence.
2. Conflational sub-type 2 : Under this sub-type the new word added acts as
the subject of the Hindi translation, and the original subject of the English
sentence becomes the post modifier or possessive case of the subject of the
Hindi sentence.
Example 1. He resembles his mother. ∼
uskii shakal uskii maa se miltii hai
The literal meaning of the translation is: “His face is similar to his mother.”.
The subject of the Hindi sentence, viz. “uskii shakal ” (meaning “his face”)
is realized form the source language verb “to resemble”. Here “uskii ” (“his”)
is the possessive pronoun of the original subject (“he”) of the English sentence.
In this example too, the subject of the Hindi sentence “iss pakvaan kaa swaad ”
(the taste of this dish) is realized from the verb “to taste”.
119
3.3. Divergences and Their Identification in English to Hindi Translation
1. If the English sentence E has declension of “be/have” verb at the main verb
position, then conflational divergence cannot occur.
3. If the number of nouns in the SPAC of the subject of E is less than the number
of nouns in the SPAC of the subject of H, and the SPAC of the subject of
H further contains a possessive personal pronoun (PRP$), or a possesive case
The algorithm returns 0 if the translation does not involve any conflational di-
vergence. Otherwise, depending upon the case it returns 1 or 2.
120
An FT and SPAC Based Divergence Identification Technique
Illustration 1.
Let E be the sentence “I stabbed John.”, and let its translation H be “main ne john
ko chaaku se maaraa”. The corresponding SPAC for both the sentences and the
term correspondences are given in Figure 3.13.
In Step 1, the algorithm finds that in the English sentence the root form of the
main verb is “stab” other than be/have, it implies else condition of step 1 occurs.
Now we will check further steps for determining the sub-type of the conflational
divergence.
In the step 2, S1 and S2 are 0 and 1 respectively, as E does not have any adjunct
but H has an adjunct. This implies, the given translation pair has conflational
divergence of sub-type 1.
Possessional divergence deals with English sentences in which the declension of verb
“have” is used as the main verb. An interesting feature of Hindi is that it has
121
3.3. Divergences and Their Identification in English to Hindi Translation
no possessive verb, i.e. one equivalent to the “have” verb of English. The normal
translation pattern of English sentences with declensions of “have” as main verb is
illustrated below:
The above examples demonstrate that the normal translation pattern of these
sentences is one of the following:
1. The main verb of the translated sentence is “honaa” which means “to be”.
2. The verb is used along with some genitive prepositions (viz. kaa, ke or kii ),
or the locative prepositional phrase, viz. “ke paas”, to convey the meaning of
possession (kachru, 1980).
3. Which one of the three genitive prepositions will be used depends upon the
number and gender of the object. It is “kaa” if the object is masculine singular,
“kii ” if the object is feminine singular, and “ke” for plural both masculine and
feminine.
However, there are many examples where the translation structure deviates from
this normal pattern giving rise to divergence. We call this as the “possessional
122
An FT and SPAC Based Divergence Identification Technique
divergence”. Depending upon how the roles of different FTs change, six different
sub-types are identified. These sub-types are explained below.
1. Possessional sub-type 1 : Here the roles of the subject and the object are
reversed upon translation. Thus this sub-type is akin to thematic divergence.
But in Hindi this pattern is observed only when the main verb of the English
In sentence (a), “he” and “a bad headache” are the subject and the object,
respectively. In the Hindi translation the subject is “tez sirdard ”, i.e. “bad
headache”, and the object of the Hindi sentence is “use” which is the accusative
case of “he”. Thus the roles of subject and object are reversed upon translation.
Similarly, in (b) upon translation the roles of the subject “Ram” and object
“fever” are reversed.
2. Possessional sub-type 2 : In this case the object and its premodifying adjective
in the English sentence are realized as the subject and SC, respectively, in the
123
3.3. Divergences and Their Identification in English to Hindi Translation
The object “voice” and its premodifying adjective “sweet” of the English sen-
tence are realized in Hindi as the subject “aawaaz ” and its adjectival comple-
ment “miithii ”. Note that the subject “these birds” of the English sentence is
3. Possessional sub-type 3 : Here, the object and its post modifier (normally, a
PP) in the English sentence are realized as the subject and the predicative
adjunct, respectively, in the Hindi translation. The subject of the English
sentence also contributes as the possessive case to the predicative adjunct. For
illustration, consider the following:
In the first example, the object (“books”) provides the subject (“kitaaben”) of
the Hindi translation. The post-modifier “in their satchels” of the object of the
124
An FT and SPAC Based Divergence Identification Technique
Similar transformation takes place in the second example. The object (“two
rupees”) and the post modifier of the object (“in his pocket”) are realized upon
translation as the subject (“do rupaye”) and the predicative adjunct (“raam
kii zeb mein”), respectively. Thus, the literal meaning of the Hindi sentence
The subject of the Hindi sentence is “ek sangrahaalay ”, which comes from
the object of the English sentence. The object “this city” translates to “iss
shahar ” that becomes a predicative adjunct “iss shahar mein” in Hindi.
premodifier, becomes the main verb of the Hindi sentence. The premodifier
may be an adjective or a noun, that becomes an adjunct of the translated
sentence. Consider, for illustration, the following translations:
125
3.3. Divergences and Their Identification in English to Hindi Translation
In example (a) the main verb of the Hindi sentence (“izzat kartii hai”) is
realized from the object “regards” of the English one. Similarly, in example
(b), the object “escape” of the English sentence is realized as the main verb
(“bache the”) of the Hindi sentence. Further, the premodifying adjective of
the object (“narrow”) is realized as an adjunct (“baal baal ”) in the translated
sentence.
6. Possessional sub-type 6 : Here, the main verb of the translated sentence is not
“ho”. Moreover, this verb does not come from any of the functional tags of
the English sentence. Consider for example the following translations:
In example (a), the main verb of the Hindi sentence is “bitaayaa” which is
different from the verb “ho” which does not come from any FT of the English
126
An FT and SPAC Based Divergence Identification Technique
sentence. The literal meaning of the Hindi translation of (a) is “Radha spent a
good time here.”. Similarly, in (b) introduction of a new verb “kiyaa” (means
“did”) may be noticed.
1. Possessional divergence cannot occur if the main verb of the English sentence
is not a declension of “have”.
3. If the root form of the main verb of H is not “ho”, then the presence of
divergence of sub-type 6 or sub-type 5 will be identified if the object is present
in H.
4. If the root form of the main verb of H is “ho”, the object of H is not present,
and the predicative adjunct is present in H, then decision of divergence sub-
type 3 and 4 will be taken on the basis of the postmodifier of the object of
E.
5. To check the precondition of sub-type 1, one has to first find out the trans-
lation of the subject and the object of the English sentence E with the help
of bilingual dictionary. If these act as the object and subject of the Hindi
translation (i.e. their roles are reversed) then possessional divergence sub-type
1 occurs.
127
3.3. Divergences and Their Identification in English to Hindi Translation
POSS, PRP$ or P.
7. For sub-type 2 to occur the following three conditions are necessary. The root
form of the main verb of H should be “ho”, the SPAC of the object of H
should contain an Adj (i.e. adjective), and also the SC of H should not be
null. When all the three conditions are meet with, possessional divergence of
sub-type 2 is identified.
We have designed our algorithm taking care of the above observations. Fig-
ure 3.14 provides a schematic view of the proposed algorithm. We illustrate the
algorithm with the help of the following examples.
Illustration 1.
Consider the English sentence (E) “Suresh has fever.”. Its Hindi translation (H)
is “suresh ko (suresh) bukhaar (fever) hai (is)”. The SPACs of these sentences and
their terms correspondences are given in Figure 3.15.
The root form of the main verb of E and H are “have” and “ho” respectively.
This implies that they do not satisfy the conditions of step 1 and 2. In step 3, the
algorithm checks the postposition condition of the subject of H, it finds that none
of the relevant postpositions is present for the subject of H. In step 4, the algorithm
finds that the subject of E and the object of H are “suresh” and “suresh ko”,
respectively, which are translations of each other. Further it finds that the object
128
An FT and SPAC Based Divergence Identification Technique
129
3.3. Divergences and Their Identification in English to Hindi Translation
“fever” of E became the subject “bukhaar ” of the Hindi sentence H. Therefore, step
4 returns 1 indicating the occurrence of possessional divergence of sub-type 1 in the
above translation.
Illustration 2.
Consider the English sentence (E) “This city has a museum.”. Its Hindi translation
H is “iss (this) shahar (city) mein (in) ek (one) sangrahaalaya (museum) hai (is)”.
The SPACs of these sentences and their terms correspondences are given in Figure
3.16.
The root form of the main verb of E and H are “have” and “ho”, respectively.
Therefore, algorithm arrives to step 3, here the subject of H does not have any
postposition “kaa”, “ke” or “kii ”. Hence the algorithm proceeds further. Since the
conditions for step 4 are not meet with, the algorithm arrives at step 5. Here it
finds that in H there is no object, but a PA (“iss shahar mein”) is present. Also,
since there is no postmodifier of the object of E, the algorithm returns 4. Thus, the
algorithm diagnoses possessional divergence of sub-type 4 in the above translation
example.
130
An FT and SPAC Based Divergence Identification Technique
In this chapter we have discussed the various types of divergences that have been
observed in English to Hindi translation. By analyzing the characteristics of various
examples, we have been able to identify different sub-types under each divergence
type. These observations helped us to design algorithms for their identification.
However, we still have some examples of divergence which do not fall under any of
the above-mentioned types. At the same time we do not have sufficient number of
examples for these types so as to classify them under some new type or sub-types.
• Cleaned and aligned parallel corpus of both the source and the target lan-
guages.
131
3.4. Concluding Remarks
• Appropriate parsers have to be designed for source language and target lan-
guage. The parsers should be able to provide the FT and SPAC information
for both the languages. Note that, presently no such parser is available for
Hindi. For our experiments we have used manually annotated Hindi corpora.
This chapter deals with the characterization and identification of different types
of divergence that may occur in English to Hindi translation. We observed that
identification of divergence can be made without going into the semantic details of
the two sentences. This can be achieved by comparing the Functional Tags (FT)
and Syntactic Phrase Annotated Chunks (SPAC) of the source language sentence
and its translation.
The work described here may be broadly classified into two parts:
regard. This work describes in detail the various types (and sub-types) of
divergence that may occur in English to Hindi translation. The work also
identifies three new types of divergence that have hitherto been not found in
translation between any other language pair.
132
An FT and SPAC Based Divergence Identification Technique
structural changes in the sentences that occur due to various types of di-
vergence. Seven different types of divergence have been studied, and all of
them have a number of sub-types. The necessary preconditions in the English
sentence corresponding to each of these sub-types have been identified, and
An obvious question that arises at this point is how an EBMT system is expected
to handle divergences. In this regard our suggestions are as follows. Once divergences
• To split the system’s example base into two parts: normal and divergence
example base. The translation examples are to be put in the appropriate part
of the example base.
EBMT system can heuristically judge whether its translation may involve any
divergence, and retrieval may be made accordingly.
patterns, their adaptations may need specialized handling that may vary with
the type/sub-type of divergence.
133
Chapter 4
A Corpus-Evidence Based
Approach for Prior Determination
of Divergence
A Corpus-Evidence Based Approach
4.1 Introduction
This chapter presents a corpus-evidence based scheme for deciding whether the trans-
lation of an English sentence into Hindi will involve divergence. Surely, occurrence
of divergence poses a great hindrance in efficient adaptation of retrieved sentences.
A possible solution may lie in separating the example base (EB) into two parts:
Divergence EB and Normal EB, so that given an input sentence, retrieval can be
made from the appropriate part of the example base. However, this scheme can
work successfully only if the EBMT system has the capability to judge from the
input sentence itself whether its translation will involve any divergence. However,
making such a decision is not straightforward, since occurrence of divergence does
not follow any patterns or rules. In fact, a divergence may be induced by various
factors, such as, structure of the input sentence, semantics of its constituent words
etc. In this chapter we propose a corpus-evidence based approach to deal with this
difficulty. Under this scheme, upon receiving an input sentence, a system looks into
its example base to glean evidences in support as well as against any possible type of
divergence that may occur in the translation of the input sentence. Based on these
evidences the system decides whether the retrieval has to be made from the normal
EB, or from the divergence EB.
The algorithm proposed here, works for structural, categorial, conflational, de-
motional, pronominal and nominal types of divergence 1 . For convenience of presen-
135
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
been classified into several sub-types depending upon the variations in the role of
different functional tags upon translation to Hindi.
In this chapter, we have identified the necessary FT-features that the source
langauge (English) sentences should have in order that a particular type/sub-type
of divergence may occur. This, however, does not mean that any sentence having
those FT-features will necessarily produce a divergence upon translation. As a
This chapter describes all these evidences and how they are to be used for mak-
ing a priori decision regarding whether the input English sentence will involve any
vergence Identification
The proposed scheme makes use of three different types of evidence to decide whether
a given input sentence will have a normal translation, or whether it will involve one
(or more ) type(s) of divergence when translated into Hindi. These evidences are
used in succession to obtain the overall evidence in support of divergence(s)/non-
divergence in the translation of the input sentence. These three steps are explained
below:
Step1 : Here Functional Tags (FTs) of the constituent words of the input sentence
are used to determine the divergence types that cannot certainly occur in the
136
A Corpus-Evidence Based Approach
Step2 : Here semantic similarities of constituent word(s) of input sentence with con-
stituent words of sentences in the divergence EB and the normal EB are de-
Step3 : Some times the above two steps may suggest more than one type of divergence.
In such a situation the algorithm should consult its knowledge base to ascer-
tain which combinations of divergence types are possible in the translation of
a single sentence. A scrutiny of our example base, and examination of the syn-
tactic rules of the Hindi grammar suggest that only the following combinations
of divergence are possible with respect to English to Hindi translation:
suggestions given by the earlier two steps that do not conform with the knowl-
edge stored in the set CD described just above.
137
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
Analysis of the divergence examples suggests that for each divergence type to occur
the underlying sentence needs to have some specific functional tags (FT) and/or some
specific attributes of these FTs. We call them together FT-features of a sentence.
Considering all the divergences together we found that ten different FT-features are,
in particular, useful for identification of divergence. Table 4.1 provides a list of these
features, which we label as f1 , f2 , . . . , f10 .
• Its presence in the input sentence is necessary for the corresponding divergence
type to occur;
138
A Corpus-Evidence Based Approach
di sub-type f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
d1 - X X P X P A A X A A
sub-type 1 P X A P A P X X A A
d2
sub-type 2 P X A P A P X X A A
sub-type 3 P X A P A A X X A P
sub-type 4 P X A P A A X X A P
sub-type 1 A X P X X X X X X A
d3
sub-type 2 A X P P X X X X X A
sub-type 1 A X P P P A A X A A
d4
sub-type 2 A X P P A A A X P A
sub-type 3 A X P P P A A X A A
sub-type 4 A X P P P A A X A A
sub-type 1 P X A P A P X P X A
d5
sub-type 2 A X P P A X X P X A
sub-type 3 P P A P A P X P A A
sub-type 1 P X A P A P P X A A
d6
sub-type 2 A X P P A P P X A A
139
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
Each row of the Relevance Table provides the necessary conditions on the FT-
features of an input sentence in order that the corresponding divergence may occur.
The advantage of this evidence is that it helps in quick discarding of those types of
divergence that cannot occur in the translation of the given input sentence.
The information given in Table 4.2 may be used in the following way. Given an
input sentence, the algorithm first extracts the values for the ten FT-features, fj ,
j = 1, 2, ..., 10, from the sentence. These values are then compared with the row
entries of the Relevance Table. If the FT-features of the sentence conform with the
entries of some particular row, then evidence is obtained towards occurrence of that
particular divergence for which this row corresponds to one of the sub-types. If a
particular sentence has evidence supporting more than one divergence then all these
possible divergence types are to be considered for step 2 of the algorithm. This set
of possible divergence types for a given input is denoted as D.
For illustration, consider the following input sentence: “ Ram is friendly to me.”.
As the sentence is parsed (with some unnecessary components edited) one may get
the following:
The notations used here are from ENGCG parser and explained in Appendix B.
We can summarize the parsed version as follows. Of the ten FT-features discussed
above (see Table 4.1), only four are present in the above sentence. These are:
140
A Corpus-Evidence Based Approach
Thus in the Hindi translation of this sentence only those divergence sub-types
can occur for which the entries corresponding to FT-features f1 , f4 , f6 , and f7 are
either “P” or “X”. For the other FT-features the entries have to be either “A”
or “X”. This algorithm assumes that occurrence of a particular divergence type is
possible only if at least one of its sub-types satisfies the above conditions. Thus for
• Categorial (d2 ), since sub-types 1 and 2 conform with the above requirements.
Also note that, sub-type 1 of d5 has values either “P” or “X” for the FT-features
f1 , f4 , f6 , and f7 . But divergence d5 cannot occur in this case as the sub-type has
an extra requirement that FT-feature f8 should also be present, which is not true
for this sentence. Therefore, the output of this step is the set D = {d2 , d6 }.
It, however, should be noted that the FT-features specified in the Relevance
Table do not provide conclusive evidence towards the presence of some particular
divergence type. For example, consider the following two sentences.
Example (A):
She is in trouble. ∼ wah (she) musiibat (trouble) mein (in) hai (is)
Since both the sentences given in Example (A) have the same FT-features, i.e.
f1 , f4 and f10 , the Relevance Table gives evidence supporting categorial divergence
141
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
d2 (check the rows for sub-types 3 and 4) for both the sentences. But of the two
sentences the translation of the first one is a normal one. It is only the second
sentence that involves categorial divergence upon translation to Hindi. Thus, to
determine the possible divergence type(s) in a sentence, only the FT-features cannot
From the above example, it can be surmised that it is the prepositional phrase
“in tears” that is instrumental for causing the categorial divergence in the second
sentence. In general, corresponding to each divergence type one can associate some
functional tags that are instrumental for causing the divergence. We call it as the
Problematic FT of the corresponding divergence type. Table 4.3 provides the Prob-
lematic FT corresponding to all the six divergence types relevant in the context
of English to Hindi translation. This table has been obtained by examining the
sentences in our example base.
142
A Corpus-Evidence Based Approach
the Relevance Table) then the corresponding problematic FT in the sentence needs
to be examined more carefully. Since both the sentences of Example (A) have the
structures required for categorial divergence, Table 4.3 suggests that to gather more
evidence the scheme should concentrate on the SC or PA of the sentences.
In this respect one major difficulty is that a particular word may convey different
senses in different context even if it is under the same FT. For example, consider
the two sentences and their Hindi translations given in Example (B) below:
Example (B):
Here, the first one is an example of normal translation, while the second one is a
case of structural divergence because of the introduction of the postposition “ko” in
the object of the Hindi sentence. A careful examination suggests that although the
main verb of both the sentences is “beat”, its translation causes divergence when
used in a particular sense, but not when used in some other sense. By referring
to WordNet 2.02 one may find that the first sentence has the 6th sense of the word
“beat”, which is “to make a rhythmic sound ”; while the second sentence has the
1st sense of the word “beat”, which is “to come out better in a competition, race,
or conflict”. Therefore, while dealing with words one needs to pay attention to
2
http://www.cogsci.princeton.edu/cgi-bin/webwn
143
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
the particular sense in which a word is being used – in some senses it may cause
divergence, and in some other senses it may not induce any divergence at all.
Since an exhaustive list of words (along with their relevant senses) that lead
to divergence is impossible to make, the proposed algorithm tries to gather more
evidences by using the semantic similarity of the constituent words to the word
senses that are already known to cause divergence, or known to deliver a normal
translation. In order to achieve this two dictionaries have been created: Problematic
Sense Dictionary (PSD) and Normal Sense Dictionary (NSD). The PSD contains the
words along with their senses that have been found to cause divergence. Similarly,
the NSD contains the words along with their senses for which normal translation
has been observed.
These dictionaries are further grouped into six sections – a section correspond-
ing to each divergence type. Section PSDi contains problematic words occurring in
144
A Corpus-Evidence Based Approach
translation. Table 4.4 gives the number of words in each section of the PSD and the
NSD that is currently present in our example base.
Each PSD/NSD entry contains along with the relevant word, its part of speech
and appropriate sense number (as given by WordNet 2.0). Table 4.5 shows some
entries corresponding to each PSDi and NSDi , i=1,2,...6. The entries are stored
in the format word#pos#k, where pos stands for the particular Part of Speech,
which can be one of n, v, a or r (corresponding to noun, verb, adjective and adverb
145
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
For illustration, consider the two sentences given in Example (A). Both of them
have the structure required for categorial divergence, i.e. d2 . Problematic FT for
this divergence type is the predicative adjunct (PA), which is a prepositional phrase.
Hence, in PSD2 and NSD2 we store tears#n#1 and trouble#n#1, respectively. Sim-
ilarly, corresponding to Example (B) where the relevant divergence is structural, i.e.
d1 , the entries in PSD1 and NSD1 are beat#v#1 and beat#v#6, respectively.
basis of four parameters, viz. sim(ai , wi ), s(di ), sim(ai , wi0 ) and s(ni ), as described
below:
1. sim(ai , wi ) gives the maximum similarity score between ai and the words in
PSDi , where sim(x, y) denotes the semantic similarity between two words x
and y (see Appendix C).
0 if xi = 0;
s(di ) = ...(1)
1 xi ci
+ otherwise.
2 ci S
146
A Corpus-Evidence Based Approach
(c) S is the total number of words in the PSD. Note that, currently the
total number of words in PSD is 416 (see Table 4.4). This number will
increase as more divergence examples are obtained, and corresponding
problematic words are added to the dictionary.
3. The quantity sim(ai , wi0 ) is similar to sim(ai , wi ). While computing sim(ai , wi0 ),
the scheme will use NSDi and NSD instead of PSDi and PSD.
4. The quantity s(ni ) is similar to s(di ), and is calculated using NSDi and NSD.
The value used for S here is the cardinality of the NSD which is at present
3931 (see Table 4.4).
These four quantities are used to determine the possibility of occurrence of di-
vergence di in the translation of the given input sentence.
In order to determine whether a given input sentence, say e, may involve some
divergence upon translation, the evidences mentioned in previous section are used
in the following way. First the input sentence e is parsed, and then using the
Relevance Table a set D is determined that contains the divergence types that may
possibly occur in the translation of e. For each possible divergence type di ∈ D the
problematic word ai is extracted from the sentence e. From PSDi , the word wi is
retrieved that is semantically most similar to ai . The subsequent steps depend upon
the value of sim(ai , wi ). If the value is 1, that implies that ai is present in PSDi .
On the other hand, a small value of sim(ai , wi ) implies that there is not enough
evidence in support of divergence di . Hence it may be concluded that divergence di
147
4.3. The Proposed Approach
will not occur in the translation of e. Note that, whether the value of sim(ai , wi ) is
sufficiently small is determined by comparing it with a threshold t, which is to be
determined experimentally from the corpus. If the value of sim(ai , wi ) is between t
and 1, then some evidence in support of divergence di is obtained. In order to make
a conclusion from this point the algorithm now refers to NSDi to obtain the word
wi0 that is semantically most similar to ai . Depending upon the values of sim(ai ,
wi ) and sim(ai , wi0 ), a decision is taken regarding whether the translation of e will
involve divergence di or not. Based on this decision, the retrieval is to be made from
the appropriate part of the example base, i.e. the Divergence EB or Normal EB.
The overall scheme is explained below which involves four major steps as follows:
Step 1: At this stage, the input sentence e is parsed, and its FT-features are
obtained. From these FT-features, using Table 4.2, the set D of possible divergence
types is determined.
The main objective now is to determine the divergence types, out of all the di ∈
D, which have positive evidence supporting them to happen in the translation of e.
Steps 2 and 3 are designed for this purpose. A set of flags, Flagi , corresponding to
each di ∈ D is used to store this information. Initially each of these flags is set to
–1. Step 2 and Step 3 are now carried out for each di ∈ D in order to reassign the
value of Flagi . At each iteration the next di with the minimum index i is chosen
longing to PSDi , and having positive semantic similarity score with ai is determined.
Thus Wi = {b : b ∈ PSDi and sim(ai , b) >0}. From Wi the word wi is obtained such
148
A Corpus-Evidence Based Approach
Case 2a: If sim(ai , wi ) = 1, then set Flagi = 1. This is because the condition
implies that the word ai is present in PSDi . Hence this sentence will certainly have
divergence di upon translation. Therefore Flagi is set to 1.
Case 2c: If sim(ai , wi ) < t, where t is some pre-defined threshold, then too it
may be decided that ai will not cause divergence di . Consequently, Flagi is set to 0.
The main difficulty here is to decide upon the right value for the threshold t. After
a sequence of experiments with different values for t, we found that the best results
are obtained for t = 0.5. However, since this value is corpus dependent, for other
corpora the value of t should be determined experimentally.
Since in all the three cases above, the scheme arrives at a decision regarding the
divergence type di , computation may skip Step 3, and go to Step 4 directly. But
there may be cases when the similarity score sim(a i , wi ) lies between t and 1. In
these cases, as mentioned above, the NSD has to be referred to. Hence Step 3 is
executed.
Step 3: Here, first the set W0i = {b |b ∈ NSDi and sim(ai , b) > 0) is computed.
149
4.3. The Proposed Approach
From this set the word wi0 is picked such that sim(ai , wi0 ) = max sim(ai , b) ∀b ∈ Wi0 .
If Wi0 is empty then sim(ai , wi0 ) is considered to be 0. Depending on sim(ai , wi0 ), one
of the following cases is executed.
Case 3a: If sim(ai , wi0 ) = 0 then it implies that there is no evidence that the
word will lead to normal translation. Consequently, Flagi is set to 1 indicating that
divergence di has a positive chance of happening.
Case 3b: If sim(ai , wi0 ) = 1 then the evidence suggests that the word ai should
Case 3c: Decision making becomes most difficult when 0 < sim(ai , wi0 ) < 1. This
implies that words sufficiently similar to ai exist neither in the PSD nor in the NSD.
Thus, any decision about divergence/non-divergence cannot be taken yet.
In this case the scheme proposes to look into how many words similar to ai , are
available in PSDi and NSDi . This evidence is given by score s(di ) and s(ni ) computed
using formula (1) (given in Section 4.2). Finally, similarity scores sim(ai , wi ) and
sim(ai , wi0 ) are combined with s(di ) and s(ni ) respectively, to take into consideration
the importance of both the evidences. If evidence supporting divergence di is more
then the value of Flagi is set to 1 otherwise it is set to 0. Thus, in this case, following
150
A Corpus-Evidence Based Approach
The last case refers to a rare situation when m(di ) and m(ni ) are equal. In
this case, the algorithm cannot recommend whether the translation will involve
divergence di , or will it be normal. In such a situation the system can at best pick
the most similar examples from both normal EB and divergence EB, and leave it to
the user to make the final decision. Therefore, in such cases, the Flagi is set to 1/2.
decision regarding possible divergence types in the translation of the given input
e. Here it should be noted that the value of Flagi = 0, implies that e cannot have
divergence di ; while value of Flagi = 1 implies that upon translation e may have
divergence di . A set D 0 is constituted, such that D 0 = {di ∈ D and Flagi = 1}, i.e.
Case 4a: If D 0 = φ, then the conclusion is that sufficient evidence has not been
obtained for any of the divergence types. Hence, the decision is that the translation
151
4.3. The Proposed Approach
Case 4c: If |D 0 | > 1, it implies that there is a possibility of more than one type
of divergence. The algorithm therefore seeks further evidence to make any decision.
The evidence provided by CD (Section 4.2) may be used here. A set C = {{di , dj }
∈ CD | di , dj ∈ D0 } is constructed. Depending upon the |C|, further decision has
• If |C| = 1, it implies that there is evidence for only one permissible combination.
Let it be {dk , dl }. The algorithm suggests that the input sentence e will involve
both divergence dk and dl upon translation to Hindi.
• If |C| > 1, that is, if the evidences are obtained in support of more than
one permissible combination of divergences, then the scheme needs to select
The flowchart of the proposed scheme is given in Figures 4.1 and 4.2.
152
A Corpus-Evidence Based Approach
153
4.3. The Proposed Approach
154
A Corpus-Evidence Based Approach
In this section we first illustrate with examples how the above algorithm works
4.4.1 Illustration 1
The parsed version of the above sentence is: @SUBJ PRON PERS SG1 “i”, @+FAUXV
V PRES “be”, @-FMAINV V PCP1 “feel”, @PCOMPL-S A ABS “hungry” < $.>.
Of the ten FT-features (see Table 4.1) only four are present in the above sentence.
These are:
• f3 – since the main verb (feel) of the sentence is not “be” or “have”.
Note that the FT-features of the given input sentence conform with both the sub-
types of d3 and only sub-type 2 of d6 (see Table 4.2). Hence the set D of possible
divergence types is obtained as D={d3 , d6 } which are conflational and nominal types
155
4.4. Illustrations and Experimental Results
Table 4.3 suggests that the problematic word for d3 is the main verb, i.e. “feel”.
WordNet 2.0 provides thirteen different senses for the word “feel” when used as a
verb, such as:
indefinite grounds
• sense3 : feel, sense – perceive by a physical sensation, e.g., coming from the
skin or muscles
For the given input sentence the appropriate sense is sense1. Thus a3 is feel#v#1.
A scrutiny of PSD3 reveals that it contains no words w such that similarity sim(w, a3 ) >
lematic word of the input sentence is “hungry”. WordNet 2.0 provides two senses for
“hungry” of which the first one “feeling hunger” is appropriate in this case. Thus,
the problematic word is a6 which is hungry#a#1. PSD6 is then scrutinized to find
the word semantically most similar to a6 . It is found that PSD6 already contains
156
A Corpus-Evidence Based Approach
Now the set D 0 is constructed as D 0 = {di ∈ D: Flagi =1}. Evidently for the
given input sentence D 0 contains a single element d6 . Thus the algorithm suggests
that the above input sentence will cause nominal divergence upon translation to
Hindi, which is a correct decision.
4.4.2 Illustration 2
Its parsed version is @SUBJ PRON PERS FEM SG3 “she”, @+FMAINV V PRES
Using the Relevance Table the set D of possible divergence types is obtained as
{d2 }.
Table 4.3 suggests that problematic FT for d2 is predicative adjunct, i.e. “in
dilemma”. Thus problematic word is “dilemma”. WordNet 2.0 provides only one
sense for dilemma: “state of uncertainty or perplexity especially as requiring a choice
between equally unfavorable option”. Thus the problematic word a2 is dilemma#n#1.
A search in PSD2 for the word that is semantically most similar to a2 retrieves the
157
4.4. Illustrations and Experimental Results
It may be noted that similarity between “dilemma” and “motion” is not apparent
at the surface level. However, since in this algorithm the hypernyms of the words
concerned are used for computing the similarity value, a positive semantic score has
been obtained because the last abstraction level in the hypernyms of “dilemma” and
“motion” are same which is “=⇒ state”.
Since 0.5 ≤ sim(a2 , w2 ) <1, the Step 2 of the algorithm suggests that NSD2 has
to be checked for further evidence. From NSD2 , the word w20 most similar to a2 is
determined, and it is found to be confusion#n#2 with sim(a2 , w20 ) = 0.960. The
algorithm therefore determines s(d2 ), s(n2 ), m(d2 ) and m(n2 ) (see case3c). These
values are found to be 0.086, 0.035, 0.332 and 0.497, respectively. Since m(n2 ) >
m(d2 ), Flag2 is set to 0.
Using step 3 the algorithm now constructs the set D 0 consisting of divergence
types di for which the Flags have been set to 1. Evidently, D 0 is found to be
empty. Thus the algorithm suggests that the above input sentence does not give
any divergence upon translation to Hindi.
It may be noted that the above decision made by the algorithm is a correct one.
4.4.3 Illustration 3
Its parsed version is: @GN> PRON PERS GEN SG1 “i” , @SUBJ N SG “house”,
@+FMAINV V PRES “face”, @OBJ N SG “east” <$.>
158
A Corpus-Evidence Based Approach
Note that the main verb of the input sentence is “face” which is not “be” or
“have”. Further, the sentence has a subject “my house” and an object “east”. Thus
the FT-features of the given input sentence are: f3 , f4 and f5 .
According to the Relevance Table the set D is constructed, and it has three
elements:
The problematic FT for d1 is the main verb which is “face”. Nine senses are
provided by WordNet 2.0 for the verb “face” of which sense3 (be oriented in a
certain direction, often with respect to another reference point; be opposite to) is the
relevant one in this case. Thus problematic word a1 is “face#v#3 ”. From PSD1
the word w1 that is most similar to a1 is retrieved. Note that w1 is obtained as
attend#v#1, and the similarity score sim(a1 , w1 ) is calculated to be 0.660. Since
0.5 ≤ sim(a1 , w1 ) < 1, the algorithm now checks the NSD1 . From NSD1 , W01 is
constructed, and w10 is found to be cap#v#1 with sim(a1 , w10 ) = 0.889. In this
case, the algorithm has to determine s(d1 ) and s(n1 ). These are found to be 0.444
and 0.151 respectively. Thus, m(d1 ) = 21 (sim(a1 ,w1 ) + s(d1 )) = 0.552 and m(n1 ) =
1
2
(sim(a1 ,w10 ) + s(n1 )) = 0.520. Since m(d1 ) > m(n1 ), the algorithm set Flag1 to be
1.
159
4.4. Illustrations and Experimental Results
The problematic FT for d3 is also main verb (See Table 4.3), and therefore the
problematic word (a3 ) here too is “face#v#3 ”. From PSD3 the word w3 that is
most similar to a3 is retrieved. In this case the same word face#v#3 exists in PSD3 ,
and therefore sim(a3 , w3 ) = 1.0. Therefore, due to Case 2a Flag3 is set to 1.
Problematic word a4 for d4 is also “face#v#3 ”, which too exists in PSD4 . Hence
values are given in Table 4.6. Using the values given therein the algorithm computes
1/2*(m(d
1) + m(d3 )) = 0.548 and 1/2*(m(d3 ) + m(d4 )) = 0.673.
Since the latter one is maximum, the algorithm suggests that the above input
sentence will have divergence d3 and d4 upon translation to Hindi. The above deci-
sion of the algorithm is also correct.
Tables 4.7 and 4.8 provide few more examples with brief explanation. The overall
160
A Corpus-Evidence Based Approach
analysis of each example sentence requires 17 columns. Table 4.7 contains the column
numbers (i ) to (viii ), and Table 4.8 contains the column numbers (ix ) to (xvii ). For
ease of understanding one column corresponding to serial number (S. No.) and
column number (ii ) are given in both the tables. In these tables, “NA” is used
when particular condition is not applicable, and “Nil” implies that no word having
semantic similarity score greater than 0 has been found in the PSD/NSD.
161
4.4. Illustrations and Experimental Results
S. Sentence D Problematic Most similar sim(ai , wi ) Is wi Most simi- sim(ai , wi 0 )
No. Word, ai word, wi a lar word, w i 0
coordinate
term?
(i ) (ii ) (iii ) (iv ) (v ) (vi ) (vii ) (viii )
Continued . . .
1. She will d1 resolve#v#6 calculate#v#1 0.984 No resolve#v#6 1.0
resolve d3 resolve#v#6 Nil 0.0 NA NA NA
this issue. d4 resolve#v#6 Nil 0.0 NA NA NA
2. I will d1 attend#v#1 attend#v#1 1.0 NA NA NA
attend this d3 attend#v#1 look#v#5 0.66 No ride#v#9 0.66
162
7. d2 NA NA NA NA 1 d 2 , d5 {d2 , d5 } NA d 2 , d5
d5 NA NA NA NA 1
8. d2 NA NA NA NA 0 d5 NA NA d5
d5 NA NA NA NA 1
9. d2 NA NA NA NA NA NA NA NA Normal as
sim(a2 , w2 ) < 0.5,
wrong decision
165
10. d4 NA NA NA NA 1 d4 NA NA d4
In order to evaluate the performance we have used the above algorithm on randomly
selected 300 sentences, that are not currently present in our example base. Manual
analysis of the translations of these 300 sentences revealed that 32 of them will
involve some type of divergence when translated from English to Hindi. Remaining
268 sentences have normal translations.
overall outcome.
The very high value (above 90%) for precision establishes the efficiency of the
algorithm in detecting possible occurrence of divergence even before the actual trans-
lation is carried out.
There are few examples when the algorithm failed to produce the correct decision.
These may be put into three categories:
166
A Corpus-Evidence Based Approach
1. Translation of the input sentence actually involves divergence but the algo-
rithm predicts normal translation. Table 4.9 indicates that there is one such
case in our experiments. Although the algorithm suggests that 261 sentences
will be translated normally it has been found that actually 260 of them are
correct decisions.
2. The input sentence actually has normal translation but the algorithm predicts
divergence. In the experiments carried out by us, we found six such exam-
ples. While the algorithm suggests that 36 sentences will involve some type of
3. The algorithm is unable to decide the nature of the translation of the input
sentence. Out of 300 examples tried the algorithm could provide decisions for
only 297 (36+261) sentences. For the remaining three sentences the algorithm
could not arrive at any decision regarding whether they will be translated
normally, or their translations will involve some type of divergence. These are
the situations that fall under Case 3c of the algorithm.
Table 4.7 provides one of the example of this type. Here the input sentence
and its translation are: “This table weighs 100kg” ∼ iss (this) mez kaa vajan
(weight of this table) 100 kilo (100 kg) hai (is)”. This example has demotional
divergence, i.e. d4 . However the algorithm could not give any decision regard-
The algorithm is not able to give correct result in first two cases. We feel that
the possible reasons behind the incorrect decisions taken by the algorithm are the
following:
167
4.5. Concluding Remarks
• Lack of robust PSD and NSD. The present size of the PSD and NSD are
416 and 3931 respectively. Evidently, these numbers are not large enough
to deal with all different sentences. As more examples (particularly, those
involving divergence) are collected, both the PSD and NSD may be enriched
with additional entries. This will in turn enable the algorithm to measure
semantic similarity in a more direct way. As a consequence the number of
erroneous decisions will reduce.
• The value of threshold. For our experiments we have used 0.5 to be the value
of the threshold t. This value has been obtained by carrying out a number
of experiments on our example base. However, with more examples this value
of t may have to be reassigned, which may in turn improve the quality of the
results. Further experiments with more examples need to be carried out to
arrive at an optimal value of the threshold t.
sentences in an EBMT system. This can be dealt with efficiently provided an EBMT
system is capable of making a priori decision regarding whether an input sentence
will cause any divergence upon translation. This will enable the EBMT system to
retrieve a past example more judiciously. However, the primary difficulty in handling
divergence is that their occurrences are not governed by any linguistic rules. Hence
no straightforward method exists for determining whether a source language sentence
will involve any divergence upon translation. In this work we attempted to bridge
this gap. We developed a scheme so that an a priori decision may be made by seeking
168
A Corpus-Evidence Based Approach
evidences from the existing example base. In order to achieve the above goal we first
analyzed different divergence examples to ascertain the root cause behind occurrence
of a divergence. We found that each divergence type can be associated with some
Functional Tag (FT) that is instrumental for causing this type of divergence. We
tionaries, viz. “problematic sense dictionary” (PSD) and “normal sense dictionary”
(NSD), have been created.
Given an input, these knowledge bases are referred to seek evidence in sup-
port/against divergence. Evidences used are of the following types:
(b) Semantic similarity of these constituent words with words in the PSD and
NSD;
(c) Frequency of occurrence of different divergence types in the example base; and
(d) Which divergence types may co-occur in the translation of an input sentence.
169
4.5. Concluding Remarks
The experiments carried out by us resulted in very high values of precision and
recall. However, more experiments need to be done to establish this scheme as a key
technique for dealing with divergences for an EBMT system.
The following points may be noted with respect to the scheme presented here:
NSD) used in this work have been created manually. Some suitable Word Sense
Disambiguation techniques may have to be developed/used to accomplish this
task.
2) The decisions made by the scheme concerns with divergence types only. We
feel that the scheme may be further extended to deal with various sub-types
that are associated with each divergence type. Our present example base does
not have sufficient number of examples for each sub-type. More examples in-
volving each of these sub-types need to be obtained, and analyzed for any such
extension, and also to improve upon the performance of the present scheme.
170
Chapter 5
5.1 Introduction
the development of an effective retrieval scheme for it. The closer the retrieved sen-
tence to the input one, the easier is its adaptation towards generating the required
translation. However, no standard technique has been developed for measuring sen-
tential similarity. Typically, similarity between sentences is measured using syntax
and semantics (Manning and Schutze, 1999). But in this chapter we show that if
adaptation is the main concern for measuring similarity neither of them is adequate
for similarity measurement.
referred to as “example base” only. The results obtained are compared with other
two algorithms based on syntactic and semantic metrics. It has been shown that the
algorithm proposed in this chapter performs better than the other two.
171
5.2. Brief Review of Related Past Work
(a) This character based method does not need morphological analysis.
(b) It can retrieve some kind of synonyms without a thesaurus, because syn-
onyms often have the same Kanji character in Japanese.
The character based best match can be determined by defining the distance
or similarity measure between two strings. Considering character order con-
1
Chunk is a segment or substring of words from a sentence or text.
172
A Cost of Adaptation Based Scheme
straints, the simple measure of similarity between two strings is the number of
matching characters.
in which (i) there are gaps between the words, (ii) the word order is different.
It also considers matching on the basis of subset of the words in the input
chunk, and also takes care of word inflections.
exactness. For example, the penalty for unmatched words is set to 10, the
penalty for disordering is set to 15 etc. Match scores are first calculated sep-
arately for each incomplete matches. Then a cumulative score is produced.
Candidate finding procedure retains only those matches whose match scores
173
5.2. Brief Review of Related Past Work
score depends to a significant extent on the word weights (word frequency and
sentence frequency)2 , which in turn depends on the sentences in the example
base. Thus the schemes become highly subjective. In particular, sentences
having similar structure (in terms of tense, subject, number of objects, etc.)
have higher similarity measurement values for a given input sentence. Differ-
ent weights have been assigned to similarity of different syntactic tags. For
example, a score of 20 is given to verb or auxiliary verb matching, a score of
5 is given to adjective or adverb matching etc. .
5. Hybrid retrieval Scheme: This scheme has been used in ReVerb system (Collins
and Cunningham, 1996) (Collins, 1998) that utilizes two different levels of
case retrieval: String matching retrieval (Phase 1), and Activation passing for
syntactic retrieval (Phase 2). In Phase 1, only exact words are matched, and
near morphological neighbours (such as, variation due to number, variation due
to tense ) are not considered. The highest score is allocated to those cases that
have been activated the greatest numbers of times. In Phase 2, for structural
retrieval, the input sentence is first pre-chunked, such that each chunk has an
explicit head-word. The algorithm initiates activation from each word in the
chunk, giving the head word an increased weight to reflect its pivotal role in
the chunk. The final score is evaluated by summing the above two scores.
6. DP-matching between word sequence (Sumita, 2001): This scheme scans the
source parts of all example sentences in a bilingual corpus. By measuring
the semantic distance between the word sequences of the input and example
sentences, it retrieves the examples with the minimum distance, provided the
distance is smaller than a given threshold. Otherwise, the whole translation
2
These terms are explained in detail in Section 5.5.1.
174
A Cost of Adaptation Based Scheme
fails with no output. The semantic distance (dist) is calculated in the following
way.
P
I + D + 2 SEM DIST
dist =
Linput + Lexample
7. Semantic matching procedure (Jain, 1995): This scheme first looks at the verb-
part of the input sentence, and on the basis of the type of verb-part it chooses
is defined as follows:
n
X
d(I, E) = dp (IG, EG) + dv (IV, EV )
p=1
where n is the number of noun syntactic groups in the source langauge sentence.
IG and EG are “input sentence noun syntactic group”, and “example sentence
noun syntactic group”, respectively. Similarly, IV and EV are the input and
the example sentence verb groups, respectively.
175
5.2. Brief Review of Related Past Work
range of 0 to 1 have been used as the weighting factors for the above parame-
ters.
jective, verb), modality (request, desire, question) and tense. This method
does not rely on functional words (e.g. conjunction, preposition, auxiliary
verb) information. A thesaurus is utilized to extend the coverage of the ex-
ample base. Two types of content words “identical” and “synonymous” have
been used. Sentences that satisfy the following conditions are recognized as
meaning-equivalent sentences.
• The retrieved sentence should have the same modality and tense as the
input sentence.
• All content words (identical or synonymous) are included in the input sen-
tence. This means that the set of content words of a meaning-equivalent
sentence is a subset of the input.
If more than one sentence is retrieved, the algorithm ranks them by introducing
“focus area” to select the most similar one. The focus area has been defined
as the last N words from the word list of an input sentence. The value of N
varies according to the length of the input sentence.
176
A Cost of Adaptation Based Scheme
Many other similarity measurement schemes are found in the literature. Every
metric has its own advantages and disadvantages. The demerits mentioned below
motivate us for defining a new metric.
• Character based metrics are highly script dependent. Hence a scheme which
is designed for a specific langauge may not be pursued for another language.
• Word based metrics are generally dependent on the size of the database. If the
database does not contain sentences having words common to a given sentence,
then these methods may fail to retrieve any similar sentence from the example
base.
• Most importantly, in almost all the schemes 3 described above, adaptation and
retrieval have been dealt with independently. However, we feel that adaptation
and retrieval should go hand in hand. A retrieval scheme should be considered
efficient (for an EBMT system) if the adaptation of the retrieved sentence is
computationally less expensive.
177
5.3. Evaluation of Cost of Adaptation
tions required for adapting a retrieved example. The total cost may then be com-
puted as the sum of the individual cost of each operation used for the adaptation.
An important point to be noted in this respect is that some adaptation operations
(e.g. constituent word addition and constituent word replacement) requires a search
of an English to Hindi dictionary. Typically, this dictionary will not be stored in the
RAM of the system, and its access requires retrieval from external storage. This cost
is much more than the cost of any operation that can be accomplished. Resorting
morpho-word or suffix operations reduce the number of dictionary searches, since
the number of morpho-tags and suffixes is much smaller in comparison with the to-
the search will be done using a binary search algorithm. One may consider
some multi-way search trees (e.g. B+-tree) (Loomis, 1997) also. But since a
successful search in a dictionary of size D takes log2 D for a binary tree, and
logm D for a m-way tree, the difference between the search times in these two
cases is due to a constant factor only4 .
2. We further assume that the index tree that is used to facilitate the dictionary
search is already in RAM. Typically, the index tree is designed with the help
4
logm D = log2 D × logm 2
178
A Cost of Adaptation Based Scheme
of a set of keys. In this case, we assume that the keys are actually the English
words, which are used for the search operation. The record corresponding to
each key contains all other relevant information, e.g. the Hindi meaning of the
word, the POS and other information. These records are stored in the external
storage.
3. The search procedure refers to the index tree for identifying the location of
the word in the dictionary. This operation is carried out by accessing the
RAM only. For actual retrieval the external storage is accessed, and this has
its associated factors, e.g. latency, seek time (Weiderhold, 1987). However,
in our analysis we do not consider all these factors. We make a simple but
realistic assumption following the studies on temporal requirements as given in
http://www.kingston.com/tools /umg/ umg01a.asp. Accordingly, we assume
that the access time for RAM for the CPU is 200 ns (nanoseconds), while the
access time for hard disk is 12,000,000ns. Thus the time requirements differ
by a constant of the order of 105 .
4. In order to reduce the search time, instead of using one dictionary, we recom-
mend using different dictionaries for different POS. The dictionaries used for
this work are of the following sizes: Noun - 13953, Adjective - 5449, Adverb -
1027, Preposition - 87, Pronoun - 72 and Verb - 4330. This database has been
taken from the “Shabdanjali” English to Hindi dictionary (http://ltrc.iiit.net
/ onlineServices/ Dictionaries/ Shabdanjali/ data source.html). Thus the ap-
proximate search time for all these part of speeches are as follows:
179
5.3. Evaluation of Cost of Adaptation
Hindi). Since one should not expect that all possible examples are available in
the knowledge base of an EBMT system, pure example-based approach may
not be able to obtain the right place for the new word to be added. In the
absence of suitable examples the system needs to wade through a large set of
syntax rules to determine the appropriate position where the new word has to
be added. We denote this cost by ψ, and assume that this cost is much more
than the cost of finding a Hindi equivalent of a given English word from the
dictionary.
6. For applying any constituent word, morpho-word or suffix operation, first one
7. Since dictionary search required for constituent word addition (WA) and con-
180
A Cost of Adaptation Based Scheme
before referring to the dictionary the scheme should first check whether the
word to be added is already present in the sentence (may be with a different
functional tag). In that case the Hindi equivalent of the word may be taken
directly from the retrieved sentence, and thereby dictionary search may be
Lp
avoided. The cost of this step is proportional to 2
, which should be added
to overall cost of constituent word addition and replacement. Here Lp is the
length of the parsed version of the retrieved English sentence.
(B) is retrieved then for these two positions constituent word replacement op-
erations are required. Typically, this operation demands a dictionary search
to get the Hindi equivalent of these words. However, in the above example
the dictionary search may be avoided since the Hindi equivalent of the desired
words (“car” and “diesel”) may be obtained from the retrieved example itself.
181
5.3. Evaluation of Cost of Adaptation
Section 5.3.1 describes how the computational cost of each of the adaptation
Based on the above observations, cost of the ten different adaptation operations
(discussed in Chapter 2) are estimated in the following way:
cost is (l1 × L2 ) + , where L is the length of the retrieved Hindi sentence, and
l1 is the constant of proportionality. is a small positive quantity reflecting
the cost of actual deletion operation (e.g. adjustment of pointers if sentences
are stored in a linked-list structure of words).
D is the size of the relevant dictionary, and c and d are the constants of
proportionality. The two terms correspond to searching the binary tree
182
A Cost of Adaptation Based Scheme
of keys and then retrieving the related record from the external storage,
respectively.
• In the second step the position (in the sentence) where the new word
has to be added is located. This requires referring to the syntactic rules
of the target language grammar to find the proper position of the word.
The cost of this operation has been considered ψ (see item 5 of Section
Lp
5.3). Thus the overall cost of this step is ψ + (l1 × L2 )+ (l2 × 2
). Here
l1 and L are same as in the case of WD discussed above. Lp is the
length of the parsed version of the retrieved English sentence, and l2 is
the corresponding constant of proportionality.
• Finally, the actual addition is done. The cost involved for this is , indi-
cating the cost of adding the new word in the retrieved translation.
required to be created for the new word. The cost, therefore, is reduced by
and ψ in comparison with constituent word addition. Hence the average cost
Lp
is (l1 × L2 ) + (l2 × 2
) + {(d × log2 D) + (c × 105 )}.
183
5.3. Evaluation of Cost of Adaptation
5. Morpho-word Addition (MA): For morpho-word addition, the cost for dictio-
nary search and access in constituent word addition (by referring to item 2
M
above) is replaced with the average cost that is (m × 2
). Moreover, the cost
(l2 × L2p ) in constituent word addition is not considered for morpho word addi-
tion, as these morpho-words are not present in the tagged version. Therefore,
the average cost of morpho-word addition is (l1 × L2 ) + (m × M
2
) + ψ+ .
that M1 and M can be equal if the word is replaced with some morpho-word
from the same set.
7. Suffix Deletion (SD): Here the work involved is first to identify the right suffix,
then to do the stripping. So the cost is (l1 × L2 ) + (k × K2 ), where k is the cost
8. Suffix Addition (SA): Suffix addition is done in two steps. First the position
184
A Cost of Adaptation Based Scheme
of the word where the suffix has to be added is determined. The average cost
for this operation is (l1 × L2 ) (as explained above). Next the suffix database
is searched for obtaining the appropriate suffix. The average cost therefore is:
(l1 × L2 )+ (k × K
2
).
stripping from the word. It may be noted that K1 and K can be equal if the
suffix is replaced with some suffix from the same set.
These individual costs may be used for determining the overall cost of adaptation.
Section 5.4 discusses how cost may be calculated for adaptation between different
functional slots and kind of sentences.
Kind of Sentences
185
5.4. Cost Due to Different Functional Slots and Kind of Sentences
The adaptation rule Table 2.15 suggests that in order to adapt a particular kind of
sentence into another kind requires one or more of the following operations: either
addition or deletion of the word adverb “nahiin”; or addition or deletion of the
morpho-word “kyaa”. Hence the cost table can be generated by computing the
costs with respect to the above four adaptation operations only.
• The cost (k1) of WA for the word adverb “nahiin” is (l1 × L2 ) + ψ + . Here
dictionary search is not required as the translation “nahiin” may be stored in
of the word in the retrieved Hindi sentence. We call this cost k3.
Table 5.1 gives the adaptation cost due to kind of sentences for different com-
binations of input and retrieved sentence. Cost of adaptation due to variations in
kind of sentences can now be calculated by referring to the required set of adaptation
operations for different cases as given in Table 2.15.
186
A Cost of Adaptation Based Scheme
Below we discuss the cost of adaptation for certain types of verb morphological
variations. In particular, we discuss two groups:
(1) the input and the retrieved sentence have same tense and same verb form.
(2) the input and the retrieved sentence have same tense but different verb forms.
In Section 2.3.1 different cases of this group have been discussed. Further, in Table
2.3 different adaptation rules for present indefinite to present indefinite have been
illustrated in detail. It has also been argued that all other cases belonging to this
group can be dealt with in a similar way. Below we discuss the adaptation cost for
present indefinite to present indefinite by referring to the corresponding adaptation
The above mentioned table suggests that the relevant adaptation operations are
copy (CP), suffix replacement (SR) and morpho-word replacement (MR). The costs
of these basic operations may be computed in the following way.
187
5.4. Cost Due to Different Functional Slots and Kind of Sentences
Input→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
M1S 0 s s+n s+n s+n s+n n s+n s+n s+n
F1S s 0 s+n n s+n n s+n s+n n n
M1P s+n s+n 0 s n s+n s+n 0 s+n s
F1P s+n n s 0 n+s n s+n s n 0
M2S s+n s+n n s+n 0 s s+n n s+n s+n
F2S s+n n s+n n s 0 s+n s+n n n
M3S n s+n s+n s+n s+n s+n 0 s+n s s+n
M3P s+n s+n 0 s n s+n s+n 0 s+n s
F3S s+n n s+n n s+n n s s+n 0 n
F3P s+n n s 0 s+n n s+n s n 0
Table 5.2: Cost Due to Verb Morphological Variation
Present Indefinite to Present Indefinite
188
A Cost of Adaptation Based Scheme
sentence.
The cost table for other verb morphological variations under same tense same
verb form can be formulated in the similar way. Some relevant points in this regard
are discussed below.
• The same Table 5.2 works for adaptation from past indefinite to past indefinite
with a slight modification. In this case morpho-word replacement is done from
the morpho-word set {thaa, the, thii } instead of morpho-words set {hain, ho,
hoon, hai }. Hence if the value of n is replaced by (l1 × L2 )+ (m× 32 ) + (m× 32 ) in
the cost Table 5.2, one gets the cost table for past indefinite to past indefinite.
placement (MR) is to be removed from the entries of the Table 5.2. The cost
s of SR in this case is (l1 × L2 )+ (k × 82 ) + (k × 82 ), which is obtained by
considering the relevant set of suffixes, viz. {oongaa, oongii, oge, ogii, egaa,
egii, enge, engii }.
• For all other combinations of verb morphological variations of the same group
one more morpho-word replacement is to be added to the cost Table 5.2 in
place of the suffix replacement cost s (as discussed in items 3 to 6 of Section
2.3.1). Here cost of these two morpho-word replacements will vary according
to the tense and verb form. For example, in case of present continuous to
189
5.4. Cost Due to Different Functional Slots and Kind of Sentences
present continuous the relevant morpho-word sets are {hain, ho, hoon, hai }
and {rahaa, rahii, rahe}. The average cost of these morpho-word replacements
are (l1 × L2 )+ (m × 42 ) + (m × 42 ) and (l1 × L2 )+ (m × 3f2 ) + (m × 23 ), respectively.
The cost for morpho-word replacements for the remaining 5 cases (e.g. future
continuous to future continuous, past prefect to past perfect etc.) can be
computed in a similar way by referring to the appropriate morpho-word sets.
There are total 18 verb morphological variations (see Section 2.3.3). To keep our
discussion simple here we explain the adaptation cost calculations with the case
explained in the Section 2.3.3 under the heading “same tense different verb forms”.
In particular, we discuss the case where the input sentence is in future indefinite,
and the retrieved sentence is either in future continuous or future perfect.
Here the cost of verb morphological variations depends on three adaption oper-
ations. One is suffix addition, and the other two are morpho-word deletions. The
cost of these operations are as follows:
• In the case of future indefinite the appropriate suffix set is {oongaa, oongii,
oge, ogii, egaa, egii, enge, engii }. Hence the average cost of suffix additionis
(l1 × L2 )+ (k × 82 ). We denote it as s.
the two morpho-word deletions are restricted to {chukaa, chukii, chuke} and
{hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
190
A Cost of Adaptation Based Scheme
Therefore, the total cost involved in adaptation from future continuous or future
perfect to future indefinite is (s + m1 + m2 ). The cost will be same irrespective of
the variation in number, gender and person of the subject of the input as well as of
the retrieved sentence.
For the reverse case (i.e input sentence is either future continuous or future
perfect, and retrieved sentence is future indefinite) the cost will be the sum of two
morpho-word additions and one suffix deletion. For these adaptation operations the
suffix set and the morpho-word sets are same as in the above case. The individual
In a similar way, the cost can be evaluated for the rest of the cases of verb mor-
phological variations of same tense different verb forms. One may refer to Sections
2.3.3 and 5.3.1 to get the relevant adaptation operations and their costs, respectively.
The adaption cost with respect to the other two groups (i.e. “different tenses
same verb form”, and “different tenses different verb forms”) can be evaluated in a
similar way with the help of rule tables and set of adaption operations as discussed
in Section 2.3.2 and Section 2.3.4. To avoid the stereotyped nature of discussion,
191
5.4. Cost Due to Different Functional Slots and Kind of Sentences
we do not present all other different cases in this report. However, we present below
the adaption cost table (Table 5.3) for verb morphological variation from present
indefinite to past indefinite, which belongs to the group “different tenses same verb
form”. These values are obtained by referring to the adaptation rule Table 2.4.
Input→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
M1S w s+w s+w s+w s+w s+w w s+w s+w s+w
F1S s+w w s+w w s+w w s+w w s+w w
M1P s+w s+w w s+w w s+w s+w s+w w s+w
F1P s+w w s+w w s+w w s+w w s+w w
M2S s+w s+w w s+w w s+w s+w s+w w s+w
F2S s+w w s+w w s+w w s+w w s+w w
M3S w s+w s+w s+w s+w s+w w s+w s+w s+w
F3S s+w w s+w w s+w w s+w w s+w w
M3P s+w s+w w s+w w s+w s+w s+w w s+w
F3P s+w w s+w w s+w w s+w w s+w w
Table 5.3: Adaptation Operations of Verb Morphological
Variation Present Indefinite to Past indefinite
Here the cost s denotes the cost of suffix replacement between {taa, te, tii }, which
In this subsection we discuss the adaptation cost majorly for three functional tags
under subject/object functional slot. These tags are genitive case (@GEN), pre-
192
A Cost of Adaptation Based Scheme
A transformation from genitive case to genitive case requires eleven adaptation op-
erations as given in Table 2.8. Below we describe the cost for each of them. Note
that the genitive word can be a proper noun, or a noun, or a pronoun. We denote
this set by P.
1. The average cost of constituent word replacement from the set P with a proper
2. The average cost of morpho-word replacement (MR) from {kaa, ke, kii } with
3. The average cost of WR from the set P with a noun. This cost is denoted
by w3 . Note that in this case noun dictionary search is necessary for which
the search time is 13.77 (see item 4 of Section 5.3). Further, to access the
4. The average cost of WR from the set P with pronoun. This is denoted by w4 .
Imitating the case just mentioned above the cost here may be formulated as
Lp
(l1 × L2 ) + (l2 × 2
) + {(d× 6.17 ) + (c × 105 )}.
193
5.4. Cost Due to Different Functional Slots and Kind of Sentences
5. The average cost of morpho-word deletion from the set {kaa, ke, kii }. This
cost is denoted by w5 , which may be formulated simply as (l1 × L2 ) + m × 23 +
.
6. The average cost of morpho-word addition from the set {kaa, ke, kii }. We
7. The average cost of suffix replacement for converting a noun in either an oblique
noun form or a plural form (refer Section 2.5.2 and Appendix A). We denote
this cost by s1 . Since the number of relevant suffixes is four, s1 may be com-
puted as is (l1 × L2 ) + (k × 42 ) + (k × 42 ).
8. The average cost of suffix addition for converting noun either in oblique noun
form or in plural form. This cost of suffix addition can be formulated in a way
similar to item 7 above. Here this cost is (l1 × L2 ) + (k × 52 ) + (k × 25 ), which
we denote as s2 .
9. The average cost of suffix addition form the set {kaa, ke, kii } is (l1 × L2 ) +
(k × 32 ). We denote it as s3 .
10. The average cost of suffix deletion for converting oblique noun form to noun,
or plural to singular ( See Appendix A and Section 2.5.2). This cost is (l1 × L2 )
+ (k × 52 ) + . We denote is as s4 .
11. The average cost of suffix replacement from the set {kaa, ke, kii }. We denote
this cost by s5 , which is formulated as (l1 × L2 ) + (k × 23 ) + (k × 32 ).
The cost table corresponding to genitive case to genitive case is given in Table
5.4. It has been formulated in accordance with the adaptation rule Table 2.8.
194
A Cost of Adaptation Based Scheme
We have considered four possible cases a noun, a proper noun, a pronoun and a
gerund form (PCP1) at the subject or object position (refer Section 2.5.5). We
denote this set by Q. All possible adaptation operations required in this case are
listed in Table 2.11, which has been referred for evaluating the cost for adapting
subject/object to subject/object possible variations.
1. The average cost of constituent word replacement of the set Q by noun. This
cost is denoted by w1 . In this case noun dictionary search is required, and its
Lp
search time is 13.77. Hence, w1 is computed as (l1 × L2 ) + (l2 × 2
) + {(d×
13.77 ) + (c × 105 )}.
2. The average cost of constituent word replacement from the set Q to proper
noun. This cost is denoted by w2 . Note that in this case no dictionary search
is required as proper nouns are not stored in any dictionary. Hence the cost
Lp
w2 is computed as (l1 × L2 ) + (l2 × 2
).
195
5.4. Cost Due to Different Functional Slots and Kind of Sentences
3. The average cost of constituent word replacement from the set Q to pronoun.
This cost is denoted as w3 , which is formulated as (l1 × L2 ) + (l2 × L2p ) + {(d×
6.17) + (c × 105 )}(same given in item 1 above).
4. The average cost of constituent word replacement from the set Q to gerund
Lp
(PCP1) is (l1 × L2 ) + (l2 × 2
) + {(d× 12.08) + (c × 105 )}(same as in item
1). Note that, here verb dictionary search is required, and its search time is
12.08. We denote this cost as w4 .
5. For converting singular form of noun to plural form of noun, or vice versa (See
appendix A) any one of the three different suffix operations are required that
are suffix replacement (SR) , suffix addition (SA) and suffix deletion (SD). The
average cost of these operations are:
6. The average cost of suffix addition “na” in the verb of PCP1 form is (l1 × L2 ).
Note that here only one suffix is required in any of the cases, therefore, no
search is required for deciding about the suffix. This cost is denoted by s4
196
A Cost of Adaptation Based Scheme
Similarly, cost of adaptation for other sentence patterns components which have
been discussed in Chapter 2 can be formulated. However, to avoid the stereotyped
nature of discussion, we do not present all other different cases in this report. The
primary advantage of the above analysis is that it paves the way for using adapta-
tion cost as a good yardstick for similarity measurement that may lead to efficient
retrieval from EBMT perspective. The following subsection describes adaptation
The input sentence may be compared with the example base sentences in terms
of functional-morpho tags, their discrepancies may be measured, and adaptation
197
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
cost may be estimated using the formulae given above. The example base sentence
having the minimum cost of adaptation may then be considered the most similar
to the input sentence, and may be retrieved for generating the translation of given
input sentence. Below we compare the proposed scheme with some other similarity
Semantic similarity depends on the similarity of words occurring in the two sentences
under consideration. Here, we used a purely word based metric, and developed a
vector space model as suggested in (Manning and Schutze, 1999). However, the
weighting scheme has been modified (Gupta and Chatterjee, 2002) in order that the
scheme can be applied on sentences in a meaningful way. Here, each of the example
base sentences and the input are represented in a high-dimensional space, in which
each dimension of the space corresponds to a distinct word in the example base.
Similarity is calculated as the normalized dot product of the vectors. The method
is explained below.
198
A Cost of Adaptation Based Scheme
example base. Thus Ej = (ej1 , ej2 ,. . .,ejn ), for j =0, 1, 2, . . ., N . The similarity
measure between E0 , and of the example base sentence Ej is defined here as:
n
X
m(E0 , Ej ) = e0i eji (5.1)
i=1
This scheme computes how well the occurrence of word Wi (measured by e0i
and eji ) correlate in input and the example base sentences. The coordinates eji are
called word weights in the vector space model. The basic information used for word
For a word Wi word frequency wji and sentence frequency si are combined into
a single word weight as
wji × (( N ) − 1),
if wji ≥ 1;
si
eji = (5.2)
0,
if wji = 0.
• Word frequency wji is the number of times the word Wi occurs in the j th
sentence Ej . This implies how salient a word is within a given sentence. The
higher the word frequency, the more likely is that word is a good description
of the content of the sentence.
199
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
Given an input sentence, this similarity may be used for retrieving an appropriate
past example from the example base. In order to achieve that the similarity of the
input sentence is measured with each of the example base sentence. The one with
the highest similarity score may be considered for retrieving.
We have experimented with two input sentences “I work.” and “Sita sings ghaz-
als.”. Tables 5.6 and 5.7 provide the best five matches for them, respectively.
200
A Cost of Adaptation Based Scheme
One may note that the drawback of this scheme is that the outcome varies signif-
icantly on the content words, the size of the database sentences and the occurrence
of the words in the sentences.
Syntactic similarity pertains to the similarity of the structure of two sentences under
consideration. Let Tj be the tagged version of English sentence Ej of the example
base, and T0 is the tagged version of the input sentence E0 . Here too, every sentence
Tj in the example base is expressed as a vector generated from the structure of the
sentence. A matching technique similar to that used for semantic similarity has been
applied on Tj and T0 (instead of Ej and E0 , as discussed in earlier subsection). As
a consequence similarity measures are computed at the structural level, and not at
the word level.
The key question is whether all the components in determining the structural
similarity are of equal importance. We feel that the contributions of all the con-
stituent words on the formation of the sentence are not same, in particular, sentences
201
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
having similar structure (in terms of verb, auxiliary, adverb etc) should have higher
similarity measurement value for a given input sentence. Having tried with different
weighting schemes, we found that the one given in Table 5.8 provides the best result.
Table 5.9 and Table 5.10 give the similarity measures obtained for the example
base corresponding to the input sentences “I work” and “Sita sings ghazals” when
the above weighted syntactic similarity scheme has been used.
202
A Cost of Adaptation Based Scheme
Note that here similarity of words is completely ignored, as the main emphasis
is laid on the similarity of tense. By resorting to a different weighting scheme one
can change the similarity measures to some extent.
ilarity
The above studies reveal that neither semantic measure nor syntactic measure pro-
vides an effective scheme for calculating similarity between two sentences. In both
cases the measurement score depends to a significant extent on the word weights,
which in turn depends on the sentences in the example base. Thus the schemes
become highly subjective. We, therefore, look for a method that provides a more
example illustrates how the functional-morpho tags of an input (IE) and a retrieved
203
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
example base sentence (RE) can be used for determining the appropriate adaptation
operations.
Table 5.11 gives the functional-morpho tags of the IE and the RE. To generate
the translation “ram bahut tezii se gaadii chalaa rahaa hai ” of the input sentence
the following adaptation operations are required.
IE: Ram is driving the car at a high RE: He is sitting on the chair.
speed.
Ram @SUBJ <Proper> N SG he @SUBJ PRON MASC SG3
“Ram” “he”
is @+FAUXV V PRES “be” is @+FAUXV V PRES “be”
driving @-FMAINV V PCP1 “drive” sitting @-FMAINV V PCP1 “sit”
the @DN> ART “the” the @DN> ART “the”
car @OBJ N SG “car” ... ...
at @ADVL PREP “at” on @ADVL PREP “on”
a @DN> ART “a” ... ...
high @AN> A ABS “high” ... ...
speed @<P N SG “speed” chair @<P N SG “chair”
Table 5.11: Functional-morpho Tags for the Input En-
glish Sentence (IE) and the Retrieved English Sentence
(RE)
(a) Whenever a functional tag along with the morpho tags match in both sentences
but the corresponding words are different, a constituent word replacement
204
A Cost of Adaptation Based Scheme
needs to be done for the retrieved Hindi translation. For example, “driving”
and “sitting” are both verbs in their present continuous form “@-FMAINV V
PCP1”. But since the root verbs “drive” and “sit” are different, a constituent
word replacement is required in the retrieved Hindi translation RH. Therefore,
the scheme replaces “baith” with “chalaa”5 . In a similar way “chair” and
“speed” have same functional-morpho tag “@<P N SG”. Hence, here too, the
scheme replaces “kursii ” with “tez ”.
(b) If the functional tags match, but corresponding morpho tags do not match,
then either a constituent word replacement or some suffix modifications (or
both) need to be done to modify the retrieved Hindi translation. For example,
“Ram” and “he” are both subjects, but “Ram” is a proper noun while “he” is
a pronoun. Hence a constituent word replacement is required, and the scheme
replaces “wah” with “ram”.
(c) Whenever a functional tag is not present in the sentence RE, but is present in
IE, the corresponding word in Hindi has to be retrieved from an appropriate
word dictionary, and added at the appropriate position in the Hindi sentence
RH. For example, the object “car” which comes before the preposition “at”
in the IE sentence is not complementing the preposition, the object “chair”
which comes after the preposition “on” in the RE sentence is complementing
the preposition. Thus, the two objects “car - @OBJ N SG” and “chair - @<P
(d) Whenever a functional tag is present in RE, but is not present in IE, the
5
The Hindi translation of the word “drive” is to be taken from the verb dictionary.
205
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
corresponding word in Hindi has to be deleted from the Hindi sentence RH.
After the identification of all the adaptation operations required for adapting RE
to IE, we have calculated cost of each of the adaptation operations. For this purpose
we have referred cost of all the different operations which are listed in Section 5.3.1
and Section 5.4.
the underline computing system. Hence in our discussion we want to keep them
independent of any particular platform. We further make a few assumptions in
order to make the calculations relatively simple:
• We assume that the linear search operations in the RAM are equally costly
irrespective of the size of each data record. Hence we assume that the constants
• It has already been discussed in Section 5.3 that hard disc operations are
costlier than RAM operations by an order of 105 . Hence we denote the constant
associated with retrieval from the external storage as c × 105 , where c ≈ α.
206
A Cost of Adaptation Based Scheme
Table 5.12 and Table 5.13 give the best five matches when the retrieval is made
by the cost of adaptation based scheme by using the same input sentence and the
same example base. Cost values here are measured according to scheme given in
Section 5.4.
To generate the translation “main kaam kartaa hoon” of the input sentence
“I work.”, the adaptation operations required for adapting each of the sentences
given in Table 5.12 above are as follows:
• For adapting “I have been working for four hours.” ∼ “main chaar ghante se
207
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
kaam kar rahaa hoon” to the input sentence, five operations are required. This
operations are : SA (kar → kartaa), WD (chaar ), WD(ghante), WD (se) and
• In case of adapting the second retrieved sentence “I have not been working for
four hours” ∼ “main chaar ghante se kaam nahi kar rahaa hoon” to the input
sentence, at most six operations are required to be done. These operations are
: SA (kar → kartaa), WD (chaar ), WD(ghante), WD (se), WD(nahiin) and
MD(rahaa). Hence the total adaptation cost is (6α)+ (4.5α+) + (4.5α+) +
(4.5α+)+ (4.5α+) +(6α+)= 30α+5.
• For adapting “ This works” ∼ ‘yah kaam kartaa hai ” to the input sentence,
at most two operations WR (yah → main) and MR (hai → hoon) are re-
quired. The total adaptation cost is therefore (9.17α + c105 )+ (2α+ 2α)=
13.67α+c105 .
• Similarly, one can identify the appropriate adaptation operations for adapting
last two sentences to the input sentence.
We now consider the costs of adaptation for the best five sentences that are
retrieved using semantic and syntactic similarity schemes as given in Sections 5.5.1
and 5.5.2.
208
A Cost of Adaptation Based Scheme
First we consider the input sentence “I work”. Table 5.14 provides the costs of
adaptation of the best five matches under semantic similarity and syntactic similarity
based measurement schemes. An examination of the adaptation costs suggests that
all the five sentences retrieved by the semantic similarity based scheme are costlier
to adapt in comparison with all the sentences retrieved by cost of adaptation scheme
(see Table 5.12). On the other hand, the sentence “I walk.” which is retrieved as the
best matching sentence under syntactic similarity based scheme actually requires
more computational efforts than the best four sentences as given by the cost of
adaptation based scheme (see Table 5.12).
209
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
In a similar way Table 5.15 provides the costs of adaptation for the best five
matches retrieved by semantic and syntactic based schemes for the input sentences
“Sita sings ghazals”. One may note the following by comparing Table 5.13 and Table
5.15.
• “Sita sings ghazals.” is retrieved as a best match by all the three schemes
because it is already there in the example base.
• The second best match “Ghazals were nice.” under semantic similarity based
scheme is actually very expensive to adapt as it has a term ψ. This term
occurs since the sentence concerned is of a structure that is different from the
210
A Cost of Adaptation Based Scheme
input sentence.
• The sentences retrieved by the syntactic similarity based scheme are costlier
to adapt than the sentences retrieved by the cost of adaptation based scheme.
The above results clearly demonstrate the superiority of the proposed scheme
One major drawback of the proposed scheme is that for each input sentence, the
scheme essentially boils down to evaluating the cost of adaptation for all the sentence
in the example base. This makes retrieval from a large example base is computa-
tionally very expensive. On the other hand, use of cost of adaptation as a potential
yardstick for measuring similarity makes too strong an argument to be ignored with
respect to Example Based Machine Translation. This, therefore, necessitates devel-
opment of some filtration technique so that given an input sentence, the example
base sentences that are difficult to adapt are discarded. The adaptation scheme can
therefore be applied only on the remaining sentences of the example base. We have
designed a systematic two-level filtration scheme for this purpose.
constituent word addition and constituent word replacement are the costliest adap-
tation operations in terms of computational cost, with the former being costlier than
the latter. Hence the filters are designed to retrieve those example base sentences for
which the adaptation of given input sentence will require less number of constituent
word addition and constituent word replacement operations.
211
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
• In the first level, the algorithm retrieves sentences that are structurally sim-
ilar to the input sentence, thereby reducing the number of constituent word
additions in the adaptation of the retrieved example. Here functional tags
(FTs) are used to determine the structural similarity. We call this step as
• In the second level only the sentences passed by the first filter are considered
for further processing. Here the dissimilarity of each of these sentences with the
input sentence is measured. The lower is the dissimilarity score of an example,
the lesser will be its adaptation cost to generate the required translation. The
dissimilarity is measured on the basis of tense and POS tag along with its root
word. Henceforth, for notational convenience, we shall denote these features
as characteristic features of a sentence. This step is denoted as “measurement
of characteristic feature dissimilarity”.
The following examples illustrate the necessity of the two levels of the filtration
Even though there are two common words (beautiful and home) between these two
sentences, adapting the translation of sentence B to generate translation of A is not
an easy task because of their structural difference. Adaptation of the translation
212
A Cost of Adaptation Based Scheme
This sentence also has two words (girl and going) common with sentence A. But
Evidently, this cost is much less than the cost of adapting B to A as computed above.
This happens because of the structural similarity and commonality of some charac-
teristic features of sentence C with A.
The above discussion suggests that one of the filters alone is not sufficient. For
appropriate filtration both the levels are required. The next section discusses the
We have used the following notations to describe the filtration scheme. Let L denote
natural language (here it is English), and let e ∈ L denote an input sentence. S
213
5.6. Two-level Filtration Scheme
denotes the example base which is a finite subset of L, and d ∈ S is an example base
sentence. The following subsections discuss the above-mentioned levels of the filter.
In this step, the aim is to filter the example base S of sentences to produce a subset
of S such that the sentences are structurally similar to e. The example base is
partitioned into equivalence classes of sentences that have same functional tags (e.g.
subject, object, verb etc.). This example base of equivalence classes is filtered, and
the classes that are similar in structure to the equivalence class of the input sentence
are identified. Here too we have used ENGCG parser for finding the functional tags
(FTs).
functional tags. Let [e] denote the equivalence class corresponding to the sentence
e, i.e. [e] = {e0 ∈ L| eEe0 }. For example, the sentences “He drank milk.”,“Sita eats
mangoes.”, “They are playing football.” and “Will Ram marry Sita?” are members of
the same equivalence class because all these five sentences have same functional tags
Since our concentration is on our example base S, the function φ and equivalence
214
A Cost of Adaptation Based Scheme
For a given input sentence e, and an example base sentence d ∈ S, | φ(e) ∩ φ(d) |
denotes the number of common FTs between [e] and [d]. Let m denote max d∈S (|
φ(e) ∩ φ(d) |), i.e. the maximum number of common FTs between φ(e) and φ(d).
From the partitioned example base S 0 a new set Se0 is constructed such that Se0 =
{[d] : | φ(e) ∩ φ(d) |≥ d m2 e}. Here the function d m2 e gives the smallest integer greater
m
than or equal to 2
. Thus, Se0 is constructed in such a manner that it contains all
those equivalence classes for which the number of common FTs is between d m2 e and
m. Thus, by this step, all the equivalence classes having number of common FTs
less than d m2 e are discarded. We claim that the sentences having higher cost of
adaptation has been discarded from the set Se0 . The proof is given below:
set (S 0 − Se0 )). Of all the examples belonging to this set, the one that will have least
cost of adaptation should have the following properties:
(a) It should have maximum number of common FTs with e. We assume that
there exist a sentence with (d m2 e -1) FTs common with e.
(b) We further assume that for all these common FTs, the underlying words are
also same as in e.
215
5.6. Two-level Filtration Scheme
means (n − (d m2 e-1)) constituent word additions are required. Therefore, the cost of
adaptation of any such sentence will be approximately6 : (n − d m2 e+1)×WA, where
WA is the cost of constituent word addition, and is ((l1 × L2 ) + (l2 × L2p ) + {(d×log2 D)
+ (c × 105 )} + ψ+ ). For details of this cost, check item 2 of Section 5.3.1. Let us
denote this cost as C1. This cost will be certainly more than the cost of adaptation
for the sentence having m common FTs with e, i.e. the sentences belonging to the
equivalence classes of the set Se0 selected by the first filter. Argument supporting
this fact is as follows:
If all the words corresponding to the m common FTs are different from the input
sentence constituent words, then the cost of adaptation will be approximately the
the value of C2 is n × WR+ (n − m) ×(ψ+ ). Now let us consider the difference
C1 − C2.
cost of dictionary search, i.e. {(d × log2 D) + (c × 105 )} (see Section 5.3).
It can also be noted that the sentence having m common FTs not necessarily
will have the minimum cost of adaptation. For this, consider the cases:
• The sentence having m common FTs have all different words. The approximate
6
We have not added other costs like suffix operations and morpho-word operations cost.
216
A Cost of Adaptation Based Scheme
costs will be sum of the cost of m constituent word replacements and the cost
of (n − m) constituent word additions, i.e. m×WR + (n − m)×WA.
• The sentence having d m2 e common FTs have all same words. In this case the
approximate cost is (n − d m2 e) constituent word additions, i.e. (n − d m2 e)×WA.
By using a similar argument as given above, it can be shown that the cost of
the latter case may be less than the former one. Hence the sentences having d m2 e
common FTs cannot be discarded at this level.
The equivalence classes passed by the first filter are subjected to further analysis
in the second level of filter as mentioned below.
This filter arranges sentences of the set Se0 on the basis of the characteristic features
(see Section 5.5.4) of a sentence. We have considered the following characteristic
features: POS with its root word – main verb (V), noun(N), adverb(ADV), adjec-
tive(A), pronoun(Pron), determiner(DET), preposition(p), gerund(PCP1), partici-
ples(PCP1, PCP2), and tense and form of the sentence. Here note that we have
considered those main verbs whose root forms are other than “be or have ”. We
stick to the notations provided by the ENGCG parser (See Appendix B). For con-
venience of presentation, we denote the above mentioned ten characteristic features
217
5.6. Two-level Filtration Scheme
as p1 (y), p2 (y), ..., p10 (y), where y is the root word of the corresponding character-
istic feature pi . For example, let us consider the sentence “I am sitting on the old
chair”. According to this example, there are six characteristic features, i.e. p 5 (“I”),
p1 (“sit”), p7 (“on”), p4 (“old”), p2 (“chair”) and p10 (“present continuous”). Here the
For example, let the input sentence e be “The old man is sitting on the old chair.”,
and let the sentence x from example base be “He is sitting on my bed”.
ous”)}
Therefore, η(e, x) = { p4 (“old”), p2 (“man”), p4 (“old”), p2 (“chair”) }
218
A Cost of Adaptation Based Scheme
X
dise (d) = ( wi ) + (ψ × (| φ(e) | − | φ(e) ∩ φ(d) |)) (5.3)
pi (y)∈η(e,d)
dise (d) gives the dissimilarity score of d ∈ Se with respect to e. Here, ψ is the
cost of finding the location of new word, which has already been explained in item
5 of Section 5.3, and wi is the weight assigned to the characteristic feature pi (.).
Significance of this dissimilarity function dise (d) and wi are explained below.
As mentioned earlier, two of the costliest operations from the adaptation point
of view are constituent word addition and constituent word replacement. Thus, the
dissimilarity measure is designed to focus on these two operations. The second term
in the above-mentioned measure correspond to the approximate cost involved for
constituent word addition (to find the appropriate position). Further it should be
noted that cost of adaptation varies with the POS of the word to be added/replaced.
This is because this cost depends on the dictionary size of the concerned POS. The
bigger is the dictionary size, more will be the search time, and hence costlier will
be the required operation. Thus, for the characteristic features pi (y), i=1, 2, . . .,
Note that for tense and form (p10 ) identification cannot be done through dic-
tionary search. Appropriate rules should be developed for this purpose. In our
implementation, we have used 65 rules to take care of the sentences in our example
base. Therefore, the weight 6.02 (log2 (65) = 6.02 ) is assigned to the characteristic
feature p10 .
219
5.6. Two-level Filtration Scheme
θ(e) = {p6 (“this”), p2 (“girl”), p5 (“I”), p2 (“sister”), p10 (“simple present tense”)}
θ(d1 ) = {p6 (“this”), p2 (“boy”), p5 (“I”), p2 (“brother”), p10 (“simple present tense”)}
θ(d2 ) = {p6 (“that”), p2 (“girl”), p5 (“she”), p2 (“sister”), p10 (“simple present tense”)}
Note that instead of the words “my” and “her” their root words “I” and “she”,
220
A Cost of Adaptation Based Scheme
P P
Thus, dise (dj ) = ( pi ∈η(e,dj ) wi ) + (ψ × |4 − 4|) =( pi ∈η(e,dj ) wi ) for j = 1, 2. It is
to be noted that the contribution of the second term is zero for both d1 and d2 since
both these sentences have the same FTs as that of e.
2. The weights are taken as given in Table 5.16. In this case, dise (d1 )= w2 + w2
= 13.77 + 13.77 = 27.54 and dise (d2 )= w5 + w6 = 6.17 + 6.17 = 12.34.
Note that in the first case, the dissimilarity score is same. But from adaptation
point of view, the cost involved for adapting d2 to e is much less than the cost of d1 .
This is due to the fact that d1 has a determiner and a pronoun characteristic feature
common with e, while d2 has two noun characteristic features common with e.
Since the search and access time from a dictionary depends upon the size of
the dictionary under consideration, in this context one has to look at the sizes of
the dictionaries concerned. It is a general observation that the size of the noun
dictionary is much more than the sizes of the pronoun and determiner dictionaries.
For example, in our case the sizes are 14000, 70 and 72, respectively (see item 4
of Section 5.3). Consequently, retrieval from noun dictionary is computationally
costlier than the retrieval from pronoun or determiner dictionaries. This fact is
not reflected if equal weights are assigned to each POS. Hence in order to assign
priorities to the POS features in such a way that the dissimilarity score reflect the
approximate cost of adaptation, weights are assigned to each POS as given in Table
5.16.
221
5.7. Complexity Analysis of the Proposed Scheme
Here the dissimilarity metric is so designed that the dissimilarity score is directly
proportional to the approximate cost of adaptation. Finally the sentences in Se are
arranged in descending order of dissimilarity score. Now a few best sentences are
considered for cost of adaptation based scheme, and the best one is retrieved as the
most similar to the given input sentence. In our experiments we have considered
five best sentences by this two-level filtration scheme for evaluation of their costs of
adaptation.
The above discussed filtration scheme aims at improving the efficiency of cost of
adaptation based scheme. This improvement can be observed by comparing the
worst case complexities of the two algorithms - cost of adaptation based scheme
without the two-level filtration, and cost of adaptation based scheme after the two-
level filtration. These two similarity measurement schemes are denoted as A1 and
A2 , respectively. Table 5.17 gives the notations for different parameters used in the
analysis, and their maximum sizes with respect to our example base.
222
A Cost of Adaptation Based Scheme
In the algorithm A1 , for each example base sentence d, the maximum amount of
efforts required to adapt the translation of d to the translation of the input sentence
e is the number of comparisons required to identify the adaptation operation(s) in
the worst case. For an example base sentence d, the number of comparisons required
are as follows:
• Then, the morpho tags for all matching functional tags are compared and
hence the maximum number of comparisons required are Le × LF .
length of the functional tag set is same as the length of the sentence (i.e. |φ(e)| =
Le and |φ(d)| = Ld ). Hence, the complexity of A1 for all example base sentences is
given by TA1 = N × C1 .
comparisons between the functional tags of the equivalence class [e] of the input and
[d] of the example base, where e ∈ L, d ∈ S. This value, in worst case, is given
by C21 = |φ(e)| × |φ(d)| = Le × Ld . So, for a total of equivalence classes in S 0 , the
complexity of the first filter is given by A21 = |S 0 | × C21 .
For the second filter, we need to work on the sentences of the equivalence classes
retrieved (Se0 ) from the first filter. Suppose, there are Pi sentences in the ith , equiv-
alence class, i = 1,2, . . ., Se0 . The total number of sentences in all the equivalence
223
5.7. Complexity Analysis of the Proposed Scheme
classes of Se0 are Se . Maximum two comparisons are required: one for POS matching
and other for its root word matching between d and e for finding the characteristic
feature. The number of POS and its root words can be at most equal to the length of
a sentence. Thus the total number of comparisons required are computed as follows:
• POS of d and e are compared first which makes the number of comparisons to
be Le × Ld .
• Then, comparison of root words of d and e having same POS is done. Thus
there are Le comparisons.
Hence, the total number of comparisons required for POS and root word matching
between e and d are Le × (Ld + 1). Summing it over all the sentences of Se , we get
PSe0
the total complexity as A22 = ( i=1 Pi ) × (Le × (Ld + 1)) = |Se | × (Le × (Ld + 1)),
PSe0
where i=1 Pi ≤ N .
Finally the cost of adaptation based scheme is applied on the top few sentences
of Se having the minimum dissimilarity score. We have considered a set of first five
sentences in our experiments. This makes the number of comparisons to be 5 × C1 .
Hence, the total complexity of the algorithm A2 is given by TA2 = A21 +A22 +5×C1 .
|S 0 | +1
= N
× [( LdL+L
d
F
)] + Se
N
× [( LLdd+L F
)] + 5
N
(As |Se | ≤ |S| = N )
224
A Cost of Adaptation Based Scheme
The above ratio shows that in the worst case the improvement in the algorithm
A2 is about 25%, i.e. the cost of adaptation based scheme will need to be applied on
only 75% sentences of the example base. But experimentally we found that for 500
different examples, which are not present in our example base, the improvement is
of the order of about 75%, which is quiet a significant improvement. This variation
is mainly due to the fact that during our experiments the cardinality of Se has been
Se
obtained to be much less than N , and thus the ratio N
reduces the contribution of
+1
Se
N
× [( LLdd+L F
)], which is the main contributory term towards c.
The retrieval scheme has been developed with respect to simple sentences. How-
ever, if the input sentence is complex then its adaptation is not straightforward
(Dorr et. al., 1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003)
and (Rao et. al., 2000).
one clause, of which one is the main clause, and the rest are subordinate clauses
(Wren, 1989), (Ansell, 2000). Relative clause is a type of subordinate clause in
which a relative adjective (who, which etc.) or relative adverb (when, where etc.)
is used as a connective. The clauses may be joined by some connectives, but their
complex sentences having same connectives are often translated in different ways
in Hindi. Consequently, for a given complex English sentence, finding its suitable
225
5.8. Difficulties in Handling Complex Sentences
match from the example base is difficult. And even when it is found its adaptation
may not be straightforward. The following section illustrates the above points.
• Even for complex sentences having the same connective (e.g. who, when, where,
which) the structure of the translations may vary. For illustration, consider the
four examples given below. Each of these English sentences may have at least
four possible variations depending on, in which position the Hindi connectives
are used. It may further be noticed that although the keywords of all these
four sentences are same7 (subject to morphological variations), their translation
patterns vary according to the role of the connective, and the role of the noun
modified by the relative clause. If the relative adjective “who” plays the role
of subject in the relative clause, then the Hindi relative adjective may be one
of “jo”, “jis” or “jin”, depending upon the tense and form (i.e. for present
perfect, past indefinite and past perfect) of the main verb of the relative clause.
The items (A), (B), (C) and (D) below show the four sentences and their Hindi
translations.
7
policeman - sipaahii, thief - chor, to chase - piichaa karnaa, I - main, tall - lambaa, to know -
jaannaa
226
A Cost of Adaptation Based Scheme
which involve the following keywords: man - aadmii, is working - kaam kar
227
5.8. Difficulties in Handling Complex Sentences
∼ jo aadmii khet mein kaam kar rahaa hai wah kisaan hai
This man said that he is a farmer.
∼ iss aadmii ne kahaa ki wah kisaan hai
Despite the dissimilarity in their structures, one may notice that the part “wah
kissan hai ” is common to both the translations. Typically this can happen if
the two complex sentences have some similar clauses. The above observation
also implies that sometimes a simple sentence may also be helpful in generating
The above discussion suggests that the retrieval and adaptation strategies for
complex sentences may need to take care of a large number of variations according
to each connective word and its usage. Creating the adaptation rules and imple-
mentation of such number of possibilities are not an easy task. To overcome this
problem, we propose a “split and translate” scheme for handling complex sentences
in an EBMT framework. The proposed scheme works as follows:
1. First it checks whether the input sentence is complex. If “yes” then it executes
the following:
2. It splits the input sentence into two simple sentences RC and MC, correspond-
ing to the Relative clause and the main clause of the complex sentence.
228
A Cost of Adaptation Based Scheme
convert complex sentence into simple sentences, and the adaptation procedure to
obtain the Hindi translation of the given complex sentence using the translation of
the splitted sentences.
Various approaches have been suggested in literature for splitting of complex sen-
tences. For example:
1. Furuse et. al. (1998, 2001) has proposed a technique where a sentence is split
3. Doi and Sumita (2003) proposed two methods: Method-T and Method-N.
229
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Many approaches exist for splitting complex sentences (typically for English),
e.g.(Orăsan, 2000), (Sang and Déjean, 2001) and (Clough, 2001). The technique
used by us is similar in nature to that proposed in (Leffa, 1998), (Puscasu, 2004).
They suggest three ways in which a sentence can be segmented to the clause level:
(1) Starting with the first word in the sentence, and processing it from left to
(3) Starting with the verb phrase, identifying the verb type, and locating its sub-
ject and complements.
In our approach, we have used first two methods. We have developed heuristics
to split a complex sentence into two simple sentences one related to the main clause
and the other to the relative clause. Here the advantage is that both the simple
sentences now can be translated independently using the retrieval and adaptation
procedures developed for dealing with simples sentences. For this work we made the
following assumptions for the input sentence:
• The sentence has only one relative clause, and a connective must be present.
• The connectives that we have considered are when, where, whenever, wher-
ever, who, which, whose, whom, whoever, whichever, that, whomever, what and
whatever.
• The algorithm makes use of the delimiter of the input sentence as well. We
illustrate this technique with respect to the delimiters “.” and “?”.
230
A Cost of Adaptation Based Scheme
• No wh-family word (e.g. who, which, when, where) should be present in the
main clause.
In the following subsections, we discuss the splitting rules for complex sentences
having any of the following connectives: “when”, “where”, “whenever”,“wherever”
and “who”. Since the splitting rule for some of the above connectives are same,
Module 1
If e is complex, then the module identifies the position of the relative adverb, which
can be one of “when”, “whenever”, “where” or “wherever”. The algorithm considers
the two possible positions of the relative clause: i.e, the relative clause is present
before the main clause, or it is present after the main clause. Depending upon the
231
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
-For all ei ∈ e, let Roote (ei ) denote the root word corresponding
to ei .
Figure 5.1: Schematic View of Module 1 for Identification of Complex Sentence with
Connective any of “when”, “where”, “whenever”, or “wherever”
232
A Cost of Adaptation Based Scheme
Example 1:
Let the input sentence e be “Whenever you go to India, speak Hindi.”. Its parsed
version f obtained using the ENGCG parser, is:
“Hindi” <$.>
The length of the input sentence e is 7, and the bag of functional tags is {@ADVL,
@SUBJ, @+FMAINV, @ADVL, @<P, @+FMAINV, @OBJ}. Since f1 is @ADVL,
Example 2:
Consider the another input sentence e “Will you bring anyone along when you return
from town?”. Its parsed version f is:
233
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
f1 is not @ADVL, the module checks the presence of any of the connectives ‘when”,
“whenever”, “where” or “wherever” in e. The connective “when” is present at the
6th position, i.e. j =6. Hence Module 1 concludes that the given input sentence e is
complex, and for splitting of e, the algorithm should proceed to Module 3.
Module 2
If the relative adverb is the first word of the given input sentence e then the sentence
is splitted in Module 2. Figure 5.2 gives a schematic view of this module. Table
5.19 gives the typical sentence structures that can be handled by this module. The
structure of the sentence handled by this module is characterized, the relative clause
will be present in the beginning of the main clause. In this module, along with
the position of the relative clause, position(s) of the subject(s) is used to split the
complex sentence. In the following, we assume the length of e to be n. Sub-steps of
this module are as follows:
• If the delimiter of the input sentence e is “?”, or if the input sentence has
only one subject (possible tags of subject are @SUBJ and @F-SUBJ) and the
delimiter of the sentence is “.”, then the main verb (i.e. @+FAMINV tag)
or main auxiliary verb (i.e. @+FAUXV tag) decides the splitting point. The
module looks for the second occurrence of @+FAMINV or @+FAUXV tag9 .
Let l be the word position where one of these two tags occur. If one of the
above two cases is true then from the second word to (l − 1)th word, and from
the lth word to nth word of e constitute the two simple sentences, which are
9
ENGCG parser always provides either @+FAMINV or @+FAUXV tag to the first occurrence
of a verb whether it is main verb or auxiliary verb. All other verbs (main or auxiliary) in the
sentence are denoted with either @-FAMINV or @-FAUXV tag
234
A Cost of Adaptation Based Scheme
the parts of the main clause and relative clause, respectively. We call these
two simple sentences as MC and RC, respectively.
• If the delimiter of input sentence e is “.”, and it has two subjects10 then
the position of the second subject slot gives the decision about the splitting
point. For this purpose, the pre-modifiers (i.e. determiner, article, pre-modifier
adjective, adverb etc.) of the second subject are identified. If the position of
the first pre-modifier of second subject is k, then from the second word to
(k − 1)th word of e, and k th word to nth word of e constitute the two simple
sentences. First simple sentence (RC) is a part of the relative clause, and
second simple sentence (MC) is the main clause.
10
The algorithm works for at most two clauses in a complex sentence, therefore, the maximum
number of subjects in the sentence is taken to be two.
235
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
- Delimiter of RC is “.";
IF(delimiter of e is “?")
THEN {delimiter of MC is “?"}
ELSE {delimiter of MC is “."}
236
A Cost of Adaptation Based Scheme
ELSE {
IF(delimiter of e = “." AND K =2)
{
m = 0;
For(i = 2 to n)
{
IF(fi = @SUBJ or @F-SUBJ)
IF(m = 0)
THEN m++ ;
ELSE {m = i; Break;}
}
}
k = m − 1;
WHILE((k>2) AND (fk = @N OR @DN> OR @NN> OR @GN>
OR @AN> OR @QN> OR @AD-A>))
k − −;
- Delimiter of RC is “.";
- Delimiter of MC is “?";
}
237
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Our discussion on Module 1 concluded with the remark that the complex sentence
given in Example 1 should be splitted using Module 2. We now continue with the
same example “Whenever you go to India, speak Hindi.” to show how Module 2 splits
this sentence into two simple ones. In this example, the number of subjects is one,
i.e. K = 1, and the delimiter is “.”. The module now proceeds to determine the
position of the second occurrence of @+FMAINV or @+FAUXV tag, which is found
to be at the 6th position11 , i.e. l = 6. Hence the input complex sentence is splitted
into simple sentences as follows:
• 2nd to 5th words constitute a simple sentence RC, i.e. “You go to India”, its
delimiter is “.”. This is a part of relative clause, and its morpho functional
tags are:
• 6th and 7th words constitute the simple sentence MC, i.e. “Speak Hindi”, its
delimiter is also “.”. This is a part of main clause, and its morpho functional
tags are: @+FMAINV V IMP “speak” , @OBJ <proper> N SG “Hindi” <$.>.
Module 3
If the relative adverb (or connective) is not the first word of the given input sentence,
then the sentence is splitted by this module. In this case, the relative clause is present
after the main clause, i.e. relative clause is located towards the end of the sentence.
Let the position of the relative adverb (as identified in Module 1) be j. In this case,
11
The second main verb in the given input sentence is “speak”.
238
A Cost of Adaptation Based Scheme
first j − 1 words of e will constitute the first simple sentence MC (which is the main
clause.), and j + 1 to nth words will constitute the second simple sentence RC (which
is a part of the relative clause). Module 3 is given in Figure 5.3. Table 5.20 gives
the typical sentence structures that can be handled by this module.
Will you bring anyone along when you return from town?
∼ jab tum shahar se waapis aate ho tab kyaa tum kisii ko saath laaoge?
According to Module 1, the rule given in Module 3 will split the complex sentence
discussed in Example 2, i.e. “Will you bring anyone along when you return from town?”.
As discussed in Module 1, for this input sentence e the value of j 12 is 6. Hence
239
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
IF(delimiter of e is “.")
THEN {delimiter of MC is “."};
ELSE {delimiter of MC is “?"};
• The first five words of e constitute a simple sentence, which is main clause.
That is, the first simple sentence (denoted by MC) is “Will you bring anyone
along”. Since the delimiter of e is “?”, the delimiter of MC is “?”. Its functional
morpho tags are:
• 7th to 10th words constitute the other simple sentence RC, i.e. “You return
from town”, which is a part of the relative clause. The delimiter of RC is “.”.
Its functional morpho tags are:
240
A Cost of Adaptation Based Scheme
Here we discuss the algorithm for splitting complex sentences when the connective
is “who”. It should be noted that in this case, the relative clause can occur either
embedded in between the main clause, or after the main clause. In both the cases,
there are two possible functional tags of the connective word “who”, i.e. @SUBJ and
@OBJ. The algorithm takes care of all these possibilities. The algorithm is divided
into four modules, which are given in Figures 5.4, 5.6, 5.7 and 5.8. Along with these
four modules, there is a subroutine SPLIT given in Figure 5.5. The brief outline of
• The Module 1 checks whether the given input sentence is complex or not. If
the sentence is complex with the connective “who” then depending on the
position of the clause and the delimiter of a sentence it routes the algorithm
to the appropriate module.
• Module 2 splits those complex sentences in which the relative clause is embed-
ded in the main clause, and the delimiter of the sentence is “.”. Table 5.21
provides the typical sentence structures considered in this module.
• The complex sentences in which the relative clause follows the main clause are
splitted in Module 3. Here also the delimiter of the sentence under consid-
eration should be “.”. The sentence structures considered in this module are
exemplified in Table 5.22.
• Irrespective of the position of the relative clause, Module 4 splits those com-
plex sentences for which the delimiter is “?”. Examples given in Table 5.23
241
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
• The algorithm for splitting those complex sentences in which the relative clause
is embedded in the main clause is given in subroutine SPLIT. This subroutine
accepts two arguments: integer-type x, and character-type y. x gives a split-
ting point position, and y provides the delimiter of the simple sentence that is
242
A Cost of Adaptation Based Scheme
Did not the man, who read the book, like it?
∼ jis aadmi ne kitaab padhii kyaa usne yah pasand nahiin kii?
∼ kyaa wah aadmii, jis ne kitaab padhii, yah pasand nahiin kii?
Let the input sentence e be “Those students, who want to learn Hindi, should study a
243
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
- For all ei ∈e, let Roote (ei ) denote the root word corresponding
to ei .
For(i = 1 to n)
IF( Roote (ei ) = "who" AND <Rel> ∈ Morpho-tag of fj )THEN
THEN {Print “Complex sentence"; j = i}
ELSE {Print "Simple sentence"; Exit;}
IF(Flag = 0)
THEN GO TO Module 2;
ELSE IF(delimiter of sentence = ".")
THEN GO TO Module 3;
ELSEIF(fi = @+FAUXV)
THEN GO TO Module 4;
ELSE Print "Sentence can not be splitted";
Figure 5.4: Schematic View of Module 1 for Identification of Complex Sentence with
Connective “who”
244
A Cost of Adaptation Based Scheme
245
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Modification:
- A new word, either "him", "her" or "them" is placed
after the cth word. The functional-morpho tag of this
new word will be @OBJ PRON PERS MASC SG3 “he",
@OBJ PRON PERS FEM SG3 “she" or
@OBJ PRON PERS PL3 “they";
The choice of this new word depends on the gender and
number of el , where l is such that fl =@SUBJ, 1 ≤ l ≤ j−1.
Delimiter of RC is ".";
Delimiter of MC is y ;
Exit;
}
246
A Cost of Adaptation Based Scheme
IF(count = 1)
break;
}
CALL SUBROUTINE SPLIT(k , ".")
\∗This modules splits sentences in which the relative clause succeeds the
main clause.∗\
IF(fj 6= @OBJ)
THEN {
- Words e1 to ej−1 of the given input sentence e forms
the main clause, which is a simple sentence, say MC;
\∗ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j − 1th word to the first word in e.∗\
NUMBER = ’ ’;
GENDER = ’ ’;
247
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
For(i = j − 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
GENDER = gender of ith word;
NUMBER = number of ith word;
}
IF(NUMBER 6= ’ ’) break;
}
IF (GENDER = ’ ’) GENDER = ’MASC’;
ELSE {
- Words e1 to ej−1 form the main clause, which
is a simple sentence, say MC;
c=0; \∗c stores the position of first occurrence of the word having
@+FMAINV or @-FMAINV tag.∗\
For(i=j + 1 to n)
IF(fi = @+FMAINV OR fi = @-FMAINV)
{c = i; break;}
248
A Cost of Adaptation Based Scheme
\∗ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j − 1th word to the first word in e.∗\
NUMBER = ’ ’;
GENDER = ’ ’;
For(i = j − 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
GENDER = gender of ith word;
NUMBER = number of ith word;
}
IF(NUMBER 6= ’ ’) break;
}
IF (GENDER = ’ ’) GENDER = ’MASC’;
249
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
\∗Position of the auxiliary verb preceding the second main verb, if any, is
determined below.∗\
s = 0;
For(i = c1 + 1 to c2 − 1)
{
IF(fi = @-FAUXV) s = i;
IF(s 6= 0) Break;
}
IF(s 6= 0)
THEN CALL SUBROUTINE SPLIT(s, "?");
ELSE CALL SUBROUTINE SPLIT(c2 , "?");
}
The length of the input sentence is 11, and the bag of functional tags is {@DN>,
@SUBJ, @SUBJ, @+FMAINV, @INFMARK>, @-FMAINV, @OBJ, @+FAUXV,
@-FMAINV, @DN>, @OBJ}. Since Roote(e3 ) is “who” and its morpho tags contain
<Rel>, the module decides that given input sentence e is complex sentence with
the connective “who”. To identify the position of the relative clause in the given
input sentence e, presence of @+FAUXV or @+FMAINV tag is checked in the first
two words. It is found that none of these two functional tags is present in the first
two words (which are those and students). Thus Flag is set to 0 indicating that the
relative clause is embedded between the main clause. Hence the Module 1 concludes
For separating the main and the relative clause, Module 2 first locates the position
k of the second occurrence of @+FAUXV or @+FMAINV tag in the parsed version
(f ) of e. It should be noted that since neither @+FAUXV nor @+FMAINV tag is
present in first two words, and the third word is “who”, the algorithm checks the
tags of 4th word to 11th word to determine the value of k. The value of k is found
to be 8. @+FMAINV tag occurs at the 4th position, and @+FAUXV tag is present
at the 8th position of the sentence e.
Since the functional tag of the connective “who” is @SUBJ, the module gives the
following output:
• The first two words concatenated with 8th to 11th words constitute the simple
sentence (MC) which is also the main clause . Thus, first simple sentence is
“Those students should study a lot.”. The delimiter of MC is “.”. The parsed
version of MC is obtained from the FTs of the corresponding words in the
parsed version f of e. Thus the parsed version of MC is
251
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
• The words from 3rd position to 7th position of the input sentence e form the rel-
ative clause, i.e. “who wants to learn Hindi”. Now the 3rd word is replaced with
“they” as the functional tag of the 2nd word (i.e. “students”)has the functional
tag @SUBJ whose number is plural (PL). Also, since the gender of “students”
that the algorithm cannot deal with those interrogative complex sentences for which
the root form of the main verb of the main clause is “be”. In these type of sentences
the identification of main clause and the relative clause is relatively more compli-
cated. For example, consider the complex sentence “Is the man who was reading the
252
A Cost of Adaptation Based Scheme
In the above sentence, “in the library” is the preposition phrase (<PP>) and
“upstairs” is adverb. Since root form of the main verb of the main clause is “be”,
it can take either “upstairs” or “in the library” along with “upstairs” as its predica-
tive(s)13 . Thus, the main clause can be “Is the man in the library upstairs?” or “Is
the man upstairs?”. The relative clause will also vary accordingly. Hence in this
situation formulating the splitting rules is not achievable using this parsing scheme.
The same problem occurs for other variations of these type of sentences (e.g. Is this
the man who saw you with the binoculars?). Thus these type of sentences are not
handled in this report.
We have developed algorithms for splitting complex sentences using other con-
nectives also. But these rules are not discussed in this report in order to avoid
the repetitive nature of discussion. The following subsection discusses the adapta-
tion procedure for obtaining translations of the input complex sentences using the
splitted simple sentences RC and MC.
253
5.10. Adaptation Procedure for Complex Sentence
Since the Hindi translation patterns of the complex sentence having connectives one
not indicated in the Hindi translation of complex sentences having any of above-
mentioned connectives (Bender, 1961), (Kachru, 1980), the correlative adverb is
given in {}.
The adaptation procedure for generating the translation of the complex sentences
254
A Cost of Adaptation Based Scheme
3. Add one morpho-word (i.e. corresponding Hindi relative adverb, refer Table
5.24) in the beginning of the translation of RC. Other morpho-word (i.e. cor-
responding Hindi correlative adverb, refer Table 5.24) may be added in the
beginning of the translation of MC.
It may be noted that the total cost involved in generating the translation of
given complex sentence depends on the cost of adapting the translation of R1 and
255
5.10. Adaptation Procedure for Complex Sentence
R2 to the translation of RC and MC, respectively. This is due to the fact that the
cost involved in one (or two) morpho-word addition (required in step 3) is fixed, i.e.
(or 2) as relative and correlative adverbs always occur at the beginning of the
Hindi translation of RC and MC sentences (refer Table 5.25), respectively. Hence no
search is required to find the correct position for morpho-word in the Hindi sentence.
Further, the cost of concatenating these two translations is also fixed, which is .
Assuming that the cost of adapting the translation of R1 and R2 to the trans-
lation of RC and MC are c1 and c2 , respectively. Hence the total cost involved in
generating the translation of the given complex sentence is c1 +c2 +2+.
This section discusses the adaptation procedure for the complex sentence having
connective “who”. It may be noted that there are many variations in sentence
structure having this connective (refer Figure 5.19). For illustration, we consider
the sentence pattern as given in Table 5.26. In this pattern the connective “who”
plays the role of subject in the relative clause of English sentence.
“jo” occurs in the beginning of the relative clause whereas in the other pattern “jo”
occurs before the subject slot of the main clause. The noun in the main clause,
to which the relative clause modifies14 , is represented by “wah” or “we” depending
upon the number of the noun (Bender, 1961), (Kachru, 1980).
14
For the sentences under consideration, this noun is the subject of the main clause.
256
A Cost of Adaptation Based Scheme
The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having the
least cost of adaptation that have been retrieved from example base corresponding
3. Depending upon the required translation pattern, add two appropriate morpho-
words in RC and/or MC. The first morpho-word to be added is taken from
257
5.10. Adaptation Procedure for Complex Sentence
the set {wah, we}, and the other morpho-word is “jo”. The position of the
morpho-words in the two patterns are given below:
• For the first pattern, the morpho-word “jo” is added in the beginning of
the translation of RC, and depending on the number of the subject of
MC, the morpho-word either “wah” or “we” is added in the beginning of
• For the second pattern, the morpho-word “jo” is added in the beginning
4. Combine the (modified) translations of RC and MC. For both translation pat-
1. Cost of adapting the translation of R1 to the translation of RC. Let this cost
be c1 . In this case, the cost involved for adapting the translation of the subject
slot is not included.
2. Cost of deletion of subject slot from the translation of RC. Let us denote this
cost by w.
3. Cost of adapting the translation of R2 to the translation of MC. Let this cost
be denoted as c2 .
258
A Cost of Adaptation Based Scheme
4. Cost of two morpho-word additions which are given below for both the Hindi
translation patterns.
• For the first translation pattern, the cost of adding these two morpho-
words is () + (0.5α+) = 0.5α+2 (refer Section 5.3). Here dictionary
• For the second Hindi translation pattern, the cost of adding these two
morpho-words is () + ( L2 × α + 0.5α + ψ + )= ( L2 + 0.5) × α + ψ +
2, where L is the length of the translated of MC.
Thus the total cost involved for two translation patterns is the sum of all the
above mentioned costs. The two simple sentences R1 and R2 are retrieved from
the example base for generating the translation of given complex sentence so as to
minimize the total cost of adaptation.
We have formulated the adaptation procedure for other complex sentence struc-
tures having connective “who” in a similar way. However, due to similar nature of
discussion we do not elaborate on them in this report.
The above discussed adaptation procedures are illustrated in the following sub-
section. In particular, we show for a given complex sentence how the scheme retrieves
two similar simple sentences from the example base that can be used to generate
the translation of the input complex sentences.
259
5.11. Illustrations
5.11 Illustrations
The adaptation procedures for complex sentences are explained using two illustra-
tions.
5.11.1 Illustration 1
Suppose the input sentence is: “You should speak Hindi when you go to India.”. Its
parsed version is:
V PRES “go” , @ADVL PREP “to” , @<P <Proper> N SG “India” < $. >
The algorithm of splitting complex sentence (see Figure 5.1 and Figure 5.2)
results in two simple sentences RC and MC as given below.
Five most similar sentences for RC and MC, obtained by applying cost of adap-
tation based scheme, are given in Table 5.27 and Table 5.28, respectively.
260
A Cost of Adaptation Based Scheme
To obtain the translation of RC and MC, we consider the first sentence of Table
5.27 and Table 5.28, respectively. Thus, R1 is “You go to school.”, and R2 is “He
261
5.11. Illustrations
The morpho-words “jab” and “tab” are to be added in the beginning of the Hindi
translation of RC and MC, respectively. After this modification, these two sentences
are concatenated. Hence the desired translation of the given input sentence is “jab
5.11.2 Illustration 2
Let us consider another input sentence “The student who wants to learn Hindi should
After applying algorithm of splitting complex sentence (refer Figure 5.4 and
Figure 5.6), two simple sentences RC and MC are obtained. These are as follows:
262
A Cost of Adaptation Based Scheme
MC : The student should study this book. @DN> ART “the”, @SUBJ N
263
5.12. Concluding Remarks
For all the possible combinations of sentences given in Tables 5.29 and 5.30,
the cost of adaptation involved for generating the translation of the input sentence
is calculated in the way explained under the Section 5.10. The minimum cost of
adaptation, which is (17.58α+c105 ) + (3α+) + (24α+ψ+2) + (0.5α+2) + (3 × α
Hence the translation of the given input sentence after generating the translation
of RC (i.e wah (he) hindi (Hindi) siikhanaa (to learn) chaahtaa hai (likes)) and
MC (i.e. vidyarthii (student) ko yah (this) kitaab (book) padhnii (study) chaahiye
(should)), and appending the relative adjective “jo”, and the appropriate personal
pronoun from the set {we, wah} in the begining of the translation of RC and MC,
respectively, is “wah vidyarthii ko jo hindi siikhanaa chaahtaa hai yah kitaab padhnii
chaahiye”.
efficient EBMT. However, we observed that with respect to EBMT similarity may
have to be defined in a different way. Since the key focus of EBMT is adaptation, we
define “cost of adaptation” as a measure for similarity between sentences. According
to this definition a sentence d is said to be similar to a given input sentence e if the
264
A Cost of Adaptation Based Scheme
based on syntax and semantics, used in text retrieval (Manning and Schutze, 1999),
(Gupta and Chatterjee, 2002). These results have been compared with the result of
cost of adaptation based scheme, and have shown the superiority of proposed scheme
over syntactic and semantic based scheme. The proposed scheme works on simple
One apparent drawback of this scheme is that it needs to compare the input
sentence with all the sentences of the example base. This makes the process com-
putationally very expensive. Hence one needs to filter out irrelevant sentences, and
In this respect, we have proposed two-level filtration scheme for measuring dis-
similarity. The filtration scheme works on the following two steps:
The lower is the dissimilarity score of an example, the lesser will be its adap-
tation cost to generate the required translation. Finally, cost of adaptation based
scheme is applied on the selected set of sentences provided by the filtration scheme.
The advantage of this filtration scheme is that it reduces the number of example
base sentences that are to be analysed for evaluating the cost of adaptation. In the
worst case it reduced this number by 25%. However, as we repeated our experiments
with 500 different sentences we found that the average reduction in the number of
sentence subjected to evaluation of cost of adaptation is about 75%. The proposed
265
5.12. Concluding Remarks
scheme however cannot be applied for complex sentences straightway. This is be-
cause adaptation with respect to complex sentence is more difficult due to the more
complicated structure of complex sentences for both English and Hindi. Conse-
quently, we suggest that complex sentences may be first split into simple sentences.
Then the adaptation cost based scheme may be applied to retrieve best matches
for each of the simple sentences. These retrieved translations may then be adapted
to generate the translation of these simple sentences. These translations may then
be combined using linguistic rules to generate the translation of the input complex
sentence. The novelty of the scheme is that it gives an algorithmic way for handling
In this work we have dealt with complex sentences with main clause and one
relative clause. We have developed heuristics to first determine whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. In this report, we
have discussed algorithms for splitting complex sentences for five connectives: “who,
“when”, “where”, “wherever” and “whenever”. We have also developed the splitting
rules for other connectives (e.g. which, whom, whose, whoever, whichever), but to
avoid the similar type of discussion, we have not discussed all of them in this report.
Finally, we have shown the adaptation procedure for adapting given complex sen-
tences with any of the above-mentioned connective. In particular, we showed for a
given complex sentence how the scheme retrieves two similar simple sentences from
the example base that can be used to generate the translation of the input complex
sentences.
266
Chapter 6
The primary goal of this research is to study various aspects of designing an EBMT
system for translation from English to Hindi. It may be observed that in today’s
world a lot of information is being generated around the world in various fields. How-
ever, since most of this information is in English, it remains out of reach of people at
large for whom English is not the language of communication. This is particularly
true for a country like India, where the population size is more than a billion, yet
only about 3% of the population can understand English. As a consequence, an in-
creasing demand for developing machine translation systems from English to various
languages of Indian subcontinent is being felt very strongly. However, development
various difficulties that one may face while developing an MT system from English
to Hindi. We feel that the studies made in this research will be helpful not only for
Hindi, but also for other languages that are major regional languages of Indian sub-
continent, and at the same time prominent “minority” languages of other countries
(e.g. U.K.). Although an increasing demand for MT systems from English to these
languages is clearly evident, development of necessary computational resources is
267
6.2. Contributions Made by This Research
b) Study of divergence for English to Hindi translation, and how translation di-
vergence can be effectively handled within an EBMT framework.
surement scheme, and a quite large example base cannot, in general, guarantee an
exact match for a given input sentence. As a consequence, the need arises for an
efficient and systematic adaptation scheme for modifying a retrieved example, and
thereby generating the required translation. In this work we developed an adap-
268
Chapter 6. Discussions and Conclusions
well. In a similar way, we further observed that declensions of Hindi verb, noun
and adjective often depend on some auxiliary words, called morpho-words. Adapta-
tion using morpho-words also makes it efficient and computationally cheaper. The
above observations motivate us to design an adaptation scheme comprising nine dif-
Retrieval and Adaptation. However good an adaptation scheme is, its per-
formance is hindered seriously if the example that it attempts to adapt is not quite
similar to the input sentence. But there is no unique way of defining similarity be-
tween sentences. Depending upon the application, the definition of similarity may
vary. In this work we proposed a scheme for defining similarity from “adaptation”
perspective. We say that a sentence S1 should be called “similar” to another sen-
tence S2 if adaptation of the translation of S1 to generate the translation of S2 is
computationally inexpensive. The lesser the cost is the more will be the similarity.
In this work we have provided appropriate models for prior estimation of cost of
adaptation. This cost depends not only on the number of basic operations to be per-
formed, but also on the functional slots on which the operation is applied. Thorough
analysis of adaptation costs for different phrasal structures within various functional
slots (e.g., subject, object, verb), and also for different sentence types (e.g., affirma-
tive, negative, interrogative) has been carried out, and models for estimating these
269
6.2. Contributions Made by This Research
We have carried out experiments on retrieval using the proposed scheme, and also
with other major similarity measurement schemes, namely schemes based on com-
monality of words and similarity of syntax. These experiments clearly established
language do not translate into sentences that are similar in structures in the target
language.”. In this work we have studied the different types of divergences that may
occur in the context of English to Hindi translation. Our findings are compared with
the divergence types that are obtained in translations among European languages,
for which divergence has been extensively studied. Through this research, we have
been able to discover three new types of divergence that have so far not been reported
with respect to European languages. Altogether we have been able to characterize
seven divergence types that are prevalent in English to Hindi context.
any of the divergences. This partitioning of example base is essential for designing
appropriate retrieval scheme for dealing with divergences efficiently.
system to take prior decision on whether the translation of given input sentence
will involve divergence or not. Depending on the decision of the scheme, similar
270
Chapter 6. Discussions and Conclusions
order to resolve this problem we proposed a “split and translate” technique to handle
complex sentences. We have developed heuristics to identify whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. This work is based
on those sentences that have at most one subordinate clause in complex sentence.
The “split” algorithm generates two simple sentences out of a given complex sentence
based on its main and relative clauses. These simple sentences are then translated
individually using the proposed approach. We have developed further heuristics
that combine these individual translations to generate the translation of the given
complex sentence.
One major difficulty that we faced while doing this research is that no suitable
online data on English-Hindi parallel corpus was available at that time. The data
required for this work should be properly aligned in both sentence level and word
level. The example base of about 4,500 sentences used for this work has been created
and aligned manually. These sentences have been collected by scrutinizing about
30,000 translation pairs collected from various sources like story books, recipes,
government notices etc. For efficient working of the proposed EBMT system, the size
of example base should be increased. Although now huge parallel data is available
online e.g., EMILLE data, various web-sites having both English and Hindi version
(e.g., www.iitd.ac.in, www.statebankofindia.com), this data is not aligned. In order
271
6.3. Possible extensions
to use these resources effectively, proper alignment techniques for English to Hindi
alignment should be developed.
The work carried out in this research may be extended in several directions:
• In this work, various adaptation procedures are studied and developed for
the sentence structures (and their components). These are the patterns that
are predominantly found among the sentences present in our example base.
Many other variations in sentence structures are possible which have not been
discussed in this work. Adaptation rules may be developed for such sentence
patterns.
adaptation rules have been developed for these restricted structures only. The
proposed “split and translate” technique needs to be extended for complex sen-
tences having more complicated structures. Further, we have left compound
sentences out of our discussion. Strategies are to be developed for dealing with
272
Chapter 6. Discussions and Conclusions
• Robustness of the scheme proposed for taking a prior decision about the pos-
sible divergence types in translation of given input sentence (refer Chapter 4),
depends on PSD/NSD. These dictionaries contain the proper sense of the word,
and are created manually. For automating the construction of PSD/NSD,
divergence type.
sentence, this knowledge is extracted from the parsed version which is obtained
from the parsers available online. But no such resource is available for Hindi.
Thus, in our work we parsed and obtained FT and SPAC of Hindi sentences
manually. For practical applications of the proposed algorithms, Hindi parser
is needed to obtain the required information of Hindi sentence.
6.4 Epilogue
There are many issues pertaining MT that have not been dealt with in this work.
Arguably, the two most important ones of them are:
273
6.4. Epilogue
not fall within the purview of the work reported in this work. We include a brief
description of these two topics here in order that future works on English-Hindi
EBMT may be directed to take care of these issues too.
Pre-editing is the process of identifying and editing, where necessary, the source
text prior to translation so that any sentences (segments) of text that the machine
will have problems with are highlighted and removed. In other words, pre-editing is
based on the building up of new text data from a given text data of existing version
(e.g. paragraph form) that the MT system is able to handle. Pre-editing metric
varies according to the requirement of MT system.
In case of our study on EBMT system, we have also done some pre-editing ac-
cording to the requirement of the problem of retrieval, adaptation and identification
of divergence. Firstly, we have assumed that our original data is aligned sententially,
i.e. one source language sentence corresponds to one target language sentence. In
the retrieval and the adaptation procedure, we have added the parsed version of
source sentence, which is based on morpho-functional tags, along with the informa-
tion of word alignment at the root level. This minimum information is stored in
our example base for carrying out adaptation, converting complex sentences into
274
Chapter 6. Discussions and Conclusions
algorithms for divergence identification require both FTs and SPAC (see Appendix
B) information for the parallel corpus. Pre-editing has to be done accordingly.
The task of the post-editing is to edit, modify and/or correct the output of the
MT system that has been processed by a machine translation system from a source
language into target language. In other words, post-editing corrects the output of the
MT system to an agreed standard, e.g. amending the style of the output sentences,
or any minimal amendments which will make the text more readable.
In case of EBMT system post-editing may be required in the following form. The
or grammar rules that the system uses while carrying out the adaptation task. In
this situation one has to correct translation according to the requirement. Another
situation, where post-editing can be useful is when the system does not have sufficient
number of words in the dictionary. Typically, in these cases MT system provides
transliteration of these words in the target language. Post-editing is useful in these
cases too. The amount of post-editing required on the output provides a good
yardstick for measuring the output quality of an MT system.
275
6.4. Epilogue
tem. In recent years, various methods have been proposed to automatically evaluate
machine translation quality. Typically, these methods take the help of some “refer-
ence” translation of some pre-selected test data. Reference translation is also known
as “gold-standard translation”. By comparing the output produced by the system
under consideration (with respect to the pre-selected test data) with the reference
translation, an estimate of possible discrepancy is arrived at. This in turn gives a
measure of the translation quality of the said system. Examples of such methods are
Word Error Rate (WER), Position-independent word Error Rate (PER)(Tillmann
et al., 1997), and multi-reference Word Error Rate (mWER)(Nießen et al., 2000).
Below we describe the above-named methods.
• WER: The word error rate is based on the Levenshtein distance. It is com-
puted as the minimum number of substitution, insertion and deletion oper-
ations that have to be performed to convert the generated sentence into the
reference sentence.
• PER: A major shortcoming of the WER is the fact that it requires a perfect
word order. In order to overcome this problem, the position independent word
error rate (PER) was introduced as additional measure. It compares the words
in the two sentences without taking the word order into account. Words that
276
Chapter 6. Discussions and Conclusions
of reference translations for each test sentence is built. For each translation
hypothesis, the Levenshtein distance to the most similar reference sentence in
this set is calculated. This yields a more reliable error measure, and is a lower
bound for the WER.
In later years n-gram based schemes have been proposed to evaluate transla-
tion quality. The most prominent among them are BLEU (Bilingual Evaluation
• BLEU: This scheme has been proposed by IBM (Papineni et. al., 2001). It is
based on the notion of modified n-gram precision, for which all candidate n-
gram counts are collected. The geometric mean of n-grams precision of various
lengths between a hypothesis and a set of reference translations are computed.
This score is multiplied by brevity penalty (BP) factor to penalize too short
translations. Therefore,
1
A fluent sentence is one that is well-formed grammatically, contains correct spellings and
adheres to common use of terms, is intuitively acceptable and can be sensibly interpreted by a
native speaker
2
The judge is presented with the gold-standard translation, and should evaluate how much of
the meaning expressed in the gold-standard translation is also expressed in the target translation
277
6.4. Epilogue
N
!
X log pn
BLU E = BP ∗ exp
n=1
N
• NIST: This score was proposed by National Institute of Standard and Tech-
nology in 2002. It reduces the effect of longer n-grams. This criterion computes
a arithmetic average over n-gram counts instead of geometric mean and mul-
tiplied by a factor BP that penalizes short sentences. Both, NIST and BLEU
are accuracy measures, and thus larger values reflect better translation quality.
one of them can be considered to be the best from all perspectives. Hence typically
translation quality is expressed in terms of four scores, viz. WER, PER, BLUE and
NIST.
have yet been reached regarding the best way of designing a system. In this work
we have made contributions on various aspects of MT. Some of them are specific
3
Matches of shorter n-grams(n = 1,2,...) capture adequacy.
4
Matches of longer n-grams(n = 3,4,...) capture fluency.
278
Chapter 6. Discussions and Conclusions
for EBMT, while some other, such as, study of divergences, is relevant for other
paradigms as well. We hope that the contributions made in this thesis will be useful
for designing English to Hindi MT system, and also for many other language pairs
at large.
279
Appendix A
Appendix A.
English and Hindi languages are of two different origin, so study of their general
structural properties is necessary. In this discussion, some of the basic concepts of the
translation from English to Hindi are briefly outlined. Some of the general structural
properties of English and Hindi(Kachru, 1980)(Kellogg and Bailey, 1965)(Singh,
2003)(Qurick and Greenbaum, 1976) are described below. For example,
• Sentence Pattern: The basic sentence pattern in English is Subject (S) Verb
(V) Object (O), whereas it is SOV in Hindi. Consider for example “Radha
eats mango” here “Radha” is subject; “eats” is the verb while “mango” is the
object. So the words occur in the order SVO. But in Hindi it becomes
tence are mainly shown by the relative position of the components. Consider
this example as:
281
A.1. English and Hindi Language Variations
Above mentioned differences are structural differences between English and Hindi.
Some differences are in the part of speech properties of English and Hindi languages.
These discrepancies are as follows:
• Noun: Hindi nouns are effected by gender, number and case ending (Kellogg
and Bailey, 1965). These are as follows:
2. Number : As English, Hindi also has two numbers- Singular and Plural.
There are some possible suffixes for singular to plural conversion in Hindi,
which are as follows (Kellogg and Bailey, 1965):
For example:
singular Plural
ladkaa - boy (MASC) ladke - boys
ghar - house (MASC) ghar - houses (No change)
kapadaa - cloth (MASC) kapade - clothes
ladkii - girl (FEM) ladkiyaan - girls
3. Case ending: There are eight case endings in Hindi, which are given below
in Table A.2. All these are appended to the oblique form of the noun,
where such a form exists. There are some rules for making oblique nouns.
Some of them are as follows:
(a) Masculine singular nouns ending in “aa” change into “e” when some
Case Case-endings
Nominative case ne
Accusative case ko
Agent case se (by, with and through )
Dative case ko (to), ke liye (for), ke waaste
Ablative case se ( from, since)
Possessive case kaa, ke, kii
Locative case mein, par ( in, on)
Vocative case he, ajii, are
Table A.2: Different Case Ending in Hindi
283
A.1. English and Hindi Language Variations
No postposition is used with the nominative and Vocative. Here we will discuss
three cases nominative, accusative and possessive case. Other cases work same
as English case ending. These cases as follows:
indefinite, present perfect and past perfect). The use of this case is to
make a noun or pronoun act as subject of a verb. In that case, verb agrees
with the object in gender and number. For example,
Ram narrated a story. ∼ ram ne kahaanii sunaayii
haanii ”, “biij ”. The number and gender of these nouns are singular
feminine and plural masculine, respectively.
2. Accusative case: “ko” is the sign of this case and it is generally added
only to animate objects. Sometimes it is also added to inanimate ob-
jects, either to intensify its effect or to express a special significance. For
example:
3. Possessive case: the signs of this case are “kaa” , “ke” and “kii ”. These
words are used with noun according to gender, number and case-ending
of the following noun. This case ending has already been discussed in
detail in Section 2.5.2 of Chapter 2.
284
Appendix A.
• Article: As Hindi has no article, the distinction indicated in English by the def-
inite and indefinite articles cannot always be expressed in Hindi. As “ghodha”
may be either “a horse” or “the horse”; “istriyaan” may be “women” or ‘the
Every language has its own grammar rules. In other words, we can find same sentence
following different grammatical aspects corresponding to the language concerned.
For example, consider an English sentence “He will be sleeping at the moment”. Its
translation in Hindi is “wah iss samay so rahaa hogaa”. As per English grammar
rules, verb phrase follows future tense and progressive aspect (or continuous aspect)
but at the same time Hindi sentence verb phrase comes under definite potential type
of mood according to Hindi grammar. For the translation work, we have followed
English grammar categorization for verb phrase structure (Quirk and Greenburm,
285
A.2. Verb Morphological and Structure Variations
Verb morphological variations in Hindi depend on four aspects: tense and form of
the sentence, gender of the subject, person of the subject and number of the subject.
All these variations affect the root verb of a sentence. Since there are three tenses
(i.e. Present, Past and Future) and four forms (i.e. Indefinite, Continuous, Perfect,
and Perfect Continuous), in all one can have 12 different conjugations. In Hindi,
these conjugations are realized using suffixes attached to the root verbs, and/or
adding some auxiliary verbs, which we call “Morpho-Words” (MW). Table A.3 gives
the total number of morphological words and suffixes in Hindi, for all the tenses and
their forms.
286
Appendix A.
Above suffixes and morphological words in present prefect, past indefinite and
past prefect are used for literal translation of a sentence. Actually conjugation in
root verb is “aa”, “e” and “ii ”. It has been observed that according to Table
A.3, suffixes {taa, te, tii } are added in the root form of past indefinite tense form.
According to the tense forms, the morpho-words {thaa, the, thii }, {chukaa, chukii,
chuke} and {hoon, hai, ho, hain} are added after the main verb of the sentence.
Another possible way of expressing these three tenses and forms in Hindi is that,
in place of above mentioned suffixes different conjugations of verbs is used that
is different from the verb of tenses and forms discussed earlier. The morpho words
{thaa, the, thii } or { hoon, hai, ho, hain} is added depending upon the tense towards
the end of the sentence.
Some rules of these conjugation of verbs are as follows (Sastri and Apte, 1968):
• If the root of the verb ends in “a” (silent) lengthen it to “aa” in masculine
singular and change it into “e” for masculine plural; in feminine singular it
becomes “ii ” and in feminine plural “iin”. For example the verb “play” -
“khelaa” is in Hindi khelaa (masculine singular), khelii (feminine singular),
khele (masculine plural) and kheliin (feminine plural).
• If the root ends in “aa” or “oo”, “yaa” is added, which changes according
to the ‘aa”, “ai ” and “ii ” rule1 . Sometimes “e is used in place of “ye”; and
“ii ” and “iin” in the place of “yii ” and “yiin”, respectively. For example, the
verb is “come” -“aa”, in masculine “aayaa” (singular) and “aaye” or “aae”
(plural), and in feminine “aayii ” or “aaii (singular) and “aayiin” or “aaiin
(plural).
1
The “aa”, “ai”, “ii” Rule (Sastri and Apte, 1968): Masculine words ending in “aa” form their
plurals by changing the “aa” into “e” and their feminine by changing aa” into “ii”
287
A.2. Verb Morphological and Structure Variations
• If the verb-root ends in “uu”, change it into “u” and add “aa” and “e” in
masculine and “ii ” and “iin” in feminine. For example the verb is “touch”, in
masculine “chhuaa” and “chhue”, and in feminine “chhuii ” and “chhuiin”.
Of the English verb group, above mentioned morpho word and suffixes are called
the morphological transformations in Hindi. Table A.4 provides some conjugation of
verb “write”, in a view of the systematize knowledge which has been given in Table
A.3.
Similar discussion can be done for the passive verb form also. Passive form can
be formulated for transitive verbs only.
The morphological variation depends on the gender and number of the object
288
Appendix A.
of the active form of the sentence that is basically the subject in the passive form
(Sastri and Apte, 1968). The subject of the active form occurs in the passive forms
as the instrumental case followed by “by” and its Hindi is either “se, “ ke duwaraa”
or “duwaraa”. In passive form, the changes in the main verb are according to the
rules of PCP form of verb as discussed in the section A.1. Moreover an extra verb
“jaa” is introduced after the main verb and the suffixes that are given in Table A.3
are added in this additional verbs instead of the main verb of the sentence. The
morpho words are added after the conjugation of verb “jaa”. Suppose the set of
examples are:
289
Appendix B
Appendix B.
In this work we have used the ENGCG parser1 for parsing the English sentence.
Most of the FTs that are relevant for this work are obtained directly from the
parser. Description of these FTs are given below:
291
B.1. Functional Tags
(e.g. He is a fool.)
@I-OBJ – Indirect Object
(e.g.He gave Mary a book.)
@ADVL – Adverbial
(e.g. She came home late. She is in the car.)
@<NOM-OF – Postmodifying of
(e.g. Five of you will pass.)
@<NOM-FMAINV – Postmodifying Nonfinite Verb
(e.g.He has the licence to kill. John is easy to please.)
292
Appendix B.
@CS – Subordinator
(e.g. If John is there, we shall go, too.)
@NEG – Negative Particle
(e.g.It is not funny.)
@DN> – Determiner
(e.g. He read the book.)
@AN> – Premodifying Adjective
(e.g. The blue car is mine.)
Some of the functional tags that are required for divergence identification al-
gorithms are not directly given by the available parsers. These FTs are Adjunct
(A), predicative adjunct (PA) and VC (verb complement)(refer Appendix C of their
definitions). We have formulated rules for obtaining these FTs by using information
available in the morpho tags of the underline sentence.
293
B.2. Morpho Tags
Part-of-speech tags
A adjective (small)
ADV adverb (soon)
CC coordinating conjunction (and)
N noun (house)
NEG-PART negative particle (not)
NUM numeral (two)
PCP1 -ing form (writing)
294
Appendix B.
PL plural (few)
POST postdeterminer (much)
PRE predeterminer (all)
SG singular (much)
295
B.2. Morpho Tags
SG singular (car)
SG/PL singular or plural (means)
296
Appendix B.
PL plural (fewer)
PL1 1st person plural (us)
PL2 2nd person plural (yourselves)
PL3 3rd person plural (them)
297
B.2. Morpho Tags
298
Appendix C
Appendix C.
The definitions of some non-typical functional tags that we have used in our algo-
rithms are given below.
For example:
He lives in Brazil.
2. Predicative Adjunct (PA): If the copula (linking verb) is present, and it allows
an adverbial as complementation, then the complementation is called predica-
tive adjunct.
For example:
Here, the underlined prepositional phrase and adverb are the examples of pred-
icative adjunct.
299
C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures
For example :
4. Verb Complement (VC): Sometimes prepositional phrases may act as the com-
plementation of a verb. We use a generalized term “verb complement” to de-
note them. This happens when the main verb of the sentence is intransitive.
In case of transitive and ditransitive verbs direct and/or indirect objects are
used to complete the sense of the sentence. But these are not considered here
under verb complement. We have used their actual names object and indirect
object.
For Example :
Although many English parsers are available on-line, none of them give full in-
formation on both FT and SPAC. In order to glean both the information, we had
to combine the output of two parsers. In particular, we have used the following two
parsers:
300
Appendix C.
of the algorithms. Some of the functional tags that are required for divergence iden-
tification algorithms are not directly given by the available parsers. These FTs are
Adjunct (A), predicative adjunct (PA), Postmodifier of the subjective complement
(SC C), and VC (verb complement). We have formulated rules for obtaining these
FTs by using information available in the morpho tags of the underline sentence.
The SPAC structure has been taken from MBSP. No specific rules are required
to capture the SPAC structure for a sentence. However, we made small structural
changes so that they can be manipulated by program easily. For example, consider
the sentence “The student is weak in his studies”. The MBSP gives the following
output:
[NP the/DT student/NN NP] [VP is/VBZ VP] [ADJP weak/JJR ADJP] PNP}{PNP
[Prep in/IN Prep][NP his/PRP$ studies/NNS NP]PNP}.
[NP [the/DT student/N]] [VP [is/V]] [ADJP weak/Adj]][PP in/IN [NP his/PRP$ stud-
ies/N]]
For Hindi, no parser is available online. According to the English parser infor-
mation, we have tagged Hindi sentences for our work i.e. we have used same FTs
1
http://www.lingsoft.fi/cgi-bin/engcg
2
http://pi0657.kub.nl/cgi-bin/tstchunk/demo.pl
301
C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures
302
Appendix D
Appendix D.
Semantic similarity between two words is computed on the basis of their semantic
The semantic similarity score lies between 0 and 1. Semantic distance [Stetina
et. al., 1998] between two words, say a and b, is computed as:
1 Ha − H Hb − H
sd(a, b) = +
2 Ha Hb
sd (a, b) = 0.5 for the same synsets in the same similarity cluster and antonym
relation ant(a,b)
303
Appendix E
Appendix E.
to Pre-modifier Adjective
Here transformation from genitive case to genitive case requires fourteen adaptation
operations. Below we describe the cost for each of them. Note that the pre-modifier
word can be either ABS or A(PCP1) or A(PCP2) (See Section 2.5.4). We denote
this set by R.
1. The average cost of word replacement from the set R to ABS. This cost is
2. The average cost of word replacement from set R to either A(PCP1) or A(PCP2).
We denote this cost as w2 . Here also, dictionary search is required. Note that
adjective forms A(PCP1) and A(PCP2) are derived from the verb part of
speech, therefore, in this case dictionary search time is 12.08. Hence the total
Lp
average cost is (l1 × L2 ) + (l2 × 2
) + {(d× 12.08) + (c × 105 )}.
3. The average cost of morpho-word addition from the set {huaa, huye, huii }.
This cost is denoted as w3 . Since the total morpho words are three, the average
cost may be formulated as (l1 × L2 ) + (m × 32 ) + ψ +.
4. The average cost of morpho-word deletion from the set {huaa, huye, huii }.
305
E.1. Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective
5. The average cost of suffix replacement from the set {aa, e, ii } is (l1 × L2 ) +
(k × 32 ) + (k × 32 ). We denote it as s1 .
6. The average cost of suffix addition from the set {taa, tai, tii }. This cost is
denoted as s2 , which is computed to be (l1 × L2 ) + (k × 32 ).
7. The average cost of suffix replacement in verb form of A(PCP2) by using PCP
form of verb (see Appendix A). We denote it as s3 . Hence, the total average
cost is (l1 × L2 ) + (k × 62 ) + (k × 62 ).
8. The average cost of suffix addition in verb form of A(PCP2) by using PCP
9. We denote the average cost of suffix replacement from the set {taa, te, tii } as
s5 , which is formulated as (l1 × L2 ) + (k × 32 ) + (k × 32 ).
10. The average cost of suffix replacement from the set {aa, ye, ii } is (l1 × L2 ) +
(k × 32 ) + (k × 32 ). We denote it as s6 .
11. The average cost of suffix replacement from the suffix set {taa, te, tii } to any
of the suffix which is required for verb form of A(PCP2) (using PCP verb form
rule, see Appendix A). We denote it as s7 . Since the number of suffixes required
for verb form of A(PCP2) is fourteen, the average cost of this operation may
14
be formulated as (l1 × L2 ) + (k × 2
) + (k × 32 ).
12. The average cost of suffix replacement from any of the suffix which is required
for converting verb form of A(PCP2) to {taa, te, tii } is (l1 × L2 ) + (k × 32 ) +
14
(k × 2
) (Here also, as in item 11 above). We denote it as s8 .
13. The average cost of suffix replacement for verb form of A(PCP2) to verb form
306
Appendix E.
14. The average cost of suffix addition for verb form of A(PCP2) to verb form
of A(PCP2). We denote it as s10 , which may be formulated as (l1 × L2 ) +
Table E.1 discusses the cost of pairwise modification from pre-modifier adjective
to pre-modifier adjective by referring its adaptation rule Table 2.10.
307
Bibliography
Ansell, M.: 2000, English Grammar: Explanations and Exercises, Second edn,
http://www.fortunecity.com/bally/durrus/153/gramdex.html.
Arnold, D. and Sadler, L.: 1990, Theoretical basis of MiMo, Machine Translation
5(3), 195–222.
Bender, E.: 1961, HINDI Grammar and Reader, University of Pennsylvania Press,
University of Pennsylvania South Asia Regional Studies, Philadelphia, Pennsyl-
vania.
Bennett, W. S.: 1990, How much semantics is necessary for MT systems?, Pro-
ceedings of the Third International Conference on Theoretical and Methodological
Issues in Machine Translation of Natural Languages, Vol. TX, Linguistics Re-
search Center, The University of Texas, Austin, pp. 261–269.
Bharati, A., Sriram, V., Krishna, A. V., Sangal, R. and Bendre, S.: 2002, An
algorithm for aligning sentences in bilingual corpora using lexical information,
International Conference on Natural Language Processing, Mumbai.
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., Lafferty, J. D. and Mercer, R. L.:
1992, Analysis, statistical transfer, and synthesis in machine translation, Pro-
ceedings of the Fourth International Conference on Theoretical and Methodolog-
ical Issues in Machine Translation of Natural Languages, Montreal, Canada,
pp. 83–100.
Brown, P. F., Pietra, S. A., Pietra, V. J. D., Pietra, D. and Mercer, R. L.: 1993, The
mathematics of statistical Machine Translation: parameter estimation, Compu-
tational Linguistics 19(2), 263–311.
Brown, P., Lai, J. C. and Mercer, R. L.: 1991, Aligning sentences in parallel cor-
pora, Proc. of 29th Annual Meeting of Association for Computational Linguistic,
Berkeley, pp. 169–176.
pp. 22–32.
310
BIBLIOGRAPHY
Carl, M. and Hansen, S.: 1999, Linking translation memories with Example-Based
Machine Translation, Proceedings of Machine Translation Summit VII, Singapore,
pp. 617–624.
Series: Text, Speech and Language Technology, Vol. 21, Kluwer Academic Pub-
lishers, Netherlands.
Choueka, Y., Conley, E. S. and Dagan, I.: 2000, A comprehensive bilingual word
alignment system: Accommodating disparate languages: Hebrew and English, J.
Vronis (ed.): Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Clough, P.: 2001, A Perl program for sentence splitting using rules,
www.ayre.ca/library/cl/files/sentenceSplitting.ps.
311
BIBLIOGRAPHY
Daelemans, W., Zavrel, J., Berck, P. and Gillis, S.: 1996, MBT: A memory-based
part of speech tagger-generator, Proceedings of the Fourth Workshop on Very
Large Corpora, E. Ejerhed and I. Dagan (eds.), Copenhagen, Denmark, pp. 14–
27.
Dave, S., Parikh, J. and Bhattacharya, P.: 2002, Interlingua Based English-Hindi
Machine Translation and language divergence, Journal of Machine Translation
(JMT) 17.
Doi, T. and Sumita, E.: 2003, Input sentence splitting and translating, HLT-NAACL
2003 Workshop: Building and Using Parallel Texts Data Driven machine Trans-
lation and Beyond, Edmonton, pp. 104–110.
Dorr, B. J.: 1993, Machine Translation: A View from the Lexicon, MIT Press,
Cambridge, MA.
Dorr, B. J., Jordan, P. W. and Benoit, J. W.: 1998, A survey of current paradigms
Dorr, B. J., Pearl, L., Hwa, R. and Habash, N. Y. A.: 2002, DUSTer: A method for
unraveling cross-language divergences for statistical word level alignment., Pro-
ceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Fung, P. and McKeown, K.: 1996, A technical word-and term-translation aid using
noisy parallel corpora across language groups, The Machine Translation Journal,
312
BIBLIOGRAPHY
Furuse, O., Yamada, S. and Yamamoto, K.: 1998, Splitting long or ill-formed input
for robust spoken-language translation, Proceedings of the Thirty-Sixth Annual
Meeting of the ACL and Seventeenth International Conference on Computational
Linguistics, pp. 421–427.
Gale, W. A. and Church, K. W.: 1991b, A program for aligning sentences in bilingual
Gale, W. and Church, K.: 1993, A program for aligning sentences in bilingual cor-
pora, Computational Linguistics 19(1), 75–102.
Technology.
Germann, U.: 2001, Building a Statistical Machine Translation system from scratch:
How much bang for the buck can we expect?, ACL 2001 Workshop on Data-Driven
Machine Translation, Toulouse, France.
Goyal, S., Gupta, D. and Chatterjee, N.: 2004, A study of Hindi translation pat-
terns for English sentences with have as the main verb, Proceedings of Interna-
tional Symposium on MT, NLP and Translation Support Systems: iSTRANS-
2004, CDEC and IIT Kanpur, Tata McGraw-Hill, New Delhi, pp. 46–51.
Grishman, R. and Kosaka, M.: 1992, Combining rationalist and empiricist ap-
proaches to Machine Translation, Proceedings of the Fourth International Confer-
313
BIBLIOGRAPHY
Gupta, D. and Chatterjee, N.: 2002, Study of similarity and its measurement for
English to Hindi EBMT, Proceedings of STRANS-2002, IIT Kanpur.
Gupta, D. and Chatterjee, N.: 2003c, A morpho-syntax based adaptation and re-
trieval scheme for English to Hindi EBMT, Proceedings of Workshop on Com-
putational Linguistic for the Languages of South Asia: Expanding Synergies with
Europe, Budapest, Hungary, pp. 23–30.
Güvenir, H. A. and Cicekli, I.: 1998, Learning translation templates from examples,
Information System 23, 353–363.
Habash, N. and Dorr, B. J.: 2002, Handling translation divergences: Combining sta-
Han, C.-h., Benoit, L., Martha, P., Owen, R., Kittredge, R., Korelsky, T., Kim,
N. and Kim, M.: 2000, Handling structural divergences and recovering dropped
314
BIBLIOGRAPHY
Kachru, Y.: 1980, Aspects of Hindi Grammar, Manohar Publications, New Delhi.
Kellogg, R. S. and Bailey, T. G.: 1965, A Grammar of the Hindi Language, Rout-
ledge and Kegan Paul Ltd., London.
Kit, C., Pan, H. and Webster, J.: 2002, Example-Based Machine Translation: A
Leffa, V. J.: 1998, Clause processing in complex sentences, Proceedings of the First
International Conference on Language Resources and Evaluation, Vol. 1, pp. 937–
943.
Loomis, M. E. S.: 1997, Data Management and File Structures, second edn, Prentice
Hall of India Private Limited, New Delhi-110001.
315
BIBLIOGRAPHY
McEnery, A. M., Oakes, M. P. and Garside, R.: 1994, The use of approximate
string matching techniques in the alignment of sentences in parallel corpora, in
A. Vella (ed.), The Proceedings of Machine Translation: 10 Years On, University
of Cranfield.
McTait, K.: 2001, Translation Pattern Extraction and Recombination for Example-
Based Machine Translation, PhD thesis, Centre for Computational Linguistics
Department of Language Engineering, UMIST.
Nieß en, S., Och, F. J., Leusch, G. and Ney, H.: 2000, An evaluation tool for ma-
chine translation: Fast evaluation for machine translation research., Proceedings
of the Second Int. Conf. on Language Resources and Evaluation (LREC), Athens,
Nirenburg, S., Grannes, D. and Domashnev, K.: 1993, Two approaches of match-
ing in Example-Based Machine Translation, Proceedings of TMIMT-93, Kyoto,
Japan.
Oard, D. W.: 2003, The surprise language exercises, ACM Transactions on Asian
316
BIBLIOGRAPHY
Orăsan, C.: 2000, A hybrid method for clause splitting in unrestricted English texts,
Proceedings of ACIDCA 2000, Monastir, Tunisia.
Papineni, K. A., Roukos, S., Ward, T. and Zhu, W.-J.: 2001, Bleu: a method
for automatic evaluation of machine translation, Technical Report Technical Re-
Piperidis, S., Boutsis, S. and Papageorgiou, H.: 2000, From sentences to words and
clauses, Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Puscasu, G.: 2004, A multilingual method for clause splitting, Proceedings of CLUK
2004, Birmingham, UK, pp. 199 – 206.
Rao, D.: 2001, Human aided Machine Translation from English to Hindi: The
MaTra project at NCST, Proceedings Symposium on Translation Support Sys-
tems, STRANS-2001, I.I.T. Kanpur.
Rao, D., Mohanraj, K., Hegde, J., Mehta, V. and Mahadane, P.: 2000, A practi-
cal framework for syntactic transfer of compound-complex sentences for English-
Resnik, P. and Yarowsky, D.: 2000, Distinguishing systems and distinguishing senses:
New evaluation methods for word sense disambiguation, Natural Language Engi-
317
BIBLIOGRAPHY
Sastri, S. and Apte, B.: 1968, Hindi Grammar, Dakshina Bharat Hindi Prachar
Sabha, Madras, India.
Workshop: Building and Using Parallel Texts Date Driven Machine Translation
and Beyond, Edmonton, pp. 50–56.
Shiri, S., Bond, F. and Takhashi, Y.: 1997, A Hybrid Rule and Example-Based
Method for Machine Translation, Proceedings of the 4th Natural Language Pro-
cessing Pacific Rim Symposium: NLPRS-97, Phuket, Thailand, pp. 49–54.
Singh, S. B.: 2003, English- Hindi Translation Grammar, first edn, Prabhat
Prakashan, 4/19 Asaf Ali Road, New Delhi-110002.
Sinha, R. M. K., Jain, R. and Jain, A.: 2002, An English to Hindi machine aided
318
BIBLIOGRAPHY
Somers, H.: 1997, Machine Translation and minority languages, Translating and the
Computer 19: Papers from the Aslib conference, London.
Somers, H.: 2001, EBMT seen as case-based reasoning, MT Summit VIII Workshop
on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 56–
65.
Stetina, J., Kurohashi, S. and Nagao, M.: 1998, General word sense disambigua-
tion method based on a full sentential context, Proceedings of COLING-ACL
Sumita, E., Iida, H. and Kohyama, H.: 1990, Translating with examples: A new
approach to Machine Translation, TMI-1990, pp. 203–212.
Sumita, E. and Tsutsumi, Y.: 1988, A translation aid system using flexible text
retrieval based on syntax matching, Proceedings of TMI-88, CMU, Pittsburgh.
319
BIBLIOGRAPHY
Thurmair, G.: 1990, Complex lexical transfer in METAL, Proceedings of the Third
International Conference on Theoretical and Methodological Issues in Machine
Tillmann, C., Vogel, S., Ney, H., Zubiaga, A. and Sawaf, H.: 1997, Accelerated dp
based search for statistical translation, In European Conf. on Speech Communi-
Uchida, H. and Zhu, M.: 1998, The Universal Networking Language (UNL) spec-
ifications version 3.0. 1998, Technical report, United Nations University, Tokyo,
http://www.unl.unu.edu/unlsys/unl/unls30.doc.
Vikas, O.: 2001, Technology development for indian languages, Proceedings of Sym-
posium on Translation Support Systems STRANS-2001, IIT Kanpur.
Watanabe, H., Kurohashi, S. and Aramaki, E.: 2000, Finding structural correspon-
dences from bilingual parsed corpus for Corpus-Based Translation, Proceedings
320
BIBLIOGRAPHY
Weiderhold, G.: 1987, File organization for database design, McGraw-Hill Inc. New
York, USA.
Wren, P., Martin, H. and Rao., N.: 1989, High School English Grammar, S. Chand
& Co. Ltd., New Delhi.
321
About the Author
Ms. Deepa Gupta was born on July 5th , 1977. She obtained a Bachelor’s Degree
in Mathematics (honors) from L.B. College, University of Delhi, in 1997 with an over-
all score of 73.00%. She completed her Post Graduation in Mathematics in 1999 from
Indian Institute of Technology Delhi with an C.G.P.A 7.70. She joined the Ph.D.
programme of Department of Mathematics at IIT Delhi in July, 1999 as a Junior Re-
search Fellow. Thereafter in July, 2001 she was awarded as a Senior Research Fellow.
During her research tenure she participated and presented various research articles
and published seven research papers in different national/international journals and
List of Publications
Communicated Paper(s)
• Tata Infotech Research Fellowship of Rs. 12,000, during July, 2002 to April,
2004.