You are on page 1of 6

14TH MAY 2007, MONDAY 9:30-12:30

MACHINE TRANSLATION 2
R A J E E V S A N G A L , I I I T - HY D E R A B A D

FLUENCY VS ACCURACY

The goal of a Machine Translation system can be realised in two ways:

1. The Target sentence transfers the meaning from the source sentence.

2. The target sentence is translated in a similar was as the source sentence.

Thus, there are two issues to be handled while making a Machine Translation system. They
are:

 Fluency – How good the sentence is to the normal sentences in Target Lang.

 Accuracy – How closely the semantics models the source sentence

A good Machine Translation system needs to obtain a balance with regard to both these
factors.

A PEEP IN TO THE INTRICACIES IN MACHINE TRANSLATION

Let us consider an English sentence:

She ran towards the bank.

This sentence, when translated to Hindi becomes:

Vaha kinaare kii ora dood’zi

- From the above example, we can understand some of the issues to be taken in to
account while designing a Machine Translation system. Some of the observations from
the above example are:

1. The difference in the word order between English and Hindi. Hindi is relatively a free
word order language.

2. Gender information is transferred from pronoun in English to verb in Hindi.

3. The word Bank is ambiguous. The ambiguity is to be resolved using the context.

SHAKTI APPROACH
The SHAKTI system follows an InterLingua approach to the MT problem.

<< Insert Picture of “Interlingua Approach” from slides here>>

This approach consists of three components

1. Source Language Analysis

a. To “Inter-Lingua” (The intermediate “Language” or “Representation”)

2. Transfer from source to target language

a. The level at which transfer has to be done depends.

3. Target Language Analysis (& surface generators)

IN SHAKTI APPROACH:

The various stages in SHAKTI system are mentioned below:

1. English sentence analysis (Uses Morphology, Chunking, Word Sense


Disambiguation and Parsing)

2. Transfer grammar from English to Hindi (Uses Word reordering system and
substitution of equivalent Hindi words in the sentence.

3. Hindi Sentence generation(The rules of agreement and word generation are


used).

VS. STATISTICAL MACHINE TRANSLATION

- In SMT, the analysis starts with POS/parse tree generation etc. on source side (and
also perhaps in the target side) and “learn” translations

- In this approach, we start with analysis first, for each of the components above (1-
3). However, now even here, “machine learning” approaches are being pursued for
each of these components.

So, currently, the “machine learning” principles are introduced in Interlingua approach, just
like rules are being introduced into SMT systems, to improve quality. The issue is only
which is being emphasized in each of these approaches. However, they are converging in
the tool-set that they use for translation. Shakti can be viewed as a combination of both Rule
Based and Statistical approaches to MT.

HIGHLEVEL ISSUES

A Machine Translation system needs to handle the ambiguities that exist in the language
structure.

- Word Ambiguity (and/or POS Tagging ambiguity)


o Metaphors make Translation a hard task.

o “Time flies like an Arrow.”

- Sense Ambiguity

o WSD based on the context is very difficult. (This may require large amounts
of annotated data, and this is hard to come by).

o “I charge the criminal.”, vs, “I charge the battery.”, etc.

- Structural Ambiguity

o “I saw a man on the hill with the telescope.”

- Other kinds of ambiguities like: NP boundaries, Verb arguments, Verb-Verb


relations, Anaphora resolution etc also exist.

- One of the most important issues in MT relates to the question of : faithfulness Vs


Fluency.

SHAKTHI MODULES & GENERAL ARCHITECTURE

Shakti has a “Dashboard Architecture” that makes the modules work

GENERAL STEPS:

1. POS Tagging

2. Morphological Analysis

3. Chunking

4. Parsing

5. Semantic Analyzer

6. WSD

7. S-T Language Dictionary Lookup

8. Transfer Rules

9. [Lang. Specific] Target language Generation

PARSING

Parse trees are important and capture the important “information” from the sentences. A
set of Context Free Grammar rules may be written to parse the Source language sentences
properly.
Note that the more the rules are:

1. The semantic capture becomes refined

2. Ambiguity is increased, as more parses may be applicable for a given sentence.

Eg: “The child plucked the flower in the plant” and “The child plucked the flower in the
park” must have two structures, but a refined rule base will give both for each.

Probabilistic Parsing does not associate a single probability to the rule per se., but associate
a probability to a rule depending on the head verb (a lexicalized versions of these rules).

While the CFG based formalisms for Parsing reached 90-92% accuracy by 2000-2001, the
accuracy was boosted.

While building a grammar using CFG, the rules have to be designed in such a way that they
consider different issues like: handling proper nouns & pronouns, handling participle
verbs(both adjectival and adverbial) etc.. should be considered.

For Indian languages, due to the free movement of phrases in the sentences, issue becomes
identifying “Subject” and “Objects”. Hence, a Phrase Structure tree kind of framework using
CFG does not become a feasible option in parsing Indian Languages. Dependency Parsing is
one option, which parses based on the relations between words.

DEPENDENCY PARSING

Dependency parsing takes care of all variations of a parse. It is suited for the needs of Indian
languages, which are of a relatively free word order. It handles different linguistic
phenomenon that exists in Indian Languages quite seamlessly. In a dependency parse of a
sentence, it gives the same parse for all the sentences with different word order, for one
sentence. The approach is based on the framework proposed by Panini, ancient Indian
grammarian. It is sometime called Syntactico-Semantic approach.

Basic Framework:

- Treats a sentence as a modifier-modified relations and provides a blue-print to


identify these relations.

o All “free word order” may be taken care of here; an additional order ID may
be provided to take care of order.

- Syntactic cues help in identifying the relation types

Vivaksha: In Paninian model, the language expresses “a speakers viewpoint/intention” that


is expressed in the sentences.

Panini’s has talked about the “modifier-modified” relationships and called them “visheshya-
visheshan” relationship. There are six such relationship, which Panini referred as: KARAKA
roles. They are: Karta, Karma, Kaarana, Sampradaana, Apaadaana, Adhikarana.

1. Karta refers to the locus of the activity referred to in the sentence.


2. Karma refers to the locus of the result of the activity.

3. Kaarana refers to the instrument used in performing the activity.

4. Sampradaana refers to the beneficiary.

5. Apaadaana refers to the source. (Ex: The flower fell from the tree.)

6. Adhikarana refers to the Time-place location. (Ex:I went to Delhi.).

Other dependency relations do exist, which were referred to by Panini as non-Karaka


relations. They are: Purpose, reason, direction etc. Different cases like – causatives,
associatives, comparatives, genitives also come under dependency relations.

“Shabdhabodh” is a rule-base that can be used for “parsing” a sentence.

Vibhakti: These are the non-dependency relations which provide syntactic cues in the
sentence. They are the post-position markers in the Indian languages. However, there is no
one-one correspondence between relations and the relation markers.

TAGSET DESIGN

IIIT-H has a rule-based parser (DP), that is more than “narrow”; Also, a tree-banking effort
is going on. There is a 26 tag tag-set being used for POS tagging tasks now.

Some of the issues involved in designing a Tagset are:

1. Coarseness Vs Fineness

2. The Number of tags to be included in the tagset

3. Simplicity in aquiring it

4. Less effort in Manual tagging

TRANSFER RULES

In English, the parsing could be “Phrase-structured”, but in Indian language, you need
“Dependency parse” to get the surface properly generated.

The phrase structure tree is constructed based on the head propagation in the tree. Using
this PST along with the head information, a Dependency Parse is constructed for the
sentence.

In short, this kind of translation can be explained in the following steps:

1. Construct a PST for the English sentence.

2. Convert it to the Dependency Tree

3. Transfer phase (from Source to Target language)


4. Choose the right word, TAM(Tense Aspect Modality) and agreement for the Target
language dependency tree.

<<How to construct a DP parse – Example that Prof Sangal gave>>

“Ram borrowed a book from the library.”

Some of the tasks performed by transfer grammar include:

1. Choosing the right word for a given source language word using lexical transfer
rules.

2. Selecting TAM for the verb

CONCLUSION

A Machine Translation system SHAKTI was implemented at IIIT, Hyderabad. SHAKTI is


a rule based system in which Machine Learning plays an important role. Machine Learning
is used in various stages of SHAKTI like POS Tagging, Chunking etc.

DISCUSSION

The possibility of automatic learning of Transfer rules can be experimented subject to


the availability of parallel corpus.

You might also like