Professional Documents
Culture Documents
MACHINE TRANSLATION 2
R A J E E V S A N G A L , I I I T - HY D E R A B A D
FLUENCY VS ACCURACY
1. The Target sentence transfers the meaning from the source sentence.
Thus, there are two issues to be handled while making a Machine Translation system. They
are:
Fluency – How good the sentence is to the normal sentences in Target Lang.
A good Machine Translation system needs to obtain a balance with regard to both these
factors.
- From the above example, we can understand some of the issues to be taken in to
account while designing a Machine Translation system. Some of the observations from
the above example are:
1. The difference in the word order between English and Hindi. Hindi is relatively a free
word order language.
3. The word Bank is ambiguous. The ambiguity is to be resolved using the context.
SHAKTI APPROACH
The SHAKTI system follows an InterLingua approach to the MT problem.
IN SHAKTI APPROACH:
2. Transfer grammar from English to Hindi (Uses Word reordering system and
substitution of equivalent Hindi words in the sentence.
- In SMT, the analysis starts with POS/parse tree generation etc. on source side (and
also perhaps in the target side) and “learn” translations
- In this approach, we start with analysis first, for each of the components above (1-
3). However, now even here, “machine learning” approaches are being pursued for
each of these components.
So, currently, the “machine learning” principles are introduced in Interlingua approach, just
like rules are being introduced into SMT systems, to improve quality. The issue is only
which is being emphasized in each of these approaches. However, they are converging in
the tool-set that they use for translation. Shakti can be viewed as a combination of both Rule
Based and Statistical approaches to MT.
HIGHLEVEL ISSUES
A Machine Translation system needs to handle the ambiguities that exist in the language
structure.
- Sense Ambiguity
o WSD based on the context is very difficult. (This may require large amounts
of annotated data, and this is hard to come by).
- Structural Ambiguity
GENERAL STEPS:
1. POS Tagging
2. Morphological Analysis
3. Chunking
4. Parsing
5. Semantic Analyzer
6. WSD
8. Transfer Rules
PARSING
Parse trees are important and capture the important “information” from the sentences. A
set of Context Free Grammar rules may be written to parse the Source language sentences
properly.
Note that the more the rules are:
Eg: “The child plucked the flower in the plant” and “The child plucked the flower in the
park” must have two structures, but a refined rule base will give both for each.
Probabilistic Parsing does not associate a single probability to the rule per se., but associate
a probability to a rule depending on the head verb (a lexicalized versions of these rules).
While the CFG based formalisms for Parsing reached 90-92% accuracy by 2000-2001, the
accuracy was boosted.
While building a grammar using CFG, the rules have to be designed in such a way that they
consider different issues like: handling proper nouns & pronouns, handling participle
verbs(both adjectival and adverbial) etc.. should be considered.
For Indian languages, due to the free movement of phrases in the sentences, issue becomes
identifying “Subject” and “Objects”. Hence, a Phrase Structure tree kind of framework using
CFG does not become a feasible option in parsing Indian Languages. Dependency Parsing is
one option, which parses based on the relations between words.
DEPENDENCY PARSING
Dependency parsing takes care of all variations of a parse. It is suited for the needs of Indian
languages, which are of a relatively free word order. It handles different linguistic
phenomenon that exists in Indian Languages quite seamlessly. In a dependency parse of a
sentence, it gives the same parse for all the sentences with different word order, for one
sentence. The approach is based on the framework proposed by Panini, ancient Indian
grammarian. It is sometime called Syntactico-Semantic approach.
Basic Framework:
o All “free word order” may be taken care of here; an additional order ID may
be provided to take care of order.
Panini’s has talked about the “modifier-modified” relationships and called them “visheshya-
visheshan” relationship. There are six such relationship, which Panini referred as: KARAKA
roles. They are: Karta, Karma, Kaarana, Sampradaana, Apaadaana, Adhikarana.
5. Apaadaana refers to the source. (Ex: The flower fell from the tree.)
Vibhakti: These are the non-dependency relations which provide syntactic cues in the
sentence. They are the post-position markers in the Indian languages. However, there is no
one-one correspondence between relations and the relation markers.
TAGSET DESIGN
IIIT-H has a rule-based parser (DP), that is more than “narrow”; Also, a tree-banking effort
is going on. There is a 26 tag tag-set being used for POS tagging tasks now.
1. Coarseness Vs Fineness
3. Simplicity in aquiring it
TRANSFER RULES
In English, the parsing could be “Phrase-structured”, but in Indian language, you need
“Dependency parse” to get the surface properly generated.
The phrase structure tree is constructed based on the head propagation in the tree. Using
this PST along with the head information, a Dependency Parse is constructed for the
sentence.
1. Choosing the right word for a given source language word using lexical transfer
rules.
CONCLUSION
DISCUSSION