• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Maryam Zafar et al / International Journal on Computer Science and Engineering Vol.1(3), 2009, 275-282275
Interactive English to Urdu Machine Translationusing Example-Based Approach
Maryam Zafar
Department of Computer ScienceMilitary College of Signals, NUSTRawalpindi, Pakistanmaryamzafar@yahoo.com 
Asif Masood
Department of Computer ScienceMilitary College of Signals, NUSTRawalpindi, Pakistanamasood@mcs.edu.pk 
 Abstract
—This work is first attempt towards English to UrduMachine Translation (MT) using example based approach. Wehave developed an interactive MT system to facilitate the user tocustomize the translation to his needs, thereby improving theperformance of the translation. Our MT system supports idioms,homographs, and some other features in addition to the ability of the bilingual corpus to evolve. In the end, we compare thefeatures of the MT system developed by the center for research inUrdu Language Processing (CRULP) with those of our MTsystem.
 Keywords-Machine Translation; Bilingual Corpus
I.
 
I
NTRODUCTION
 Knowledge is the key to progress and English is onelanguage that preserves a tremendous amount of knowledge.There is a huge literature of Sciences and Engineering availablein English. Rightly so, English is believed to be an internationallanguage. It becomes more important to be able to understandEnglish.Urdu is a language that is spoken in all over Pakistan andmany parts of India and some other South Asian countries [1].This makes Urdu a very important language.Only 5% to 10% people of Pakistan are familiar withEnglish [2]. To be able to move with the world, people of Pakistan must get the latest knowledge but the current standingin terms of literacy and the understanding of English finds agulf that needs to be abridged. This suggests that there is a needof the interface between the two languages, i.e. English andUrdu.Solution comes in the form of Machine Translation (MT).MT refers to the use of computing resources to facilitate thetranslation of the content available in one language to itsequivalent content in another language [3].There are three main techniques for MT known as Rulebased Machine translation (RBMT), Statistical MachineTranslation (SMT), and Example Based Machine Translation(EBMT). RBMT depends upon linguistic rules to carry outtranslation from one language to another. It performsmorphological, syntactic and semantic analysis on the languageand then transfers it into target language [4]. SMT is a corpusbased approach that uses statistical models to carry outtranslation [5].Third approach for MT is Example Based MachineTranslation (EBMT). It was introduced by Nagao [6] in 1984.Using EBMT, translation is carried out by decomposingEnglish sentence into fragments, then finding correspondingtranslation for those fragments in Urdu and then recombiningtranslated fragment into Urdu sentence.English and Urdu are structurally different languages [1].For structurally different languages EBMT is a better choicethan others [7]. This suggests us to use Example basedapproach for our MT system.A quality translation cannot be developed now unless theuser gives a feedback to the MT system as discussed in [8], and[9]. An interactive system that takes the feedback of the user toimprove the translation quality is developed in our MT system.The scheme of this paper is as follows. We present relatedwork in section 2, our proposed method in section 3, discussbilingual corpus in section 4, compare the work that we havedone with that done by CRULP in section 5, after whichconclusion and future work is provided in section 6.II.
 
R
ELATED
W
ORK
 EBMT has been attempted for many languages, includingEnglish, French, Spanish, German, Japanese, Chinese, Turkish,Arabic, Indian, and even Sign language, as for instance in [10]– [16].An MT system from English to Urdu is recently developedby CRULP that is available online [17]. This MT system usesrule based technique, i.e. this MT system uses syntactic parsingtrees, the parts of speech tagger, and the grammatical rules forthe translation purpose.Work has been done on a web-based interactive MT system[18]. The system is a chat-style MT. The idea is to provide abroad coverage machine translation using the user’s response toimprove the results inspired by [8], and [9].To the best of our knowledge, no work is available inliterature for English to Urdu MT using example basedapproach. Also, interactive features are not available for an MTsystem from English to Urdu. This has motivated us to build aninteractive EBMT system for the translation from English toUrdu. Also, we aim to overcome the discrepancies that arepresent in the CRULP MT system. The proposed method isdiscussed in the next section.
ISSN : 0975-3397
 
Maryam Zafar et al / International Journal on Computer Science and Engineering Vol.1(3), 2009, 275-282276III.
 
PROPOSED
 
METHODThe proposed method utilizes the phrase based set of examples accompanied by interactivity with the user. Themethodology is divided into the following main phases.Sentence FragmentationSearching in CorpusN-ary Product based RetrievalOrdering of Translated Text
 A.
 
Sentence Fragmentation
Fragmentation of sentences into phrases is important toimprove the scope of the sentences that the translator canhandle. This result can be obtained alternatively by keepingsentences in corpus and by gaining a broad coverage byfragmentation and combination to get new sentences using thegenetic algorithm at run time [19]. The problem of fragmentinga sentence into simpler sentences and phrases is handled usingthe concept of idioms, the connecting words and the cutterpoints that are explained as follows.
1)
 
 Idioms
An idiom is a phrase whose meaning is not determinedword for word, but is instead related to the common use of thenative speakers, for details consult any standard text in Englishgrammar, for instance [20]. A list of idioms is developed tohandle idiomatic phrases. The idioms are picked from [20].
2)
 
Connecting Words
Connecting words are the words that separate two or moresentences that are present in a large sentence e.g. and, becauseetc available at [21].
3)
 
Cutter Points
Cutter points are the words that separate differentmeaningful phrases in a sentence. These words are auxiliaryverbs, possessive pronouns, and prepositions.F
RAGMENTATION
A
LGORITHM
 
 Input: A sentence in EnglishOutput: A set of English phrases Algorithm body
:
For each idiom, in the list of IdiomsFind idiom in the input sentence If idiom is found Tag it as an idiom If endsFor loop endsFind connecting wordsFor each connecting word Separate the fragments of sentences intonew sentencesFor loop endsFind cutter pointsFor each cutter point Separate the fragments of sentences into phrasesFor loop ends
E
XAMPLE
1:I cannot back out of my promise.This sentence contains an idiom ‘back out of’; therefore thisis first tagged as an idiom, and the word ‘of’ in the idiom is notconsidered as a cutter point. Thus, the sentence is fragmentedinto the following phrases.
 
I
 
Cannot back out of 
 
My promiseE
XAMPLE
2:I love to play football because it thrills.This sentence is first decomposed into two sentencesnamely,
 
I love to play football
 
Because it thrillsThe first sub-sentence is further divided into phrases usingcutter points as follows.
 
I love
 
To play footballIts fragmented phrases are:
 
I love
 
To play football
 
Because it thrills
 B.
 
Source Text 
The source text in English is input and is fragmented intophrases using the fragmentation algorithm.
C.
 
Searching in Corpus
Searching comprises of the idea of finding whether or notan input phrase is available in the bilingual corpus, discussed insection IV. If it doesn’t find the exact match, it attempts to findthe close match. The measure of closeness is done using thethreshold at two levels; one for the exact match, the other forthe close match.This is done in two ways using Levenshtein Algorithm, andsemantic distance algorithm. The user is given the option tochoose from the both approaches.L
EVENSHTEIN
A
LGORITHM
 This is a character based algorithm that compares an inputstring with the target string [22]. We have used the concept of Levenshtein algorithm for the word based comparison of twophrases or two sentences as follows.
TABLE I. D
ESCRIPTION OF USING
L
EVENSHTEIN
A
LGORITHM
 
ISSN : 0975-3397
 
Maryam Zafar et al / International Journal on Computer Science and Engineering Vol.1(3), 2009, 275-282277
Is Eating
0 1 2
Was
1 1 2
Eating
2 2 1
1 in the bottom right corner indicates that one operation isneeded to convert the source text ‘is eating’ into target text‘was eating’.The 0 Levenshtein distance means the exact match. Apositive finite Levenshtein distance means that it requires somefinite operations to get the target text from the input text. Athreshold 0<T<n/2 means a closer match between two strings.We provide interactivity for the string having closer match.Semantic distance measures how close the two strings aresemantically. Semantic distance algorithm is an open sourceproject [23]. This algorithm returns score, a coefficient measureof semantic closeness between two strings. A score of 1 meansexact match, a score 0.9<s<1 means a synonymous word,where closeness is towards 1. We provide interactivity for0.8<s<0.98. These thresholds have been set after testing onsome examples. They can be adjusted in some cases assuitable.
TABLE II. T
RADE
-
OFF BETWEEN USING
L
EVENSHTEIN ALGORITHM AND
S
EMANTIC DISTANCE ALGORITHM
 Figure 1. Screenshot for interactive searching of phrases
We calculate the number of phrases for which a match isfound divided by the total number of phrases in the inputsentence, if the ratio is greater than or equal to 0.75, theprogram attempts to find the meanings of all the words in theremaining phrases from the dictionary. If the meanings of allthe words in a phrase are found, the phrase is added to thecorpus with its Urdu equivalent, otherwise, the phrase isreturned to the user with whatever portion of the phrase isfound in the dictionary accompanied by the word whosemeanings are not found.S
EARCHING
A
LGORITHM
 
 Input: English PhraseOutput: OK or SENTENCE IS NOT SUPPORTED Algorithm body:For each phraseCompare the phrase with the set of phrasesin the corpus If exact match found  Return OK  ElseShow options similar to the phrasetaken from Corpus If user selects an option Return OK  Else Return NOT FOUND If ends If endsFor loop ends If total number of OK responses divided by total phrases >= 0.75For each word of the phraseFind meaning of the word from thedictionary If all found  Add the phrase to thecorpus Return OK  Else Return the found words’meanings or words If endsFor loop ends Else Return SENTENCE IS NOT SUPPORTED If ends
The tool returns a string “SENTENCE IS NOTSUPPORTED” if the ratio of successful matches < 0.75.
 D.
 
 N-ary Product Based Retrieval
This phase comprises of the steps used to retrieve thetranslation of the source text. There is a possibility of manytranslations available for an input sentence. We gather thepossibilities and we use the idea of n-ary product to list all thepossible sentences. The idea is discussed as follows.R
ETRIEVAL
A
LGORITHM
 
 Input: Set of all English phrases of the input sentenceOutput: Set of all possible Urdu translations Algorithm body:Set i=1 // i is a counter of the number of phrasesFor each English phrase e
i
 Find the phrase e
i
in the corpus
Processing CoverageLevenshtein
Fast Low
Semantic
Slow High
ISSN : 0975-3397
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...