Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
7Activity
0 of .
Results for:
No results containing your search query
P. 1
Evaluation of English-Telugu and English-Tamil Cross Language Information Retrieval System using Dictionary Based Query Translation Method

Evaluation of English-Telugu and English-Tamil Cross Language Information Retrieval System using Dictionary Based Query Translation Method

Ratings: (0)|Views: 742 |Likes:
Published by ijcsis
Cross Lingual Information Retrieval (CLIR) system helps the users to pose the query in one language and retrieve the documents in another language. We developed a CLIR system in computer science domain to retrieve the documents in Telugu and Tamil languages for the given English query. We opted for the method of translating queries for English-Tamil and English-Telugu language pairs using bilingual dictionaries. Transliteration is also performed for the named entities present in the query. Finally, the translation and transliteration results are combined and used the resultant query to the searching module for retrieving target language documents. For Telugu, we achieve a Mean Average Precision (MAP) of 0.3835 and for Tamil, we achieve a MAP of 0.3665.
Cross Lingual Information Retrieval (CLIR) system helps the users to pose the query in one language and retrieve the documents in another language. We developed a CLIR system in computer science domain to retrieve the documents in Telugu and Tamil languages for the given English query. We opted for the method of translating queries for English-Tamil and English-Telugu language pairs using bilingual dictionaries. Transliteration is also performed for the named entities present in the query. Finally, the translation and transliteration results are combined and used the resultant query to the searching module for retrieving target language documents. For Telugu, we achieve a Mean Average Precision (MAP) of 0.3835 and for Tamil, we achieve a MAP of 0.3665.

More info:

Published by: ijcsis on Jun 12, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/24/2012

pdf

text

original

 
Evaluation of English-Telugu and English-TamilCross Language Information Retrieval Systemusing Dictionary Based Query Translation Method
 Abstract
 — 
Cross Lingual Information Retrieval (CLIR) systemhelps the users to pose the query in one language and retrieve thedocuments in another language. We developed a CLIR system incomputer science domain to retrieve the documents in Teluguand Tamil languages for the given English query. We opted forthe method of translating queries for English-Tamil and English-Telugu language pairs using bilingual dictionaries.Transliteration is also performed for the named entities presentin the query. Finally, the translation and transliteration resultsare combined and used the resultant query to the searchingmodule for retrieving target language documents. For Telugu, weachieve a Mean Average Precision (MAP) of 0.3835 and forTamil, we achieve a MAP of 0.3665.
 Keywords-Cross Lingual Information Retrieval; Translation;Transliteration; Ranking.
I.
 
I
NTRODUCTION
 CLIR can be defined as a subfield of Information Retrieval(IR) system that deals with searching and retrievinginformation written/recorded in a language different from the
language of the user’s query.
It Facilitates the process of finding relevant documents written in one natural languagewith automated systems that can accept queries expressed inother language(s) is thus the major purpose of CLIR system.The process is bilingual when dealing with a language pair,that is, one source language and one target or documentlanguage. In multilingual information retrieval the targetcollection is multilingual, and topics are expressed in onelanguage [1]. In any of such cases CLIR is expected to supportqueries in one language with a collection in anotherlanguage(s) [2].According to Peters and Sheridan [3] CLIR is a complexmultidisciplinary research area in which methodologies andtools developed in the field of IR and natural languageprocessing converges. IR is traditionally based on matchingthe words of a query with the words of document collections.Because the query and the document collection are in differentlanguages, this kind of direct matching is impossible in CLIR.Translation is needed: either the query has to be translated intothe language of the documents or the documents have to betranslated into the language of the query. Obviously,translating the whole document collection is more demanding,as it requires more scarce resources like full-fledged MachineTranslation (MT) system, which is not available for a numberof languages in developing countries. Hence query translationtechniques become more feasible and common in developmentand implementation of CLIR system. The present paperdiscusses a CLIR system using query based translation.The organization of the paper is as follows. Section II,describes related work done on CLIR systems in Indianlanguages. Section III discusses a brief overview of CLIRsystem architecture. Evaluation results are described inSection IV. The conclusion and future enhancements of thepaper are given in Section V.II.
 
R
ELATED
W
ORK
 Many organizations in India are working on the CLIRsystem for different Indian Languages [13]. IIIT, Hyderabadhas developed a Hindi and Telugu to English CLIR system[4]. They used a vector based ranking model with bilinguallexicon using word translations combined with a set of heuristics for query refinement after translation. Jagadeesh andKumaran [5] build a CLIR system with the help of a wordalignment table learned from a parallel corpus, primarily forstatistical machine translation. They participated in the CrossLanguage Evaluation Forum (CLEF) competition, in theIndian language sub-task of the main Ad-Hoc monolingualand bilingual track. This track tests the performance of systems in retrieving the relevant documents in response to aquery in the same and different languages from that of thedocument set. In Indian context, documents are provided inEnglish (corpus) and queries are specified in differentlanguages including Hindi, Telugu, Bengali, Marathi andTamil on the CLEF dataset. A cross-language query focused
P. Sujatha
 
P. Dhavachelvan V.Narasimhulu
Department of Computer Science Department of Computer Science Department of Computer SciencePondicherry Central University Pondicherry Central University Pondicherry Central UniversityPondicherry-605014, India. Pondicherry-605014, India. Pondicherry-605014, India.spothula@gmail.comdhavachelvan@gmail.comnarasimhavasi@gmail.com
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, May 2010314http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
multi-document summarization for the Telugu-Englishlanguage pair was described in [6]. The authors used a cross-lingual relevance based language modeling approach togenerate extraction based summary. It would provide asyntactically well formed set of sentences in the summary toenable easy machine translation. Other benefit of the system isoutput can be an easily translatable content (minimizingambiguities). Mandal et al. [7] described two cross-lingual andone monolingual English text retrieved at CLEF in the Ad-Hoc track. The cross-language task includes the retrieval of English documents in response to queries in two most widelyspoken Indian languages Hindi and Bengali. Here, authorsadapted automatic query generation and machine translationapproach to develop the system.An Indian Language Information Retrieval System [8],which exploits the significant overlap in vocabulary across theIndian languages. Cognates are identified using some of thewell-known similarity measures, and incorporated this withthe traditional bilingual dictionary approach. The effectivenessof the retrieval system was compared on various models. Theresults show that using cognates with the existing dictionaryapproach leads to a significant increase in the performance of the system. Language independent information retrieval isone
 
of the major issues in the web access by the regionalpopulation of any kind. Language Independent InformationRetrieval from Web (LIIRW) was described in [9]. Here, theuser with the independence of typing the query in anylanguage of his choice and getting the results in any languageor any combination of languages, it is intended to make themultilingual content of the web easily available and morenoticeable. It addresses the implementation of the LIIRWconcept in Indian languages (Hindi and Tamil). A Tamil-English CLIR system was developed in [10]. This system ismainly developed for the farmers of Tamilnadu in Agriculturedomain. It helps them to specify their information need inTamil and retrieve the documents in English (corpus). Here,the query in Tamil language is translated syntactically andsemantically to English using statistical machine translationapproach and gives the better result. The system exhibits adynamic learning approach.This paper presents a CLIR system, which translate Englishquery into Tamil and English query into Telugu usingbilingual dictionaries related to computer domain. It alsotransliterates the named entities, which are present in thequery other than the words which can be translated.III.
 
S
YSTEM
A
RCHITECTURE
 The overview of CLIR system is shown in Fig. 1. It mainlycontains the following modules: Text Processing, Verification,Translation, Transliteration and Retrieval and Ranking.
Text Processing:
For a given source or user query, thismodule performs preprocessing of the source query. That is,before translating the source query into target query, need toperform some text processing steps. The following are the textprocessing steps: Tokenization, stop words removal,morphological analyzer and stemming. Tokenization is thetask of dividing query into pieces, called tokens, perhaps at thesame time throwing away certain characters, such aspunctuation. These tokens are referred to as terms or words.After tokenization, some words in the query need to beremoved called stop words, which should not help retrievingtarget documents. Examples of those words are:
the, is, was,that 
 
etc
. Using stop words removal step, these words can beremoved from the source query. Morphological analyzeranalyzes the structure of the words in the query. Examples of those words are
verbs, adverbs, adjectives
 
etc.
That is,vocabulary of the words in the query can be identified.Stemming is the process of reducing inflected words to theirbase or root form. For example,
 fishing, fished and fisher 
areinflected words, which can be reduced into their root form
 fish
 using stemmer. After the text processing, the output of thesource query
(SQ)
is called preprocessed source languagequery
(PSQ)
, which includes preprocessed source languagequery words
{PSW 
1
 , PSW 
2
…P 
SW 
n
 }.
 
Verification Module:
It is designed for the purpose of checking the occurrence of source language words in MachineReadable Dictionaries (MRD) or Bilingual (source to targetlanguage) dictionaries. MRDs are electronic versions of printed dictionaries, and may be general dictionaries orspecific domain dictionaries or a combination of both. Aftertext processing module, the verification module accepts theinput query words
{PSW 
1
 , PSW 
2
…PSW 
n
 }
and performs adatabase lookup operation to check whether the given query isdirectly present in the bilingual dictionary. The words whichfound in the dictionary
{PSW 
1
 , PSW 
2
…PSW 
i
 }
can be given tothe Translation Module and the words which are not found inthe dictionary
{PSW 
1
 , PSW 
2
…PSW 
 j
 }
can be given to theTransliteration module.
Translation Module:
In this paper, the language of the userquery is English and the documents considered for retrievalare in Tamil and Telugu languages. These documents are a setarticles based on computer science terminology. Hence wehave concentrated on queries with computer terminology. It isalso called machine translation module. It follows thedictionary based translation method. Dictionary basedtranslation method can translates the query words using thebilingual dictionaries. These words are called vocabularywords. We have developed an English-Tamil and English-Telugu bilingual dictionaries that contain most the wordsrelated to computer science domain. The dictionary had to bebuilt from the scratch as no resource is available for thisdomain. After each intermediary step in the MorphologicalAnalyzer, the extracted word is mapped with the bilingualdictionary to check whether it is a root word. If it is available,meaning of the word is returned. If not, the word is thenpassed on to the subsequent stages in the MorphologicalAnalyzer. The words which are not found in the dictionary arecalled Out Of Vocabulary (OOV) words. This module takesinput as {PSW
1
, PSW
2
…PSW
i
} and translates into target
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, May 2010315http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
language query words
{TDW 
1
 , TDW 
2
…TDW 
i
 }
using bilingualdictionary. The output of this module is translated dictionarywords {TDW
1
, TDW
2
…TDW
i
}.
Transliteration Module:
Translation module can handleonly vocabulary words, but not OOV words. Previous studiessuggested that OOV words can be properly handled; otherwisethe retrieval performance of CLIR system can reduce up to60% [11]. OOV terms can be of many types. They can be of newly formed words, loan words; abbreviations or domainspecific terms etc. One possible and effective way of handlingOOV terms is using transliteration techniques. Transliterationis the suitable method for translating OOV terms.Transliteration is the process of transforming a word in asource language into a target language without the aid of aresource like a bilingual dictionary.This work follows grapheme based transliteration model [12],which is one of the major techniques of transliteration.Grapheme refers to the basic unit of written language orsmallest contrastive units. In grapheme based transliterationmodel spelling of the original string is considered as a basis fortransliteration. It is referred to as the direct method because itdirectly transforms source language graphemes into targetlanguage graphemes without any phonetic knowledge of thesource language words. This module takes input as {
PSW 
1
 ,PSW 
2
…PSW 
 j
 }
. It is designed with following steps: dividing
Input Information FlowI
1
= {SQ} = {Source Language Query}I
2
= {PSQ} = {PSW
1
, PSW
2
…PSW
n
} = {Processed SourceLanguage Query Words}I
3
= {PSW
1
, PSW
2
…PSW
i
} = {Words Found in the BilingualDictionary}I
4
= {PSW
1
, PSW
2
…PSW
 j
} = {Words Not Found in theBilingual Dictionary}I
5
= {TQ} = {Target Language Query}I
6
= {TD
1
, TD
2
…TD
} = {Retrieved Relevant TargetDocuments}Output Information FlowO
1
= {PSQ} = {Processed Source Query}O
2
= {PSW
1
, PSW
2
…PSW
i
} = {Processed Source LanguageQuery Words}O
3
= {TDW
1
, TDW
2
…TDW
i
} = {Translated Words Found in theBilingual Dictionary}O
4
= {TDW
1
, TDW
2
…TDW
 j
} = {Transliterated Words Not Foundin the Bilingual Dictionary}O
5
= {TD
1
, TD
2
…TD
} = {Retrieved Relevant Target Documents}O
6
= {TR} = {TRD
1
, TRD
2
…TRD
m
} = {Topmost m RelevantRanked Documents}Figure 1. System ArchitectureNo (Notincluded in)Target QueryFormationYes (Includedin)VerificationModuleSource QueryProcessed SourceQueryI
1
= {SQ}O
1
= {PSQ}Text ProcessingTranslation UsingBilingual DictionaryBilingualDictionary(Source toTarget)Searching andRetrievingTarget Result (TR)TargetDocumentsCollectionRanking TopmostDocumentsTranslatingWordsO
3
= {TDW
1
,TDW
2
…TDW
i
}O
4
= {TDW
1
,TDW
2
…TDW
 j
}I
5
= {TQ}O
5
= I
6
= {TD
1
,TD
2
…TD
}O
6
= {TR} = {TRD
1
,TRD
2
…TRD
m
}I
2
= {PSQ} = {PSW
1
,PSW
2
…PSW
n
}O
2
= {I
3
} = {PSW
1
,PSW
2
…PSW
i
}I
4
= {PSW
1
,VerifyingWordsTransliterationProcess
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, May 2010316http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (7)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Iulia Popel liked this
rayanag liked this
arjunnadar liked this
arjunnadar liked this
Hajiram Beevi liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->