multi-document summarization for the Telugu-Englishlanguage pair was described in [6]. The authors used a cross-lingual relevance based language modeling approach togenerate extraction based summary. It would provide asyntactically well formed set of sentences in the summary toenable easy machine translation. Other benefit of the system isoutput can be an easily translatable content (minimizingambiguities). Mandal et al. [7] described two cross-lingual andone monolingual English text retrieved at CLEF in the Ad-Hoc track. The cross-language task includes the retrieval of English documents in response to queries in two most widelyspoken Indian languages Hindi and Bengali. Here, authorsadapted automatic query generation and machine translationapproach to develop the system.An Indian Language Information Retrieval System [8],which exploits the significant overlap in vocabulary across theIndian languages. Cognates are identified using some of thewell-known similarity measures, and incorporated this withthe traditional bilingual dictionary approach. The effectivenessof the retrieval system was compared on various models. Theresults show that using cognates with the existing dictionaryapproach leads to a significant increase in the performance of the system. Language independent information retrieval isone
of the major issues in the web access by the regionalpopulation of any kind. Language Independent InformationRetrieval from Web (LIIRW) was described in [9]. Here, theuser with the independence of typing the query in anylanguage of his choice and getting the results in any languageor any combination of languages, it is intended to make themultilingual content of the web easily available and morenoticeable. It addresses the implementation of the LIIRWconcept in Indian languages (Hindi and Tamil). A Tamil-English CLIR system was developed in [10]. This system ismainly developed for the farmers of Tamilnadu in Agriculturedomain. It helps them to specify their information need inTamil and retrieve the documents in English (corpus). Here,the query in Tamil language is translated syntactically andsemantically to English using statistical machine translationapproach and gives the better result. The system exhibits adynamic learning approach.This paper presents a CLIR system, which translate Englishquery into Tamil and English query into Telugu usingbilingual dictionaries related to computer domain. It alsotransliterates the named entities, which are present in thequery other than the words which can be translated.III.
S
YSTEM
A
RCHITECTURE
The overview of CLIR system is shown in Fig. 1. It mainlycontains the following modules: Text Processing, Verification,Translation, Transliteration and Retrieval and Ranking.
Text Processing:
For a given source or user query, thismodule performs preprocessing of the source query. That is,before translating the source query into target query, need toperform some text processing steps. The following are the textprocessing steps: Tokenization, stop words removal,morphological analyzer and stemming. Tokenization is thetask of dividing query into pieces, called tokens, perhaps at thesame time throwing away certain characters, such aspunctuation. These tokens are referred to as terms or words.After tokenization, some words in the query need to beremoved called stop words, which should not help retrievingtarget documents. Examples of those words are:
the, is, was,that
etc
. Using stop words removal step, these words can beremoved from the source query. Morphological analyzeranalyzes the structure of the words in the query. Examples of those words are
verbs, adverbs, adjectives
etc.
That is,vocabulary of the words in the query can be identified.Stemming is the process of reducing inflected words to theirbase or root form. For example,
fishing, fished and fisher
areinflected words, which can be reduced into their root form
fish
using stemmer. After the text processing, the output of thesource query
(SQ)
is called preprocessed source languagequery
(PSQ)
, which includes preprocessed source languagequery words
{PSW
1
, PSW
2
…P
SW
n
}.
Verification Module:
It is designed for the purpose of checking the occurrence of source language words in MachineReadable Dictionaries (MRD) or Bilingual (source to targetlanguage) dictionaries. MRDs are electronic versions of printed dictionaries, and may be general dictionaries orspecific domain dictionaries or a combination of both. Aftertext processing module, the verification module accepts theinput query words
{PSW
1
, PSW
2
…PSW
n
}
and performs adatabase lookup operation to check whether the given query isdirectly present in the bilingual dictionary. The words whichfound in the dictionary
{PSW
1
, PSW
2
…PSW
i
}
can be given tothe Translation Module and the words which are not found inthe dictionary
{PSW
1
, PSW
2
…PSW
j
}
can be given to theTransliteration module.
Translation Module:
In this paper, the language of the userquery is English and the documents considered for retrievalare in Tamil and Telugu languages. These documents are a setarticles based on computer science terminology. Hence wehave concentrated on queries with computer terminology. It isalso called machine translation module. It follows thedictionary based translation method. Dictionary basedtranslation method can translates the query words using thebilingual dictionaries. These words are called vocabularywords. We have developed an English-Tamil and English-Telugu bilingual dictionaries that contain most the wordsrelated to computer science domain. The dictionary had to bebuilt from the scratch as no resource is available for thisdomain. After each intermediary step in the MorphologicalAnalyzer, the extracted word is mapped with the bilingualdictionary to check whether it is a root word. If it is available,meaning of the word is returned. If not, the word is thenpassed on to the subsequent stages in the MorphologicalAnalyzer. The words which are not found in the dictionary arecalled Out Of Vocabulary (OOV) words. This module takesinput as {PSW
1
, PSW
2
…PSW
i
} and translates into target
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, May 2010315http://sites.google.com/site/ijcsis/ISSN 1947-5500