You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262310186

Amharic-English bilingual web search engine

Conference Paper · October 2012


DOI: 10.1145/2457276.2457284

CITATIONS READS
2 3,084

2 authors:

Mequannint Munye Solomon Atnafu


University of Oslo Addis Ababa University
1 PUBLICATION   2 CITATIONS    57 PUBLICATIONS   279 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Similarity-Based Algebra for Image Database Systems View project

Modélisation et traitement de requêtes images complexes View project

All content following this page was uploaded by Solomon Atnafu on 05 January 2016.

The user has requested enhancement of the downloaded file.


Amharic-English Bilingual Web Search Engine
Mequannint Munye Solomon Atnafu
Department of Computer Science Department of Computer Science
College of Engineering and Technology College of Natural Sciences
Jijiga University Addis Ababa University
P.O.BOX 1020, Jijiga, Ethiopia P.O.Box 1176, Addis Ababa, Ethiopia
Tel: (+251) 91 211 99 15 Tel: (+251) 91 140 69 46
email: mokemz24@yahoo.com email: solomon.atnafu@aau.edu.et

ABSTRACT
As non-English languages are growing exponentially on the
Categories and Subject Descriptors:
Web, the number of online non-English speakers who realizes the
importance of finding information in different languages is
H.3.3 [INFORMATION STORAGE AND RETRIEVAL]:
Information Search and Retrieval-Search process
enormously growing. However, the major general purpose search
engines such as Google, Yahoo, etc have been lagging behind in General terms: Design, Algorithms and Language,
providing indexes and search features to handle non-English
languages. Amharic, which is the family of Semitic languages Keywords: Bilingual search engines, cross-lingual information
and the official working language of the federal government of retrieval, query preprocessing, query translation, transliteration.
Ethiopia, is one of these languages with a rapidly growing content
on the Web. As a result, the need to develop bilingual search
engine that handles the specific characteristics of the users‟ native
1. INTRODUCTION
language query (Amharic) and retrieves documents in both As non-English language Web contents which are written in
Amharic and English languages becomes more apparent. different languages and encoding schemes are available online on
In this research work, we designed a model for an Amharic- the Web, the number of non-English speakers that use the Web as
English Search Engine and developed a bilingual Web search their major source of information and a means of communication
engine based on the model that enables Web users for finding the channel is rapid growing. However, general purpose search
information they need in Amharic and English languages. In engines, such as Google, Yahoo, etc, often ignore the special
doing so, we identified different language dependent query characteristics of non-English languages although information
preprocessing components for query translation. We have also search and retrieval needs language specific treatment [2, 7]. The
developed a bidirectional dictionary-based translation system non-English speaking users use these search engines that do not
which incorporates a transliteration component to handle proper take into account the structure and the special characteristics of
names which are often missing in bilingual lexicons. We have the specific language they use, because they may not have
used an Amharic search engine and an open source English search alternatives. This has led to the need to develop language specific
engine (Nutch) as our underlying search engines for Web search engines instead of general purpose once. However,
document crawling, indexing, searching, ranking and retrieving. developing search engines that support only a specific language
To evaluate the effectiveness of our Amharic-English and writing scheme do not allow the users to access all relevant
bilingual search engine, precision measures were conducted on documents available on the Web. This is because; there may be
the top 10 retrieved Web documents. The experimental results relevant documents in response to the user‟s query that are not in
showed that the Amharic-English cross-lingual retrieval engine the same language and script of the query language.
performed 74.12% of its corresponding English monolingual In recent years, multilingual information retrieval (MLIR) or
retrieval engine and the English-Amharic cross-lingual retrieval bilingual IR, for the case of two languages, has got considerable
engine performed 78.82% of its corresponding Amharic attention because it searches and presents the results in second or
monolingual retrieval engine. The bilingualism advantage of the third language of which the user is aware of [6]. MLIR system
system is also evaluated by comparing its results with general involves providing a query in one language and searching
purpose search engines. The overall evaluation results of the document collections in one or more different languages. Since
system are found to be promising. the query and the document are written in different languages, any
Bilingual Information Retrieval (BLIR) system integrates a
machine translation system as its core component to translate
Permission to make digital or hard copies of all or part of
either the query or the document. According to [5], many BLIR
this work for personal or classroom use is granted without
systems have used a query translation approach since translating
fee provided that copies are not made or distributed for profit
the document needs large bilingual corpus particularly in
or commercial advantage and that copies bear this notice
morphologically rich languages such as Amharic.
and the full citation on the first page. To copy otherwise, or
Although machine translation (MT) systems use different
republish, to post on servers or to redistribute to lists,
approaches, dictionary based approach is preferable for under
requires prior specific permission and/or a fee.
resourced language bilingual web search engines that use query
MEDES‟12, October 28-31, 2012, Addis Ababa, Ethiopia.
translation approach for several reasons. We can see these reasons
Copyright © 2012 ACM 978-1-4503-1755-9/10/10...$10.00.
from two perspectives: The first one is from the advantages of lingual information retrieval by focusing on the translation
machine readable dictionary translation approach perspective[13]: approaches used, the information retrieval model or tool used, the
Machine-readable dictionaries that are used in dictionary-based language pairs used, and whether they are web applications or
translation approach are more widely available and easier to use not.
than the parallel corpora required by the corpus-based A work of Mohammed et al. [14], evaluated the use of a
approach. Machine Translation based approach for query translation in an
The limited availability of existing parallel corpora cannot meet Arabic-English CLIR system. The work is experimented using the
the requirements of practical retrieval systems in today‟s three query types to determine the effects of query length on the
diverse and fast-growing Web environment. performance of the machine translation based method for CLIR.
The dictionary-based Cross-Lingual Information Retrieval The results showed that the machine translation achieved 61.8%,
(CLIR) approach is more flexible, easier to develop, and easier 64.7%, and 60.2% for title, description, and narrative fields,
to control when compared with Machine Translation based respectively. As Arabic is a relatively widely researched Semitic
CLIR which has little space for users to modify it for their language and has a number of common properties that share with
specific purposes. Amharic, the tools developed for this language can be customized
and used for Amharic.
Secondly, from the current IR models perspective [17]:
Another work on Cross-Lingual Information Retrieval is the
Most IR models are based on bag-of-words models. Therefore,
work of P. L. Nikesh et al. [15], the paper described about an
they don‟t take the syntactic structure and word order of queries
English-Malayalam Cross-Lingual Information Retrieval system.
into account and hence they are easy to develop.
The system retrieves Malayalam documents in response to query
Queries submitted to IR systems are usually short. Therefore,
given in English or Malayalam language. The authors used a
they don‟t describe the user‟s information need in an
bilingual dictionary developed in house for translation. For
unambiguous and precise way.
document ranking and retrieval, a system developed based on the
This means that a simpler translation approach may suffice to
vector space model (VSM) was used. As Malayalam is under
implement the translation process. Having these ideas and the
resourced language like Amharic, the query translation
scarcity of linguistic resources for Amharic in mind, we believe
approaches that the authors follow and the Information Retrieval
that developing the query translation system based on a word-by-
(IR) tool used are the important points that we have learnt to
word translation method by looking up the general-purpose
develop our system.
Amharic-English bilingual dictionary can perform better.
Joanne Capstick et al. [12], developed MULINEX which is a
However, dictionary based approach has one serious
fully implemented multilingual search engine and it is available in
problem, that is, word coverage limitations of dictionaries because
German, French and English languages. MULINEX is a
of the appearance of Out-Of-Vocabulary (OOV) words in the
multilingual Internet search engine that supports selective
query. This often occurs because most of the queries contain
information access, navigation and browsing in a multilingual
proper names and borrowed words that do not often present in the
environment. MULINEX incorporates Web spiders, concept-
bilingual lexicons [6]. As a result, the translation component
based indexing, relevance feedback, translation disambiguation,
should integrate a transliteration component to alleviate the
document categorization, and summarization functionalities. In
problem of OOVs and to translate them into the target language
this system, queries are morphologically analyzed and then
without using dictionary.
translated by making use of multilingual dictionaries.
Amharic is the second most-spoken Semitic language in the
Another work on multilingual search engine is the work of
world, next to Arabic, and the official working language of the
Wen-hui Zhang et al. [11]. This work is a multilingual Chinese-
Federal Democratic Republic of Ethiopia (A country with more
English search engine developed by a project of Chinese
than 80 million populations). As a result, there are significant
Academy of Sciences. The work was conducted with the intention
amount of Amharic documents on the Web. Although the
to develop systems that can efficiently search, index and retrieve
percentage of Amharic content on the Web is very much less
multilingual information for Chinese (mother tongue) and English
compared to English, the Amharic contents could be different
information. The system has uniform query interfaces for both
than contents that are published in English language, as these
Chinese and English languages to allow the user to conduct
contents are related mostly to Ethiopian events. Thus, using only
searching.
one of the languages do not allow for accessing all the available
Another work on Multilingual Web Retrieval with an
relevant documents. There are a number of works about Amharic-
Experiment in English–Chinese Business Intelligence is
English information retrieval [1, 3, 4] and a little about Amharic
developed by Jialun Qin et al. [13]. This work dealt with
search engine [2]. However, to our knowledge, there are no
developing and evaluating a multilingual English–Chinese Web
attempted works about Amharic-English information retrieval on
portal that incorporates various CLIR techniques for use in the
the Web that accepts user queries according to the user‟s language
business domain. According to the paper [13], the dictionary
preferences and returns the result in both languages.
based approach of query translation is the most promising for
This research work is thus to design and develop a generic
Web applications. The authors strongly recommended to use
model for a bilingual search engine that integrates a query
dictionary based approaches for Web based applications
translation system for Amharic and English languages.
especially for under resourced languages like Amharic. It has also
shown the effectiveness of the approach as queries are usually
short and IR models are based on bag-of-words.
2. RELATED WORK Other related works are on Amharic-English Information
We reviewed and discussed some related works on CLIR, Retrieval. Atelech Alemu, in her three consecutive research works
multilingual/bilingual search engines, and Amharic-English cross- on Dictionary-based Amharic - English Information Retrieval [4],
Amharic-English Information Retrieval [3], and Amharic-English  Character Redundancy
Information Retrieval with Pseudo Relevance Feedback [1]
discussed cross lingual information retrieval between Amharic Amharic has some redundant symbols with the same sound
and English languages. For all the research works the authors in its alphabet. For example, አ, ኣ, ዏ and ዓ, ሰ and ሠ, and ጸ
used Amharic-English machine readable dictionaries and an and ፀ have the same sound. These characters can be used
online Amharic-English dictionary for the query terms translation interchangeably without any meaning difference in the
with some additional enhancements for query translation, language.
indexing and searching from one research to the next. The  Short Words
Amharic topic set used in all the experiments was constructed by In Amharic, it is common to shorten some words using the
manually translating the English topics. As the experimental forward slash („/‟) and the English full stop (ነጥብ(.)). For
results showed, progressive performance achievements were example the Amharic words ወታደር, ዓመተ ምህረት, ወልደ ስሊሴ
observed form the first research to the next one and the challenges can be shortened as ወ/ር, ዓ.ም, ወ/ስሊሴ, respectively. These
related to the issues were discussed. The Amharic-English cross shortened words will be expanded to their normal forms for
lingual information retrieval researches done by the researcher dictionary lookup or IR systems.
were not information retrieval on the Web and is not used to
retrieve both Amharic and English language documents. However,  Variations due to Pronunciations
cross lingual information retrieval on the Web has additional Some of the Amharic words have multiple spelling variants
overheads that should be considered compared to traditional like ጧት፣ጥዋት፣ጡዋት፣ጠዋት፣ etc. The main reason for such
CLIR due to several factors exist on the Web documents. problems is due to the multiple regional dialects of the
Multilingual Web retrieval differs from traditional CLIR because spoken language. In addition, most of the words which are
of the factors that [13]: adopted from English or other languages are written in
 Web pages are more unstructured and are very diverse in terms different formats like ሚሉዮን፣ ሚልዮን፣ ሚሉየን, etc. There is no
of document content and document format (such as HTML or standardization in spelling for such words which in turn
ASP). result in huge variation in transliteration. These variations in
 Traditional CLIR usually focuses on effectiveness, measured in spelling are difficult for Web information retrieval
recall and precision. However, Web retrieval is also concerned applications such as Google, which retrieves documents
with efficiency to end users (i.e. response time and query based on exact term matching.
length).  Stop-words
Most search engines and query translation systems do not
3. ANALYSIS OF THE AMHARIC consider extremely common words in order to save disk
LANGUAGE space or to speed up search results. Although Amharic does
not have standard stop word lists, ሁለ, ሁኔታ, ሆነ, በኋሊ, በጣም,
Amharic is the second most-spoken Semitic language in the ብቻ, ወደ, ናቸው, etc are considered to be stop words.
world, next to Arabic, and the official working language of the
Federal Democratic Republic of Ethiopia. Amharic uses a unique  Amharic Punctuation Marks
script, which has originated from ancient language, the Ge‟ez There are different punctuation marks in Amharic language
alphabet, which is the liturgical language of the Ethiopian and among these punctuation marks ።(አራት ነጥብ), ፣(ነጠሊ
Orthodox Church. It has been the working language of the ሰረዝ), ፤(ድርብ ሰረዝ),?(የጥያቄ ምልክት) are the most commonly
government, the military, and of the Ethiopian Orthodox used ones. They are used in query or document tokenization
Tewahedo Church throughout modern times. Outside Ethiopia, as word delimiters in addition to white spaces.
Amharic is the language of some 2.7 million emigrants (primarily
in Egypt, Israel, Sweden, Eritrea, and the United States) [8, 9, 10].
Thus, it has official status and is spoken by many people as their
4. THE ARCHITECTURE OF AMHARIC-
native and second language. In addition, it is a language with ENGLISH BILINGUAL SEARCH
many literatures. Of these who speak Amharic, a significant ENGINE
number of them (usually the educated class) can understand and
speak English as well. As it can be surmised from its name, our bilingual search
Amharic has a complex morphology which combines engine is designed to pull up results in not one, but two languages
consonantal roots and vowel intercalation. Amharic and English by accepting queries in either of the languages (Amharic or
differ substantially in their morphology, syntax and the writing English). As most of the bilingual/multilingual search engines,
system they use. Therefore, the search engines which are mainly our proposed Amharic-English bilingual search engine has basic
developed for English cannot efficiently be used to retrieve components such as Query Preprocessing, Query Translation,
Amharic documents. Amharic language has the following Amharic Search Engine, and English Search Engine as shown in
characteristics that should be considered in information retrieval Figure 1.
and translation systems.
Query Preprocessing
 Morphological variations The query posed by the users must pass through the query
As Amharic language is morphologically complex, words are preprocessing stage before it has been translated for cross
inflected with prefixes, suffixes and infixes. For example, the language retrieval and at the same time it can be directly
Amharic word “በሊ” has morphological variations such as: submitted to the appropriate search engine for monolingual
“በለ” ፣“በሊን”፣“በሊች” ፣“በሊሁ” ፣“በሊችሁ”፣ “አስበሊ”, etc. retrieval. Since query preprocessing is a language specific task,
the query preprocessing process is done in two different query Query Translation
preprocessing modules. The query translation process also uses two components
(Amharic Query translation and English Query translation) which
are responsible for lexical transfer or dictionary lookup in the
bilingual Amharic-English dictionary and transliterating the out of
dictionary words assuming that they are proper names or
borrowed words as shown in Figure 2.
The Amharic Query translation component in turn has two
subcomponents: a lexical transfer and a transliterator. The Lexical
transfer receives Amharic words from the query preprocessing
component and automatically translates them into English words
only by using the bilingual dictionary.

Transliteration
All the words that are not found in the bilingual dictionary will
pass through the transliterator module. As shown in the algorithm
in Figure 3, this module works by segmenting each stemmed
Amharic words into Amharic phoneme (or characters),
transforming each character into its corresponding English
characters based on the convention for the transcription of
Ethiopic script for ASCII, and concatenating each translated
English phoneme (or character) into a single English word. In the
Figure 1: The Architecture of the Amharic-English Bilingual transliteration convention a single Amharic character is replaced
Search Engine by one, two or three Latin characters. For instance, መ: me, ሙ:mu:
ሚ: mi, ማ:ma, ሜ:mie, and ሞ:mo. However, sadises such as “ም” will
The first is Amharic query preprocessing module which in vary according to their positions. It will be mapped to “mi” when
turn consists of subcomponents such as tokenizer, normalizer, it comes first or simply “m” when it comes at the end of the word.
stop-words remover, and stemmer. This module is responsible for It is these constants and vowels are called “Xi” in the algorithm.
tokenizing the Amharic queries into words using white spaces and
Amharic punctuation marks, normalization of Amharic redundant
symbols which have the same sound (such as: አ, ኣ, ዏ and ዓ, ሰ and
ሠ, and ጸ and ፀ) and expanding short form of Amharic words such
as ወ/ሮ፣ ዓ.ም፣ ወ/ሚካኤል፣ etc, eliminating stop-words (less
informative words) such as: ነው፣ሆነ፣ወደ፣ናቸው፣ውስጥ፣etc, and
stemming inflectional and some derivational Amharic morphemes.
For example, the Amharic words “በለ” ፣“በሊን”፣“በሊች” ፣“በሊሁ”
፣“በሊችሁ”፣ “አስበሊ”, etc can be reduced to their citation word “በሊ”
using the stemming subcomponent. This helps the query
translation component to handle morphological variations and to
find matches in the dictionaries for as many of the query words as
possible. The output of Amharic query preprocessing module is a
set of Amharic preprocessed bag of words which is used as an
input for the Amharic query translation component.
The second is English query preprocessing module, like
Amharic query preprocessing module it has subcomponents like
tokenizer, stop words remover, and stemmer. This module is
responsible for tokenizing the English queries into words using
white spaces and some English punctuation marks as word
delimiters, eliminating English stop-words (such as: “a”, “an”,
“are”, “be”, “for”, etc), and stemming inflectional and some
derivational English morphemes. For example, the English words
such as: “charger”, “charging”, “charged”, “charges” can be Figure 2: Amharic Query Translation Component
reduced to their base word “charge”. In general, these
subcomponents have similar functionalities to their corresponding
Amharic query preprocessing subcomponents except they are used
for English language queries.
1. Input: Amharic bags of words experiments to evaluate the performance of some individual
2. Output: English bags of words subcomponents which have direct impact on the effectiveness of
3. For each Amharic stemmed word not found in the dictionary our system.
4. Split the word into its Amharic character
5. For all the characters in the word
Results of Translation Component
6. If (the character is at the beginning of the word and the To visualize the effectiveness of the Amharic-English
character is sadis) translation component, an in house developed bilingual dictionary
7. Replace with its Latin conventional „Xi‟ with 5000 Amharic words and their corresponding translations is
8. Put the characters at the beginning of the word used, and 15 Amharic queries, which have a total number of 36
9. else if (the character is sadis and the character before it is words, are selected for the cross-lingual evaluation. This
also sadis) component properly translates 10 queries out of the 15 queries.
10. Replace with its Latin conventional „Xi‟ When we see it word wise, out of the 36 words, 31(88.57%)
11. Append the character in its appropriate place words are properly translated. To test the effectiveness of the
12. else if (the character is sadis and it is before the last English-Amharic query translation component, 15 English queries
character) which are the manual translation results of the 15 Amharic queries
13. Replace with its Latin conventional „Xi‟ except for are used. Out of the 15 English queries, 14 (93.33%) queries are
some exceptions properly translated.
14. Append the character in its appropriate place Results of the Transliteration Component
15. else The transliteration component is also independently
16. Transliterate the character based on the transliteration evaluated for its performance. To do this, we have collected 1100
convention (one thousand one hundred) proper names of which most of them
17. Concatenate the characters are person names and some are the names of countries, places,
18. End if hotels, well known cities, etc. The manual transliteration of the
19. End for gazetteer is made with the help of linguists. Out of the 1100
20. End for proper names, 887(80.64% of the names) names are properly
Figure 3: Algorithm for Amharic-English transliteration transliterated.
The English Query Translation component has also two Bilingual Retrieval Performance Evaluation
subcomponents and works like the Amharic query translation
Since our system is bidirectional, within a single query and
component except the transliterator subcomponent uses the
its corresponding manual translation four different search results
transliteration database as its knowledge base instead of direct
will be retrieved. For the Amharic query “የአማርኛ ቋንቋ”, the
character mappings.
Amharic-English search engine retrieved both Amharic and
Like the algorithm in Figure 3, some other algorithms such
English web documents in the same browser window separated
as Amharic-English translator, English-Amharic translator,
byframes as shown in Figure 4. In the same manner, the English-
English-Amharic transliterator, etc have been developed for the
Amharic results are as shown in Figure 5 for the English query
purpose of Amharic-English and vice versa translation and
“Election 2002”.
transliteration.
The Amharic-English cross-lingual retrieval engine is
Amharic and English Search Engines evaluated using the 15 Amharic queries and their manual
The results of our bilingual search engine are highly translations. Since it was impossible to read all Web page
dependent on the underlying English and Amharic search engines documents collected during crawling to judge the relevancy of the
used to collect Web documents from the Web. Each search engine documents, we emphasized on precision only for the top 10
performs the crawling, indexing and ranking according to its own retrieved Web pages for each query. Based on the precisions
mechanism. The Amharic search engine component is calculated for the English monolingual and Amharic-English
responsible for crawling, indexing, and ranking of the Amharic cross-lingual retrieval for each query, the precision of the English
Web documents [2]. The English search engine component is monolingual retrieval engine is 85% and that of the Amharic-
responsible for crawling, indexing, and ranking of English Web English retrieval engine is 63%. From this, we can observe that
documents [16]. the Amharic-English cross-lingual retrieval system out performed
74.12% of its corresponding monolingual retrieval engine.
The English-Amharic cross-lingual retrieval engine is also
5. EXPERIMENTAL RESULTS evaluated using the 15 English queries and their manual
translations. The Amharic monolingual retrieval engine has an
In order to evaluate the effectiveness of our search engine, an
average precision of 85% and its corresponding English-Amharic
experiment which measures the precision of the system was
cross-lingual retrieval engine has an average precision of 67% as
designed and conducted using both Amharic and English queries
calculated for only the top 10 retrieved English web pages. This
and on both Amharic and English language web document
experimental result also showed us the English-Amharic cross-
collections. We also conducted an experiment to visualize the
lingual retrieval engine out performed 78.82% of its
significance of our bilingual search engine. This is done by
corresponding monolingual retrieval engine.
comparing the relevant bilingual retrieval results of our system
(Amharic and English documents) with the relevant results of
Google monolingual search results. We also conducted
Figure 4: A Snapshoot of the Search Result for the Amharic query “የአማርኛ ቋንቋ”

Figure 5: A Snapshoot of the Search Result for the English query “Election 2002”
The Significance of Our System Compared to As shown in Table 1, our system has two advantages over
general purpose search engines. The first one is our system
General Purpose Search Engines
considered the characteristics of Amharic language and retrieved
To evaluate the significance of our system when compared to
the same results for different variations of the words in the query
general purpose search engines, some Amharic queries which
but not Google. Secondly, our system retrieved more number of
have morphological variations, character variations, and short
relevant documents by simultaneously searching both Amharic
words are given to our system and to „Google‟ search engine. For
and English Web documents for the given Amharic query but
the experiment, our search engine has worked on seven hand-
Google retrieved only Amharic Web documents.
picked seed URLs provided to the crawler. Domain experts are
also involved to judge the relevance of the documents retrieved to
the query.

Table 1: Google versus our system results for queries that show different characteristics of Amharic language.
Our System Results
Google Results
Language Amharic Documents No. of Rel. Total No. of
Query
Characteristics No. of Rel. Documents No. of Rel. Documents English Relevant
Documents Similarity Documents Similarity Documents Documents

የኢትዮጵያ ፓርቲዎች 10 7 5 12
0 7
Morphological ኢትዮጵያ ፓርቲ 7 7 5 12
Variations የኢትዮጵያ ምርጫዎች 9 8 7 15
0 8
ኢትዮጵያ ምርጫ 6 8 7 15
አማርኛ ዜና 9 10 7 17
0 10
Character ዏማርኛ ዜና 2 10 7 17
Variations መሇስ ዜናዊ 10 6 4 10
0 6
መሇሥ ዜናዊ 3 6 4 10
የትምህርት ጥራት 10 8 5 13
Short words 1 8
የት/ት ጥራት 6 8 5 13

6. CONCLUSION AND FUTURE WORKS  Identified the critical issues that need to be dealt in cross-
lingual search engines and proposed appropriate techniques.
In this work, we designed a generic model and developed a  Proposed a general model for Amharic-English bilingual
bilingual Web search engine for Amharic and English languages. search engine.
The search engine has different components which address the  Paved the way for further large scale research projects on
basic language specific issues for the two languages in query Web based bilingual search engines by identifying language
translation and information retrieval. These major components dependent components.
are: Amharic query preprocessing, English query preprocessing,  Developed algorithms for query translation and
Amharic query translation, English query translation, Amharic transliteration. These algorithms can also be adopted for
and English search engines. other cross-lingual information retrieval.
To evaluate the effectiveness of our Amharic-English  Designed and created a bilingual Amharic-English dictionary
bilingual search engine, precision measures were conducted on for query translation purpose and a repository reference table
the top 10 retrieved Web documents. The precision was calculated for transliteration of words in queries.
for each query and the average precision measure showed  Customized the Amharic stemmer developed for information
promising results for both the Amharic-English and English- retrieval to fit for translation systems.
Amharic cross-lingual evaluations.  Designed and developed an Amharic-English bilingual
With this work, several novel ideas and designs are proposed as search engine based on the proposed model.
contributions. Among these contributions, we:  Evaluated the prototype developed for its effectiveness.
 Identified the major components of Amharic-English
bilingual search engine which should be considered in
developing Amharic-English bilingual search engine.
7. REFERENCES
[9] Amharic Ethiopia Language. Available at: http://www.free-
[1] Atelach Alemu Argaw. "Amharic-English Information press-release.com/news/200907/1248234344.html,
Retrieval with Pseudo Relevance Feedback". In: Peters, C., Accessed on July 21, 2009.
et al. (eds.) Advances in Multilingual and Multimodal [10] Amharic Language. Available at:
Information Retrieval: 8th Workshop of the Cross http://multilingualbooks.com/amharic.html, Accessed on 11
Language Evaluation Forum, CLEF 2007, Budapest, August 2010.
Hungary, September 19-21, 2007. [11] Wen-hui Zhang, Hua-lin Qian, Wei Mao and Guo-nian Sun.
[2] Tessema Mindaye, Hassen Redwan, Solomon Atnafu. “A Multilingual (Chinese, English) Indexing, Retrieval,
“Searching the Web for Amharic Content”. On the Searching Search Engine”. Available at
International Journal of Multimedia Processing and http://www.isoc.org/inet99/proceedings/posters/210/index.h
Technologies (JMPT), 2010. ISSN Print ISSN: 0976- tm, Accessed on August 10, 2009.
4127.” [12] Joanne Capstick, Abdel Kader Diagne, Gregor Erbach and
[3] Atelach Alemu Argaw and Lars Asker. "Amharic-English Hans Uszkoreit. “MULINEX: Multilingual Web Search and
Information Retrieval". Working Notes of CLEF 2006, Navigation”, Accessed on August 25, 2009, Available at
Alicante, Spain. September 2006. http://eprints.kfupm.edu.sa/52030/1/52030.pdf, Published
[4] Atelach Alemu Argaw, Lars Asker, Rickard Cöster and on 08.02.99.
Jussi Karlgren. "Dictionary-based Amharic - English [13] Jialun Qin, Yilu Zhou, Michael Chau and Hsinchun Chen.
Information Retrieval". In Proceedings of Cross Language “Multilingual Web retrieval: An experiment in English–
Evaluation Forum (CLEF 2004), Bath, UK. September Chinese business intelligence”. John Wiley & Sons, Inc.
2004. New York, NY, USA, 2006
[5] Kristen Parton, Kathleen R. McKeown, James Allan, and [14] Mohammed Aljlayl, Ophir Frieder, and David Grossman.
Enrique Henestroza. “Simultaneous Multilingual Search for “On Arabic-English Cross-Language Information Retrieval:
Translingual Information Retrieval”. In Proceeding of the A Machine Translation Approach”. IEEE Computer
17th ACM conference on Information and knowledge Society Washington, DC, USA, 2002
management, Napa Valley, California, USA. ACM New [15] P. L. Nikesh, Sumam Mary Idicula, S. David Peter.
York, NY, USA, 2008. “English-Malayalam Cross-Lingual Information Retrieval–
[6] Karunesh Arora, Ankur Garg, Gour Mohan, Somiram an Experience‟, In Proceedings of IEEE International
Singla, and Chander Mohan. “Cross Lingual Information Conference on Electro/Information Technology, Ames,
Retrieval Efficiency Improvement through Transliteration”. Iowa State University, May 2008
In Proceedings of ASCNT – 2009, CDAC, Noida, India, [16] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S.
pp. 65 – 71, 2009. Raghavan. “Searching the Web”. ACM Transactions on
[7] Judit Bar-Ilan and Tatyana Gutman: “How do search Internet Technology (TOIT), 2001.
engines handle non-English queries? - A case study”. [17] Wessel Kraaij, Jian-Yun Nie, and Michel Simard.
WWW (Alternate Paper Tracks), Budapest, Hungary, 2003. “Embedding Web-based statistical translation models in
[8] Amharic-WIKIPIDIA, the free encyclopaedia. cross-language information retrieval.” MIT Press
Available at: http://en.wikipedia.org/wiki/Amharic, Cambridge, MA, USA, 2003.
Accessed on 21 September 2009.

View publication stats

You might also like