Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Rule Based Hindi to English Transliteration System for Proper Names

Rule Based Hindi to English Transliteration System for Proper Names

Ratings: (0)|Views: 342|Likes:
Published by ijcsis
There are Cross Language Information Retrieval systems that uses bilingual dictionary for translating user query from one language to another. The problem arises when a query term is not available in the bilingual dictionary. Such words are called Out of Vocabulary (OOV) words, and should be transliterated during translation process.OOV words are mainly proper nouns, named entity, and technical terms. We have developed a rule based transliteration system from Hindi to English script. We have also created a database of specialized spelling, e.g. some city names, person names, etc. which has considerably improved performance of our system.
There are Cross Language Information Retrieval systems that uses bilingual dictionary for translating user query from one language to another. The problem arises when a query term is not available in the bilingual dictionary. Such words are called Out of Vocabulary (OOV) words, and should be transliterated during translation process.OOV words are mainly proper nouns, named entity, and technical terms. We have developed a rule based transliteration system from Hindi to English script. We have also created a database of specialized spelling, e.g. some city names, person names, etc. which has considerably improved performance of our system.

More info:

Published by: ijcsis on Sep 11, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

09/11/2012

pdf

text

original

 
Rule Based Hindi to English Transliteration Systemfor Proper Names
Monika Bhargava
#1
, M.Kumar
*2
, Sujoy Das
#3
 
# 1
M.Tech Scholar CSE Department, SIRT, Bhopal, India
*2
Professor CSE Department, SIRT, Bhopal, India
#3
 Associate Professor, Department of Computer Application, MANIT, Bhopal, India
 
1 
2 
3
 Abstract
 — 
There are Cross Language Information Retrievalsystems that uses bilingual dictionary for translating user queryfrom one language to another. The problem arises when a queryterm is not available in the bilingual dictionary. Such words arecalled Out of Vocabulary (OOV) words, and should betransliterated during translation process.OOV words are mainlyproper nouns, named entity, and technical terms. We havedeveloped a rule based transliteration system from Hindi toEnglish script. We have also created a database of specializedspelling, e.g. some city names, person names, etc. which hasconsiderably improved performance of our system.
Keywords- CLIR, OOV, TransliterationI.
 
I
NTRODUCTION
 In past 20 years the area of Information Retrieval (IR) hasgrown well beyond its primary goals of indexing text andsearching useful documents in a collection. Nowadays,research in IR includes modelling, document classificationand categorization, data visualization, filtering, etc. The Webis becoming a universal repository of human knowledge andculture which has allowed unprecedented sharing of ideas andinformation in a scale never seen before. Now the Web is seenas a publishing medium with accessibility to everybody. TheWeb contents are growing very rapidly and containinformation written in many languages. Often a user of Webneeds information written in a language not familiar to theuser but he /she wishes to get it in the native language. This ispossible through a process called Cross Language InformationRetrieval. Several methods are used to convert the text of onelanguage to another language. Machine translation systems inmany language pairs are available and widely used. Thebilingual dictionaries are also frequently used to convert textfrom one language to another language. A major problemarises when a word of the text is not available in the bilingualdictionary. Such words are called Out of Vocabulary (OOV)words and should be transliterated.Transliteration is the task of transcribing a word or textfrom one writing system into another writing system such thatpronunciation of the word remains same and a person readingthe transcribed words can read it in original language. Inothers words, transliteration is the task of converting a text inits customary orthography. It is different from translation,which generates the meaning of the input text e.g. book istranslated as
fdrkc
but transliterated as
cq 
. OOV words are
 
problematic in Cross Lingual Information Retrieval. Acommon source of error in CLIR is out of vocabulary words,named entity and technical terms.Among OOV, the proper nouns pose a major problem inthe transliteration .This is due to the fact that a proper noun(name of person) is written by different persons with differentspelling. This research has developed a rule based Hindi-English transliteration system especially for proper nouns witha fair degree of accuracy.
 
II.
 
L
ITERATURE
S
URVEY
 The problem of transliteration has been studied by anumber of researchers during the last decade. Knight andGraehl [1] use five probability distributions at various phasesof transliteration for the language pair English to Katakana (aform of Japanese Language) writing system. Al-Onaizan andKnight [2] have studied transliteration system from Arabic toEnglish writing, which uses existing named entity recognitionsystem. Asif et.al [3] have considered Bengali to Englishtransliteration scheme and used supervised training set toobtain a direct orthographic mapping. Lehal and Saini [4]have developed a Hindi to Urdu transliteration system byimproving on the work of Bushra and Tafseer [5]. Lehal andSaini have claimed an accuracy of 99.46% when HindiUnicode text is transliterated to Urdu.Haung et.al [6] have developed a system which extractsHindi
 – 
English named entity pairs through alignment of parallel corpus. Here, Chinese-English pairs are first extractedusing a dynamic programming string matching. This model isthen adapted for Hindi-English named entity pairs. Sinha et.al[7] have developed a simple yet powerful method for miningof Hindi
 – 
English names from parallel text corpus. The Hinditext written in Devanagari is first converted to IITK-Romanform which is direct representation of UTF-8 or ISCII -8coding scheme and claimed an accuracy of nearly 93%.In this paper, an effort is made to develop a rule basedtransliteration scheme for proper names of Hindi-Englishlanguage pair. The system is under extensive experimentationand test.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 10, No. 8, August 201218http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
III.
 
O
VERVIEW OF
H
INDI AND
E
NGLISH
S
CRIPTS
Hindi, official language of India, is an Indo-Aryan languagewith about 487 million speakers. Hindi is written in Devanagriscript which uses 52 symbols for representing 10 vowels, 40consonants and 2 modifiers. The vowels are transcribed in twoforms i.e. independent and dependent form. Dependent formis also known as matraa. Former is used when vowel letterappears alone at the beginning of word or is immediatelyfollowed by another vowel. Latter is used when vowelfollowed consonant [8].English is the most widely used language in the world.Approximately 375 million people speak English. It has been
referred to as ‘world Language. English speaker have manydifferent accents which often signal the speaker’s native
dialect or language. English is derived from West Germanicbranch of Indo
 – 
European family. English has 21 consonantsand 5 vowels.The Indian languages TRANSliteration (ITRANS) is anASCII transliteration scheme forIndic scripts,particularlyforDevanagariscript. [9]. It is a pre-processor that convertsEnglish-encoded text into various Indian languages script andhas 7-bit ASCII encoding schemes (see [10]).
 Mapping from Hindi to English
There is no one to one correspondence from Hindi toEnglish script. Tables I and II show the ITRANS mappingbetween source language [Hindi] to target language[English].We have used these mappings to transliterate propernames in Hindi to English language.
TABLE
 
I
 
M
APPING OF
V
OWELS FROM
H
INDI TO
ENGLISH
 
S.No
Hindi Vowel English VowelDependentFormIndependentForm
1
अ 
 A2
 
 
आ 
 
AA3

 
इ 
 
I4
 
 
ई 
 
II5

 
उ 
 
U6

 
ऊ 
 
UU7

 
ऋ 
 
RRI8

 
ए 
 
E9

 
ऐ 
 
AI10
 
 
ओ 
 
O11
 
 
औ 
 
AU12

 
अ
 
AM13

 
अ
 
AH
TABLE
 
IIM
APPING OF MOST
C
ONSONANTS FROM
H
INDI TO
E
NGLISH
S.No Hindi English
1
क 
 
KA2
ख 
 
KHA3
ग 
 
GA4
घ 
 
GHA5
ङ 
 NA6
च 
 
CHA7
छ 
 
CHHA8
ज 
 
JA9
झ 
 
JHA10
ञ 
 
NA11
ट 
 
TA12
ठ 
 
THA13
ड 
 
DA14
ढ 
 
DHA15
ण 
 
NA16
त 
 
TA17
थ 
 
THA18
द 
 
DA19
ध 
 
DHA20
न 
 
NA21
प 
 
PA22
फ 
 
PHA23
ब 
 
BA24
भ 
 
BHA25
म 
 
MA26
य 
 
YA27
र 
 
RA28
ल 
 
LA29
व 
 
VA OR WA30
श 
 
SHA31
ष 
 
SHA OR SHHA32
स 
 
SA33
ह 
 
HA34
 
 
KSHA35
 
 
GYA
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 10, No. 8, August 201219http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
IV.
 
E
XPERIMENTAL
S
ETUP
 We have developed a rule based transliteration systemusing JAVA SE DEVELOPMENT KIT (JDK), VERSION 6,MYSQL Server 5.5 for database. The program flow and thesystem architecture of Hindi to English rule basedtransliterator is shown in Fig1. The string of Hindi language isfirst searched in the database created for specialized spellingused in proper names, if the string match is found in thedatabase then its transliteration equivalent is produced
.
If thestring is not found in the database, it is then transliteratedusing nine rules given below:
Observations and Rules Creation
Sample results of our experimentation are shown in TablesIII-XI where sample Hindi strings, expected Englishtransliteration (commonly used English spellings),transliteration through ITRANS mappings and resultsproduced by our transliteration system are shown.
TABLE
 
IIIS
AMPLE
1
 
(
TOTAL
45
STRINGS TESTED
)
HindiStringExpectedTransliterationThroughMappingRule BasedTransliteration
पलक 
 PALAK PALAKA PALAK
अभय 
 ABHAY ABHAYA ABHAY
भरत 
 BHARAT BHARATA BHARAT
करण 
 KARAN KARANA KARAN
1)
 
 Rule1:
The observation of Table III gives the rule that if length of proper noun is of 3 characters containing no vowel(matraa)
then ‘
A
’ is removed from the last position.
 
TABLE
 
IVS
AMPLE
2
 
(
TOTAL
45
STRINGS TESTED
)
HindiString
 
ExpectedTransliteration
 
ThroughMapping
 
Rule BasedTransliteration
हरशद 
 
HARSHAD HARASHADA HARSHAD
तनमय 
 
TANMAY TANAMAYA TANMAY
अरनव 
 
ARNAV ARANAVA ARNAV
अकबर 
 
AKBAR AKABARA AKBAR
2)
 
 Rule2:
The observation of Table IV gives the rule that if length of proper noun is of 4 characters containing novowel (matraa
) then ‘
A
’ is removed from the second and
last position
.
 
TABLE
 
VS
AMPLE
3
 
(
TOTAL
45
STRINGS TESTED
)
HindiString
 
ExpectedTransliteration
 
ThroughMapping
 
Rule BasedTransliteration
कपल 
 
KAPIL KAPILA KAPIL
अभनव 
 
ABHINAV ABHINAVA ABHINAV
वपन 
 
VIPIN VIPINA VIPIN
म   क   ल 
 
MUKUL MUKULA MUKUL
3)
 
 Rule3:
The observation of Table V gives the rule that if proper n
oun ends with a consonant then ‘
A
’ should be
removed from last position in English spelling.
TABLE
 
VI
 
S
AMPLE
4
 
(
TOTAL
45
STRINGS TESTED
)
HindiString
 
ExpectedTransliteration
 
ThroughMapping
 
Rule BasedTransliteration
वमल 
 
VIMLA VIMALAA VIMLA
पदम 
 
PADMA PADAMAA PADMA
ब   शर 
 
BUSHRA BUSHARAA BUSHRA
बलदव 
 
BALDEV BALADEVA BALDEV
4)
 
 Rule4:
The observation of Table VI gives the rule that if two consonants occur in succession, latter consonant followedby a vowel in a proper noun and index of first consonant
should be greater than 1, then ‘
A
’ is removed from first
consonant during transliteration.
TABLE
 
VIIS
AMPLE
5
 
(
TOTAL
45
STRINGS TESTED
)
HindiString
 
ExpectedTransliteration
 
ThroughMapping
 
Rule BasedTransliteration
आसफ 
 
AASIF AASIPHA AASIF
आमन 
 
AAMNA AAMANAA AAMNA
हमन 
 
HIMANI HIMAANII HIMANI
गरम 
 
GARIMA GARIMAA GARIMA
5)
 
 Rule5:
The observation of Table VII gives the rule that if proper noun begins with
आ 
it is replaced
with ‘AA’, and i
Hindi WordTransliterationRulesEnglish OutputSpecialSpellingsDatabaseMapping TablesDatabase Look Up
Fig. 1. Program Flow of the Transliteration system
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 10, No. 8, August 201220http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->