Professional Documents
Culture Documents
http://www.cisjournal.org
ABSTRACT
Web contents are increasing day by day and contain information in different languages. To access information written in a
language other than the native language, one needs Cross Language Information Retrieval (CLIR) systems. CLIR systems
generally use bilingual dictionaries to translate user query from one language to another language. The problem arises when a
query term is not found in the dictionary. Such words are transliterated using character mappings. The problem of
transliteration from Hindi to Urdu script is studied by several authors. Bushra and Tafseer have reported that the proper names
pose major problems in the transliteration. In this research, we have considered the problem of transliteration of proper noun
from Hindi to Urdu. We have developed a rule base transliteration system with a fair degree of accuracy.
1. INTRODUCTION
The Web is becoming a universal repository of Graehl [1] use five probability distributions at various
human knowledge and culture which has allowed phases of transliterations for the language pair English to
unprecedented sharing of ideas and information in a scale Katakana (a form of Japanese language) writing system.
never seen before. The web contents are growing very Huang et.al [2] have developed a system which extracts
rapidly and contain information written in many languages Hindi-English named entity pairs through alignment of
often there is a need to access information written in a parallel corpus.
language other than the native language. To access such
information, we need CLIR systems. The process of writing Al-Onaizan and Knight [3] have studied
query in the native language and retrieving documents in a Transliteration system from English to Arabic writing
target language is called Cross Language Information system. Bushra and Tafseer [4] discussed the issues and find
Retrieval (CLIR). the enhancement where required. Lehal and Saini [5] have
developed a Hindi to Urdu Transliteration system by
There are various methods by which we can improving on the work of Bushra and Tafseer.
convert the text of one language into another language. The
bilingual dictionaries are also frequently used to convert text In this paper, we have made an effort to develop a
from one language to another language. A major problem rule based transliteration scheme for proper names of Hindi-
arises when a word of the text is not available in the Urdu language pair. The system is under extensive
bilingual dictionary. Such words are mainly proper names, experimentation and test.
cultural specific words, etc and are called Out Of
Vocabulary (OOV) words. Such words should be 3. A BRIEF OVERVIEW OF HINDI AND
transliterated. URDU
Transliteration is a practice of converting a text Hindi is an official language of India and is
from one script to another. It only strives to represent the written in Devanagari script. Urdu is one of the state
character accurately. The Transliteration is very useful languages of India, and is written in Perso-Arabic script.
whenever there is a need to display information in two Urdu has borrowed many words from “Arabic hue” and is
languages. For example railway and airline reservation written from right to left instead of left to right. There are 11
charts, display of signboards, publication of parliament vowels and 36 consonants in Hindi. With 38 letters, Urdu
proceedings in two languages especially in bilingual alphabet is typically written in the calligraphic Nasta'liq
countries. script. Vowels in Urdu are represented by letters that are also
considered as consonants [6]. In Urdu some Hindi characters
2. LITERATURE SURVEY may be represented by more than one character, whereas
some characters may change their shapes due to presence of
The problem of transliteration has been studied by a certain characters. They may acquire initial, isolated, medial
number of researchers during the last decade. Knight and or final shape. The characters that change their shape can
1200
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
acquire all these shapes whereas rest of the characters can has been taken care of in our system.
have only isolated and final shape. [5].
4.3 Transliteration of Proper Nouns
3.1 Character mapping from Hindi to Urdu
Table 1: Mapping of Hindi and Urdu Consonants
Transliteration is a process in which input string of
a language is converted to a string of target language while Equivalents Urdu
S/ Hindi
maintaining the phonetics of the original word. The most Name Characters
NO Letters
I II III
commonly used mappings of consonants of Hindi language
to characters of Urdu language are given in Table 1.The 1 क Ka ک
mappings of vowels of Hindi language to characters of Urdu 2 ख Kha ﮐﻬ
are shown in Table 2.These mappings are taken from the
paper of Bushra and Tafseer[4]. It is apparent from these 3 ग Ga گ
tables that 3 characters of Hindi have multiple mappings in 4 घ Gha ﮔﻪ
Urdu. We have done statistical analysis of proper names in
Hindi to determine the percentage of times the Hindi 5 ङ Nga ﻥ
character was mapped to similar sounding Urdu character. 6 च Ca چ
The findings are shown in the Table 3.The Urdu character
with highest occurrence is used for default mapping i.e. ﺕ 7 छ Cha ﭼﻪ
[te] for त, ﺱ [sin] 8 ज Ja ﺝ
for स and ﻩ [choti he] for ह. 9 झ Jha ﺟﻬ
10 ञ Nya ﻳﺎں
4. CHALLENGES IN HINDI- URDU
TRANSLITERATION
11 ट Tta ٹ
12 ठ Ttha ﮢﻬ
On the basis of morphology and phonology, Hindi
shares some similarity with Urdu, but still there are few
13 ड Dda ڈ
challenges that need to be handled for a precise Hindi-Urdu 14 ढ Ddha ﻫڈ
transliteration system. Urdu uses diacritical marks of Arabic
script and has short and long vowels. Diacritic is placed with
15 ण Nna ڈﺍں
consonant and it precedes syllable in case of short vowels. 16 त Ta ﺕ ﻁ
Diacritic marks are also used for consonant elongation
[7].There are Hindi to Urdu transliteration related issues that
17 थ Tha ﺗﻬ
have been discussed in [4, 5].Some of the issues that are 18 द Da ﺩ
related to transliteration has been tackled by framing rules
and are as follows:
19 ध Dha ﺩﻫ
20 न Na ﻥ
4.1 Ambiguous Character 21 प Pa پ
ब ﺏ
representation in Urdu script which creates ambiguity at 23 BA
character level as shown in Table 1, for e.g. सोमा�लया in 24 भ Bha ﺑﻬ
Urdu is written as ﺻﻮﻣﺎﻟﻴہwhere स[s] is mapped to ﺹwhich
25 म Ma ﻡ
occurs less frequently.
26 य Ya ے
4.2 Non existence of nukta Symbol in Hindi 27 र Ra ﺭ
script 28 ल La ﻝ
Some words in Hindi are borrowed from Arabic, 29 व Va ﻭ
Persian etc. Thus for pronunciation of these borrowed words 30 श Sha ﺵ
consonants with nukta (क़ ख़ ग़ ज़) are normally used in
Hindi. But for the ease of printing and typing, nukta are
31 ष Sha ﺵ
slowly disappearing from Hindi script. As a result, character 32 स Sa ﺱ ﺙ ﺹ
to character mappings produce wrong transliteration. This
33 ह Ha ﺡ ﻩ ﻫ
1201
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
The transliteration of proper nouns poses a As discussed in section 4.2, some Hindi words may
challenge. As some Urdu words are written in different be spelled wrongly because of nukta characters. Removal of
ways. Therefore we have tried to formulate some rules based nukta is not so important for Hindi readers but when
on our observations of outputs of our experiments. For some transliterated as such they produce wrong Urdu spellings.
very specialized spellings, we have created a database shown For handling this issue, we have developed a keyboard using
in Fig 5.1 the facilities of .net which contains additional consonants
like क़ ख़ ग़ ज़ etc. क ख ग ज are replaced by suitable cases.
Table 2: Mapping of Hindi and Urdu Vowels
During the transliteration process, we observed
S.No Hindi Vowels Urdu Vowels certain anomalies in respect of orthography of proper nouns.
Letter Diacritical They have been corrected with the help of following rules
mark described in this section. Sample results of our
1 अ ﺍ experimentation are shown in Table 4-11 where sample
आ ◌ा ﺁ
2 Hindi strings, expected Urdu Transliteration (commonly
used Urdu spellings), and transliteration produced through
3 इ ि◌ ◌ِ mappings and result produced by our transliteration system
ई ◌ी ی
4 are shown:
5 उ ◌ु ُ◌
Hindi Word
6 ऊ ◌ू ﺅ
7 ऋ ◌ृ ◌ِ +ﺭ Database Specialized
(consonant+ Lookup Spellings Database
vowel)
8 ए ◌े ﺍے
9 ऐ ◌ै ﺁے Transliteration Mapping
Tables
ُﺍ
ऒ ◌ो
10 Rules
11 औ ◌ौ ﺁﻭ
Urdu Word
Table 3: Frequency of different Mapping Characters Fig 5.1: Program Flow of the Transliteration system
Hindi
Char- त स ह Table 4: Sample 1(Total 25 strings tested)
acter
Urdu ﺕ ﻁ ﺍ ﺙ ﺱ ﺹ ﺡ. ﻩ Hindi Expected Using
Rule-Based
Equi- te to’e alif se sin swa bad choti String output Mapping
valent d i he he आ�दल ﻋﺎﺩﻝ ﺁﺩﻝ ﻋﺎﺩﻝ
Freq- 88.09 11.90 1. 53 4.61 75.38 18.46 13.34 86.66
uency आ�मर ﻋﺎﻣﺮ ﺁﻣﺮ ﻋﺎﻣﺮ
(%) आ�सफ ﻋﺎﺻﻒ ﺁﺳﻒ ﻋﺎﺻﻒ
1202
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
Table 5: Sample 2 (Total 35 strings tested)
Rule 5: The observation of Table 8 gives the rule
Hindi Expected Using Rule- that if the Hindi word starts with हा then it is replaced with ﺣﺎ
String output Mapping Based in Urdu word.
याफरा ﻳﺎﻓﺮﻩ ﻳﺎﻓﺮﺍ ﻳﺎﻓﺮﻩ
इटारसी ﺍِﻁﺎﺭﺳﯽ ِ◌ﻁﺎﺭﺳﯽ ﺍِﻁﺎﺭﺳﯽ Table 10: Sample 7 (Total 30 strings tested)
उपेन ﺍُﭘﻴﻦ ◌ُ ﭘﺎےﻥ ﺍُﭘﻴﻦ Table 11: Sample 8 (Total 15 strings tested)
1203
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
are transliterated using mappings discussed in paper of Resources”, Proceedings of ACL 2002, pp 400-408,
Bushra and Tafseer. Some challenges have been handled July2002.
such as ambiguous character, nukta related errors etc. by
formulating special rules and using Database. The [4] Bushra and Tafseer, “Hindi to Urdu Conversion:
transliteration system is under extensive test and some rules Beyond Simple Transliteration”, Proceedings of the
will be formulated to enhance performance of the system. Conference on Language and Technology, pp 24-31,
2009
REFERENCES
[5] Lehal G.S and Saini T.S., “A Hindi to Urdu
[1] Knight K. and J. Graehl, “Machine Transliteration”, Transliteration System”, Proceedings of ICON, pp
Computational Linguistics, 24(4): pp 599-612, 1998. 235-240, 2010.
[2] Huang Fei, Stephan Vogel, and Alex Waibel, [6] http://en.wikipedia.org/wiki/Urdu_alphabet
“Extracting Named Entity Translingual Equivalence
with Limited Resources”, ACM Transactions on [7] Karthik Visweswariah, Vijil Chenthamarakshan,
Asian Language Information Processing (TALIP), Nandakishore Kambhatla,”Urdu and Hindi:
2(2): pp 124–129, 2003. Translation and sharing of linguistic resources”,
Cooling 2010: Poster Volume, pp 1283–1291, 2010.
[3] Al-Onaizan Y. and Knight K., “Translating Named
Entities Using Monolingual and Bilingual
1204