You are on page 1of 5

VOL. 3, NO.

8 Aug, 2012 ISSN 2079-8407


Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org

Rule Based Hindi to Urdu Transliteration System


1
Bushra Baig , 2 M.Kumar , 3 Sujoy Das
1
M.Tech Scholar CSE Department, SIRT, Bhopal, India
2
Professor CSE Department, SIRT, Bhopal, India
3
Associate Professor, Department of Computer Applications, MANIT, Bhopal, India
1
bushra78622@yahoo.in, 2 prof.mkumar@gmail.com, 3 sujdas@gmail.com

ABSTRACT
Web contents are increasing day by day and contain information in different languages. To access information written in a
language other than the native language, one needs Cross Language Information Retrieval (CLIR) systems. CLIR systems
generally use bilingual dictionaries to translate user query from one language to another language. The problem arises when a
query term is not found in the dictionary. Such words are transliterated using character mappings. The problem of
transliteration from Hindi to Urdu script is studied by several authors. Bushra and Tafseer have reported that the proper names
pose major problems in the transliteration. In this research, we have considered the problem of transliteration of proper noun
from Hindi to Urdu. We have developed a rule base transliteration system with a fair degree of accuracy.

Keywords: Out of Vocabulary (OOV), CLIR, Transliteration

1. INTRODUCTION
The Web is becoming a universal repository of Graehl [1] use five probability distributions at various
human knowledge and culture which has allowed phases of transliterations for the language pair English to
unprecedented sharing of ideas and information in a scale Katakana (a form of Japanese language) writing system.
never seen before. The web contents are growing very Huang et.al [2] have developed a system which extracts
rapidly and contain information written in many languages Hindi-English named entity pairs through alignment of
often there is a need to access information written in a parallel corpus.
language other than the native language. To access such
information, we need CLIR systems. The process of writing Al-Onaizan and Knight [3] have studied
query in the native language and retrieving documents in a Transliteration system from English to Arabic writing
target language is called Cross Language Information system. Bushra and Tafseer [4] discussed the issues and find
Retrieval (CLIR). the enhancement where required. Lehal and Saini [5] have
developed a Hindi to Urdu Transliteration system by
There are various methods by which we can improving on the work of Bushra and Tafseer.
convert the text of one language into another language. The
bilingual dictionaries are also frequently used to convert text In this paper, we have made an effort to develop a
from one language to another language. A major problem rule based transliteration scheme for proper names of Hindi-
arises when a word of the text is not available in the Urdu language pair. The system is under extensive
bilingual dictionary. Such words are mainly proper names, experimentation and test.
cultural specific words, etc and are called Out Of
Vocabulary (OOV) words. Such words should be 3. A BRIEF OVERVIEW OF HINDI AND
transliterated. URDU
Transliteration is a practice of converting a text Hindi is an official language of India and is
from one script to another. It only strives to represent the written in Devanagari script. Urdu is one of the state
character accurately. The Transliteration is very useful languages of India, and is written in Perso-Arabic script.
whenever there is a need to display information in two Urdu has borrowed many words from “Arabic hue” and is
languages. For example railway and airline reservation written from right to left instead of left to right. There are 11
charts, display of signboards, publication of parliament vowels and 36 consonants in Hindi. With 38 letters, Urdu
proceedings in two languages especially in bilingual alphabet is typically written in the calligraphic Nasta'liq
countries. script. Vowels in Urdu are represented by letters that are also
considered as consonants [6]. In Urdu some Hindi characters
2. LITERATURE SURVEY may be represented by more than one character, whereas
some characters may change their shapes due to presence of
The problem of transliteration has been studied by a certain characters. They may acquire initial, isolated, medial
number of researchers during the last decade. Knight and or final shape. The characters that change their shape can

1200
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org
acquire all these shapes whereas rest of the characters can has been taken care of in our system.
have only isolated and final shape. [5].
4.3 Transliteration of Proper Nouns
3.1 Character mapping from Hindi to Urdu
Table 1: Mapping of Hindi and Urdu Consonants
Transliteration is a process in which input string of
a language is converted to a string of target language while Equivalents Urdu
S/ Hindi
maintaining the phonetics of the original word. The most Name Characters
NO Letters
I II III
commonly used mappings of consonants of Hindi language
to characters of Urdu language are given in Table 1.The 1 क Ka ‫ک‬
mappings of vowels of Hindi language to characters of Urdu 2 ख Kha ‫ﮐﻬ‬
are shown in Table 2.These mappings are taken from the
paper of Bushra and Tafseer[4]. It is apparent from these 3 ग Ga ‫گ‬
tables that 3 characters of Hindi have multiple mappings in 4 घ Gha ‫ﮔﻪ‬
Urdu. We have done statistical analysis of proper names in
Hindi to determine the percentage of times the Hindi 5 ङ Nga ‫ﻥ‬
character was mapped to similar sounding Urdu character. 6 च Ca ‫چ‬
The findings are shown in the Table 3.The Urdu character
with highest occurrence is used for default mapping i.e. ‫ﺕ‬ 7 छ Cha ‫ﭼﻪ‬
[te] for त, ‫ﺱ‬ [sin] 8 ज Ja ‫ﺝ‬
for स and ‫ﻩ‬ [choti he] for ह. 9 झ Jha ‫ﺟﻬ‬
10 ञ Nya ‫ﻳﺎں‬
4. CHALLENGES IN HINDI- URDU
TRANSLITERATION
11 ट Tta ‫ٹ‬
12 ठ Ttha ‫ﮢﻬ‬
On the basis of morphology and phonology, Hindi
shares some similarity with Urdu, but still there are few
13 ड Dda ‫ڈ‬
challenges that need to be handled for a precise Hindi-Urdu 14 ढ Ddha ‫ﻫڈ‬
transliteration system. Urdu uses diacritical marks of Arabic
script and has short and long vowels. Diacritic is placed with
15 ण Nna ‫ڈﺍں‬
consonant and it precedes syllable in case of short vowels. 16 त Ta ‫ﺕ‬ ‫ﻁ‬
Diacritic marks are also used for consonant elongation
[7].There are Hindi to Urdu transliteration related issues that
17 थ Tha ‫ﺗﻬ‬
have been discussed in [4, 5].Some of the issues that are 18 द Da ‫ﺩ‬
related to transliteration has been tackled by framing rules
and are as follows:
19 ध Dha ‫ﺩﻫ‬
20 न Na ‫ﻥ‬
4.1 Ambiguous Character 21 प Pa ‫پ‬

For some sound in Hindi there are multiple 22 फ Pha ‫ﭘﻬ‬

ब ‫ﺏ‬
representation in Urdu script which creates ambiguity at 23 BA
character level as shown in Table 1, for e.g. सोमा�लया in 24 भ Bha ‫ﺑﻬ‬
Urdu is written as ‫ ﺻﻮﻣﺎﻟﻴہ‬where स[s] is mapped to ‫ ﺹ‬which
25 म Ma ‫ﻡ‬
occurs less frequently.
26 य Ya ‫ے‬
4.2 Non existence of nukta Symbol in Hindi 27 र Ra ‫ﺭ‬
script 28 ल La ‫ﻝ‬
Some words in Hindi are borrowed from Arabic, 29 व Va ‫ﻭ‬
Persian etc. Thus for pronunciation of these borrowed words 30 श Sha ‫ﺵ‬
consonants with nukta (क़ ख़ ग़ ज़) are normally used in
Hindi. But for the ease of printing and typing, nukta are
31 ष Sha ‫ﺵ‬
slowly disappearing from Hindi script. As a result, character 32 स Sa ‫ﺱ‬ ‫ﺙ‬ ‫ﺹ‬
to character mappings produce wrong transliteration. This
33 ह Ha ‫ﺡ‬ ‫ﻩ‬ ‫ﻫ‬

1201
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org
The transliteration of proper nouns poses a As discussed in section 4.2, some Hindi words may
challenge. As some Urdu words are written in different be spelled wrongly because of nukta characters. Removal of
ways. Therefore we have tried to formulate some rules based nukta is not so important for Hindi readers but when
on our observations of outputs of our experiments. For some transliterated as such they produce wrong Urdu spellings.
very specialized spellings, we have created a database shown For handling this issue, we have developed a keyboard using
in Fig 5.1 the facilities of .net which contains additional consonants
like क़ ख़ ग़ ज़ etc. क ख ग ज are replaced by suitable cases.
Table 2: Mapping of Hindi and Urdu Vowels
During the transliteration process, we observed
S.No Hindi Vowels Urdu Vowels certain anomalies in respect of orthography of proper nouns.
Letter Diacritical They have been corrected with the help of following rules
mark described in this section. Sample results of our
1 अ ‫ﺍ‬ experimentation are shown in Table 4-11 where sample
आ ◌ा ‫ﺁ‬
2 Hindi strings, expected Urdu Transliteration (commonly
used Urdu spellings), and transliteration produced through
3 इ ि◌ ◌ِ mappings and result produced by our transliteration system
ई ◌ी ‫ی‬
4 are shown:
5 उ ◌ु ُ◌
Hindi Word
6 ऊ ◌ू ‫ﺅ‬
7 ऋ ◌ृ ◌ِ +‫ﺭ‬ Database Specialized
(consonant+ Lookup Spellings Database
vowel)
8 ए ◌े ‫ﺍے‬
9 ऐ ◌ै ‫ﺁے‬ Transliteration Mapping
Tables
ُ‫ﺍ‬
ऒ ◌ो
10 Rules

11 औ ◌ौ ‫ﺁﻭ‬

Urdu Word

Table 3: Frequency of different Mapping Characters Fig 5.1: Program Flow of the Transliteration system

Hindi
Char- त स ह Table 4: Sample 1(Total 25 strings tested)
acter
Urdu ‫ﺕ‬ ‫ﻁ‬ ‫ﺍ‬ ‫ﺙ‬ ‫ﺱ‬ ‫ﺹ‬ ‫ﺡ‬. ‫ﻩ‬ Hindi Expected Using
Rule-Based
Equi- te to’e alif se sin swa bad choti String output Mapping
valent d i he he आ�दल ‫ﻋﺎﺩﻝ‬ ‫ﺁﺩﻝ‬ ‫ﻋﺎﺩﻝ‬
Freq- 88.09 11.90 1. 53 4.61 75.38 18.46 13.34 86.66
uency आ�मर ‫ﻋﺎﻣﺮ‬ ‫ﺁﻣﺮ‬ ‫ﻋﺎﻣﺮ‬
(%) आ�सफ ‫ﻋﺎﺻﻒ‬ ‫ﺁﺳﻒ‬ ‫ﻋﺎﺻﻒ‬

आ�बश ‫ﻋﺎﺑﺶ‬ ‫ﺁﺑﺶ‬ ‫ﻋﺎﺑﺶ‬


5. EXPERIMENTAL SETUP
The Transliteration system has been developed Rule 1: The observation of Table 4 gives the rule
using .net version 3.5.The input is given to the system in a that if आ in a word is followed by ि◌ occurs then it is
simple Hindi word in Unicode format. The Hindi word is replaced with ‫ ﻋﺎ‬in Urdu script.
first searched in the database, if the word is found it will
generate its transliterate equivalent, otherwise it is
transliterated using the mappings shown in Table 1 and 2
and the eight rules described below. The program flow is
shown in the Fig 5.1.

1202
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org
Table 5: Sample 2 (Total 35 strings tested)
Rule 5: The observation of Table 8 gives the rule
Hindi Expected Using Rule- that if the Hindi word starts with हा then it is replaced with ‫ﺣﺎ‬
String output Mapping Based in Urdu word.
याफरा ‫ﻳﺎﻓﺮﻩ‬ ‫ﻳﺎﻓﺮﺍ‬ ‫ﻳﺎﻓﺮﻩ‬

वा�हबा ‫ﻭﺍﮨﺒہ‬ ‫ﻭﺍﮨﺒﺎ‬ ‫ﻭﺍﮨﺒہ‬ Table 9: Sample 6 (Total 30 strings tested)

आहना ‫ﺁﮨﻨہ‬ ‫ﺁﮨﻨﺎ‬ ‫ﺁﮨﻨہ‬ Hindi Expected Using Rule-


म�लका ‫ﻣﻠﮑہ‬ ‫ﻣﻠﮑﺎ‬ ‫ﻣﻠﮑہ‬ String output Mapping Based
फलक ‫ﻓﻠﮏ‬ ‫ﭘﻬﻠﮏ‬ ‫ﻓﻠﮏ‬
Rule 2: The observation of Table 5 gives the rule फह�म ‫ﻓﮩﻴﻢ‬ ‫ﭘﻬﮩﻴﻢ‬ ‫ﻓﮩﻴﻢ‬
that if the diacritical form of आ i.e. ◌ा occurs at the end of
फाल्गुन ‫ﻓﺎﻟﮕﻨﯽ‬ ‫ﭘﻬﺎﻟﮕﻮﻧﯽ‬ ‫ﻓﺎﻟﮕﻨﯽ‬
the Hindi string then it is replaced with ‫( ﻩ‬choti heh).
फ�नश्व ‫ﻓﻨﺸﻮﺭ‬ ‫ﭘﻬﻨﺸﻮﺭ‬ ‫ﻓﻨﺸﻮﺭ‬
Table 6: Sample 3 (Total 25 strings tested)
Rule 6: The observation of Table 9 gives the rule
Hindi Expected Using
Rule-Based that if Hindi word consists of consonant फ then it is replaced
String output Mapping with ‫ ﻑ‬in Urdu script.
इटल� ‫ﺍِﻁﻠﯽ‬ ‫ِ◌ﻁﻠﯽ‬ ‫ﺍِﻁﻠﯽ‬

इटारसी ‫ﺍِﻁﺎﺭﺳﯽ‬ ‫ِ◌ﻁﺎﺭﺳﯽ‬ ‫ﺍِﻁﺎﺭﺳﯽ‬ Table 10: Sample 7 (Total 30 strings tested)

इन्दौ ‫ﺍِﻧﺪﻭﺭﺭ‬ ‫ِ◌ﻧﺪﻭﺭ‬ ‫ﺍِﻧﺪﻭﺭﺭ‬ Hindi Expected Using


Rule-Based
ईटानगर ‫ﺍِﻁﺎﻧﮕﺮ‬ ‫ﻳﻄﺎﻧﮕﺮ‬ ‫ﺍِﻁﺎﻧﮕﺮ‬ String output Mapping
सत्य ‫ﺳﺘﻴﻢ‬ ‫ﺳﺘﮯﻡ‬ ‫ﺳﺘﻴﻢ‬
Rule 3: The observation of Table 6 gives the rule शयाम ‫ﺷﻴﺎﻡ‬ ‫ﺷﮯﺍﻡ‬ ‫ﺷﻴﺎﻡ‬
that if the Hindi word starts with consonant form of इ or ई
�टया ‫ﻁﻴﺎ‬ ‫ﻁﻴﮯﺍ‬ ‫ﻁﻴﺎ‬
then it is replaced with ِ‫ ﺍ‬in Urdu string.
�नयती ‫ﻧﻴﺘﯽ‬ ‫ﻧِﮯﺗﯽ‬ ‫ﻧﻴﺘﯽ‬
Table 7: Sample 4 (Total 25 strings tested)
Rule 7: The observation of Table 10 gives the rule
Hindi Expected Using
Rule-Based that if य occurs at the first and middle of the Hindi word
String output Mapping then it is replaced with ‫ ی‬in Urdu word.
उदय ‫ﺍُﺩے‬ ‫◌ُ ﺩے‬ ‫ﺍُﺩے‬

उपेन ‫ﺍُﭘﻴﻦ‬ ‫◌ُ ﭘﺎےﻥ‬ ‫ﺍُﭘﻴﻦ‬ Table 11: Sample 8 (Total 15 strings tested)

उमेश ‫ﺍُﻣﻴﺶ‬ ‫◌ُ ﻣﺎےﺵ‬ ‫ﺍُﻣﻴﺶ‬ Hindi Expected Using


Rule-Based
उ�दत ‫ﺍُﺭﺗﺎ‬ ‫◌ُ ِﺩﺗﺎ‬ ‫ﺍُﺭﺗﺎ‬ String output Mapping
तनमय ‫ﺗﻨﻤﮯ‬ ‫ﺗﻨﻤﮯ‬ ‫ﺗﻨﻤﮯ‬
Rule 4: The observation of Table 7 gives the rule अजय ‫ﺍﺟﮯ‬ ‫ﺍﺟﮯ‬ ‫ﺍﺟﮯ‬
that if the Hindi word starts with consonant form of उ or ऊ
अभय ‫ﺍﺑﻬﮯ‬ ‫ﺍﺑﻬﮯ‬ ‫ﺍﺑﻬﮯ‬
then it is replaced with ُ‫ ﺍ‬in Urdu string.
संजय ‫ﺳﻨﺠﮯ‬ ‫ﺳﻨﺠﮯ‬ ‫ﺳﻨﺠﮯ‬
Table 8: Sample 5 (Total 25 strings tested)
Rule 8: The observation of Table 11 gives the rule
Hindi Expected Using
Rule-Based that if य occurs as the last character in Hindi word then it is
String output Mapping replaced with in Urdu script.
हा�रस ‫ﺣﺎﺭﺙ‬ ‫ﮨﺎﺭﺱ‬ ‫ﺣﺎﺭﺙ‬

हा�मद ‫ﺣﺎﻣﺪ‬ ‫ﮨﺎﻣﺪ‬ ‫ﺣﺎﻣﺪ‬ 6. CONCLUSION


हा�शर ‫ﺣﺎﺷﺮ‬ ‫ﮨﺎﺷﺮ‬ ‫ﺣﺎﺷﺮ‬
In this research paper, a Hindi to Urdu
हा�दर ‫ﺣﺎﺩﺭ‬ ‫ﮨﺎﺩﺭ‬ ‫ﺣﺎﺩﺭ‬ Transliteration system is presented with good accuracy. We
have tried to overcome the shortcomings when proper names

1203
VOL. 3, NO. 8 Aug, 2012 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org
are transliterated using mappings discussed in paper of Resources”, Proceedings of ACL 2002, pp 400-408,
Bushra and Tafseer. Some challenges have been handled July2002.
such as ambiguous character, nukta related errors etc. by
formulating special rules and using Database. The [4] Bushra and Tafseer, “Hindi to Urdu Conversion:
transliteration system is under extensive test and some rules Beyond Simple Transliteration”, Proceedings of the
will be formulated to enhance performance of the system. Conference on Language and Technology, pp 24-31,
2009
REFERENCES
[5] Lehal G.S and Saini T.S., “A Hindi to Urdu
[1] Knight K. and J. Graehl, “Machine Transliteration”, Transliteration System”, Proceedings of ICON, pp
Computational Linguistics, 24(4): pp 599-612, 1998. 235-240, 2010.

[2] Huang Fei, Stephan Vogel, and Alex Waibel, [6] http://en.wikipedia.org/wiki/Urdu_alphabet
“Extracting Named Entity Translingual Equivalence
with Limited Resources”, ACM Transactions on [7] Karthik Visweswariah, Vijil Chenthamarakshan,
Asian Language Information Processing (TALIP), Nandakishore Kambhatla,”Urdu and Hindi:
2(2): pp 124–129, 2003. Translation and sharing of linguistic resources”,
Cooling 2010: Poster Volume, pp 1283–1291, 2010.
[3] Al-Onaizan Y. and Knight K., “Translating Named
Entities Using Monolingual and Bilingual

1204

You might also like