You are on page 1of 5

VOL. 3, NO.

8 Aug, 2012

Journal of Emerging Trends in Computing and Information Sciences


2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

ISSN 2079-8407

Rule Based Hindi to Urdu Transliteration System


Bushra Baig , 2 M.Kumar , 3 Sujoy Das M.Tech Scholar CSE Department, SIRT, Bhopal, India 2 Professor CSE Department, SIRT, Bhopal, India 3 Associate Professor, Department of Computer Applications, MANIT, Bhopal, India 1 bushra78622@yahoo.in, 2 prof.mkumar@gmail.com, 3 sujdas@gmail.com
1 1

ABSTRACT
Web contents are increasing day by day and contain information in different languages. To access information written in a language other than the native language, one needs Cross Language Information Retrieval (CLIR) systems. CLIR systems generally use bilingual dictionaries to translate user query from one language to another language. The problem arises when a query term is not found in the dictionary. Such words are transliterated using character mappings. The problem of transliteration from Hindi to Urdu script is studied by several authors. Bushra and Tafseer have reported that the proper names pose major problems in the transliteration. In this research, we have considered the problem of transliteration of proper noun from Hindi to Urdu. We have developed a rule base transliteration system with a fair degree of accuracy.
Keywords: Out of Vocabulary (OOV), CLIR, Transliteration

1. INTRODUCTION
The Web is becoming a universal repository of human knowledge and culture which has allowed unprecedented sharing of ideas and information in a scale never seen before. The web contents are growing very rapidly and contain information written in many languages often there is a need to access information written in a language other than the native language. To access such information, we need CLIR systems. The process of writing query in the native language and retrieving documents in a target language is called Cross Language Information Retrieval (CLIR). There are various methods by which we can convert the text of one language into another language. The bilingual dictionaries are also frequently used to convert text from one language to another language. A major problem arises when a word of the text is not available in the bilingual dictionary. Such words are mainly proper names, cultural specific words, etc and are called Out Of Vocabulary (OOV) words. Such words should be transliterated. Transliteration is a practice of converting a text from one script to another. It only strives to represent the character accurately. The Transliteration is very useful whenever there is a need to display information in two languages. For example railway and airline reservation charts, display of signboards, publication of parliament proceedings in two languages especially in bilingual countries. Graehl [1] use five probability distributions at various phases of transliterations for the language pair English to Katakana (a form of Japanese language) writing system. Huang et.al [2] have developed a system which extracts Hindi-English named entity pairs through alignment of parallel corpus. Al-Onaizan and Knight [3] have studied Transliteration system from English to Arabic writing system. Bushra and Tafseer [4] discussed the issues and find the enhancement where required. Lehal and Saini [5] have developed a Hindi to Urdu Transliteration system by improving on the work of Bushra and Tafseer. In this paper, we have made an effort to develop a rule based transliteration scheme for proper names of HindiUrdu language pair. The system is under extensive experimentation and test.

3. A BRIEF OVERVIEW OF HINDI AND URDU


Hindi is an official language of India and is written in Devanagari script. Urdu is one of the state languages of India, and is written in Perso-Arabic script. Urdu has borrowed many words from Arabic hue and is written from right to left instead of left to right. There are 11 vowels and 36 consonants in Hindi. With 38 letters, Urdu alphabet is typically written in the calligraphic Nasta'liq script. Vowels in Urdu are represented by letters that are also considered as consonants [6]. In Urdu some Hindi characters may be represented by more than one character, whereas some characters may change their shapes due to presence of certain characters. They may acquire initial, isolated, medial or final shape. The characters that change their shape can

2. LITERATURE SURVEY
The problem of transliteration has been studied by a number of researchers during the last decade. Knight and

1200

VOL. 3, NO. 8 Aug, 2012

Journal of Emerging Trends in Computing and Information Sciences


2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

ISSN 2079-8407

acquire all these shapes whereas rest of the characters can have only isolated and final shape. [5].

has

been

taken

care

of

in

our

system.

4.3 Transliteration of Proper Nouns 3.1 Character mapping from Hindi to Urdu
Table 1: Mapping of Hindi and Urdu Consonants Transliteration is a process in which input string of a language is converted to a string of target language while maintaining the phonetics of the original word. The most commonly used mappings of consonants of Hindi language to characters of Urdu language are given in Table 1.The mappings of vowels of Hindi language to characters of Urdu are shown in Table 2.These mappings are taken from the paper of Bushra and Tafseer[4]. It is apparent from these tables that 3 characters of Hindi have multiple mappings in Urdu. We have done statistical analysis of proper names in Hindi to determine the percentage of times the Hindi character was mapped to similar sounding Urdu character. The findings are shown in the Table 3.The Urdu character with highest occurrence is used for default mapping i.e. [te] for , [sin] for and [choti he] for . S/ NO 1 2 3 4 5 6 7 8 9 10 Hindi Letters Equivalents Urdu Name Ka Kha Ga Gha Nga Ca Cha Ja Jha Nya Tta Ttha Dda Ddha Nna Ta Tha Da Dha Na Pa Pha BA Bha Ma Ya Ra La Va Sha Sha Sa Ha I
Characters

II

III

4. CHALLENGES IN TRANSLITERATION

HINDI-

URDU

11 12

On the basis of morphology and phonology, Hindi shares some similarity with Urdu, but still there are few challenges that need to be handled for a precise Hindi-Urdu transliteration system. Urdu uses diacritical marks of Arabic script and has short and long vowels. Diacritic is placed with consonant and it precedes syllable in case of short vowels. Diacritic marks are also used for consonant elongation [7].There are Hindi to Urdu transliteration related issues that have been discussed in [4, 5].Some of the issues that are related to transliteration has been tackled by framing rules and are as follows:

13 14 15 16 17 18 19 20

4.1

Ambiguous Character

21 22 23 24 25 26 27 28 29 30 31 32 33

Urdu is written as where [s] is mapped to which occurs less frequently.

For some sound in Hindi there are multiple representation in Urdu script which creates ambiguity at character level as shown in Table 1, for e.g. in

4.2 Non existence of nukta Symbol in Hindi script


Some words in Hindi are borrowed from Arabic, Persian etc. Thus for pronunciation of these borrowed words consonants with nukta ( ) are normally used in Hindi. But for the ease of printing and typing, nukta are slowly disappearing from Hindi script. As a result, character to character mappings produce wrong transliteration. This

1201

VOL. 3, NO. 8 Aug, 2012

Journal of Emerging Trends in Computing and Information Sciences


2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

ISSN 2079-8407

The transliteration of proper nouns poses a challenge. As some Urdu words are written in different ways. Therefore we have tried to formulate some rules based on our observations of outputs of our experiments. For some very specialized spellings, we have created a database shown in Fig 5.1 Table 2: Mapping of Hindi and Urdu Vowels S.No Hindi Vowels Letter Diacritical mark Urdu Vowels + (consonant+ vowel)

As discussed in section 4.2, some Hindi words may be spelled wrongly because of nukta characters. Removal of nukta is not so important for Hindi readers but when transliterated as such they produce wrong Urdu spellings. For handling this issue, we have developed a keyboard using the facilities of .net which contains additional consonants like etc. are replaced by suitable cases. During the transliteration process, we observed certain anomalies in respect of orthography of proper nouns. They have been corrected with the help of following rules described in this section. Sample results of our experimentation are shown in Table 4-11 where sample Hindi strings, expected Urdu Transliteration (commonly used Urdu spellings), and transliteration produced through mappings and result produced by our transliteration system are shown:
Hindi Word Database Lookup Specialized Spellings Database

1 2 3 4 5 6 7

8 9 10 11

Transliteration Rules

Mapping Tables

Urdu Word

Table 3: Frequency of different Mapping Characters Hindi Character Urdu Equivalent Frequency (%) te 88.09 toe 11.90 alif 1. 53 se 4.61 sin 75.38 swa d 18.46 . bad i he 13.34 choti he 86.66

Fig 5.1: Program Flow of the Transliteration system

Table 4: Sample 1(Total 25 strings tested) Hindi String Expected output Using Mapping Rule-Based

5. EXPERIMENTAL SETUP
The Transliteration system has been developed using .net version 3.5.The input is given to the system in a simple Hindi word in Unicode format. The Hindi word is first searched in the database, if the word is found it will generate its transliterate equivalent, otherwise it is transliterated using the mappings shown in Table 1 and 2 and the eight rules described below. The program flow is shown in the Fig 5.1. Rule 1: The observation of Table 4 gives the rule that if in a word is followed by occurs then it is replaced with in Urdu script.

1202

VOL. 3, NO. 8 Aug, 2012

Journal of Emerging Trends in Computing and Information Sciences


2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

ISSN 2079-8407

Table 5: Sample 2 (Total 35 strings tested) Hindi String Expected output Using Mapping RuleBased Rule 5: The observation of Table 8 gives the rule that if the Hindi word starts with then it is replaced with in Urdu word. Table 9: Sample 6 (Total 30 strings tested) Hindi String Expected output Using Mapping RuleBased

Rule 2: The observation of Table 5 gives the rule that if the diacritical form of i.e. occurs at the end of the Hindi string then it is replaced with ( choti heh). Table 6: Sample 3 (Total 25 strings tested) Hindi String Expected output Using Mapping Rule-Based

Rule 6: The observation of Table 9 gives the rule that if Hindi word consists of consonant then it is replaced with in Urdu script. Table 10: Sample 7 (Total 30 strings tested) Hindi String Expected output Using Mapping Rule-Based

Rule 3: The observation of Table 6 gives the rule that if the Hindi word starts with consonant form of or then it is replaced with in Urdu string. Table 7: Sample 4 (Total 25 strings tested) Hindi String Expected output Using Mapping Rule-Based

Rule 7: The observation of Table 10 gives the rule that if occurs at the first and middle of the Hindi word then it is replaced with in Urdu word. Table 11: Sample 8 (Total 15 strings tested) Hindi String Expected output Using Mapping Rule-Based

Rule 4: The observation of Table 7 gives the rule that if the Hindi word starts with consonant form of or then it is replaced with in Urdu string. Table 8: Sample 5 (Total 25 strings tested) Hindi String Expected output Using Mapping Rule-Based

Rule 8: The observation of Table 11 gives the rule that if occurs as the last character in Hindi word then it is replaced with in Urdu script.

6.

CONCLUSION

In this research paper, a Hindi to Urdu Transliteration system is presented with good accuracy. We have tried to overcome the shortcomings when proper names

1203

VOL. 3, NO. 8 Aug, 2012

Journal of Emerging Trends in Computing and Information Sciences


2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

ISSN 2079-8407

are transliterated using mappings discussed in paper of Bushra and Tafseer. Some challenges have been handled such as ambiguous character, nukta related errors etc. by formulating special rules and using Database. The transliteration system is under extensive test and some rules will be formulated to enhance performance of the system.

Resources, Proceedings of ACL 2002, pp 400-408, July2002. [4] Bushra and Tafseer, Hindi to Urdu Conversion: Beyond Simple Transliteration, Proceedings of the Conference on Language and Technology, pp 24-31, 2009 Lehal G.S and Saini T.S., A Hindi to Urdu Transliteration System, Proceedings of ICON, pp 235-240, 2010. http://en.wikipedia.org/wiki/Urdu_alphabet Karthik Visweswariah, Vijil Chenthamarakshan, Nandakishore Kambhatla,Urdu and Hindi: Translation and sharing of linguistic resources, Cooling 2010: Poster Volume, pp 12831291, 2010.

REFERENCES
[5] [1] Knight K. and J. Graehl, Machine Transliteration, Computational Linguistics, 24(4): pp 599-612, 1998. Huang Fei, Stephan Vogel, and Alex Waibel, Extracting Named Entity Translingual Equivalence with Limited Resources, ACM Transactions on Asian Language Information Processing (TALIP), 2(2): pp 124129, 2003. Al-Onaizan Y. and Knight K., Translating Named Entities Using Monolingual and Bilingual [6] [7]

[2]

[3]

1204

You might also like