Professional Documents
Culture Documents
Machine Translation
B. Hettige1, A. S. Karunananda2
1
Department of Statistics and Computer Science, Faculty of Applied Sciences, University of Sri Jayawardenepura,
Sri Lanka
2
Faculty of Information Technology, University of Moratuwa, Sri Lanka
budditha@yahoo.com1 , asoka@itfac.mrt.ac.lk2
Abstract – Machine translation is a challenging task in natural language parser, translator, target language morphological
language processing. Out-Of-Vocabulary, handling Proper generator, target language composer and lexical databases
nouns and Technical terms are some major issues which are [8][9]. At present a number of approaches are used for
common to all machine translation systems. This paper Machine Translation: Dictionary-based Machine
presents a transliteration approach to machine translation
from English to Sinhala. We have used Finite State
Translation, Statistical Machine Translation and Example-
Automaton to develop transducers for English to Sinhala based Machine Translation [25].
transliteration. This approach can transliterate the text in However, machine translation cannot be achieved merely
original English and Sinhala words that are written using by handling languages at the morphological, syntax and
English letters. The transliteration System has been semantic levels. This is because; Out-Of-Vocabulary,
developed using SWI-PROLOG and Prolog Server Page handling Proper nouns and Technical terms become crucial
(PSP). English WorldNet and Sinhala Chatbot are used to test when source and target languages are reasonably different
the transducers and reasonable results were achieved.
in terms of alphabet, pronunciation, etc. There are various
transliteration approaches taken to solve these issues.
I INTRODUCTION In general, transliteration is the practice of transcribing a
Machine translation has been a potential solution for word or text written in one writing system into another
giving access to the world knowledge available in English writing system [25]. In other words, Machine transliteration
for those who have different mother tongues. There are is a method for automatic conversion of words in one
number of Machine Translation (MT) systems, available for language in to phonetically equivalent ones in another
many languages. Among others, Electronic Dictionary language. For example, the English word ‘machine’ is
Research (EDR) [27] is the most powerful MT System in transliterated into Sinhala as ueIska. At present there are
the Asian region. EDR system translates English to number of Machine transliteration approaches available
Japanese and vise versa. Also it uses knowledge base namely, grapheme-based transliteration model, phoneme-
Machine Translation approach. In the Asian region, based transliteration model, hybrid transliteration model
numbers of MT Systems are developed to Translate English and correspondence-based transliteration model [18].
to Indian family languages. Some of these systems can be Grapheme-based transliteration is direct orthographical
named as Anglabharati, Anusaaraka, MaTra, Mantra etc. mapping from source graphemes to target graphemes.
[6]. Anglabharati MT System Translates English to Indian Several transliteration methods have been proposed using
languages, primarily Hindi, using a rule-based transfer this method. These methods include channel model and
approach. The Anusaaraka MT System [3] is used to access decision tree model [18]. Phoneme-based transliteration
language between Indian Languages. Also this System uses model is based on pronunciation or the source phoneme
Paninian Grammar (PG) model [1] to it’s language analysis. rather than spelling or source grapheme. This model uses
The Anusaaraka project has developed Language accesses source-phoneme-to-source-phoneme transformation and
from Punjabi, Bengali, Telugu, Kannada and Marathi into source- phoneme-to-target-phoneme transformation. By
Hindi. The approach and lexicon is general, but the system using this transliteration model, Knight and others have
has mainly been applied for children’s stories. MaTra is a developed Japanese-to-English transliteration system with
Human-Assisted translation project for English to Indian weighted finite state transducers (WFSTs) [19]. Arabic-to-
languages, and the Mantra project is based on the TAG English, English-to-Chinese Transliteration Systems too
formalism, that has been developed for the domain of have used this model. Hybrid and correspondence-based
gazette notifications pertaining to government appointments Transliteration models are used both source grapheme and
[2]. source phoneme to transliteration.
The Machine Translation process can be described simply Furthermore, there are two types of transliterations,
as decoding the meaning of the source text, and re-encoding namely, Forward Transliteration and Backward
this meaning in the target language [11]. Any MT System Transliteration. Forward Transliteration means
contains source language morphological analyzer, source transliteration of a name from its native script to a foreign
one. Backward Transliteration is restoration of a previously TABLE I
The Sinhala Alphabet
transliterated name to its native scripts. However any
transliteration model can be modeled as phoneme-based Letter Type Sinhala Letters
transliteration, letter-based transliteration or substring-based Vowels
w, wd, we, wE, b, B, W, W! ,Ì, Ï iD, iDD, t, ta,
ft, T, ´, T!
transliteration. Among others, Weerasinghe and others have
l, L, . , >, V, Õ, p, P, c, Cv [, {, P, g,
developed a rule based Syllabification Algorithm, that can Consonants G, v, V, K, Ë, ; , : , o, O, k, |, m, M, n, N,
be considered under letter-based transliteration for Sinhala u, U, h, r, ,, j, Y, I, i, y, <, *
language [24]. This is a backward transliteration approach Semi-Consonants x, (
that reads Sinhala text and its pronunciation.
We present a transliteration approach that come under TABLE II
letter-based tradition with the use of theory of finite Sinhala Stokes and their positions
automaton. At this stage, this project has considered only No Stoke Name Position E.g.
the forward transliteration from English to Sinhala. 1 A Al-lakuna1 Upper ia
The rest of this paper is organized as follows. Section II A Al-lakuna2 Upper ¾
describes structures of the Sinhala and English words. 2 d Aela-pilla Right ld
Section III reports on the design and implementation of the 3 e Ketti aeda pilla Right le
proposed transliteration models for English to Sinhala 4 E Diga aeda pilla Right lE
machine translation. Section IV gives conclusion and 5 s Ketti is pilla Upper ls
further work. S lS
6 Diga is pilla Upper
7 Q Ketti paa pilla1 Lower nq
= Ketti paa pilla2 Lower l=
II THE SINHALA AND ENGLISH LANGUAGES Q nQ
8 Diga paa pilla1 Lower
STRUCTURES + l+
Diga paa pilla1 Lower
Machine transliteration process is not a simple task. This 9 D Gaetta pilla Right iD
is mainly because; there are many ambiguities about the 10 f Kombuva Left fu
relationship between the spelling of a word and its 11 ! Gayanukitta Right T!
pronunciation. Further, some letters cannot be pronounced
in isolation, but need to be considered as adjacent letters.
Therefore, it is required to do word level analysis of words TABLE III
before going into transliteration approaches. The consonant ‘l’ with vocalic stokes
It appears that, the transliteration method proposed in Using Type 2 transliteration, a Sinhala proper noun such
Godage dictionary is very readable, among others. Hence, as ‘Dambadeniya’ (oUfoksh) can be recognized. In this
our transliteration system has used this method for example, it should be noted that the English letters ‘mba’
Transliteration. are transliterated as U in Sinhala. This is why we cannot
use Type 1 transliteration for the Sinhala words written
using in English letters.
D1 C Implementation
C1 g
e In order to develop English to Sinhala full transliteration
C2
c
k system, we must implement both Type1 and Type 2 FSTs.
d C3 In a typical English to Sinhala machine translation system,
e e
C4 the transliteration comes after recognition of proper noun,
v e out of vocabulary words and technical terms by applying
C5
t, e, s,c ,g h English morphological analyzer and English dictionary. As
C D
such the transliteration system becomes a plug-in for a
q0 standard machine translation system. This Plug in
t implements two major tasks, namely, string manipulation
C6
D2 e and applying Type 1 and Type 2 transliterations. During
h string manipulation the tasks such as changing case,
n C7 g
Figure 1
converting letters in English words into a list, and
C8 generating a strings of characters. This English character
l
q0 = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z} string will be sent to FST implementation of Type 1 and
Types 2 transliterations.
Fig. 2 FST for Consonants in Types 1 transliteration
We have used SWI-PROLOG [26] and PSP to
implement the proposed transliteration approach into
V1
English to Sinhala Machine translation system.
I I
r
V2
r
e e D Appraoch in Practice
D1
The whole transliteration Process can be described as
V3
a Q2 follows. Having accepted an English word, first, the
A B system removes capitalization and gets a lowercase text
Q1 and generates Atom list in Prolog [20]. After that, finite
V4
i state transducer is used, together with information Table
i
e VII and returns appropriate Sinhala vowels and
e
V5 consonants. Then Phonetic font layout (Table III) is used
u to store this information. In the phonetic layout, letters are
u V6
stored using base letter and its vowel sound. For example
o V7 o, u letter ‘fld’ (koo) is stored as letter ‘la’(k) and vowel ‘T’
(o) This method is used to generate Sinhala letters, such as
Q1 = { a, e, ,i, o, u, Ǐ, ŕ }, Q2 = { a, e, i }
j + s = ú , u + a = ï etc. The Finite State Transducer
Fig. 3 FST for Consonants in Types 2 transliteration (FST) contains approximately hundred of ‘arcs’ to
represent above models. Fig 6 shows how our approach
C7 transliterate a proper noun ‘Saman Kumara’ .
b
Figure 3 i
C1
l
C2 s
l
s h
t D1
C3
t t
C
Q1
C4
Q2 h
d C5 d
D2 h
n d
C6 n, d, y
n d, j
D3 j
d D4
Fig. 5 English to Sinhala Transliteration System
Q1 = { k, g, c, j, t d ,b, m, y, r, f, v, s, h, l, n, p }
Q2 = { k, g, c, j, t, d, b, s, p} We have also linked up the transliteration system with our
Sinhala Chatbot [10]. Additionally, people, who do not
Fig. 4 FST for Consonants in Types 2 transliteration
fluent in Sinhala typing [7], but good in English typing, [7] G. Dias, A. Goonetilleke, "Development of Standards for Sinhala
Computing", 1st Regional Conference on ICT and E-Paradigms,
can communicate with the Sinhala Chatbot. Colombo, Sri Lanka, 2004
[8] B. Hettige, A. S. Karunananda, “A Parser for Sinhala Language –
First Step Towards English to Sihala Machine Translation”, To
IV CONCLUSION AND FURTHER WORK appear in the proceedings of International Conference on Industrial
Our objective of this project was to develop a and Information Systems(ICIIS2006), IEEE, Sri Lanka, 2006.
[9] B. Hettige A. S. Karunananda , “A Morphological analyzer to
transliteration model that can enhance the usage of our enable English to Sinhala Machine Translation”, Proceedings of the
English to Sinhala Machine Translation system. In 2nd International Conference on Information and Automation
achieving this, we have developed two FST based models (ICIA2006), Colombo, Sri Lanka, pp 21-26, 2006.
(Type 1 and Type 2) for implementing transliteration of [10] B. Hettige, A. S. Karunananda, "First Sinhala chatbot in action",
pronoun, out of dictionary words and technical terms. The Proceedings of the 3rd Annual Sessions of Sri Lanka Association
for Artificial Intelligence(SLAAI), University of Moratuwa, 2006.
Type 1 model was tested by using famous WordNet [22].
[11] D. Jurafsky, J. H. Martin, “Speech and Language Processing”,
Type 2 model was tested by using Sinhala Chatbot. The Pearson Education Pte Ltd, Indian branch, 482, F.I.E. Patparganj,
results of these experiments were very encouraging. Delhi, India, 2005.
However, we have also identified the following as the [12] A. Stevenson, J. Elliott, R Jones, “The Little Oxford English
main limitations. Handeling of Pronunciations of a Dictionary”, Oxford university press, 2002.
English word is a critical problem in English to Sinhala [13] K. Gunaratne, “Ratna English-Sinhalese Dictionary”, Ratna Poth
Prakasakayo, 513, Maradana road, Colombo 10, Sri Lanka 2006.
transliteration. For example, the English letter ‘a’
[14] G. P. Malalasekera, “English-Sinhala Dictionary”, M. D. Gunasena
represent different sound ‘w’, ‘we’ and ‘wE’ (ago – wf.da, saha Samagama, 217, Olkote mawatha Colombo 11, Sri Lanka
America – wefursld and ant- wEkaá) in Sinhala. This 2005.
causes to leave some ambiguity in transliteration. At [15] S. Maitipe, “Sinhala-English Pocket Dictionary”, M. D. Gunasena
present, though we can tackle this issue by showing the saha Samagama, 217, Olkote mawatha Colombo 11, Sri Lanka
2005.
English word next to transliterated word, we intend to
[16] S. Ranaweera, “Wasana English-Sinhala Dictionary”, Wasana
tackle the issue by incorporating English IPA into our Prakashakayo, Dankotuwa, Sri Lanka, 2004.
system. This will be a further work. [17] A. M. Gunasekara, “A Comprensive grammar of the Sinhala
Language”, Asian Educational services, New Delhi, Madras, 1999.
[18] O. Jong-Hoon, C. Key-sun, I. Hitoshi, “A comparison of Different
ACKNOWLEDGMENT Machine Transliteration models”, Journal of Artificial Intelligence
Research, pp 119- 151, 2007.
The authors wish to thank Dr. Neranjan Bandara from the [19] K. Knight, J. Graehl, “Machine transliteration”,. In Proceedings of
Department of Sinhala and Pali Bauddha, University of Sri the 35th Annual Meetings of the Association for Computational
Jayawerdenepura, for his greate support to solve some Linguistics, pp. 128–135.1997.
Linguistic problems. [20] A. C. Micheal, “Natural Language processing for Prolog
Programmers”, Prentice Hall, Upper Saddle river, New Jersey,
2002.
REFERENCES
[21] Sri Lanka Standards institute, Sri Lanka Standard SLS1134:1996 –
[1] B. Akshar, V Chaitanya, R. Sangal., Natural Language Processing: Sinhala Character Code for Information Interchange; available at
A Paninian Perspective, Prentice Hall of India, New Delhi, http://www.fonts.lk.lk/doc/sls1134.pdf, 2004
India,1995. [22] WordNet: a lexical database for English Language; available at
[2] B. Akshar, V Chaitanya, R. Sangal., “Computional linguistics in http://wordnet.princeton.edu/index.shtml.
India: an Overview”, available at: [23] A. Weerasinghe, C. P. Weerasinghe, “Godage English-Sinhala-
http://ucrel.lancs.ac.uk/acl/P/P00/P00-1077.pdf Tamil Dictionary”, S. Godage and brothers, Godage book shop, 661,
[3] A. Bharati, V. Chaitanya, A. P. Kulkarni, R. Sangal, “Anusaaraka: Maradana road, Colombo 10, Sri Lanka, 1999.
Overcoming language barrier in India”, to appear in "Anuvad: [24] R. Weerasinghe, A Wasala, K Gamage. "A Rule Based
Approaches to Translation", Rukmini Bhaya Nair, (editor), Sage, Syllabification Algorithm for Sinhala", Proceedings of 2nd
New Delhi, 2001. International Joint Conference on Natural Language Processing
[4] C. Carter, “A Sinhalese-English Dictionary”, Asian Education (IJCNLP-05), p. 438-449, Jeju Island, Korea 2005
Services, New Delhi, Chennai, 2004. [25] wikipedia, the free encyclopedia, http://en.wikipedia.org.
[5] J. B. Disanayaka, “Basaka Mahima:2 Akuru ha pili”, S. Godage & [26] swi-prolog home page; http://www.swi-prolog.org
Bros, 661, P. D. S. Kularathna mawatha, Colombo 10 , Sri
Lanka,2000. [27] Y. Toshio, 1995, “The EDR electronic dictionary”,
Communications of the ACM, Volume 38, Issue 11 (November
[6] Durgesh R., “Machine Translation in India: A Brief Survey”, 1995), Pages: 42 - 44, 1995.
National Centre for Software Technology, Mumbai, India.
http://www.elda.org/en/proj/scalla/scalla2001/scalla2001Rao.pdf