P. 1
English to Amharic Statistical Machine Translation

English to Amharic Statistical Machine Translation

|Views: 244|Likes:
Published by Ambaye Tadesse
Translation from English to Amharic Statistical machine Translation
Translation from English to Amharic Statistical machine Translation

More info:

Published by: Ambaye Tadesse on Oct 18, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less







The Prague Bulletin of Mathematical Linguistics


JULY 2012


English to Amharic Machine Translation Using SMT Ambaye Tadesse, Yared Mekuria
Bahir Dar University, Institute of Technology School of Computing and Electrical Engineering

This paper deals with translation of English to Amharic using statistical methods. Amharic, despite one of most speakers in Semitic languages and its relative wide usage within Ethiopia, it is one of the most resource scarce languages. This paper has two main targets: one is to test how we can go with the available limited parallel corpus for the English - Amharic language pair and the applicability of existing Statistical Machine Translation (SMT) systems like Moses on this language pair. The other target is to analyze the output of the translation with the objective of finding the challenges that need to be tackled like reordering in phrase based translation model since both languages have different sentence structure. Also this paper aimed to be a standpoint for similar researches. After prepossessing of a raw data we use a limited corpus of 37,970 bilingual sentences and 68, 815 monolingual sentences, translation accuracy in terms of BLEU Score of 18.74 was achieved and using the same data hierarchical translation model system scores with BLEU of 8.43 even if better in reordering than phrase based.Finally after using 12537 simple English phrases which is translated as Amharic words the score is improved to 23.16 for phrase based model and 11.24 for hierarchial model.

1. Introduction
Amharic is the official government language of Ethiopia and spoken by 25 -30 million people. But, it lacks enough amounts of computational linguistic resources. And most work done on natural language processing of Amharic to date has been on rule based machine approaches and these are of limited prototypes of morphological analyzers, part of speech(POS) taggers, and parsers that all suffered from the fact that
c 2012 PBML. All rights reserved. Corresponding author: ambayet03@gmail.com Cite as: Ambaye Tadesse, Yared Mekuria. English to Amharic Machine Translation Using SMT. The Prague Bulletin of Mathematical Linguistics No. ???, 2012, pp. 1–10. doi: HERE-WILL-BE-A-DOI-NUMBER

PBML ???

JULY 2012

no lexicon (dictionary), annotated corpus or tree bank existed on previous works as shown by previous researches on corpus processing.[Gamback,Fredrick and Atelach] Even if English language is rich of resources and it is main language to students of high school up to higher level of education for Amharic speakers, the knowledge of most people and even students towards English is poor. Relatively, Amharic Language is not as rich as English Language in Variety of Resources; even if is used only for elementary schools as mother tongue language, as official language and it has limited amount of literatures like history, fiction, law and news. Therefore, the resources scarcity in Amharic languages lead some professional translators to translate resources found in English language, for Amharic language users. For example international laws, variety of news, magazines, historical and fiction books are still translated to Amharic languages which may have inconsistency in quality translations that depend on translator’s performance and it may incur high costs for translation and take much time to translate the resources. The other thing is that most of the websites hosted on internet targeted for Amharic speakers are in English language, which contains important information for Amharic speakers. There are also some organization and people which are always performing a translation work by professional translator, for example News Agencies, Magazine producers, FM radios, Private translators of books, News Paper producers and governmental low announcement paper producers which prints its news paper in English and Amharic versions. Therefore, translation of documents from English to Amharic is necessary to provide translation between two languages, for making useful online documents and resources accessible for Amharic language users, for facilitating work for organizations that needs day to day translation, and also for future researchers who can improves the quality of translations this experiment will show direction on how to acquire better translation.

2. Softwares Used for Experiment
• Operating platform: we used Ubuntu Operating system, since our SMT tool, Moses run in this operating system as a basic platform with full functionality. • Giza++: For Translation Model we used Giza++ to train and design alignments between parallel English-Amharic corpus. • IRSTLM: For Language model we used IRSTLM which is free and open source software to design and train a language model by using Amharic target corpus. • Moses: For Decoding we used Moses that use decoding algorithm which used after training a translation model and language model. • Multi BLEU: For Evaluating the translation system. • MERT: For Tuning or improving the performance of the translation. 2

Ambaye Tadesse, Yared MekuriaEnglish to Amharic Machine Translation Using SMT (1–10)

Figure 1. Architecture figure The Architecture of English Amahric SMT

3. Experimentation
For developing the SMT translation system we used the Moses engine which considers a combination of basic components like language model, translation model, decoder, tuning, and reordering and evaluation components.

4. Collecting and Preparation of Data
For the English to Amharic translation, we prepared parallel corpus from Ethiopian constitution Negarit Gazeta Awaj, from Bible books and some international regulations and Ethiopian governmental portals. We also use some words from dictionary and we used English phrases translated as Amharic words. And we perform sentence alignment between the parallel corpus and we process like removing other characters other than Amharic characters , removing additional numbers and short name of bible verses from Amharic texts. We collect monolingual data for language model3

PBML ???

JULY 2012

ing in addition to target side of parallel corpus; we add some parts from known web news agency like Ethiozena, Ethiopian reporter, Walta information center and Addis Admas using web mining. The data used in training the SMT system are sentence aligned parallel data that is used for translation model and monolingual data in the target language (that is only Amharic data) used for modeling language model. For this experiment we finally use a corpus of 37,515 for bilingual sentences and 68,815 sentences for monolingual sentences. One of the cleaning processes in data preparation is eliminating sentences with words more than ninety. This is for the purpose to speed up the training process in Giza++. The other is lower casing to determine that all words with different cases (some words may have capitalized letters causing inconsistency) are the same. Moses also has the cleaning step that removes the empty line and the redundant space characters. The other thing in cleaning process is that, since we use bible data, it contains archaic English (old English words) containing words that are not used now a days. Some of the words with their meaning are listed below are changed with current standard english words. • ye-you • thy -your • hath -has Also in bible, there are some verses compiled with the previous verse in Amharic side. but these two or more verses separated into two or more specific verses in the English and there was a need of combining these English side verses when verses are not much long and segmentation of Amharic side single verse into two or more verses also when the single verse is too much long for combining English side verses .The following from bible part of corpus, chapter 1 of book Judah verse 14 and 15. Also most words in the bible we got the English side has some extra characters which are surrounded by < >, additional non English words that are not useful for translation were removed as a preprocessing.

5. Specific Experimentation
While applying the procedures to do basic baseline System in Moses Decoder to English-Amharic SMT for repetitive experiments. In the first experiment we have tested whether the Moses tool works well in the Amharic language with small amount of Ethiopian constitution parallel corpus 4

Ambaye Tadesse, Yared MekuriaEnglish to Amharic Machine Translation Using SMT (1–10)

Figure 2. Processing figure Preprocessing of King James version Bible


PBML ???

JULY 2012

i.e 2615 sentences of parallel data used. After doing this experiment we see that the result of translation is not good interms of translation output also in reordering. After we see low performance in reordering we decide to do in parallel experiment with hierarchial translation model which is good in reordering.We never test and tune this experiment and we only apply for phrase based translation model. The second experiment was done by the corpus from Ethiopian constitution Negarit Gazeta and Bible documents. In this experiment we use 34173 parallel EnglishAmharic sentences for training translation model, 37970 Amharic sentences for training language model, 2500 parallel sentence for tuning purpose and 3797 parallel sentences for evaluating the translation system using multi-bleu.But this parallel corpus was not clean,it has lots of numbers within sentences which is not necessary since it is not consistent between two languages(i.e only Amharic has short verse names and verse numbers) and it results misalignment of words. The training process using phrase based model, produce the configuration file with tuning set which assigns questionable weights for translation model, language model, word penalty and reordering model (distortion weight). After the tuning run of 9 cycles using Moses MERT file, which assign best weights for each model to be used for decoding, the configuration weight is improved that shows better translation. In evaluating the system, we use the text files of the reference human translated Amharic sentences and the machine translated Amharic sentences. And the result of the evaluation using multi-bleu is score of 9.87 for phrase based and 3.54 for hierarchial model. This result show that the system has low performance based on data and domain of biblical and legal evaluation which is about 10 words correct out of 100 based on the tested domain and the reasons for this low quality translation are: • Limited amount of corpus for training used in two specific domains. • The existence of lot amount of numbers and Bible verse titles of Amharic side but not English. • We have lately seen alignment problems or disorder between parallel sentences in the corpus after evaluation Ubuntu OS creates free line space at some random lines for text files come from Window XP. In the third experiment we have used same English-Amharic corpus as experiment two and even if we can’t get good amount of dictionary files, we add small amount of dictionary words (174 words) since using dictinary increases performance of translation to compromise unknown words. That is about 37520 parallel sentences for training the model, which are tokenized and cleaned to have sentences with maximum words of 90. We have also added much monolingual Amharic language corpus for language modeling with size of 68258 sentences from known newspapers. And in this experiment we try to avoid the problem of preprocessing found in the second ex6

Ambaye Tadesse, Yared MekuriaEnglish to Amharic Machine Translation Using SMT (1–10)

periment, we avoid sentence alignment problem caused by created empty lines from text files, we avoid Bible short verse titles and numbers, we remove English words and other unknown characters from target side corpus (Amharic side) of monolingual data especially from newspaper sites. We also increase the size of corpus for training. After doing this with tuning (using same size corpus of the second experiment), we test the translation model using 3797 parallel sentences with both domains(legal and biblical) and we get multi-bleu score of 15.8 in phrase based translation model and 5.61 score for hierarchical phrase based translation. This shows great improvement from the second experimentation. In the fourth experiment we use the same data set as experiment 3 parallel sentence aligned data but unlike previous experiments we select our test to be chosen systematically in such a way that 10% from each 1000 sentences that is 100 sentences out of 1000 sentences from both language sides in biblical and legal domains except for simple phrase, but in the previous 3 experiments we select our test set 10% at the final part of corpus in each domain(legal and biblical). In this experiment we got a phrase based translation model BLEU score of 18.74 and for hierarchical translation model BLEU score of 8.43.Which is good translation output based on testing on domain of parallel corpus we used for training. The increase of BLEU score in this two baseline systems come from systematical way of choosing test data which increases much the domain of test data with training, for example while choosing 10% of each 1000 data in King James Version of Bible has high probability of the test data and train data to be included in the same book and same chapter in some cases, and this book and chapter deals with same or most relative issues. In the final experiment we use 12537 simple English phrases that contains more than 1000 unique words and translated as a word in Amharic language for example English side ’he kick’ and Amharic side ’metta’ an additional to experiment 4 and also we selec our test data systematically as experiment 4 and we got 23.16 BLEU score of phrase based translation model and 11.24 with hierarchial translation model.

6. Result
The result we found when we evaluate using our test data is good translation but with low performance reordering in phrase based model, even if we set appropriate distortion factor and phrase penalty for SVO-SOV language pairs through tuning process. So we conclude that the reordering inefficiency come from the data we used in the training corpus, which didn’t consistently have the standard and current English SVO structure, by King James Bible and some legal documents, this is because, the alignment is done correctly but showing the source data without preserving normal and consistent structure that can’t match the structure of relatively big language 7

PBML ???

JULY 2012

Figure 3. Result figure Summary of Experiments and BLEU score model as compared to parallel corpus, that is trained with more data prepared from current news websites, and the target side of training data.But the translation quality of hierarchial translation model is better than compared with phrase based model also in reordering hierarchial is much better than phrase based tranlation.

7. Recommendation
The improvement in SMT system can be done, by applying linguistic features using morphological analyzer and applying syntactic information using language parser for Amharic and English languages. This helps to show the language structures and it improve the reordering in the translation outputs towards the good quality. It is also possible to improve the system by using efficient Amharic POS tagger. We suggest that, there must be a research center that develops a greater amount of parallel corpus and good application is also necessary for automatic web mining that retrieves parallel corpus data online from websites that display their information periodically in both languages. For best Reordering from English which is SVO to Amharic which is SOV we need to use some extra approaches beyond applying tuning for a correct distortion factor value as we did in our experiment like the output listed as monotonic translation that means translating words without any reordering, it can be moving English verbs to the end of the sentence to be like Amharic language structure i.e SVO–>SOV. 8

Ambaye Tadesse, Yared MekuriaEnglish to Amharic Machine Translation Using SMT (1–10)

8. Acknowledgments
We would like to thank the Moses support team especially staffs at the University of Edinbergh for completing our experiment soon, our advisor Mr Tsegaw Kelelaw for his encouraging advice throughout the experiment and also anyone who help us for success of this experiment.

9. Bibliography
- Kevin Knight and Philipp Koehn, What is New in Statistical Machine Translation Information Sciences Institute University of Southern California - David Chiang,Hierarchical phrase-based translation, Computational Linguistics, USC Information Sciences Institute - Kevin Knight,A Statistical MT Tutorial Workbook,1999 - Philipp Koehn,Statistical Machine Translation, Cambridge University Press,2003 - LetsMT, D3.6 Training and evaluation of initial SMT systems,2011 - Vincent Pietra and Brown, The mathematics of statistical machine translation, IBM T.J. Watson research center, USA,2003 - Zahurul Islam, English to Bangla phrase based Statistical machine translation,Saarland University,2009 - Papineni and Kishoreànd Salim Roukos Todd Ward and Wei-Jing Zhu,ACL, BLEU:Method for Automatic Evaluation of Machine Translation, IBM T.J. Watson research center, USA,2002 - Philipp Koehn and Hieu Hoang and Bertoldy and Brook Cowan, Moses: Open Source Toolkit for Statistical Machine Translation, Edinburgh University,Maryland Univercity, Charlse University,2007 - M. Federico and N. Bertoldi and M. Cettolo, IRST Language Modeling Toolkit Version 5.20.00 USER MANUAL, FBK-irst, Trento, Italy - Philipp Koehn, MOSES Statistical Machine Translation System User Manual and Code Guide, University of Edinbusrgh,2008 9

PBML ???

JULY 2012

- Daniel Yacob, Developments towards an Electronic Amharic Corpus, Geez Frontier Foundation - Bjorn Gamback and Lars Asker and Fredirik Olsson and Atelach and Alemu Argaw,Collecting, Processing and Testing an Amharic Corpus, Userware Laboratroy, Norwegian University, Stockholm University and Addis Ababa University - George Foster and Simona Gandrabur and Philippe Langlais and Pierre Plamondon and Graham Russell and Michel Simard, Statistical Machine Translation: Rapid Development with Limited Resources, University of Montreal - Sisay Adugna, Andreas Eisele, English - Oromo Machine Translation: An Experiment Using a Statistical Approach,Haramaya University and DFKI GmbH,2009


You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->