Bidirectional EnglishHadiyyisa Machine Translation

Admas University
Postgraduate Office
MSc Program
Bi-directional English-Hadiyyisa Machine

Translation
A Thesis Submitted to the Department of Computer Science for the

Partial Fulfillment of the Requirements for the Degree of Master of
Science in Computer Science
By
Tarekegn Yohannes Sugebo
Advisor: Michael Melese (Ph.D)
July 2020
Declaration
I ,Tarekegn Yohannes, the under signed, declare that this thesis entitled: “Bi-directional
English-Hadiyissa Machine Translation” is my original work. I have undertaken the research
work independently with the guidance and support of the research advisor. This study has not
been submitted for any degree or diploma program in this or any other institutions and that all
sources of materials used for the thesis has been acknowledged.
Declared by
Name_________________________________
Signature: _____________________________
Department ____________________________
Date _________________________________
i
Certificate of Approval of Thesis
School of Postgraduate Studies
Admas University
This is to certify that the thesis prepared by Tarekegn Yohannes entitled “Bi-directional English-
Hadiyyisa Machine Translation” and submitted in partial fulfilment of the requirements for the
Degree of Masters of Science in Computer Science/MSc complies with the regulations of the
University and meets the accepted standards with respect to originality and quality.
Name of Candidate: _____________; Signature: _______________Date: _____________.
Name of Advisor: _____________. Signature: _______________Date: _______________.
Signature of Board of Examiner`s:
External examiner: ___________________Signature: ____________Date: _____________.
Internal examiner: ____________________Signature: ____________Date: _____________.
Dean, SGS: _________________________Signature: ____________Date: ____________.
ii
Table of Contents
Abbreviations...............................................................................................................................viii
Acknowledgement..........................................................................................................................ix
Abstract............................................................................................................................................x
CHAPTER ONE..............................................................................................................................1
INTRODUCTION...........................................................................................................................1
Introduction..................................................................................................................................1
1.1 Background............................................................................................................................1
1.2 Statement of the Problems.....................................................................................................4
1.3 Objective................................................................................................................................5
1.3.1 General Objective............................................................................................................5

1.3.2 Specific Objectives..........................................................................................................5
1.4 Methodology..........................................................................................................................6
1.4.1 Literature review.............................................................................................................6

1.4.2 Tools and techniques.......................................................................................................6
1.4.3 Data collections...............................................................................................................7
1.4.4 Evaluation........................................................................................................................7
1.5 Scope and limitation...............................................................................................................7
1.6 Significance of the Study.......................................................................................................8
1.7 Thesis organization................................................................................................................8
CHAPTER TWO.............................................................................................................................9
LITRATURE REVIEW...................................................................................................................9
2.1 introduction............................................................................................................................9
2.2 Machine Translation..............................................................................................................9
2.3 History of machine translations...........................................................................................11
2.4 Machine translation approaches...........................................................................................12
2.4.1 Rule-Based Approaches (RBMT).................................................................................13
ii
2.4.2 Corpus-Based Machine Translation Approach (CBMT)..............................................16
2.4.3 Neural machine translations (NMT)..............................................................................17
2.4.4 Hybrid Machine Translation (HMT).............................................................................20
2.5 Evaluation of Machine Translation Systems.......................................................................20
2.5.1 BLEU (Bilingual Evaluation Understudy)....................................................................20

2.6 Related works.......................................................................................................................21
CHAPTER THREE.......................................................................................................................29
HADIYYISA LANGUAGE..........................................................................................................29
3.1 Introduction..........................................................................................................................29
3.2 Hadiyyisa Language Orthography.......................................................................................30
3.3 Phonology............................................................................................................................30
3.4 Alphabets.............................................................................................................................31
3.5 Hadiyyisa Nouns..................................................................................................................32
3.6 Pronouns of Hadiyyisa Language........................................................................................33
3.7 Dative pronounces................................................................................................................35
3.8 Possessive pronounces.........................................................................................................36
3.9 Verbs....................................................................................................................................37
3.10 Conjunctions (CNJ)............................................................................................................38
3.11 Adjectives...........................................................................................................................39
3.12 Punctuations.......................................................................................................................40
3.13 Numerals............................................................................................................................41
3.14 Challenges of Hadiyyisa language.....................................................................................42
CHAPTER FOUR.........................................................................................................................43
SYSTEM DESIGN AND DEVELOPMENT................................................................................43
4.1 Introduction..........................................................................................................................43
4.2 System Architecture of English Hadiyyisa Machine Translation........................................43
4.2.1 System Architecture of English-Hadiyyisa MT (SMT)................................................45
iii
4.2.2 System Architecture of English-Hadiyyisa (NMT).......................................................47
1. Encoder and Decoder Stacks...........................................................................................50
2. Attention:........................................................................................................................50
4.3 Data collection.....................................................................................................................53
4.4 Data Preparation...................................................................................................................53
4.5 SMT Experimental setup.....................................................................................................56
4.5.1 Language Modeling For SMT.......................................................................................56

4.5.2 Training Translation Modeling (SMT)..........................................................................57
4.5.3 Tuning (in SMT)...........................................................................................................57
4.5.4 Prototype of the system (in SMT).................................................................................58
4.5.5 Tools for SMT Experimentations..................................................................................59
4.6 NMT experimental setup.....................................................................................................59
4.6.1 Preprocessing.................................................................................................................60
4.6.2 Training.........................................................................................................................60
4.6.3 Translation.....................................................................................................................61
4.6.4 Testing...........................................................................................................................62
4.6.5 Prototype of the System................................................................................................62
4.6.6 Tools for NMT..............................................................................................................63
CHAPTER FIVE...........................................................................................................................64
DISCUSSION OF THE RESULT.................................................................................................64
5.1 Introduction..........................................................................................................................64
5.2 Testing method.....................................................................................................................64
5.3 corpus...................................................................................................................................64
5.4 Results (statistical)...............................................................................................................65
5.5 NMT Results........................................................................................................................67
5.6 Summarized result for both Approaches..............................................................................71
5.7 Discussion............................................................................................................................72
CHAPTER SIX..............................................................................................................................73
iv
CONCLUSION AND RECOMMENDATION............................................................................73
6.1 Conclusion...........................................................................................................................73
6.2 Recommendation.................................................................................................................74
References..................................................................................................................................75
Appendix I: Statistical Machine Translation.............................................................................78
Appendix II: Neural Machine Translation.................................................................................79
Appendix: III..............................................................................................................................82
Appendix IV: sample corpus......................................................................................................83
v
List of Tables
Table 3. 1. Vowel with consonants.

Table 3.2. O, e and a class noun example
Table 3.3. Distinguishing masculine and feminine genders
Table 3.4. Hadiyyisa Pronounces
Table 3. 5. Dative pronounces [31].
Table 3.6. Possessive/genitive pronouns
Table 3.7.Hadiyyisa verbs
Table 3.8. Hadiyyisa verbs with simple past tense and subjects.
Table 3.9. Conjunction [17]
Table 3.10. Hadiyyisa punctuations [16].
Table 3.11. Hadiyyisa cardinal number
Table 4.1. Distribution of English and Hadiyyisa words
Table 4.2.The English- Hadiyyisa language pair corpus
Table 5.1. Corpus distribution
Table 5.2. Evaluation results English Hadiyyisa Evaluation Result.
Table 5.3. Neural machine translation English to Hadiyyisa, batch size 4096 (all domain)
Table 5.4. Neural machine translation English to Hadiyyisa, batch size 2048 (all data)
Table 5.5. Neural machine translation Hadiyyisa-English by batch size 4096 (all data)
Table 5.7.The BLEU score on the remaining data domains.
Table I: Hadiyyisa Alphabets with corresponding sounds [14].
Table II. Sample corpus from Religious domain
vi
List of figures
Fig 2.1.Machine translation triangle [24]......................................................................................13

Figure 2.2 Language divergence....................................................................................................14
Figure 2.3. How Transfer based RBMT works [9]........................................................................15
Figure 2.4. Neural network encoder decoder architecture.............................................................18
Figure 3.1. Consonant phonemes of Hadiyyisa [31].....................................................................30
Figure 3.2. Dative Pronounces.......................................................................................................36
Figure 4.1 General architecture of English-Hadiyyisa machine translation..................................44
Figure 4.2. English Hadiyyisa machine translation SMT architecture..........................................46
Figure 4.3. Transformer model architecture [34]..........................................................................48
Figure 4.4 English Hadiyyisa NMT Architecture..........................................................................52
Figure 4.5. English tokenized, true cased and cleaned data..........................................................54
Figure 4.6. Hadiyyisa tokenized, true cased and cleaned data......................................................54
Figure 4.7. During tuning source validation tests being translated to target.................................58
Figure 4.8. Prototype of the model English to Hadiyyisa..............................................................58
Figure 4.9.Prototype of the model Hadiyyisa to English...............................................................59
Figure 4.10. System Prototype.......................................................................................................62
Figure 5.2. Statistical machine translation result...........................................................................66
Figure I. English to Hadiyyisa.......................................................................................................78
Figure II. Hadiyyisa to English......................................................................................................78
Figure III. Bible domain Hadiyyisa to English..............................................................................78
Figure XI. English to Hadiyyisa....................................................................................................80
Figure XV. Neural evaluation educational (English to Hadiyyisa)...............................................81
Figure XVI. Neural evaluation educational (Hadiyyisa to English)..............................................81
Figure XVII. Bible domain English-Hadiyyisa.............................................................................81
Figure XVII. Neural Evaluation all data Hadiyyisa-English.........................................................81
Figure XVIII. Neural Evaluation all data English-Hadiyyisa.......................................................82
vii
Abbreviations
BLEU: Bilingual Evaluation Understudy
CNN: Convolutional Neural Network
DBMT: Dictionary Based Machine Translation
EBMT: Example Based Machine Translation
GPU: Graphic processing Unit
GRU: Gated Recurrent
GT: Google Translator
LSTM: Long Short Term Memory
MT: Machine Translation
NLP: Natural Language Processing
NMT: Neural Machine Translation
OpenNMT-py: Open source Neural Machine Translation platform with pytorch
backend
RBMT: Rule Based Machine Translation
RNN: Recurrent Neural Network
SMT: Statistical Machine Translation
SOV: Subject Object Verb
TTC: Teachers Training College
viii
Acknowledgement
In this research work, there were participants directly and indirectly contributing unforgettable
assistance. I couldn’t have finalize this study without their help.
Firstly, my advisor Dr. Michael Melesse, who was relentlessly guiding me from the beginning to
finalization of this work. His sincere help when I faced any confusion, I cannot forget. Indeed he
created an enthusiasm in me that I can do.
I would like to thank my family for their countless courage. Their prayer and patience for my
success is beyond my words. May the lord bless them all!
My God, lord of heavens and earth who is my shield and all strengths, exalted be his name. He
answered my prayer. He gave me full health. He led me through the ways that is too difficult to
me. Worthy is He to receive glory, thanks, power, blessings and every praise. Amen!!
ix
Abstract
Machine translation is one of the application area of natural language processing (NLP), where
computers are used to translate from one source language to another target language. Because of
the need for information sharing from resource rich language there is a great demand of
translation. Thus the purpose of this study is to develop bi-directional English-Hadiyyisa
machine translation.
In order to implement this study statistical and neural machine translation approaches applied.
The data collected for this purpose were from two different sources and organized into two
categories; religious and educational domain. These data were preprocessed in the suitable
format and way to be used in machine translation. During experimentation both approaches
applied in the two data sets separately. Besides these two experimentations the third
experimentation conducted on the combined data set. While experimenting the statistical
machine translation (SMT) Moses decoder was used. Moses decoder searches the highest
probability sentence of the target language for a given source language sentences, based on the
parallel corpus. Similarly, the same parallel corpora experimented by neural machine translation
(NMT) method called Transformer model to transform the model with attention that enhances
parallelization during data processing (training). It models works with self-attention and feed-
forward networks in each encoder and decoder layers taking the source and target corpora as
input. All experimentations conducted were in both directions because the translation was
bidirectional. The models created while training the data were evaluated by bilingual evaluation
understudy (BLEU score).
The experimental results of each approaches were recorded separately. Accordingly, English to
Hadiyyisa in all data set is 2.07%, 1.47% SMT and NMT respectively and Hadiyyisa to English
is 3.29%, 3.81% SMT and NMT respectively. The bible domain English to Hadiyyisa is 2.16%,
1.47% SMT and NMT respectively and Hadiyyisa to English 3.58%, 3.59% SMT and NMT
respectively. In educational domain English to Hadiyyisa is 0.33%, 0.31% SMT and NMT
respectively and Hadiyyisa to English 0.68% and 0.83% SMT and NMT respectively. According
to these result Hadiyyisa to English performs a bit better than English to Hadiyyisa. Finally, to
improve the evaluation future experimentation with larger data set is recommended.
x
Key words: Machine Translation, Neural Machine Translation, Statistical Machine Translation,
Moses, Transformer model with attention.
xi
CHAPTER ONE
INTRODUCTION
Introduction
This chapter gives a general information about Bi-directional English-Hadiyyisa machine
translation research study. In so doing the background of the study which basically
highlights Natural Language Processing, Machine Translation (MT), Hadiyyisa language are
going to be discussed. And the problem statement on which the study is based on, the
objective of the study, literature review, methodology how the study has been, scope and
limitations of the study, significances of the study and paper organization are going to be
discussed.
1.1 Background
Since the emergence of the first most devastating machine translation work in 1945 by
Warren Weaver [1], Natural Language Processing (NLP) has been the area of huge demand
for information sharing [2]. Ability of computers (machines) understanding human language
besides the advancement of ICT and internet, has contributed to the ever-increasing
development of NLP [3] [4]. It is a sub field of artificial intelligence and linguistics whose
goal is to achieve a human like language processing by analyzing and understanding texts
based on theories and technology for communication [5].
A computational NLP has a critical goal of creating computational relationship between

languages and computational models. The main goal is enhancing computers to generate
human language by voices and texts to interact with human as if humans communicating
with another human. This is because most of the knowledge are found in human
language .Therefore machines with natural language understanding ability (knowledge) is
needed [6]. This ability could be at phonology, phonetics, morphology, syntax, semantics,
pragmatics and discourse levels of the natural language [1] [5].
With this in mind natural language processing has several application in text simplification,
text to speech synthesizing, word sense disambiguation, sign language processing, automatic
speech recognition systems, speech to text automations, text categorization, sentence
1
boundary detections, optical character recognition (OCR), natural language generation
(NLG),question answering and .etc. [1] [6] [7]. Machine Translation is one of the areas
where NLP application is most widely applied.
Machine translation is a term that refers to a computer aided system to produce a natural
language translation (MT) from one language (source language) to other language (target
language) with minimal intervention of human or without human intervention so as to create
the best possible translation [8] [4]. It is more than word to word substitution. The translation
in source must be transferred holding grammar, syntax, and semantics of the target language
[3]. Translations can be speech to text, text to speech or text to text or other type. Even
though machines are capable of translating human languages no machines are like humans
[9]. So sometimes linguists has to interfere to edit some anomalies of the machines created
translations.
In order to accomplish the machine translation task there are different types of translation
approaches used by many researchers .These are corpus based machine translation, rule
based machine translations (RBMT), and hybrid approaches [2]. The RBMT builds linguistic
rules based on morphological, syntactic and semantic information related to source and target
language [9].It includes Interlingua, dictionary based and transfer based approaches. The
corpus-based approach generates translations from bilingual text corpora. Statistical, neural
and example-based machine translations are categorized as corpus based machine translation.
The hybrid method [2] is an advanced method that shares the features in corpus based and
rule-based approaches.
There are around 7,097 living languages in the world [3]. Understanding these languages
without translator is not easy. And human translation to do so is a resource intensive. So
machine translation is needed. In order to share scientific, educational, legal, web sources
and other resources found in specific languages; like English. 71% of web pages are also in
English language. And German, Japanese, French and Spanish are of most common
languages [10].
The continent Africa contributes 30% of the 7,097 languages which are resource grieving
including Ethiopian Languages [3]. As a result of language barrier in the world, most of the
2
people are not able to use their own resource in the internet [11]. So, in order to use the
resources reside in the most common available languages like English human translation is
not easy task. Human expertise who knows each elements of the language and knows how
the words affect the meaning of the translation is costly. So there is a need for Machine
translation that uses different types of machine learning approaches [12] [7].Specially for
developing countries like Ethiopia with more than 80 different languages.
Hadiyyisa is one among 80s Ethiopian languages spoken by Hadiyya people in Southern
Nations Nationalities and Peoples Region (SNNPR), Hadiyya zone [13]. According to central
statistical census report the population of Hadiyya people was 1.6 million [14] [15]. However
as CIA estimated as of March 2018, the population of Hadiyya people was 1.8 million that is
1.7% out of 105,350,020 Ethiopian population [14].
Hadiyyisa language belongs to afro Asiatic language family which is widely spoken in West
Asia/Middle East, Horn of Africa, North Africa, and parts of the Sahel. The language
Hadiyyisa belongs to the language of Arabic, Oromo, Hausa, Amharic, Somali, Hebrew,
Tigrinya, Kabyle, Tamazight, and many more [14] [16]. It is also categorized as Cushitic
language subfamily in which Oromo, Somali, Sidamo, Afar, Agew, Gedeo, Saho, and Beja
are included [17].
Hadiyyisa language was of one from the 15 Ethiopian languages a literacy campaign was
made using a Saba script in 1970s and 1980s.The latter is essentially a syllabus each of
whose characters with a consonant and a vowel. In 1994, the Latin script was adopted for
Hadiyyisa, as some other Cushitic languages of Ethiopia use. Currently, the Latin-based
orthography is in use especially at schools and college, as well as university level [13].
Like all other Ethiopian languages, Hadiyyisa is severely a resource grieving language.
Currently, machine translations works are being used to solve such kind of problems around
the world. Therefore, this is what this research study, bidirectional English-Hadiyyisa
machine translation, had attempted to solve by statistical and neural machine translation
approaches.
3
1.2 Statement of the Problems
Now days, borderless economy demands the needs of communication between the people to
facilitates their day to day business [2] [18]. In addition to this, people want to communicate
with others in their own language or wants to learn the new language which may require a
translation from one language to another. Beside these, internet is one of the sources of
information being only in certain resourced and technological supported languages such as
European, Asian and English language [19] [20]. On the contrary, under-resourced Ethiopian
language suffers from the lack of digital resource to take the advantage of the technology as
discussed by different researchers [3]. Ethiopia has 83 registered languages, among these,
Hadiyyisa is one of the languages that require digital resource to facilitate the communication
between different people. Currently Hadiyyisa is given as a subject in primary, secondary
and tertiary level of education [13]. However, the resource availability is too small. The
resources unavailability is also leading Hadiyyisa to lose the ground by native speakers and it
is not functionally being used in zonal administrative offices [14] [16]. Therefore, to prevent
the language endangerments there should have written documents translation that
encompasses new and traditional contents of the language to other language [21]. Compared
to other languages, English language is technologically supported and well-resourced that
needs translating to the under-resourced one.
The driving factor to develop the English-Hadiyyisa machine translation is the absence of
existing automatic machine translation tools that translates documents from English to
Hadiyyisa in both directions. The automated translation helps to minimize the time and cost
of translator by making resource available in the respective language through translation.
Therefore, there must have an automated system that bridges the language barrier too. So, the
tools must be developed to translate from one language to another languages. In order to
fulfill this need translating languages manually is impossible due to time, cost, accuracy and
other factors.
Thus, the main objective of this research study is to design and develop the bi-directional
English-Hadiyyisa language machine translation. In attempt to solve the problems of the
above, the following research questions were formulated:-
4
 How to develop a parallel corpus for English-Hadiyyisa machine translation system?
 How English-Hadiyyisa machine translation system contribute a vital role for under
resourced language Hadiyyisa?
 To what extent the English-Hadiyyisa machine translation system works?
1.3 Objective
1.3.1 General Objective
The general objective of this study is to design and develop a bidirectional English-Hadiyyisa
language machine translation to facilitate the communication between English and Hadiyyisa
language speaker.
1.3.2 Specific Objectives
To achieve the general objective the following specific objectives performed.
 To review the state of the art literature of machine translation between resourced
and under resourced languages.
 To collect and prepare parallel English-Hadiyyisa and monolingual corpora for

machine translation for training and testing the system.
 To design a bidirectional English-Hadiyyisa language machine translation.
 To develop a prototype for the bidirectional English-Hadiyyisa language machine

translation.
 To test the developed prototype of English-Hadiyyisa language machine system.
 To evaluate the accuracy of the system that is going to be developed.
 To report the finding of the study
 To analyze and forward recommendation in the upcoming area of machine

translation.
5
1.4 Methodology
In order to achieve a proposed Bidirectional English Hadiyya machine translation
different types of literatures and related works form national and international perspective
have been studied. Different types of tools and techniques applied. Data collection, data
preparation, experimentation and translation performance evaluation made.
1.4.1 Literature review
To have clear understanding in machine translation, in this section literatures concerning

with language background and morphology, machine translations background, natural
language processing, translation evaluation and translation approaches studied. Beside
this, linguistics literatures of Hadiyyisa is also reviewed which includes books, thesis
works, scholarly journal articles, reports, conferences and some good websites addressed.
1.4.2 Tools and techniques
In this research study both statistical (SMT) and neural (NMT) translation approaches
applied. The software tools used for machine translation, evaluation and word preparation in
each approach were: -
While working with statistical method, Linux (Ubuntu 16.04.4) in laptop with RAM size
4GB and a processor speed core i5 used. To work with statistical machine translation tools
like Moses, MGIZA and ISRLM must be installed in Linux. Moses is an SMT system that
trains a translation model into any language. MGiza++ for word-aligning of a parallel corpus.
SRILM used for language modeling. The bilingual evaluation understudy (BLEU) is used to
evaluate the performance of the translation model. There is a text editor in Linux that is used
to prepare data in the usable format for machine translation. During data preparation the
Nova Aligner app was used to prepare the parallel corpus from the original sources. It helps
see the data side by side and allows to break larger sentences into smaller one. To data
preparation purpose Microsoft Excel 2013 was also used. Lastly the Microsoft Office Word
2013 is sued to write a report.
Using these same data, the OpenNMT-py, an open source neural machine translation
platform with Pytorch backend was used. It was installed in Google Colab’s Jupyter
notebook that is used for code editing and running. Google Colab is a free online application
6
which provides access for graphic processer unit (GPU) that is used for data processing
during data training. It also provides a storage for data running. The BLEU score, MS offices
and data preparations tools were common in NMT.
1.4.3 Data collections
As other African and Ethiopian languages there is a resource unavailability for Hadiyyisa. As
a result, small amount of data collected from different sources. Websites, Wachamo
University, Hosanna TTC College and Bible Society of Ethiopia (BSE) are some sources the
data collected from. A total data of 40,492 English-Hadiyissa parallel sentence were
collected from two different categories: religious and education. The religious category
contributes around 83% of all data from religious domain. In this domain, after data
extraction and cleaning 33,880 bilingual parallel preprocessed sentences were prepared. The
second category were from the educational resource consisting of 6,610 bilingual parallel
preprocessed sentences which were prepared from teaching and learning sources of
Hadiyyisa language.
1.4.4 Evaluation
The performance evaluation of the system in this study was evaluated by the BLEU score. It
is the most widely used tool in machine translation performance evaluation. This because it is
inexpensive and automatic whose evaluation result is also nearly comparable with human.
1.5 Scope and limitation

The scope of bidirectional English-Hadiyyisa language machine translations starts by
preparing parallel data for machine translation. The parallel data collected passes through
different preprocessing of Hadiyyisa and English languages. After data preparation and
preprocessing, statistical and neural based machine translation system has been designed and
prototype developed to translate from English to Hadiyyisa and vice versa using the state-of-
the-art tools.
In order to get well performing machine translation there should have large number of
bilingual corpora. But in this study, the data used were only limited number of bilingual
corpora from specific religious and educational domains. Even if several attempts were made
to get some monolingual corpora for Hadiyyisa, it failed. So, a resource unavailability and
7
preparing these small data set from bible and educational sources was the most challenging
time. Besides these all, lack of data training machines for neural machine translation was also
another problem. Because of this Google Colab used but while training the neural machine
translation, it limits the access and closes for hours or for whole day but maximum. Because
of these and other constraints human evaluation of the system, models evaluation for SMT
and corpus validation by linguistics were not included in this study.
1.6 Significance of the Study

There are some main significances this study contributes. The corpus collected for this study
can be used as an input for like text to speech, spell checker, part of speech tagging (POS)
and other NLP applications. The translation system can also be used as a material
development by translating teaching and learning of Hadiyyisa languages from resourced
English language. It is obvious that there is resource scarcity in Hadiyyisa language,
therefore in order to translate the materials from English to Hadiyyisa the system can be used
with post-editing for usability which increases the translation efficiency. In addition to these,
all the system can be input for further machine translations studies in English and Hadiyyisa
like post editing. Moreover, the steps for statistical and neural approaches used in this study
will help other researchers adopt in other local language translations.
1.7 Thesis organization

This study is organized into six different but interrelated chapters. The first chapter of this
paper discuss the basic introduction of the research study; which includes statement of the
problems, general and specific objective of the study, research methodology, scope and
limitations of the study and finally discusses the significance of the study. The second
chapter discusses the literature review under which history of machine translation,
approaches of machine translation, evaluation of machine translation and related works
attempted to be addressed. The third chapter discusses the Hadiyyisa language background,
Hadiyyisa language orthography, phonology, nouns, pronouns, verbs, conjunctions,
adjectives, punctuations and numerals. Chapter four discusses methodology how this study
was made. It discusses translation approaches used in the study and English-Hadiyyisa
language models. The fifth chapter discusses experimental evaluation result of both statistical
8
and neural machine translation. The test result of the experimentation and the corpus used are
briefly explained. Finally, chapter six discusses conclusion and recommendation.
CHAPTER TWO
LITRATURE REVIEW
2.1 introduction
In this chapter literatures in the field of machine translation have been studied. Section 2.2 of
this chapter discuss the concept of machine translation followed by the historical
development of machine translation in section 2.3. Consequently, section 2.4 discusses
different approaches of machine translation systems including rule-based, corpus based,
Hybrid and neural-based machine translation system. Finally, the review of related work
attempted in translation from resourced to under-resourced and between under-resourced
language is discussed in section 2.5 followed by the discussion of challenges in machine
translation system in section 2.6.
2.2 Machine Translation

Machine translation is computer aided system that produces a natural language translation
from source language into the target language with or without human intervention [8] [4].
While conducting such translations, the translation in source language must be translated
holding grammatical, syntax and semantics of the target language [3]. In machine translations
speech to text, text to speech or text to text translations can be made.
Because of human language being a media for communication there is a need in translations.
This is because it contains different types of information to be shared. Therefore the need of
translation among the languages is currently increasing [2].Today we are leaving in
information age with numerous resources that have to be translated like newspapers, legal
documents, working manuals and instructions, different types of books, scientific
publications and others. Because of the abundance and variety of these resources professional
translators cannot fulfill the increasing demand of the users. Therefore, there must have an
automatic tool (MT) that easily translates these resources from resource rich languages.
9
In order to solve the translation problems there are different types of machine translations
[22], such as machine translation for watchers (MT-W), machine translation for revisers
(MT-R), machine translation for translators (MT-T) and machine translation for Authors
(MT-A). MT-W is used for the readers that need to get some information written in a foreign
language. It came into usage by the pioneers of the machine translation to translate the
written military technological documents. The second one MT-R is used to produce an
automatic raw translation comparable with original human translated draft. Since it frees the
professional translators by avoiding the difficult and time-consuming activities that have to
be reported to the revisers it is taken as a brush up. The third one MT-T is by human
translators and it is categorized in the personal computer (PC) based translation tools that
runs in some standard platforms integrated with numerous text processing tools. It provides
an online dictionaries and translation memories for professional translators. The last one MT-
A is used to help authors translate their texts into one or so many other languages by
accepting to assist the machines lest there could have an ambiguity in order to get a satisfying
translation without revision.
Machine translations have several advantages besides their limitations [6].Some of the
advantages: -
 Quick translation: - When we compare MT with human translation that takes too
much time it is quite preferable. This is because, while translating large amount of
texts it saves time and effort.
 Low price: - If we hire a professional translator for translation task paying the money
according the need he/she requires is mandatory. So doing could be too costly thing.
Instead using machine translation is reliable and cost effective.
 Confidentiality: - In terms of information severity, like confidential emails using

machine translations is more secure and confidential than human translators could be
trusted.
 Universality: - Machines translate any text into the required language regardless of
the texts from the domain they belong to. However, the human translators are not;
they are limited to certain specific fields they specialized their translation profession.
10
 Online translations and web page content translations: - Today there are so many
online machine translation tools that help translate information from one language to
the intended language we need. Moreover, there are machine translation tools to
translate queries of search engines and the web pages’ contents.
However, machine translations have some disadvantages that only human translations
can solve. Like lack of superior exactness and inferior translation quality because of texts
with ambiguous words and sentences.
2.3 History of machine translations

Machine translation has traveled many decades from the first time it had been conceived by
Warren Weaver in 1945 till now [1]. It is the first computer-based NLP work. The primary
recommendations for utilizing computers to decipher normal dialects were made in 1947 by
Warren Weaver in discussion with Andrew Booth. In 1948 Booth collaborated with Richard
Richens on the primary tests in ‘mechanical translation’ utilizing punched cards. In the
following a long time investigate on MT started at the College of Washington, College of
California at Los Angeles, RAND Corporation and Massachusetts Established of Innovation
(MIT). It was in 1951 at MIT that the first arrangement of a MT researcher was made [23].
In 1952, there had been few examples of machine translations, during then only interested,
very few in number were invited to the first conference [23].This time was first machine
translations conference. Bar-Hillel’s paper was conveyed in progress, together with Weaver’s
1949 memorandum, as foundation perusing to all members at the primary Conference on
Mechanical Interpretation, facilitated by Bar-Hillel at MIT on 17-20 June 1952 and the
summaries of the this conference were published within the recently made journal
“Mechanical Translation ” [24].During the conference closure time there had been a
discussion what the coming step would be and they agreed that sources of financing had to
be investigated, and accordingly Duncan Harkin of the US Office of Defense accepted that
his division and likely other US organizations would be forthcoming with reserves for
projects. Jerome Wiesner included that fund and help might moreover be forthcoming from
the Investigate Research facility of Hardware at MIT [23].
11
During this time Leon Dostert who was, Eisenhower’s individual interpreter amid the war
(1944-1945), and contact officer to the French commander had been welcomed for his
involvement with mechanical helps for interpretation. The equipment for such activities were
donated by International Business Machines (IBM). So then, he contacted Watson at IBM
and they agreed to collaborate. Dostert, he contacted his friend Watson the founder of IBM
and they agreed to work together, after constellation of linguists and engineers, that they
confirmed this work is “feasible”. After all he couldn’t get a computer mathematician for
processing side, but he couldn’t, instead he decided to demonstration should translate
Russian into English. This is because of the political case of the time for their enemy was
Russia not Germany, and lastly this demonstration was made with vocabulary size 250
lexemes at on 7th January 1954 at the New York headquarters of IBM [23].
Very works of this early time in MT were very devastating because of the meaning problems
generated and the ambiguities not handled. The Translation system made in Georgetown
University, when it translates English sentence into Russian:
“The spirit is willing, but the flesh is weak”, it translates as “the vodka
is good but the meat is rotten” [1].
The studies continued, in 1962 the Association for Machine Translation and Computational
Linguistics was established in U.S. This time many researchers were joining this association.
Followed by this the Automatic Language Processing Advisory Committee (ALPAC) was
established in 1964. In 1966 the ALPAC report tells since machine translations could not
deliver the expected result the support funds for the field was reduced and the development
of this field was also very slow. However, after its application on the web started 1996 with
SYSTRAN that translates small texts. In 2003 and 2007 more innovations like Moses an
open source SMT tool was invented. Later text and SMS translating mobile phones in 2008
and speech to speech translating mobile devices for English, Japanese and Chinese were
invented in 2009, in Japan. In 2012 Google announced Google Translate (GT) [2].
2.4 Machine translation approaches

Machine translation approaches are categorized in different groups by many researchers. It
has categorized into two as rule based and corpus-based approaches. And combining both
12
features there is also another approach called hybrid [2]. The rule-based approach works with
built in linguistic rule and bilingual dictionaries for both languages. The corpus-based
approaches like statistical machine translation (SMT) translates from source to target
language based on the given corpus, taking it as example. The following sections discusses
these approaches in detail and the Vauquois triangle in the figure 2.1 shows MT approaches.
Fig 2.1.Machine translation triangle [24]
The direct approach, Interlingua and the transfer based in accordance with source language
analysis and target language generation, are clearly shown by this figure. From the zero level
that is text in the form of strings and characters to higher levels that is understandable we can
imagine [24].
2.4.1 Rule-Based Approaches (RBMT).
It is also called a knowledge-based machine translations which works with a built in

linguistic rules that are retrieved from dictionaries of the language being translated. It is
categorized into three categories [9]. Taking a source sentences input, an RBMT system
translates them to output target sentences on the basis of morphological, syntactic, and
semantic analysis of both the source and the target languages [2].There are three types of
RBMT approaches.
1. Direct Method (Dictionary Based machine Translation )
13
2. Transfer Based Machine translation
3. Interlingua RBMT
2.4.1.1 Direct Machine Translation (Dictionary Based machine

translation)
Source language content are deciphered without passing through an internal representation.
The words will be translated as a word reference does, more often than not without much
relationship of meaning between them. Dictionary lookups may be done with or without
morphological investigation. This translation approaches are bilingual and unidirectional one
that needs a little syntactic and semantic analysis [9].There is no source data analysis
processing and source phrases were directly translated into its matching target language
translation, through the encoded word references [24]. Figure 2.1 presents a direct machine
translation using Vauquois triangle.
Limitations of DMT is a language divergence since DMT is characterized by a word to word

translation, translating syntactically and lexically similar source sentences into target
language is not possible. Frequent mistranslations at the lexical level and largely
inappropriate syntax structures which mirrored too closely of the source language.
Mistranslation of this lexical level and inappropriately matching syntax orders into target is
what happens here [9] [2]. Linguistic and computational naivety is also another known
limitations.
Example: let us see the language divergence by the following sentence pairs in Hadiyyisa and
English.
Figure 2.2 Language divergence

14
When we translate English sentence “She is at home” into Hadiyyisa sentence “Ise
mininnette”, “she” translated as “Ise” and “is at home” translated as “minennete”. This shows
a language divergence.
2.4.1.2 Transfer Rule Based Machine Translation Systems (TRBMT)
Figure 2.3. How Transfer based RBMT works [9].
In this figure SL is source language, TL is target language.
Because of the drawbacks observed in DMT RBMT approach was discovered [2]. It
explicitly arranges the machine translation process in three stages. Firstly, the source analysis
takes place the intermediate representations of source language syntax tree or semantic
representations (SL representation). Secondly the source language intermediary
representation is changed to corresponding target language intermediary representation (TL
representations) and finally, target language text is generated from its target language
intermediary representation [24].Figure 2.1 above in section 2.4 with horizontal, downward
and upward arrows is a transfer-based machine translation representations.
This paradigm has also certain limitation, like DMT. Rules must be defined at every stage of
translation, no module reusability and transfer modules are not simple [2].
15
2.4.1.3 Interlingua approaches
This approach is to create linguistic homogeneity over the world. In this strategy, source
language is translated into a mediator representation which does not depends on any
languages. Target language is derived from this assistant shape of representation [2]. It is the
foremost reasonable approach for multilingual frameworks. It works in two phases, analysis
(from SL to the Interlingua) and generation (from the Interlingua to the TL) [22].
2.4.2 Corpus-Based Machine Translation Approach (CBMT)
This approach is also called a data driven approach, which overcomes the existing anomalies
in RBMTs. CBMTs uses a bilingual parallel corpora in order to translate one source language
into the target.it learns the knowledge from the corpora to make such task and large amount
of data is required [2] [22].Statistical machine translation (SMT) and Example based
machine translations (EMBT) are known examples of the this class.
2.4.2.1 Statistical machine translations (SMT)
Statistical machine translation (SMT) is produced on the basis of statistical models whose
parameters are derived from the analysis of bilingual corpora. The starting model of SMT,
based on Bayes Theorem and it takes the view that each sentence in one language may be a
possible translation of any sentence within the other and the most appropriate is the
interpretation that’s doled out the highest probability by the system. The document
translation is based on the probability indicated by p (e|f). Which is the probability of
translating a sentence f in the source language (SL) F (for example, English) to a sentence e
in the target language (TL) E (for example, Hadiyyisa) [2].
Using such probability function SMT finds the most probable target text based on the source
sentences. Since this approach is independent of any prior customization of linguistic to
conduct a translation task it has an eminent advantage solving the enumerable problems of
the rule-based approaches.
For translation work, SMT needs, source and target texts should be taken in the translation
model. There must have the target language texts (monolingual data) for language model.
Now these all with source text enter the decoder that manipulates the translation. Finally, the
16
target sentences (translation) will be given. Before all there must a processed input data, data
are tokenized, true cased and cleaned. After the translation, there is translation evaluation,
like Bilingual Evaluation Understudy (BLEU) which closes to human evaluators. Or human
evaluators are needed; it is resource intensive.
There are some challenges in statistical machine translations [9]:-
1. Sentence alignment: one sentence in a given parallel corpora can be translated in

many target sentences. Aligning these is challenging
2. Statistical anomalies: abundant existence of certain words in the training set

bypasses the real-world translations. For example, “the sentence I took the train to
Berlin” can be miss-translated as “I took the train Paris”.
3. Data dilution: statistical anomaly unique to a subset of natural language, that impact
in machine translation while developing MT for commercial use.
4. Idioms: idioms may not be translated.
5. Word orders: from language to language word order is different. Example Subject
word order (SVO) in the other language it can be SOV. etc.
2.4.2.2 Example based machine translation (EBMT)
EBMT is also known as “Memory based translation” which translates the source sentences
and generates equivalent translation of the target language with point to point mapping. It
uses examples to check if similar sentences are used previously in translated sentences and to
check its correctness again.it works well with small data set and this is it advantage. It is
mainly used to translate two too different languages. However, applying a linguistic analysis
in detail is impossible [9]. Even if EBMT is a good approach in avoiding manual rules for
machine translation, it needs module’s analysis and generation to create a dependency tree
that is used for translation database and for sentence analysis [2].
17
2.4.3 Neural machine translations (NMT)
Unlike all machine translation approaches discussed here the new MT approach is neural
machine translation. As statistical machine translation replaces the traditional rule-based
machine translations the neural replaces the statistical machine translation.
Neural machine translation uses word vector representations that utilizes numerous amount
of neural networks to predict the probability of word sequence. It translates sentences at
once, that other traditional machine translation systems could not do. In NMT the input
sentence passes through the encoder that designates the meaning of the input
sentences ,known as “thought vector” or a sentence vector, which passes through a decoder
that process the input to provide a translation. This model is known as an encoder decoder
architecture [25] as shown in the figure 2.4
Neural encoder – decoder architecture
Translated
Input text Encoder Decoder text
Figure 2.4. Neural network encoder decoder architecture.

RNN Encoder-
Decoder
This section is taken from D. bahadanau and Y.Bengio [26].
In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of

vectors x = (x1, · · ·, xTx ), into a vector c. The most common approach is to use an RNN
such that
ht = f (xt, ht−1)
(1)
18
c = q ({h1, · · ·, hTx }),
Where ht ∈ Rn is a hidden state at time t, and c is a vector generated from the sequence of the
hidden states. f and q are nonlinear functions. Used an Long-Short Term Memory (LSTM) as
f and q ({h1, · · · , hT }) = hT ) for instance.
The decoder is often trained to predict the next word yt given the context vector c and all the
previously predicted words {y1, · · ·, yt 0−1}. In other words, the decoder defines a
probability over the translation y by decomposing the probability into the ordered
conditional.
T
p ( y ) =∏ p ¿ ¿ (2)
t =1
Where y = y1, · · ·, yTy. With an RNN, each conditional probability is modeled
p(yt|{ y1,……….yt-1} ,c )= g(yt-1,st,c), (3)
Where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt,
and st is the hidden state of the recurrent neural network (RNN).
Decoder: General Description
In a new model architecture, define each conditional probability in Eq. (2) as
p (yi|y1,……………yi-1,x)= g(yi-1,si,ci), (4)
Where si is an RNN hidden state for i, computed by
Si=f (si-1, yi-1, ci).
It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)) above,
here the probability is conditioned on a distinct context vector ci for each target word yi.
19
NMT has recently been changing to newly emerged techniques especially from
Convolutional Neural network (CNN) and Long –Short Term Memory to Transformer. These
approaches of RNNs has their own drawbacks while training data, taking too much time and
memory consumption. The newly transformation to Transformation with attention
approaches is sparingly playing a role in NMT, surpassing all traditional and RNNs’ NMT
approaches. This is because it supports parallelization.
2.4.4 Hybrid Machine Translation (HMT)
This approach shares the characteristic of rule-based machine translation and statistical
machine translation. Hybrid models differ in various ways [9]:-
1. Rules Post-Processed by Statistics: - Rule based tool is used for translation at

first. Statistical model is applied to adjust the translated output of rule based
tool.
2. Statistics Guided by Rules: -rules used for preprocessing in data input which
a well guide for statically tool is. And it is also used for post-processing the
statistical output.
2.5 Evaluation of Machine Translation Systems

Machine translation performance evaluation is important not only for users but also for the
research as well [22]. Human translated test set with machine translated outputs will be
checked. Then automatic machine evaluation’s closeness to human test makes the machine
translation better. Example: BLEU, NISI, TER, etc.
2.5.1 BLEU (Bilingual Evaluation Understudy)
The first automatic MT evaluation tool proposed by Papineni [27] and it is widely accepted
evaluation metrics. It can be calculated on the n-gram precession of machine translation and
reference translations [22].
20
BLEU =BP∗e (∑ wi.log (pn)) (1)
Where: “pn”: the number of n-grams of machine translation is also present in one or more
reference translation, divided by the number of total n-grams of machine translation. “ wi”:
positive weights. “BP”: Brevity Penalty, which penalizes translations for being “too short".
The brevity penalty is computed over the entire corpus and was chosen to be a decaying
exponential in “r/c”, where “c” is the length of the candidate translation and “r” is the
effective length of the reference translation.
BP= {1 Si c>r; e 1-r/c Si c<=r (2)
2.6 Related works

There are many related works available in nationally and internationally studied concerning
machine translation including techniques they used, experimentations and results. So in this
section the related works in machine will be addressed.
2.6.1 Bidirectional English-Amharic Machine Translation: An Experiment using

Constrained Corpus
The “Bidirectional English-Amharic Machine Translation” [6] was conducted by statistical

machine translation approaches on a constrained data. Amharic language background, some
local and international literature reviews and related works concerning statistical machine
translation ,language modeling and the methodology ,experimentation and result, and finally
the conclusions of this work is well addressed. The main objectives of this work is to design
and develop a bidirectional English-Amharic machine translation system. Addressing the
problem of English Amharic translation application unavailability, Eleni studied and brought
forth English-Amharic automatic machine translation in both directions English to Amharic
and vice versa.
To do this research, sample data collected from different sources. The corresponding
sentences, phrases and words in both parts were recognized. Query preparation was
performed so as to urge prepared data for the testing and arrangement was also implemented
in order, on the sentences to create sense. Basic sentences were manually prepared and
complex sentences were moreover collected to test its applicability.1020 simple sentences
21
prepared manually and 1951 complex sentences were also collected. That is, 414 from the
Public Procurement Directive and 1537 sentences from the Bible. Very small data was
collected, because finding data specific to human daily life is not easy. Most resources
collected were from specific areas, like bible, law and some other.
The research work was implemented using SMT approach using two corpus categories, one
with simple sentences and the other with complex sentences, creating two language models
to assist bidirectional language translation i.e. one for English one for Amharic.
The data distribution for training and testing made. The separately taken experiment result
record shows, both BLEU score and manual evaluation was 82.22% accuracy for the
English to Amharic, 90.59% for the Amharic to English and using the manual questionnaire
preparation method, the accuracy from English to Amharic was 91% and from Amharic to
English was 97%. This was in simple sentences. The complex sentence pair result shows, the
result acquired from the BLEU Score was approximately 73.38% for the English to Amharic,
84.12% for the Amharic to English and from the questionnaire method from English to
Amharic was 87% and from Amharic to English was 89%. In order to achieve this result the
data preprocessing activities data cleaning, tokenization and true case were applied.
To conduct this research work the tools for statistical machine translation, in Linux terminal
were installed. Like GIZA++ that incorporates two tools, one GIZA++ itself and the other
MKCLS which is a tool used to train word classes. And, IRSTLM Toolkit. GIZA++ and
IRSTLM integrated into Moses which makes the installation task completed.
Finally, the experimentation result shows how human translation good at evaluating though
time consuming and expensive. Here another observation from results shows translation from
English to Amharic is lesser accuracy recorded. Eleni justifies it, because of the Amharic
language’s morphological richness.
2.6.2 English–Afaan Oromo machine Translation: An Experiment Using

Statistical Approach
The objective of English-Afan Oromo machine translation [28] is to develop a prototype for
English-Afan Oromo machine translation. In this thesis work, both languages background
22
discussion, selected literatures and related works in machine translation, data collection and
preparation, experimentations and evaluations were done accordingly.
Statistical machine translation approach is used on the data collected from Constitution of
FDRE (Federal Democratic Republic of Ethiopia), Universal Declaration of Human Right,
proclamations of the Council of Oromia Regional State, religious documents, and other
documents.
The collected documents were then divided into training set and testing set in such a way that
nine-tenth of it was used for training and the rest (one-tenth) was used for testing the system.
i.e 90/10. 62,300 sentences of monolingual was used to train the Language Modeling
subsystem. And finally using limited corpus about 20,000 bilingual sentences, and a
translation accuracy of 17.74% achieved. To arrive at this result data preprocessing,
tokenization, true casing, and cleaning were made. And the SMT translations tools for model
creation, sentence alignment and translation installed in Linux. Those data trained and tested
separately according to their domains. Evaluation result for the test data from the three data
domain shows as 13.69%, 1.97%, and 21.72% for medical, legal and religious domains
respectively. From this, it is possible to conclude that the BLEU score of the system is highly
dependent on size of the training and testing data. Finally, Sisay recommends, in order to
improve the accuracy of the system there must have additional bilingual data set and further
experimentation needed.
2.6.3 Bidirectional English Afan Oromo Machine Translation Using Hybrid

Approach
Bidirectional English Afan Oromo machine translation [7], its objective was to develop
English Afan Oromo machine translations using Hybrid approach which shares the benefits
of both statistical and rule based machine translation approaches.
For this work data collected from, Holy Bible, the Constitution of FDRE, the Criminal Code
of FDRE, international conventions, Megeleta Oromia and a bulletin from Oromia health
bureau. Monolingual Afaan Oromo and English corpus are also collected from certain web
sites. The data collected were in many different formats that is not easily used for machine
23
translation. In order to make data usable to machine translation the data was preprocessed
using tokenization, true casing and cleaning.
To train and test the model the data used was total of 3000 English Afaan Oromo parallel
sentences. From the total of 3000 parallel sentences, 2,900 parallel sentences were used for
training the system and the rest were used for testing the system. And the statistical machine
translation tools for model training, word alignment were used installing them in Linux.
The implementation was done using hybrid approach, in bidirectional English-Oromifa

language, syntactic reordering approach which makes the structure of source sentences to be
similar to the structure of target sentences. So, reordering rules were developed for simple,
interrogative and complex English and Afaan Oromo sentences. This is because of English
and Oromifa has different word ordering. The data collected reordered in both training and
test data sets. Then the two language models were developed for both languages. The model
were used to assign a probability that a given source language text generates a target
language text are built and a decoder which searches for the shortest path was used. The
result of the two experiments were, the first statistical experimentation with BLEU score of
32.39% for English to Afaan Oromo translation and 41.50% for Afaan Oromo to English
translation. The last experiment is carried out by using a hybrid approach and the result
obtained had result of 37.41% for English to Afaan Oromo and 52.02% for Afaan Oromo to
English translation.
Based on the evaluation outputs of the experiments Jabessa concludes that hybrid
approaches are well performing than statistical approaches when Oromifa is taken as source
and English taken as target.
2.6.4 Experimenting Statistical Machine Translation for Ethiopic Semitic

Languages: The Case of Amharic-Tigrigna
The objective of this study was experimenting Amharic Tigrigna sentences using the
statistical machine translation approach [3]. Data source being only bible is, because of there
was no other available Amharic-Tigrigna parallel corpora.
In order to conduct the experiment these data were prepossessed. This is because data from
online cannot be used as they are for statistical machine translation. Especially normalization,
24
type errors and uncleanliness were solved by Perl and Python. The parallel corpora for each
language after all, 25470 Amharic sentences with 3555993 tokens (64259 types) 14 words in
average per sentence. Whereas 25,470 sentences consisting of 396,565 tokens (61,175 types)
with an average of 16 words per sentence. These data set identified as training, modeling and
test sets. The data also expressed in morpheme, word and sentence levels. Additionally, the
language model consists of 36,989 sentences (697,716 tokens of 112,511 types) for Amharic
languages and 62,335 sentences (1,089,435 tokens of 109,988 types) for Tigrigna language at
word level.
These languages are morphologically rich ones, therefore conducting the experiment in word
and morpheme units is good. After training the data based on training, developmental and
test sets, eight models were created, and 1000 sentences were used to evaluate for word and
morpheme based evaluation from this same domain. The Evaluation result shows, word-word
unit based translation BLEU score of 6.65 and 8.25 from Tigrigna-Amharic and Amharic-
Tigrigna respectively. In addition, using morpheme as a unit for Amharic and Tigrigna,
Amharic-Tigrigna resulted 13.49 while Tigrigna-Amharic scores 12.93 BLEU score.
2.6.5 Amharic-Arabic Neural Machine Translation
The objective of this study is to develop Amharic Arabic Neural Machine translation
[12].The study was studied using data collected from holly Quran. 13,501 Amharic-Arabic
parallel corpora have been constructed to train the Amharic to Arabic NMT system by
splitting the verses of Quran manually into separate sentences making Amharic language as a
source and Arabic language as a target and this data were split into (80%) and (20%). 80%
data used as training data set and the rest 20% used for validation and test set.
Neural machine translation needs a bilingual corpus, aligned each in a sentence level. It
uses a neural network to train data and NMT is preferable MT approaches compared with
statistical and rule based ones.
The preprocessed data was trained using LSTM and GRU, a Recurrent Neural Network
(RNNs) models tuning maximum sentence length to 44 and batch-size 80 and 40.And
learning rate to 0.001 with Adam optimization. Other parameters remained as default. The
trained models saved every 10000 steps displaying the validation accuracy and perplexity,
25
that shows how well the model performing. The smaller the perplexity, the better will be
translations.
The Evaluation result of this both models 12%, 11%, for LSTM and GRU respectively .And
this result compared with Google translator and its result was 6%.This shows there is a
promising future for developing machine translations for local languages by the neural
machine translation approach beside SMT method.
2.6.7 Arabic–Chinese Neural Machine Translation: Romanized Arabic as Sub

word Unit for Arabic-sourced Translation
The objective of this study is to develop Arabic-Chinese Neural machine transitions [29].
Both languages are the most widely spoken languages in the word, approximately 319 and
918 million Arabic and Chinese people, respectively. However, studies done concerning
these languages are understudy. Therefore, developing NMT system for these languages may
gain economic, cultural and social impacts.
The inexistence of machine translation work for these languages made the researchers build
this system so as to bridge that gap .The approach used contains five things.1) Data filtering
2) morphological segmentation, 3) Arabic romanization, 4) data-driven sub word units, and
5) linguistic feature integration.
Like SMT since NMT depends on the pure data set, finding corpora was not easiy task. But
Lewis and Zipporah approaches for filtering and selection of high-quality data was used.
These approaches used to clean the most problematic sentences, which increases the quality
of data in both language pairs. To implement this study 2 million sentences were selected
from UN corpus and filtering process conducted by Data ambiguity filtering (removing
diacritic marks in Arabic), Bad encoding filtering (corrupt symbols), Unique parallel
sentences (remove duplicate sentences pairs), remove repeating sentence pairs, incorrect
language filtering and Bad alignment filtering.
The filtering approach removes 109,484 bad and noise sentence pairs. 1,726,170 cleaned and
filtered sentences with 42,592,208 and 41,884,593 Arabic and Chinese words respectively
refined. And 4000 sentence pairs for validation and test data was also filtered.
26
The experimentation made using TensorFlow framework and GRU, tuning some higher
parameters. Word embedding dimension 500, RNN hidden layers 1024, mini batches of size
80, and a maximum sentence length of 50. The vocabulary size for input and output is set to
40K, decoding is performed with a beam size of 12, and the default hidden and embedding
dropout are applied. Models are trained using Adam optimizer, thus reshuffling the training
corpus among epochs. Model validation every 10000, saving every 30000 steps. The models
are trained for approximately one month by using 8 Tesla P100 GPUs. From those saved
models the lastly saved models four models were selected to write a report. Finally, BLEU
(multi-BLEU. Perl) is used to evaluate the accuracy of the models.
2.6.8 A Bidirectional Tigrigna – English Statistical Machine Translation
The objective of this thesis work is to design the bidirectional Tigrigna-English machine
translation by statistical approach [30]. It is aimed to solve existing problems like lack of
information and communication gap for Tigrigna speakers.
So to kick this objective the data collected from various sources and classified in five
categories. Three experiments were made. Baseline (phrase-based machine translation
system), morph-based (based on morphemes obtained using unsupervised method) and post
processed segmented systems (based on morphemes obtained by post-processing the output
of the unsupervised segmented).
Statistical machine translation approach cared on, and by it four models were created to
handle bidirectional machine translation. For English and the other three are for Tigrigna
language includes for baseline, morph-based and the other for the post processed experiment.
And BLEU score evaluation conducted and accordingly, the results of corpus 2 the result
obtained has a BLEU score of 53.35 % for Tigrigna -English and 22.46 % for English –
Tigrigna translations.
2.6.9 Summery
Based on the review of related works for machine translations conducted in English-
Ethiopian and Ethiopian language to Ethiopian languages the researchers were sparingly
using statistical approach and limited data set. The evaluation results of the proposed models
of these research works point the accuracy of the translation system depends on quality and
27
size of the input data. The statistical machine translation tools Moses, ISRLM and Giza++
are frequently used. And the BLEU score used for translation performance evaluation.
However, the thing must be taken into account is the translation approaches we use. It has its
own role in resulting better BLEU score. For instance, neural machine translation with
attention is the new approach that outperforms better than statistical machine translation
approach does. This is because, the NMT translation works with encoder-decoder deep
neural network architecture with the mechanism called attention so as to create simple word
alignments’ model which is used for source language vector representations (encoding) and
the target language decoding. The Amharic-Arabic neural machine translation work by
LSTM and GRU RNNs on very small amount of dataset shows the new state-of-the-art
NMT from local MT works. And the Chinese-Arabic neural machine translation from
international perspective by TensorFlow and GRU is also the other NMT work. In this study
SMT and NMT MT approaches applied on the parallel corpora collected from two different
sources. We used both methods due to the rapidness and cost effectiveness for MT system
development, though there is poor translation performance in data out of the domain, the
system trained. So using these both approaches differs English-Hadiyyisa machine
translation research study from prior works that had been studied in English and Ethiopian
languages and Ethiopian language to Ethiopian language. To the best knowledge of the
researcher, not only two MT approaches used differs this study from others but also the
language pairs. There is no English-Hadiyyisa or Hadiyyisa-English machine translation
works studied before.
28
CHAPTER THREE
HADIYYISA LANGUAGE
3.1 Introduction
In this chapter we are going to discuss general information about Hadiyyisa language. The
following subsections discuss the language family Hadiyyisa belongs to, the phonology,
alphabets used in Hadiyyisa language, nouns, pronouns, conjunctions, adjectives and
challenges of Hadiyyisa language.
There are more than 83 distinct languages spoken by nations and nationalities in Ethiopia [6].
Hadiyyisa is one among these languages which is spoken by Hadiyya people in southern
nation’s regional state [13]. The majority of the language speakers are found in Hadiyya zone
which is located to the west of the Rift valley. There are also some considerable number of
language speakers in Dawro, Bale and Arsi zones [16]. The World Fact estimation of March
2018 shows there are 1.8 million Hadiyyisa speakers [14] speaking it in four dialectics
Leemo, Sooro, Badawwaachcho and Shaashoogo. It is also grouped in Cushitic language
subfamily which include Oromo, Somali, Sidamo, Afar, Agew, Gedeo, and Beja to name a
few [17].
Hadiyissa has been used as a media of instruction in education from grades 1-4 and is
thought as a subject in Junior and senior secondary grades levels. It is also given as a subject
in Teachers Training Colleges in Hosanna. Moreover, since 2014/2015 G.C Hadiyyisa
language department has been launched in Wachamo University [16].
In order to have a better understanding in Hadiyyisa language this chapter briefly discusses
Hadiyyisa language in some subsequent sections as shown below. Section 3.2 discusses
orthography of Hadiyyisa language, how to write the language. In section 3.3 the phonology
with consonants and vowel sound is discussed. Section 3.4 discusses Hadiyyisa alphabets, 23
consonants, five vowels, one glottal, and seven diagraphs. In section 3.5 Hadiyyisa nouns
ending with <a, o and e > sounds, including feminine and masculine gender information is
discussed. Section 3.6 discusses the Hadiyyisa pronounces which distinctly identifies the
29
plural and singular, male and female. In section 3.7 dative pronouns like “to/for me” for
singular and plural from first person singular and plural to the third person singular and
plural is discussed. Section 3.8 discusses Hadiyyisa possessive pronouns like “my, your, his”
with corresponding Hadiyyisa translations. Section 3.9 discusses Hadiyyisa verbs that shows
how objects and subjects are connected; this section includes the subject, verb and objects
order in both languages. Section 3.10 discusses connecting conjunctions. Section 3.11
discusses the noun identifiers, adjectives. In section 3.12 Hadiyyisa punctuations that are
adopted from English punctuations are briefly discussed. Section 3.13 discusses some ordinal
and cardinal Hadiyyisa numerals. And lastly section 3.14 addresses the challenges of
Hadiyyisa language.
3.2 Hadiyyisa Language Orthography
Orthography of Hadiyyisa is based on a Latin script by which English is also written, which
is also alphabetic that represents a phoneme. But the writing and pronunciation of both
Hadiyyisa and English is in not identical [13]. Firstly, it is important to discuss the
phonology of Hadiyyisa language. This is because phonology is a base for orthography [16].
3.3 Phonology
There are identified Hadiyyisa consonant and vowel phonemes, 23 consonants and five
vowels in total. Symbols in parentheses represent consonants used for loanwords [13].
“There are six Plosives /b/, /t/, /d/, /k/, /g/, and /ʔ/, five fricatives /f/, /s/, /z/, /ʃ/ and /h/, two
affricates /tʃ/and /dʒ /, four ejectives /p’/, /t’/, /tʃ’/ and /k’/, two nasals /m/and /n/, lateral
approximant /l/ and trill /r/ and two approximants /w/ and /j/” [16].These are shown in the
following figure.
30
Figure 3.1. Consonant phonemes of Hadiyyisa [31]

The sounds [p], [v], [s’], [ɲ] and [ʒ] were not included in the phonemic list of Hadiyya [16]
[31]. These consonants’ sounds are for loan words that are not in Hadiyyisa.
Like other Highland East Cushitic languages, Hadiyya has five vowel phonemes /a, u, i, e, o/.
Each vowel has a long counterpart which exhibit meaning difference. Hence, vowel length is
phonemic in the language [16]. Table 3.1 below shows the vowel phonemes in Hadiyyisa.
Table 3. 1. Vowel with consonants.
Phoneme status Hadiyyisa Gloss Hadiyyisa Gloss
Short Vowel a.[dasa] ‘slower d.[dira] “the dirt”
Long vowel [daasa] ‘the tent” d.[diira] “fattened”
Short vowel b. [dubukko] “hide” e. [qota] “partly

broken”
Long vowel [duubukko ] “became e.[qoota] “dowry”

cloudy”
When we insert these vowel phonemes in the medial position of the consonants the length
of the vowels according to the position, they give different meanings [31]. Vowel length and
consonant germination are written by doubling the graphemes. For vowel length, dasa
‘slower versus daasa ‘the tent” and dira ‘dirty’ versus diira ‘fattened’ are some of the
examples.
3.4 Alphabets
In order to represent the above consonant and vowel phonemes Hadiyyisa uses 33
graphemes. There are 23 consonants and five vowels from these graphemes. Twenty-six
graphemes are identical with English letter alphabets. But there are seven additional
31
Hadiyyisa alphabets < CH NY PH SH TS ZH > that are six digraphs and one glottal symbol
representation, apostrophe {‘} [13].
Hadiyyisa Alphabets has capital and small letters like English. From these alphabets for
example “X” is not read as it is in English. But it is used as other Ethiopian languages do use
the Latin scripts like Afan Oromo, Sidaama, Kambata and Gede’o. In these language
categories the glottal word representation is also the same while using these alphabets in the
word construction. An apostrophe {‘} does represent it [13] [31]. It is shown in appendix III.
3.5 Hadiyyisa Nouns
Hadiyyisa nouns end with vowels, specially <a, o, e >, but no ending with < I, u > [31].
Qashsha, Buyya, Sanna, Qooqa etc are ending with ‘a’ whereas Bu’o, Buuro, Laso, Godabo,
Baado, Wi’llo ends with ‘o’ and Afare, Sane, Aade, Xibbe, Miine, Baalle and Biqe are
ending with ‘e’. See table 3. 2 below.
Table 3.2. O, e and a class noun example
a Gloss o Gloss e Gloss
Qashsha spoon Bu’o fountain Afare back
Buyya leaf Duro killing Sane nose
Sanna Season/Sunday Laso back Aade downwards
Qooqa blind Godabo stomach Xibbe hundred
Haqqa trees Ludo top Miine Forehead
Buna cofee Baado revenge Baalle valley
Waaasa food Wi’llo Bravery Biqe chest
32
There are also some masculine and feminine representation of gender in nouns available in
Hadiyyisa. Example: gannichcho, saayya, ada, are, amaayya, aayya and amaya are some
the nouns indicating feminine whereas aro’o, ambula, bula, abbaayyo, annabbaayyo, sanga
and eeshimma are also some masculine indicators. Their English meaning and Hadiyyisa
representation are shown in the table 3.3.
Table 3.3. Distinguishing masculine and feminine genders
Feminine Masculine
gannichcho mare aro’o husband
sayya cow ambula He goat
ada aunt bula mule
geexa young abbaayyo brother

hoarse/goat/sheep
are woman annabbaayyo Uncle/ fathers

brother
amaayya Aunt/mother’s sister sanga bull
aayya sister eeshimma Uncle /mother’s

brother
3.6 Pronouns of Hadiyyisa Language
Hadiyyisa pronounces like English has their own impact in the meaning of the languages.
Pronoun is a word that is used instead of a noun. Pronoun is very much used in our day to
33
day conversation. Let us know the pronouns [32]. Here M, F, PL represent male, female and
plural respectively.
Table 3.4. Hadiyyisa Pronounces
English Hadiyyisa
I Ane/ Ani
WE Neese
YOU (M) Ati
YOU (F) Ati
SHE Ise
YOU (PL) Ki’i nu wa
HE Ixxo
THEY Issuwwi/ixxuwa
Let us see some sentence constructions with these pronounces inserting positive and negative
form of certain tense, in the following example with past continuous tenses.
Example 1: An Itummuuyya hee’umo .I was eating (positive)
An itummuyya hee’ummoyyo.I was not eating (negative)

Neese intummuyyi hee’nnummo. We were eating (positive)
At ittittuuyyi hee’llito .You were eating (singular first person)
Kinnuwwi ittaku’uyyi hee’llakko’o. You were eating (2nd person plu)
Ixxuwwi itamukkuyyi/ittu’uuyyi hee’amukko. They were eating.
Ise ittu’uuyyi hee’llo’o. she was eating.
34
In this example pronouns, past continues tense with negative and positive forms of the verb,
plural and singular are seen. For instance, “An itummuya hee’ummo.” ‘An‘ represents –I,
Itummuuyyi represents –eating, hee’ummo represents -was.to negate it adding a suffix –yyo
to the verb “hee’ummo”.
3.7 Dative pronounces
There are additive pronouns ending with –Ina adding pronoun or subject representatives;
these are in addition to the known pronounces described in table four. Everything here is
represented with phonemic representation in the original desperation work [31].English
meaning and their phonemic representations are shown in the table below.
Table 3. 5. Dative pronounces [31].
Person Dative pronounces
1Sg I-ina (?i-ina) ‘for/to me’
2Sg Ki-ina (ki-ina) ‘for/to you’
2Sg(HON) Ki-‘nnena (ki?ne’-na) ‘for/to you’
3Sg.M Ixxena (?itte’e’-na) ‘for/to him’
3Sg.F Isena (?ise’-na) ‘for/to her’
3Sg (HON) Issena(?isse’-na) ‘for/to him’
1 PL Niina (ni-ina) ‘for/to us’
2PL Ki’nnuwwina(ki?n-uwwi-ina) ‘for/to you”
35
3PL Ixxuwwina (?ittu-uww-ina ) ‘for/to them’
Person first singular (1SG), a dative pronoun I-ina (Iina) to mean ‘for/to me’. Person 2 nd
singular (2SG), a dative pronoun Ki-ina (Kiina) “for/to you” .Ki’nena (Honor) for aged
people or some respectful expression; this is also a singular second person. Amharic
representation says it “Le-erso, lerso”. Ixxena (to him), isena (to her), niina (to us), issena (to
him) honoring somebody in age or respectful person in 3 rd person singular, ixxuwwina (to/for
them).Example: the Hadiyyisa sentence <Ha’i iina ki koboorta kaballina aggiisda’e” > is
translated < Borrow me your coat today, please> as shown in the figure below.
36
37
Here “iina” represents “me”, “ki” to “your” are used represent the dative pronounces as
discussed in the above table 3.5.
Figure 3.2. Dative Pronounces
3.8 Possessive pronounces
Such pronouns are used as a modifier element and usually occur preceding ahead noun in the
formation of genitive NP. The following paradigm shows the genitive pronouns along with
their attributive functions. The head noun involved in the phrase is lókko ‘leg’ [31].
Table 3.6. Possessive/genitive pronouns
I lokko My leg
Ki lokko Your legg
Is loko Her leg
Ni mine Our home
Other Ixxuwwi mine Their home
Issi (isso’i) mine(Honor) His home
Ki’nnuwwi mine (Plural) Your home
pronounces type, ablative pronounces like, iiniinse (from me), kiinniinse (from you),
ki’nnuwwwiinse (plural) to mean from you, iseense (from her) are also there.
38
3.9 Verbs
It is obvious that verbs are used to represent the action how subjects or objects relate to. With
verbs we do know when action has occurred i.e. tenses. We can see some examples of
Hadiyyisa sentences and the corresponding English meaning, changing some of these verbs
time like present, past and future tenses in table 3.8. Example 1 in the previous section 3.6
had shown the verb with Hadiyyisa word (ite) in combination with subject pronouns. That
simple word, changes its forms based on the subjects that precedes it and the time (tense) it
represents. But most of Hadiyyisa root words for verbs ends -e, but there are also some words
ending with –e to represent a noun. These sample verbs shown here are only in second
person singular command forms.
Table 3.7.Hadiyyisa verbs
Afe Reach
Waare Come
Mare go
Fire Go out
The Ite eat verbs in
Disse Put there
Hine dig
Hadiyyisa with several inflection and derivation are categorized as monosyllabic or

disyllabic [31]. Of the three groups the ordinary shape of verb root is monosyllabic. Let us
see the word verb age (drink) and see it with changing its time and subject in table 3.8
below.
39
Table 3.8. Hadiyyisa verbs with simple past tense and subjects.
Subject verb form
Positive form negative form
1 I Drank coffee did not drink coffee
An buna agummo buna agumoyyo
2 We drank coffee did not drink coffee
Neese buna angummo buna angummoyyo
3 You(s) drank coffee did not drink coffee
At buna aggitto Aggittoyyo buna
Subject verb and object order in Hadiyyisa is not like in English. From the above table let us
see simple word order examples. Each sentence in Hadiyyisa follows an S+O+V order but
English uses (S+V+O) order.
40
Figure 3.3. Hadiyyisa word order
In this sentence Hadiyyisa sentence “ Ise buna aggo’o “, Ise (she) represents subject, buna
(coffee) represents object ,and aggo’o represents the verb (drunk), following S+O+V order.
3.10 Conjunctions (CNJ)
In Hadiyya, coordinating conjunction ‘and’ is marked by the vowel length which is attached
to all words in the enumeration [16].CNJ is conjunction connects something before with after
it.
Table 3.9. Conjunction [17]
adoo buuroo milk and butter

milk-CNJ butter-CNJ
annii beetii father and son

father-CNJ son CNJ
annaa amaa father and mother

father –CNJ mother-CNJ
isee ixxoo her and him

her-CNJ him-CNJ
In this table adoo buuroo,”-oo” connects both nouns being at the end of each nouns, ado and
buuro which means milk and butter; in annii beetii “-ii” is also another conjunction, to mean
father and son whereas isee ixxoo to mean her and him.
41
3.11 Adjectives
Adjectives are used to construct a class of words that are not obligatory to gender maker.
Hadiyyisa language has inestimable adjectives that can be created. Hence, every adjective
has a noun and verbal counterpart, with which it shares the basic phonological structure and
semantic content [31]. In the picture below “biijaal lanchcho “, biijalli is an adjective which
identifies the known landichcho (girl) .
3.1
2
Figure 3.4: Hadiyyisa adjectives.
Punctuations
Punctuations used in Hadiyyisa sentences are adopted from English sentences and the use of
the punctuation in the given sentences are like English sentences as shown in the table below.
Table 3.10. Hadiyyisa punctuations [16].
Symbols Hadiyyisa names of the Corresponding

punctuations marks English Names
. ullish mare’e full stop
, giphit mare’e comma
42
; shiqqeen giphit mare’e semi colon
: caakkishsh mare’e colon
? xa’mmichchi mare’e question mark
! maallaxxi mare’e exclamation mark
“” aggiishshi mare’e quotation mark
() qaaphpho’I mare’e parenthesis
- ceeqqix mare’e hyphen
___ dufaachchi mare’e dash
<> matandar mare’e single quotation
Example: -
Ayyaane er losaanchchotte.(Uullishshi mare’e)

Ayyaane is a good student. (Full stop)
Daayyam,Lombaamii Seendanee sasim loh baxxanchchi losaano.(comma)
Daayam,Lombaam and Seendane are grade six students.
Losaan kaba usheexxatonnete;waaroo hinchoomina asheettamo.(semi-colon)
Students are at home now; by next year they will start the class.
Hanniins waatitto? (Question mark)
Where did you come from?
Abeet waa‘i malaayye! (exclamation mark)
How mighty one is our God!
43
3.13 Numerals
Same numerals that English uses Hadiyyisa uses. Here are some Cardinal numerals in the
table below.
Table 3.11. Hadiyyisa cardinal number
0 zero Zeero’o 11 Tommo mato Eleven
1 One Matto 12 Tommmo lamo Twelve
2 Two Lamo 13 Tommo saso Thirteen
3 Three Saso 14 Tommo sooro Fourteen
4 Four Sooro 15 Tommo onto Fifteen
5 Five Ontto 16 Tommo loho Sixteen
6 Six Loho 17 Tommo loho Seventeen
7 Seven Lamara 18 Tommo sadeesnto Eighteen
Hadiyyisa does share the challenges that all Ethiopian languages can face. Particularly,
morphological richness and complexity of words. Because of the various inflections,
compounding, clipping, or blending of words there can have plenty of meaning differences in
words. For instance, simple word “waare” by adding numerous suffixes we can have many
meaningful morphemes as shown in the example below.
Even if a Latin-based writing system is currently in use, writing those words by adding
several morphemes is a bit problematic issue for many users. This is because of the absence
of spell checker. If we miss single letter or a glottal representation that word can’t give any
sense, or totally meaningless. Unlike the native speakers, this complexity could probably
hinder non-natives from learning Hadiyyisa. In addition to the morphological complexity
44
inaccessibility of digital resources is also another most devastating challenge that makes
Hadiyyisa a resource grieving language.
Example: waarukko, waatto’o, waaramukko, waarakko’o, waarummo, waarukkoyyo,

waaraa, waarimma, waarimmayyo, waaroohaare, waattamaare, waalloommo, waaroommi,
waatteenannihe, waarummibee’aarahe, waattaakkobee’ita’n, waattakko’oyyotemeyyahe,
waattakkeenayyonnihe , waattibeelasahe, waattada’e ….etc
45
CHAPTER FOUR
SYSTEM DESIGN AND DEVELOPMENT
4.1 Introduction
The objective of this study is to develop bidirectional English Hadiyyisa machine translation.
In order to achieve the objective appropriate approaches were studied and chosen, data
collected and manipulated in the way to be used for MT. In addition to these, the tools and
techniques have been used.
This chapter briefly discusses how the study has been made, classifying the chapter into
subsections. Consequently, section 4.2 discusses the general explanation of English-
Hadiyyisa MT system architectures. The detail explanation is shown in its sub sections. The
data collection and data preparation for conducting experimentation is discussed as well in
section 4.3 and 4.4, respectively. Following this, sections 4.5 and 4.6 discusses the
experimental setups of SMT and NMT approaches in which the SMT possesses language
modeling, training translation model, tuning and evaluation whereas NMT possesses
preprocessing, training, translation and evaluation process.
46
4.2 System Architecture of English Hadiyyisa Machine Translation
English Hadiyyisa machine translation architecture shown below in the figure 4.1 shows the
general architecture of SMT and NMT in one. The Statistical part shows there are three main
components namely, language modeling , translation model creation and decoding whereas
the neural part shows there are four interrelated components, 1) preprocessing which differs
from data preprocessing, 2) training the data using the preprocessed data, 3) translating and
4) testing. The input and out texts in the processes are shown by pile of sheets; the created
models from the process are shown by storages; the processes are also shown by rectangles;
And the flow of the directions are shown by arrows.
As Figure 4.1 General architecture of English-Hadiyyisa machine translation this

figure shows there are serious of steps each approach should follow. Before processing either
47
of the approaches there must have the source and target data sets that are pre-preprocessed.
The pre-preprocessed data can be splatted into the training, validation, and test sets. Taking
these data language modeling, translation modeling and decoding processes will take place
while working with SMT whereas while working with NMT the same data passes through
the preprocessing which creates three pytorch files, training steps that creates the language
model from the pytorch files, the translation phase that predicts the target texts using the
model together with the source test data set and finally the evaluation.
This figure shows only the general overview of the two models. But the detail explanation of
English-Hadiyyisa SMT and NMT architectures will be discussed in the following sections
as shown in section 4.2.1 and 4.2.2.
4.2.1 System Architecture of English-Hadiyyisa MT (SMT)
The SMT approach, to create a translated output it depends on the bilingual text corpora [9],
no need of customization of linguistic rules but the system learns through analysis of
statistical text corpora. The document translation is based on the probability p(e|f). Which is
the probability of translating a sentence f in the source language (SL) F (for example,
English) to a sentence e in the target language (TL) E (for example, Hadiyyisa) [2].
This probability distribution works based on Bayes Theorem: - That is, if p(f|e) and p(e)
indicate translation model and language model, respectively, then the probability distribution
p(e|f) ∞ p(f|e)p(e). The translation model p(f|e) is the probability that the source sentence is
the translation of the target sentence or the way sentences in E get converted to sentences in
F. The language model p (e) is the probability of seeing that TL string or the kind of
sentences that are likely in the language E.
This decomposition is attractive as it splits the problem into two sub problems. Finding the
best translation is done by picking up the one that gives the highest probability:
e =¿ e argmax P(e | f) = e argmax P(e) * P(f | e)

This noisy channel created by rules of Bayes Theorem stated above, it is based on a
conditional probability, i.e. Bayesian rule [33].
48
P (e|f) = P (e) * P (f | e) / P (f) (1)
Using Bayes Rule, we can rewrite the expression for the most likely translation:
e argmax P(e | f) = e argmax P(e) * P(f | e) (2)
The figure 4.2 below shows the architecture of the model and with subsections of statistical
machine translations, and how it works.
49
Figure 4.2. English Hadiyyisa machine translation SMT architecture
As it is tried to show in the general architecture above, this figure shows the SMT model
architecture for English-Hadiyyisa machine translation. While working with SMT both
bilingual and monolingual data must be preprocessed so as to process the training, validation
and test processes. The preprocessed monolingual data is used for target language modeling
and the bilingual corpora is used for training purpose. Some amount of the bilingual corpora
50
are taken as source test that is used for translation model generation. And some amount of the
test set of the target language is used as a reference for the evaluation of the system. The
translation model created together with language modeling it will be decoded by decoding
algorithm. Decoding is used to find for the best sequence of alterations that translates source
input texts to the corresponding target output sentences, in this case Hadiyyisa sentences or
English sentences. This is because, this study is bidirectional. The target translated texts are
tested by a reference test to check the system performance.
4.2.2 System Architecture of English-Hadiyyisa (NMT).
Neural machine translation is the stat-of-the-art in machine translation that surpasses all
existing machine translation methods. It uses word vector representation that utilizes
numerous amounts of neural networks to predict the probability of word sequence. It
translates sentences at once, that other translation approaches could not do. In NMT the input
sentence passes through the encoder that designates the meaning of the input
sentences ,known as “thought vector” or a sentence vector, which passes through a decoder
that process the input to provide a translation. This model is known as an encoder decoder
architecture [25].
Just as there is a successive change in other translation approaches, through time, there are
also a promising improvement in neural machine translations. Initially it was implemented by
sequence to sequence RNNs methods like LSTM, GRU and CNN. Later, attention
mechanism added. Now it has been brought forth with supper improvements “Transformer
with attention”. In addition to SMT approach this study has also been implemented by
transformer attention mechanism.
4.2.2.1 Transformer Model Architecture
It was first presented in paper “attention is all you need” [34], in 2017.It was developed to
solve a transduction problems (parallelization, time during training, accuracy etc.) of the
existing NMT models LSTM, GRU and CNN. Even though it is similar to LSTMs, an
architecture for transforming one sequence into another one with the help of two parts
(Encoder and Decoder) it is completely different from sequence to sequence models [35].Its
51
allowing more parallelization during computation made it best performing new model
architecture.
The following figure shows the Transformer model architecture.
Figure 4.3. Transformer model architecture [34].
This figure shows a model architecture of transfer model. The left side is encoder and the
right side is decoder. Neural sequence transduction models have an encoder-decoder
structure. The Transformer follows this overall architecture using stacked self-attention and
point-wise, fully connected layers for both the encoder and decoder [34].
This model architecture fits for any language, in this experiment English can be taken as
input and Hadiyyisa as an output, and vice versa. This means both languages’ training set
pass through the input Embedding i.e. English training set through input and Hadiyyisa
training set through output. It depends on the language we make as source and target. First
52
let us see the encoder block (shown in sec 1 below). The input sequences pass simultaneously
at once, each word and each sentences at once and input Embedding generated at once. But
this is not possible in RNNs instead each word passes in time sequence and word Embedding
created one at a time. Input Embeddings are, representation of inputs in computer
understandable format like vectors and arranging these similar words or vectors in
meaningful or physical, in input emending spaces where the words represented by vectors.
The words in the Embedding space can have different types of meanings based on the
position they are in the sentences. Then the positional embedding comes into action. The
positional Embeddings are vectors that add a position in vectors in the input embedding
space, i.e. the context of the words based on their position to be identified. To do this cosine
and sign functions applied as described in the below sections (section 3). Now here is where
a word vector is created (Input Embedding + Positional Embedding). The word vector now
passes through encoder block with Multi-Head Attention and Feed Forward layer. The idea
of attention is described in the section 2 below. It can be self-attention and multi-head
attentions. In self-attention each word in each language sentences are, how relevant in other
sentences of the language is checked. The attention layer adds an attention vectors to the
word vectors and passes it to feed forward block. The Feed Forward block takes the attention
vectors and transforms them into a usable form for encoder and decoder i.e. next level.
Decoder block (section 1); similarly, this blocks accepts the output language Hadiyyisa
sentences at once and creates the input embedding then the input words of the output
language. The input embedding added a positional embedding and forwarded to Masked-
Multi Head attention block in a decoder block. The attention (self) checks how each word
related to each word in the same sentence of the output language.
Now Attention vectors from encoder block and decoder block forwarded to Multi-Head
Attention layer which is another layer of encoder-decoder block. English and Hadiyyisa
sentences at this time. Source and target word mapping happens at this stage. At this stage (Q
–Query output of encoding of target sentence, K and V, key and value output part of
encoding of the source sentences) forwarded to Multi-Head Attention block. The mapped
vectors (every English words with Hadiyyisa words), now forwarded to Feed forward block
that manipulates data usable in other decoder block and Linear layer. This layer is another
53
Feed Forward Layer, where the inputs (vectors) manipulated by Feed forward layer, will be
expanded into number of words in target language, Hadiyyisa. Lastly the Soft Max layer,
where the linear layer input will be changed into human interpretable format. This done, by
probability distribution for each words.
Each encoder decoder has normalization, these are batch (training sample) normalization and
layer normalization (sample of dimensions). This can be tuned before starting training. The
accuracy of the translation also depends on how we tune the training parameters.
The following section are taken form Vaswani, “Attention is all you need” [34].
1. Encoder and Decoder Stacks
Encoder: The encoder is made of a stack of N = 6 identical layers. Each layer with two sub-
layers. Multi-head self-attention mechanism the first one, and the second one is feed-forward
network that is fully connected simple and position wise. The residual connections would
normalize the residual connections.
Decoder: It is similar with encoder and it is also made of a stack of N = 6 identical layers.
Unlike encoder additionally the decoder has another third sub layer which performs Multi-
Head Attention over the output of the encoder stack.
2. Attention:
It is a function described by mapping a query and a set of key-value pairs to the output,
where the query, keys, values, and output are all vector values. The result is calculated as a
weighted sum of the values, where the weight assigned to each value is computed by a
compatibility function of the query with the corresponding key.
So attention is giving special focus only on the relevant information other that taking whole
information at once. Self-attention is represented by the following formula.
Attention ( Q , K , V )=¿softmax ((QKT ) /√dk ) V - (1)
Where Q represents the matrix that contains the query result, K represents all keys, and V are
values. Each symbols represents the vector representations of all words in the sequence [35].
54
3. Positional Encoding:-The positional encodings have the same dimension dmodel as the
embedding, so they can be summed. We use sine and cosine functions of different
frequencies for this purposes: -
PE(pos,2i)=sin(pos/10000 2i/dmodel)
PE(pos,2i+1)=cos(pos/10000 2i/dmodel)
Where pos is the position and i is the dimension.
Attention (Multi-Head and Scaled product attention), softmask and embedings, positional
embedding, Position-Wise feed forward network, optimizer, and Regularization (drop out
and label smoothing) are the crucial contents in Transformer model with attention.
In this section the detail Transformer architecture how it works with source and target
language in each encoder decoder blocks has been well addressed. In addition to this, while
working with transformer model architecture, the overall NMT for English-Hadiyyisa NMT
architecture is summarized in the figure 4.3 below.
55
Figure 4.4 English Hadiyyisa NMT Architecture
While working with

NMT, there must have parallel corpora. The corpora that are pre-processed must also be
splitted into the training, validation and test data sets. From these data the source and target
validation data; the source and target train data are used to preprocess the first phase that
creates three Pytorch files that are going to be used during the training phase. The training
steps takes the Pytorch files i.e (train model, vocabulary and validation files) to create the
models for the translation step. This is where the best performing NMT architecture is chosen
and tuned with an appropriate parameter like batch size. In this research the Transformer
56
Model Architecture is chosen. At the end of the training there will have certain models
created by which the translation will take place. It is time consuming activity like tuning
phase in statistical machine translation. The model created here, together with source test
data set predicts an appropriate target translations. The predicted target translation is tested
with target test data set (reference) to check the performance of the given models (system).
4.3 Data collection

Both statistical machine translation system highly depends on the parallel corpora to develop
a bi-directional English-Hadiyyisa translation system and monolingual corpus for the
language modeling of the target language. So is neural machine translation system. The
parallel corpus is collected from religious and educational domain. Religion domain which
mainly constitutes bible with contents extracted “King James Version 21 st century” from the
online1 source for the English language and the introduction parts of each English bible
scriptures has been extracted from other online source 2. Likewise, some Hadiyyisa part
religious data were extracted from another online source which is not with exact chapters and
versus in actual bible but simple and brief description of selected bible verses in line with
English extracted3. A total of 33,880 parallel sentences from this domain were collected.
Similarly, the educational domain parallel teaching module data collected from Hosanna
Teachers Training College (TTC) and Wachamo University were used. In addition to this,
Hadiyyisa teaching data were also collected from certain websites 4. The educational resource
consisting of 6,610 parallel were used from teaching and learning sources of both languages.
4.4 Data Preparation

Data collected from those various sources are not clean, full of tables, pictures, and special
characters. So, data extraction was the first activity had been applied. Scanned pdf
documents were extricated by online character recognizer (OCR) and manually edited. Bible
data for English was downloaded and added its appropriate introductions to each scripture.
Hadiyyisa bible scriptures had already been added the introductions to each scripture. Data
from both sources extracted manually. As data extraction was over, manual sentence
1
www.bibilegateway.com
2
https://www.bible.com/bible/296/GEN.INTRO1.GNB
3
www.hadiyyajourney.com.
4
https://www.learnhadiya.com/learn
57
alignment was made. To perform a machine translation parallel data needed, so parallel data
was made using excel and a NovaAligner app to check if the sentences are appropriately
aligned according to the corresponding translations. Here one English sentence is translated
as two and more sentences in Hadiyyisa. It was not easy to use automatic alignment and
sentence breaker because of data inconsistency. There were many translation mismatches
especially educational corpus. So, each mismatches and misalignments were removed by
manual checking. These data were taken to notepad and converted into a UTF-8 format.
Once pre-preprocessed data got ready, data categorization was made with its appropriate
naming. Whole documents in English and Hadiyyisa domain first removed all special
characters using a Linux commands and Moses scripts, then true cased. No empty lines left.
The data preprocessing activities tokenization (inserting white spaces between available
words and punctuations), true-casing (changing data into lowercase ,incase single word
written in lower and upper case could be taken as two different words) and cleaning applied
(removing long sentences and empty lines). Figure 4.5 and 4.6 show the preprocessed data.
Figure 4.5. English tokenized, true cased and cleaned data
Figure 4.6. Hadiyyisa tokenized, true cased and cleaned data
Corpus after preprocessing shows all data are in lowercase, separated, punctuations removed,
and special characters removed. Apostrophes from English sentences were removed but same
apostrophe is used for Hadiyyisa language to show a glottal words was not removed.
Distribution of sample words in a given corpus is shown in the following table 4.1. These are
English and Hadiyyisa words in a corpus they belong to.
Table 4.1. Distribution of English and Hadiyyisa words
58
English words Counts Hadiyyisa Counts
Words
field 273
keenim 223
turned 272
hiraagaanch 223
same 272 i
Very 271 daba'llaa 223
Twenty 269 bikkinatte 223
speaking 269 issookko 222 The are 273

words for ‘field’
in English data , 273 words for ‘turned’,269 words for the word ‘speaking’ whereas there are
223 words for Hadiyyisa ‘keenm’ and 222 words for ‘issookko’.
The data after preprocessing is shown in the table below. Religion domain, educational
domain and all together in one domain. The religious and educational data are classified in
80/20 in %. 80% of these data used for training and 20% for validation and test [12].
Table 4.2.The English- Hadiyyisa language pair corpus
Data category Training set Validation set Test set Total
Educational 5290 660 660 6610
Religion 27105 3388 3387 33880
All in one 32335 4047 4050 40492
59
4.5 SMT Experimental setup
4.5.1 Language Modeling For SMT
Language modeling is used to ensure the fluent output and it is modeled by target language
[6]. For this thesis, a bidirectional machine translation is built, i.e. both languages taken as
targets interchangeably. This model helps the translation system with selecting words or
phrases appropriate for the local context and with combining them in a sequence with better
word order [20]. For this study 3-gram language model, by removing singletons, smoothing,
and adding sentence boundary symbols were built for both languages. This is because of the
translation is bidirectional. The likelihoods of the obtained word from the n-gram model
could be unigram, bigram, trigram, or higher order n-grams. Let us see in the following
Hadiyyisa sentences:
waa'i iina danaamo lehiinsem ki'ukko

liirammummo
waa'i iina anna
yesuus hannqinam
waa'i malaayyee woda
annana'e
yesuus danaamo niina
The unigram probability can be computed by:
P(w1) =count (w1) => P(waa'i) = 3/18 = 0.167
Here 3 are the number of times the word 'waa'i' was used and 18 is the total words in the
corpus (sample corpus).
The bigram probability can be computed by:
P(w2| w1 ) =count ( w1 w2 ) /count ( w )
=> P(iina|waa'i) =count ( waa'i iina )/ count ( waa'i )

=2/3=0.667
Where 2 is the number of times the words 'iina' and 'iina' have been happened together in
corpus and 3 is the number of times the word 'waa'i' is happen in corpus
60
And the trigram probability becomes:
P(w3| w1.w2 ) =count ( w1.w2 w3 ) count ( w1.w2 )
=>P(woda|waa'i iina) =count(waa'i iina woda) count ( waa'i iina )=1/2=0.5
Where 1 is the number of times the word “waa'i”, “iina” and “woda” have been used together
and 2 is the number of times the words “waa'i”, and “iina” were used.
The model is trained by a monolingual data and there is free tool for this purpose. In this
research work SRILM is used. It doesn’t need more memory spaces and supports three
output formats ARPA format, qARPA format, and iARPA [20]. In order to speed up faster
loading of the model we need to binarise the file with (.blm). The model created here also
checks the inputs we give whether they are in the model created and calculates perplexity of
out of vocabulary (OOV) and displays number of tokens.
Because of data shortage for Hadiyyisa monolingual data couldn’t be used. Instead of that,
data from the bilingual parallel corpora used to build the language modeling.
4.5.2 Training Translation Modeling (SMT)
This is the main part of translation modeling. At this step word-alignment (using GIZA++),
phrase extraction and scoring, creating lexicalized reordering tables and Moses configuration
file (mosis.ini) with some weights for translation accuracy optimization is created. Model
training tracking is also recorded in (train.out) file.
4.5.3 Tuning (in SMT)
This is the slowest process. In this experimentation tuning took from an hour to eight hours,
according to the data domain. Religious domain data constitutes more than 83% of the total
corpus .It was too slow step, so is all data domain except educational.
To tune models, small amount of parallel source and target validation data set is needed, it
must be tokenized, cleaned and true cased. The mosis.ini file with both validation test set is
given to “mert-moses.pl ", and during this process other versions of (mosis.ini) files will be
created. Finally the filtered moses.ini with phrase table, input information and reordering
61
table will be created. So to get translation model with better weights the system must be
tuned.
During translation the validation test data, translated into appropriate target language. See the
screen shout of predictions.
Figure 4.7. During tuning source validation tests being translated to target
The input sentence is English and the target sentence is Hadiyyisa. The time taken to search
it is 0.178s. Decision time took 0.00s to display the translation, additional reporting took
0.008s and translation took 0.0187s.
During tuning, when the prediction made, for one sentence there can have many best fit
target predictions. The unknown vocabularies also handled as UKN. The step after tuning is
testing. We will it see in chapter 5.
4.5.4 Prototype of the system (in SMT)
Now the model is ready to translate the inputs we give into the target language. Now from
English to Hadiyyisa .To run the model and to see the translation we can run the command in
terminal. Then terminal displays with interface to enter input set and displays the translation
as shown figure 4.7 below. Example: input sentence is “good morning”, the expected output
is “Xumma gattaa ", or “Xumma waagalata”.
Figure 4.8. Prototype of the model English to Hadiyyisa 62

Initial search to search the sentence, 0.00s, collecting information to translate the given
sentence, 0.404s, search took 1.669s, decision to display the translation 0.000s, time taken to
translation 2.074 seconds. This shows English to Hadiyyisa. Now Hadiyyisa to English is
shown in picture below.
Input Hadiyyisa sentence: ‘wa’im lama'l balla usheexxukko ‘, expected output: “God rested
on the seventh day “or something similar to this.
Figure 4.9.Prototype of the model Hadiyyisa to English
63
The output “the seventh day he rested” displays. Here Translation took lesser time than
English to Hadiyyisa. It is shown in appendix I.
4.5.5 Tools for SMT Experimentations
For SMT Linux ( ubuntu 16.04.4) in laptop with RAM size 4GB and a processor speed corei
5 were used. Statistical machine translation tools GIZA++ for word alignment SRILM for
model estimation installed in Ubuntu and Moses for decoding purpose. LibraryOfficeCalc for
recording data in Linux. Textediter to arrange, prepare and view data. MS Excel 2013 for
parallel data preparation. LibraryOffice writer to record the experimentation results.
4.6 NMT experimental setup
Data collection, cleaning, tokenization, and true casing is over in section 4.2 and 4.3. These
preprocessed data are trained and tested by neural translation approach. To do this, the files
each with file naming and (.txt) extension were adjusted. Since neural machine translation
needs a parallel aligned data, paired each sentence per line, the data are aligned and checked
manually. In NMT data sets are classified as source train, source validation, source test,
target train, target validation and target test. It is done. 80 % by 20 % corpus distribution
applied.
4.6.1 Preprocessing
Preprocessing step in NMT is not as preprocessing discussed in SMT. To process this

process the data tokenized (separated by space), true cased, cleaned and parallel aligned data
each per line must be prepared. Source files (src) and target files (tgt), src-train.txt, src-
val.txt, src-test.txt, tgt-train.txt, tgt-val.txt, and tgt-test.txt. Six files were made ready. This
phase processes a given data (src-train.txt, src-val.txt, tgt-train.txt, and tgt-train.txt) and
generates serialized pytorch train, serialized pytorch vocabulary and serialized pytorch
validation files. These step is highlighted in figure 4.3.
From the data given it also creates some features, like source and target vocabulary.
Vocabularies created for all data were 50,004 for Hadiyyisa and 15,897 for English. Finally
the shards the divided data into smaller parts are also created (only one shard in this study).
64
To do this and other steps the data were upload to Google drive. A Google colab with Jupyter
Notebook must be ready too. OpenNMT-py with pytorch backend must also be installed in
Colab’s Jupyter Notebook.
4.6.2 Training
This step in NMT is where we select our model architecture that fits. For this experiment”
Transformer model with attention is selected”. Codes for Transformer model is available
online5. For training step there are some parameters should be customized. Like learning rate,
batch size (numbers samples from training data set, to be taken for training), number of gpu,
validation steps, etc. In this experiment the data are trained, by batch size at 2048 (base line)
and 4096 that are default batch size for transformer model [36]. Single gpu (-gpu_rank 0),
learning rate 2 ( default learning rate for Transformer model; its value fluctuates during
training time by adjusting optimizer” -optim adam -adam_beta2 0.998 -decay method and
noam”). The training step was 100000, valid steps 1000, saving every 3000 steps.
Having taken those parameters together with others they associate the scripts run, taking data
inputs created during preprocessing (Pytorch train, Pytorch vocabulary and validation).
During training the system calculates validation accuracy and validation perplexity that show
how accurate the models are being created. The higher the accuracy and the better the model
is being created. The perplexity is the probability how well the model predicts the sample.
The smaller the perplexity the good perfuming is the model. The system saves every 3000
steps. There were nearly 10 different epochs (steps) saved and tested in this study. During
training time the report of every 50 steps displayed, with information how well the models
performing.
The important bases on which a transformer model is built on (input and out embeddings,
MultiHeadedAttention, linear keys, softmax with its dimensions and dropouts,
PositionwiseFeedForward network, LayerNorm etc.) had been observed during training
time. These are dealt in the figure 4.3 above and in paper “Attention is all you need” [29].See
appendix II training step screen shoot section.
5
https://opennmt.net/OpenNMTpy/.
65
This is too slow process like tuning step in SMT. For training small educational data
minimum of three hours; for larger data size bible domain and all in one 9 hours. The training
speed during English to Hadiyyisa is too slow and Hadiyyisa to English was the opposite.
The models created at every 3000 steps and were very large in size that Google drive can’t
save if more than 12 models created. Google drive provides only 15 GB. The size of the
models created depends on the batch size we tune and the amount of data set we are training.
Google Colab also gives the access for 12 hours only sometimes below that. When the
parallel tests being checked on the models checked Google Colab limits the access and
disconnects from graphic processor units (GPU) and 76GB running storage size.
According to the observation during this time, training same data with the same parameters
in different time results in different BLEU score. So, the training made again and again. After
all the better one was taken.
4.6.3 Translation
After the translation step is over at some points the succeeding step is translation which
predicts the possible best translations taking a model (for instance model_step_16000) with
source test data (src-test.txt) and it gives best possible translations. The best possible
translation predicted is used to evaluate the performance of that model.
This translation attempt is somehow close to Hadiyyisa translation but it lacks the exact
translation. This is because of the data size. The smaller is the data size the poor is the result.
66
NMT basically operates with input vocabulary from 30000 to 50000 [29], but here input
vocabulary used is 15897 for English language.
4.6.4 Testing
At the end of every epochs’ translation is over we test the models performance by taking
target test data (tgt-test.txt) with pred.txt which was the output of the translation step. Test
results of each epoch of all data will be presented in chapter 5.
4.6.5 Prototype of the System
One of the specific Objective of this study was to show the prototype of the proposed model.
So the prototype with some sentences are shown below. Hadiyyis sentence” qooccanchchi
aganna” to English: “Story of creation “.
Example 1: Input
Output:
Figure 4.10. System Prototype.
Taking one good

performing model and giving inputs displays the targeted output. The prediction average
score is too small and so is prediction perplexity. This shows the system is performing well.
However, there are some problems with unknown words. Anyways, English sentence “the
story of creation” is translated to Hadiyyisa ”qooccanchchi aganna”.
Example: input
Output:
67
The observation from these shows translation from Hadiyyisa to English is faster and good
performing than English to Hadiyyisa. The prediction perplexity of Hadiyyisa to English is
greater than Hadiyyisa to English. The prediction perplexity (the probability of how our
model is predicating well) must be smaller. The prediction score average is also so small in
English to Hadiyyisa.
4.6.6 Tools for NMT.
For neural machine translation OpenNMT-py an open source neural machine translation
platform with a pytorch backend was used. Google Colab a free online application tool that
provides an interface with Jupyter notebook to install OpenNMT-py with its instruments. It
also provides free GPU, running storage and Python 3.6.
68
CHAPTER FIVE
DISCUSSION OF THE RESULT
5.1 Introduction
In this chapter the performance evaluation results and the reason behind the results will be
discussed. Section 5.2 discusses the system performance evaluation and section 5.3 explains
the corpus analysis for training, test, and evaluation purpose. The evaluation results of both
approaches are also discussed in section 5.4 and 5.5. And the reason behind the performance
of the system is also discussed section 5.7.
5.2 Testing method
The evaluation of system performance was tested by bilingual evaluation understudy (BLEU)
score which is inexpensive and quick [27]. The BLEU score was used in this study for both
statistical and neural machine translation system in both directions. The experimentations
made by both approaches in each data set were tested and recorded separately.
5.3 corpus
The corpus collected contains two categories. As it has already been discussed in chapter
four, the first corpus from religious domain was 33,880 bilingual, parallel sentences with
vocabulary size of 50,002 for Hadiyyisa and 12,803 for English. The second corpus category
was education domain 6,610 bilingual parallel sentences aligned each per line. Vocabulary
size in educational domain was 13,442 for Hadiyyisa and 5,057 for English. The
experimentation conducted was in three different phases. The first experimentation was
done in each domain and the result of these phase evaluated. Lastly all data merged in one
and experimentation conducted. All in one category with 50,004 Hadiyyisa vocabulary and
15,897 English vocabulary size. In order to conduct the experimentation 80% of each data
category was used for training, and 20% for validation and testing [12] for both approaches
i.e. SMT and NMT.
69
Table 5.1. Corpus distribution
Data Training Validation set Test set Total

category set English/Hadiyyi English/H English/Hadiyyis
English/Ha sa adiyyisa a
diyyisa
Educational 5290 660 660 6610
Religion 27105 3388 3387 33880
All in one 32335 4047 4050 40492
5.4 Results (statistical)
Data in religious domain constitutes above 83% of whole data, it is divided into training and
test set as mentioned in section 5.3 above. The results from English to Hadiyyisa and
Hadiyyisa to English, based on the data distribution are shown in table 5.2 below. The BLEU
report for religion domain shows, Hadiyyisa to English is 3.58% (in multibue.perl) and
English to Hadiyyisa with the same data shows 2.16%.
Educational domain evaluation result, Hadiyyisa to English 0.68% and Hadiyyisa to English
is 0.33%. And all in one evaluation test shows From Hadiyyisa to English 3 .29% and
English to Hadiyyisa 2.07%. See appendix I.
Table 5.2. Evaluation results English Hadiyyisa Evaluation Result.
Corpus BLEU E2H BLEU H2E
Religion Domain 2.16% 3.58%
Education Domain 0.33% 0.68%

SMT
All in one 2.07% 3.29%
BLUE Enlish To Hadiyyisa BLUE Hadiyyisa to English
4 3.58 %
3.5 3.29%
3
2.5 2.16% 2.07%
2
1.5 70
1 0.68%
0.5 0.33%
0
Relgion Domain Educational Domain All in one
According to the evaluation results above English to Hadiyyisa and Hadiyyisa to English
evaluation results show little variations. Even if the evaluation result is far beyond the
expected, a little change can be observed like Ethiopian-English languages studies confirm
English to Ethiopian language translation is less performing than Ethiopian Language to
English [6] [20] . So English to Hadiyyisa is a bit better than English to Hadiyyisa. This
because of Hadiyyisa language’s morphological richness. The neural machine translation
confirms this as wells.
And the evaluation result also depends on the amount of corpus, though all in one and
religious domain couldn’t confirm this. It is not possible to confidently conclude this by this
small amount of data. The religious data constitutes 83% of the total data that is why there is
no that much difference. There was no enough monolingual data domain too for language
modeling. Instead only data from the bilingual data of each part 80% was used for language
modeling. Since the BLEU counts n-grams (uni-grams, bi-gram and tri-gram ) while testing
the translated output with test set if there is no any matching pair between the test set and
translated output the BLEU score will result in 0.00.
5.5 NMT Results
As discussed in the previous chapter, to conduct a neural machine translation, there are some
hyper parameters must be tuned. Like batch size, learning rate with optimizer, rnn size, etc.
The transformer approach is already tuned by appropriate parameters. But to see better
results, the Transformer model with default parameters has been applied by changing the
batch size 2048 and 4096 recommended default batches [36]. The data were categorized as it
had been explained in SMT evaluation section. The results of the trained models test are
shown in the below tables.
Some trained models created from the training shown in the tables below.
Table 5.3. Neural machine translation English to Hadiyyisa, batch size 4096 (all domain)
EPOCH MODEL VALIDATION PRE.PPL BLEU

ACCURAC
71
Y ACCURACY
1 60.07 19.8394 1.0446 0.75
2 95.7 20.6471 1.0392 1.11
3 98.55 21.6799 1.0341 1.11
4 99.38 22.1843 1.044 1.35
5 99.61 22.3646 1.0452 1.31
7 99.72 21.8459 1.0323 1.31
8 99.75 22.0821 1.0396 1.47
9 99.82 22.1444 1.0404 1.27
10 99.82 22.0087 1.0325 1.4
The models created are with accuracy almost 99 and above. This shows the epochs created
are functioning well. Their prediction perplexity is so small (probability how the model
predicts the sample), the validation accuracy (testing accuracy of unseen data in the training
set).
Table 5.4. Neural machine translation English to Hadiyyisa, batch size 2048 (all data)
EPOC MODEL VAL PRE.PP BLEU

H ACCURACY ACCURACY L
1 36.15 20.2018 1.0436 0.25
2 84.86 19.9272 1.0368 0.64
3 94.35 21.8235 1.0481 1.01
4 97.92 21.8123 1.0399 1.06
72
5 99.01 22.2369 1.0477 1.06
6 99.34 22.0901 1.0427 1.23
7 99.53 22.1907 1.0433 1.05
8 99.62 22.1587 1.037 1.09
9 99.71 21.9975 1.0352 1.17
10 99.79 21.1172 1.0398 1.13
The BLEU score depends on some factors like hyper parameter optimizations (batch size and
learning rate for instance), data size, model size, training hour etc. [36]. Transformer model
tuned with these parameters was applied in this translation, and accordingly the BlEU score
for both batch sizes shown in these tables confirms a bit right. But with this data size used in
this study, it is not easy to reach at the right decision. This is why, neural machine translation
performs well at larger data sizes [36] [29].But, by experimenting this limited corpus
changing the batch sizes the maximum BLEU recoded is 1.47.The result is calculated by
multiBLEU.perl.
EPOC MODEL VAL PRE.PP BLE

H ACCU ACCURACY L U
1 58.98 24.8886 1.043 2.33
2 92.16 26.1012 1.0342 3.15
3 97.22 27.4277 1.0269 3.42
4 98.66 27.8967 1.0264 3.44
73
5 99.18 27.8074 1.0232 3.58
6 99.43 27.4355 1.0226 3.33
7 99.55 27.8219 1.0221 3.45
8 99.62 27.7895 1.024 2.62
9 99.73 27.9536 1.0234 3.6
EPOCH MODEL VAL PRE.PPL BLEU

ACCU ACCURACY
1 92.14 25.8586 1.0312 2.8
2 98.55 25.2105 1.0257 3.12
3 98.23 28.2216 1.0277 3.62
4 99.19 28.8668 1.0302 3.57
5 99.73 28.7074 1.0268 3.52
6 99.75 28.5655 1.024 3.76
7 99.84 28.6819 1.028 3.76
8 99.65 28.9885 1.028 3.8
9 99.83 28.8536 1.0278 3.81
The same data distribution used in English to Hadiyyisa used as well. The models seems
performing better. Especially validation accuracy of the models that is calculated at the end
of every 1000 steps shows it is fine. And the BLEU score recorded in both batches shows a
74
bit variance. We can conclude that Hadiyyisa to English is performing better than English to
Hadiyyisa did. This is because, Hadiyyisa is morphologically rich as it has been observed and
concluded while evaluating same data with statistical approach in the previous section. As
result, maximum of 3.81 BLEU score is recorded.
The evaluation results of religion and education is shown in figure 5.7.
Table 5.7.The BLEU score on the remaining data domains.
CORPUS BLEU BLEU

(ENGLISH - (HADIYYISA-
HADIYYISA) ENGLISH)
RELIGION 1.47 3.59
EDUCATIONAL 0.31 0.83
After conducting similar experimentations in religion and education data domain, by

adjusting the training parameters as shown for all data domain in this section; these results
were obtained. This also shows us the smaller is the data size we use for training, the smaller
is the evaluation result. Same thing has been observed during SMT experimentation i.e.
educational domain data with smaller data set has brought forth BLEU score of very small
when we compare it the larger data size bible domain.
5.6 Summarized result for both Approaches
Eventually, the evaluation record of each data domain by each approach looks like: -English
to Hadiyyisa for all data domain 2.07%, 1.47% SMT and NMT respectively; Hadiyyisa to
English 3.29%, 3.81% SMT and NMT respectively. The religion domain English to
Hadiyyisa 2.16%, 1.47% SMT and NMT respectively; Hadiyyisa to English 3.58%, 3.59%
75
SMT and NMT respectively. Education domain English to Hadiyyisa 0.33%, 0.31% SMT
and NMT respectively; Hadiyyisa to English 0.68% and 0.83% SMT and NMT. As shown in
Fig 5.3.
Figure 5.3 NMT&SMT result summary

As it has been tried to discuss above, evaluation result of Hadiyyisa to English is better than
English to Hadiyyisa. These results also show the BLEU score is corpus dependent though
the BLEU score of all data domain and religion domain could not confirm this, instead
similar result observed. This is because of religion (bible) domain data constitutes more than
83% of all data. And in order to confidently decide NMT approach is better than SMT,
conducting experimentation in large amount data is mandatory.
5.7 Discussion
In this section the reason behind the performance of the system will be discussed. Most of the
ideas have been addressed in result section above. We can observe the varying evaluation
results while conducting the experimentation in those different data domains. The result
varies not only in the data domain but also in the approaches and the direction the translation
undertaken. The result by both approaches in smaller data set (educational) and larger data
set (religious) shows us, the performance of the system depends on the data size. The larger
is the data size the better is the evaluation result. However, the evaluation results of all data
set is a bit smaller than the bible domain result, in SMT testing. But the NMT test has no
difference. This is because of creating the confusion for the system; and the religious data
size is consisting the higher portion of all training data. Similarly, Hadiyyisa to English is
best performing than English to Hadiyyisa because of morphological richness of Hadiyyisa
language. Finally, because of the small data size NMT couldn’t outperform than SMT and
finally lack of larger data size has highly affected the BLEU score in this study.
76
CHAPTER SIX
CONCLUSION AND RECOMMENDATION
6.1 Conclusion
The purpose of this study was to develop the bidirectional English-Hadiyyisa machine
translation. In order to do that, small amount of bilingual data were collected from different
sources and categorized into two main domains, bible (religious) and educational domains.
By using two machine translation approaches, statistical and neural approaches, the
experimentation conducted on the preprocessed corpora. The experimentation was
bidirectional.
At the beginning of this study, Hadiyyisa language had been discussed. Background of
Hadiyyisa, phonology, orthography, nouns, pronouns, adjectives, conjunctions, numerals and
word orders in Hadiyyisa and English sentences were discussed.
Additionally, literatures were reviewed. In so doing, related works from local and
international studies were also reviewed. And language models for both languages by both
translation approaches were also created. Language model can be created during corpus
training. ISRTLM tool in SMT was used to create a language model whereas the well-tuned
transformer model was used in NMT.
77
The models created, were eventually evaluated by BLEU. The evaluation results recorded in
each domain were: - The statistical evaluation result, English to Hadiyyisa was 2.07%,
2.16%, 0.33% in all data domain, bible data domain and educational data domain
respectively. In this domain Hadiyyisa to English evaluation result was 3.29 %, 3.58%,
0.68% all data domain and educational data domain respectively. The neural translation
approach, English to Hadiyyisa evaluation result was 1.47% ,1.47%, 0.31% for all data
domain, bible data domain and educational, respectively. The Neural Hadiyyisa to English
evaluation result was 3.81%, 3.59%, 0.83% for all data, bible domain and educational data
domain respectively. These results in each data domain by both approaches show Hadiyyisa
to English is performing a bit better than English to Hadiyyisa. It also shows BLEU score
depends on corpus size though all data domain’s and religion data domain’s BLEU score
didn’t confirm this but nearly similar. This is because, more than 83 % of the data was from
bible domain. Another reason for this is making confusion to the system by combining two
different data sources. It is not easy to conclude NMT is performing better than SMT using
small data set in this this study. Of course, yes, NMT is a supper performing approach, but it
needs large amount of data.
The system prototypes ran and observed if the trained models do translate the user inputs.
Then the system translates sentences exactly if the input sentences are similar with training
data. Otherwise, it translates into something that relates with exact meaning. However, if
there is no matching words it translates something that doesn’t match. So, this answers the
research question “to what extent will Hadiyyisa-English machine translation works”. And
to answer the research question “How will English-Hadiyyisa machine translation system
contribute a vital role for under resourced language Hadiyyisa?” the way has already been
started in this study. Adding other additional data from various sources and further studies
will answer this question.
6.2 Recommendation
The reason to get evaluation result below expected was data size. With very small amount
data, expecting MT system with good quality is not possible. So larger data clean and good
quality should be prepared. Based on the observations from the undertaken study, some
additional recommendations were also raised: -
78
 During experimentation, single Hadiyyisa word was counted as two or more
vocabularies because of spelling error. Therefore, it is better to develop a spell
checker in the future research.
 Since there is some available data at hand, it is possible to develop other

machine translation works for Hadiyyisa and other local languages by adding
extra data.
 The time how long the data training stays, has effect on the evaluation result.
Therefore training the data at local machine installing openNMT-py and
running for more than 12 hours is recommended.
References
[1] k. Ela, Natural Language processing, NewDelh: I.K international Publishing house,
2011.
[2] O. M. D, "Machine Translation Approaches: Issues and Challenges," IJCSI

International Journal of Computer Science Issues, vol. 11, no. 5, pp. 1694-0784,
September 2014.
[3] M. Michael and M. Milion, "Experimenting Statistical Machine Translation for Ethiopic
Semitic Languages: The Case of Amharic-Tigrigna," in Bahir Dar, Ethiopia, Addis
Abeba ,Ethiopia, 2017.
[4] M. Melese, T. Solomon, Y. Martha and M. Million, "Parallel Corpora for bi-lingual
English-Ethiopian Languages Statistical Machine Translation," in Proceedings of the
First Workshop on Linguistic Resources for Natural Language Processing, Santa Fe;
New Mexico; USA, 2018.
[5] E. Liddy, Natural Language Processing, Marcel Decker, Inc.: In Encyclopedia of

Library and Information Science, 2001.
[6] T. Eleni, "Bidirectional English-Amharic Machine Translation: An Experiment using

Constrained Corpus," 2013.
[7] D. Jabesa, "Bidirectional English – Afaan Oromo Machine Translation Using Hybrid
Approach," 2013.
79
[8] H. W.John, Concise history of the language sciences: from the Sumerians to the
cognitivists., Pergamon Press, 1995, pp. 431-445.
[9] A. Ballabh and C. Dr.Umesh, "Study of Machine Translation Method and their
Challenges”," international Journal of Advance Research In Science And Engineering,
2015.
[10] S. A. Sisay Fisseha, "Machine Translation for Amharic:-Where we are," pp. 47-49,
2006.
[11] G. Björn and A. Lars, "Experiences with Developing Language Processing Tools and
Corpora for Amharic," 2010.
[12] G. Ibrahim and S. H L, "Amharic-Arabic Neural Machine Translation," Computer

Science & Information Technology (CS & IT), vol. 9, no. Mangalore-574199, pp. 55-68,
2019.
[13] S. Biniyam and J. Janne, Multilingual Ethiopia: Linguistic Challenges and Capacity
Building Efforts, Oslo: Henning Andersen, Los Angeles (historical linguistics;Östen
Dahl, Stockholm (typology), 2016.
[14] W. Girma, "Linguistic Highlights of Hadiyya Language, Hadiyyisa," 20 June 2018.

[Online]. Available: https://hadiyajourney.com. [Accessed 25 2 2020].
[15] F. D. R. o. Ethiopia, "Population Projection of Ethiopia for All Regions," Central

Statistical Agency of Ethiopia16 , Addis Abeba ,Ethiopia, 2013.
[16] H. Samuel, "Contrastive Analysis of Lexical Standardization in Amharic and Hadiyya,"

pp. 7-150, January 2017.
[17] SIL Iternational, "Ethnologue World Language," 25 2 2020. [Online]. Available:

https://www.ethnologue.com/subgroups/cushitic.
[18] N. Satoshi, "Overcoming the Language Barrier with Speech Translation Technology,"
QUARTERLY REVIEW, pp. 35-47, 2009.
[19] F. Sisay and A. Saba, "Machine Translation for Amharic:-Where we are," pp. 47-49,
2006.
[20] H. Mulubrhan, "A Bidirectional Tigrigna – English Statistical Machine Translation,"

pp. 2-84, 2017.
[21] David Crstal, Language Death, Cambridge University Press, 2014.
80
[22] C. Mohamed Amine, "Theoretical Overview of Machine translation," Proceedings
ICWIT 2012, 2012.
[23] J. Hutchins, "The first public demonstration of machine translation:the Georgetown-

IBM system," March 2006.
[24] L. Schwartz, "The History and Promise of Machine Translation," Creative Commons
Attribution 4.0 License, pp. 13-15, 2016.
[25] B. Pranjali, H. Hemant and K. Shivani, "Survey on Neural Machine Translation for
multilingual," in Proceedings of the Third International Conference on Computing
Methodologies and Communication (ICCMC 2019), 2019.
[26] D. Bahadanau and Y.Bengio, "Nueral machine translation by jointly leaning to align
and translate," pp. 1-11, 2014.
[27] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, "BLEU: a Method for Automatic
Evaluation of Machine Translation," in Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), Philadelphia, 2002.
[28] A. c. Sisay, "English–Afaan Oromoo machine Translation: An Experiment Using

Statistical Approach," Addis Abeba, 2019.
[29] A. AQLAN, X. FAN, A. ALQWBANI1 and A. A. AL-MANSOUB, "Arabic–Chinese

Neural Machine Translation:Romanized Arabic as Subword Unit for," IEE, vol. 7, no.
133122, 2019.
[30] H. Mulubrhan, "A Bidirectional Tigrigna – English Statistical Machine Translation,"

pp. 1-83, 2017.
[31] S. Tadesse, "Documentation and description of Hadiyya (A Highland East Cushitic),"

2015.
[32] s. K and E. Tilahun, Hadiyyisa in 10 Days, Hossana,Ethiopia, 2018, pp. 1-138.
[33] K. Knight, A Statistical MT Tutorial Workbook, prepared in connection with the JHU
summer workshop, 1998.
[34] A. Vaswani, N. Shazeer, N. Parmar and J. Uszkoreit, "Attention Is All You Need," in
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach,
CA, USA, Long Beach, 2017.
[35] A. Maxim, "Inside Machine learning," 4 January 2019. [Online]. Available:

https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04.
81
[Accessed 11 June 2020].
[36] M. Popel and O. Bojar, "Training Tips for the Transformer Model,"
arXiv:1804.00247v2, vol. 2, pp. 1-24, 2 May 2018.
[37] G. Akubazgi, Amharic-to-Tigrigna Machine Translation Using Hybrid Approach, Addis

Abeba, 2017.
Appendix I: Statistical Machine Translation.

1. Prototype
Figure I. English to Hadiyyisa
Figure II. Hadiyyisa to English
82
2. Test result
Figure III. Bible domain Hadiyyisa to English
Figure IV. Bible domain English to Hadiyyisa
Figure V. Educational English to Hadiyyisa
83
Figure VI. Educational Hadiyyisa to English
84
Figure VII. All in one Hadiyyisa to English
85
Figure VIII. All in one English to Hadiyyisa
Appendix II: Neural Machine Translation

1) Preprocessing. During preprocessing vocabularies, shards, training and validation features
will be created from input data training and validation data.
86
Figure IX. Neural preprocessing 2) Trai
ning. The second step training phase comes after features created in preprocessing
phase. The training model information displays here. The embedding’s, rnn level,
encoder and decoders, optimization functions, positional embedding etc.
Figure X. Training (Transformer model information fully displays here)

3)
Translation. From training phase models are created. These models are used to test the
translation by predicting the best target sentences.
87
Figure XI. English to Hadiyyisa
Figure XII. Hadiyyisa to English
4)
Prototype, by giving sending input sentences to the model created.
Figure XIII. Input Sentences
Figure XIV: Out put
5) Evaluation
Figure XV. Neural evaluation educational (English to Hadiyyisa)
Figure XVI. Neural evaluation educational (Hadiyyisa to English)
88
Figure XVII. Bible domain English-Hadiyyisa
Figure XVII. Neural Evaluation all data Hadiyyisa-English
89
Figure XVIII. Neural Evaluation all data English-Hadiyyisa
Appendix:III: Hadiyyisa Alphabets

Table I: Hadiyyisa Alphabets with corresponding sounds [14].
LETTER NAME/SOUND
(LARGE/SMALL)
Aa [a]
Bb [ba]
Cc [t∫’a]
CH ch [t∫a]
Dd [da]
Ee [e]
Fe [fa]
Gg [ga]
Hh [ha]
Ii [i]
Jj [dʒa]
Kk [ka]
Ll [la]
Mm [ma]
Nn [na]
NY ny [ɲa]
Oo [o]
Pp [pa]
PH ph [p’a]
Qq [k’a]
Rr [ra]
Ss [sa]
SH sh [∫a]
Tt [ta]
TS ts [s’a]
Uu [u]
Vv [va]
Ww [wa]
Xx [t’a]
Yy [ya]
Zz [za]
ZH zh [ʒa]
‘ (no allograph) [ʔa]
90
Appendix IV: sample corpus
Table II. Sample corpus from Religious domain
The Story of Creation Qooccanchi Aganna

1 In the beginning God created the heavens 1 Luwwi hundiinsim gaassaa Waa'i iimanee
and the earth. uullaa qooccukko.
2 Now the earth was formless and empty, 2 Ee ammane uulli mat luwwim bee'anee
darkness was over the surface of the deep, duuha'i la'isoobee'anee hee'ukko.
and the Spirit of God was hovering over the Ganbabbaakkoo tuunsim xillaalli wo'i
waters. woronne hee'ukko; Waa'i Ayyaanim wo'i
hanenne chabbaa hee'ukko.
3 And God said, "Let there be light," and 3 Waa'im «Caakki ihona» yukko. Caakkim
there was light. ihukko.
4 God saw that the light was good, and he 4 Waa'im caakki eran ihukkisa moo'ukko;
separated the light from the darkness. caakkam tuunsiinsi annanni issukko.
5 God called the light "day," and the 5 Waa'im caakka «Balla» yaa weeshukko;
darkness he called "night." And there was tuunsom «Hiimo» yukko; hiimukkom
evening, and there was morning the first day soodukkom oohim luxxi balli hee'ukko.
6 And God said, "Let there be an expanse 6 Ee lasage Waa'i «Wo'o iimii aadii
between the waters to separate water from annanni issaa woroonii hanaanii egedoo
water." luwwi ihona» yukko.
7 So God made the expanse and separated 7 Ixxim ee luwwa issukko; oo luwwim
the water under the expanse from the water ixxeensi woroon yoo wo'oo ixxeensi hanaan
above it. And it was so. yoo wo'oo annanni issukko; ee'isam ihukko.
8 God called the expanse "sky." And there 8 Waa'im ee luwwa «Iimane» yaa
was evening, and there was morning-the weeshukko; hiimukkom soodukkom oohim
second day. la'm balli hee'ukko.
9 And God said, "Let the water under the 9 Ee lasage Waa'i «Uulli siidamoo'isina
sky be gathered to one place, and let dry iimaniinsi woroon yoo wo'i mat beyyonne
ground appear." And it was so. wixxihona» yukko. Ee'isam ihukko.
91
10 God called the dry ground "land" and the 10 Waa'im wo'i bee'i beyyo «Uulla» yaa
gathered waters he called "seas." And God weeshukko; odim wixxaa yoo wo'om
saw that it was good. «Dambalaqa» yaa weeshukko; oo issu
luwwi hundim eran ihukkisa moo'ukko.
92

Bidirectional EnglishHadiyyisa Machine Translation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bidirectional EnglishHadiyyisa Machine Translation

Uploaded by

Copyright:

Available Formats

Admas University

Bi-directional English-Hadiyyisa Machine

A Thesis Submitted to the Department of Computer Science for the

Tarekegn Yohannes Sugebo

Advisor: Michael Melese (Ph.D)

School of Postgraduate Studies

Name of Candidate: _____________; Signature: _______________Date: _____________.

Name of Advisor: _____________. Signature: _______________Date: _______________.

Signature of Board of Examiner`s:

External examiner: ___________________Signature: ____________Date: _____________.

Internal examiner: ____________________Signature: ____________Date: _____________.

Dean, SGS: _________________________Signature: ____________Date: ____________.

1.2 Statement of the Problems.....................................................................................................4

1.3.1 General Objective............................................................................................................5

1.4.1 Literature review.............................................................................................................6

1.6 Significance of the Study.......................................................................................................8

1.7 Thesis organization................................................................................................................8

2.2 Machine Translation..............................................................................................................9

2.3 History of machine translations...........................................................................................11

2.4 Machine translation approaches...........................................................................................12

2.4.1 Rule-Based Approaches (RBMT).................................................................................13

2.5.1 BLEU (Bilingual Evaluation Understudy)....................................................................20

3.2 Hadiyyisa Language Orthography.......................................................................................30

3.5 Hadiyyisa Nouns..................................................................................................................32

3.6 Pronouns of Hadiyyisa Language........................................................................................33

3.7 Dative pronounces................................................................................................................35

3.8 Possessive pronounces.........................................................................................................36

3.10 Conjunctions (CNJ)............................................................................................................38

3.14 Challenges of Hadiyyisa language.....................................................................................42

4.2 System Architecture of English Hadiyyisa Machine Translation........................................43

4.2.1 System Architecture of English-Hadiyyisa MT (SMT)................................................45

4.4 Data Preparation...................................................................................................................53

4.5 SMT Experimental setup.....................................................................................................56

4.5.1 Language Modeling For SMT.......................................................................................56

5.2 Testing method.....................................................................................................................64

5.4 Results (statistical)...............................................................................................................65

5.5 NMT Results........................................................................................................................67

5.6 Summarized result for both Approaches..............................................................................71

Appendix I: Statistical Machine Translation.............................................................................78

Appendix II: Neural Machine Translation.................................................................................79

Appendix IV: sample corpus......................................................................................................83

Table 3. 1. Vowel with consonants.

Fig 2.1.Machine translation triangle [24]......................................................................................13

BLEU: Bilingual Evaluation Understudy

CNN: Convolutional Neural Network

DBMT: Dictionary Based Machine Translation

EBMT: Example Based Machine Translation

GPU: Graphic processing Unit

GRU: Gated Recurrent

GT: Google Translator

LSTM: Long Short Term Memory

MT: Machine Translation

NLP: Natural Language Processing

NMT: Neural Machine Translation

OpenNMT-py: Open source Neural Machine Translation platform with pytorch

RBMT: Rule Based Machine Translation

RNN: Recurrent Neural Network

SMT: Statistical Machine Translation

SOV: Subject Object Verb

TTC: Teachers Training College

A computational NLP has a critical goal of creating computational relationship between

 To what extent the English-Hadiyyisa machine translation system works?

1.3.2 Specific Objectives

To achieve the general objective the following specific objectives performed.

Name of Candidate: _____; Signature: _Date: _______.

Name of Advisor: _____. Signature: _Date: _________.

External examiner: _______Signature: Date: _.

Internal examiner: ________Signature: Date: _.

Dean, SGS: _____________Signature: Date: .