Professional Documents
Culture Documents
Postgraduate Office
MSc Program
By
July 2020
Declaration
I ,Tarekegn Yohannes, the under signed, declare that this thesis entitled: “Bi-directional
English-Hadiyissa Machine Translation” is my original work. I have undertaken the research
work independently with the guidance and support of the research advisor. This study has not
been submitted for any degree or diploma program in this or any other institutions and that all
sources of materials used for the thesis has been acknowledged.
Declared by
Name_________________________________
Signature: _____________________________
Department ____________________________
Date _________________________________
i
Certificate of Approval of Thesis
Admas University
This is to certify that the thesis prepared by Tarekegn Yohannes entitled “Bi-directional English-
Hadiyyisa Machine Translation” and submitted in partial fulfilment of the requirements for the
Degree of Masters of Science in Computer Science/MSc complies with the regulations of the
University and meets the accepted standards with respect to originality and quality.
ii
Table of Contents
Abbreviations...............................................................................................................................viii
Acknowledgement..........................................................................................................................ix
Abstract............................................................................................................................................x
CHAPTER ONE..............................................................................................................................1
INTRODUCTION...........................................................................................................................1
Introduction..................................................................................................................................1
1.1 Background............................................................................................................................1
1.3 Objective................................................................................................................................5
CHAPTER TWO.............................................................................................................................9
LITRATURE REVIEW...................................................................................................................9
2.1 introduction............................................................................................................................9
ii
2.4.2 Corpus-Based Machine Translation Approach (CBMT)..............................................16
2.4.3 Neural machine translations (NMT)..............................................................................17
2.4.4 Hybrid Machine Translation (HMT).............................................................................20
2.5 Evaluation of Machine Translation Systems.......................................................................20
CHAPTER THREE.......................................................................................................................29
HADIYYISA LANGUAGE..........................................................................................................29
3.1 Introduction..........................................................................................................................29
3.3 Phonology............................................................................................................................30
3.4 Alphabets.............................................................................................................................31
3.9 Verbs....................................................................................................................................37
3.11 Adjectives...........................................................................................................................39
3.12 Punctuations.......................................................................................................................40
3.13 Numerals............................................................................................................................41
CHAPTER FOUR.........................................................................................................................43
SYSTEM DESIGN AND DEVELOPMENT................................................................................43
4.1 Introduction..........................................................................................................................43
iii
4.2.2 System Architecture of English-Hadiyyisa (NMT).......................................................47
1. Encoder and Decoder Stacks...........................................................................................50
2. Attention:........................................................................................................................50
4.3 Data collection.....................................................................................................................53
4.6.1 Preprocessing.................................................................................................................60
4.6.2 Training.........................................................................................................................60
4.6.3 Translation.....................................................................................................................61
4.6.4 Testing...........................................................................................................................62
4.6.5 Prototype of the System................................................................................................62
4.6.6 Tools for NMT..............................................................................................................63
CHAPTER FIVE...........................................................................................................................64
DISCUSSION OF THE RESULT.................................................................................................64
5.1 Introduction..........................................................................................................................64
5.3 corpus...................................................................................................................................64
5.7 Discussion............................................................................................................................72
CHAPTER SIX..............................................................................................................................73
iv
CONCLUSION AND RECOMMENDATION............................................................................73
6.1 Conclusion...........................................................................................................................73
6.2 Recommendation.................................................................................................................74
References..................................................................................................................................75
Appendix: III..............................................................................................................................82
v
List of Tables
vi
List of figures
vii
Abbreviations
backend
viii
Acknowledgement
In this research work, there were participants directly and indirectly contributing unforgettable
assistance. I couldn’t have finalize this study without their help.
Firstly, my advisor Dr. Michael Melesse, who was relentlessly guiding me from the beginning to
finalization of this work. His sincere help when I faced any confusion, I cannot forget. Indeed he
created an enthusiasm in me that I can do.
I would like to thank my family for their countless courage. Their prayer and patience for my
success is beyond my words. May the lord bless them all!
My God, lord of heavens and earth who is my shield and all strengths, exalted be his name. He
answered my prayer. He gave me full health. He led me through the ways that is too difficult to
me. Worthy is He to receive glory, thanks, power, blessings and every praise. Amen!!
ix
Abstract
Machine translation is one of the application area of natural language processing (NLP), where
computers are used to translate from one source language to another target language. Because of
the need for information sharing from resource rich language there is a great demand of
translation. Thus the purpose of this study is to develop bi-directional English-Hadiyyisa
machine translation.
In order to implement this study statistical and neural machine translation approaches applied.
The data collected for this purpose were from two different sources and organized into two
categories; religious and educational domain. These data were preprocessed in the suitable
format and way to be used in machine translation. During experimentation both approaches
applied in the two data sets separately. Besides these two experimentations the third
experimentation conducted on the combined data set. While experimenting the statistical
machine translation (SMT) Moses decoder was used. Moses decoder searches the highest
probability sentence of the target language for a given source language sentences, based on the
parallel corpus. Similarly, the same parallel corpora experimented by neural machine translation
(NMT) method called Transformer model to transform the model with attention that enhances
parallelization during data processing (training). It models works with self-attention and feed-
forward networks in each encoder and decoder layers taking the source and target corpora as
input. All experimentations conducted were in both directions because the translation was
bidirectional. The models created while training the data were evaluated by bilingual evaluation
understudy (BLEU score).
The experimental results of each approaches were recorded separately. Accordingly, English to
Hadiyyisa in all data set is 2.07%, 1.47% SMT and NMT respectively and Hadiyyisa to English
is 3.29%, 3.81% SMT and NMT respectively. The bible domain English to Hadiyyisa is 2.16%,
1.47% SMT and NMT respectively and Hadiyyisa to English 3.58%, 3.59% SMT and NMT
respectively. In educational domain English to Hadiyyisa is 0.33%, 0.31% SMT and NMT
respectively and Hadiyyisa to English 0.68% and 0.83% SMT and NMT respectively. According
to these result Hadiyyisa to English performs a bit better than English to Hadiyyisa. Finally, to
improve the evaluation future experimentation with larger data set is recommended.
x
Key words: Machine Translation, Neural Machine Translation, Statistical Machine Translation,
Moses, Transformer model with attention.
xi
CHAPTER ONE
INTRODUCTION
Introduction
This chapter gives a general information about Bi-directional English-Hadiyyisa machine
translation research study. In so doing the background of the study which basically
highlights Natural Language Processing, Machine Translation (MT), Hadiyyisa language are
going to be discussed. And the problem statement on which the study is based on, the
objective of the study, literature review, methodology how the study has been, scope and
limitations of the study, significances of the study and paper organization are going to be
discussed.
1.1 Background
Since the emergence of the first most devastating machine translation work in 1945 by
Warren Weaver [1], Natural Language Processing (NLP) has been the area of huge demand
for information sharing [2]. Ability of computers (machines) understanding human language
besides the advancement of ICT and internet, has contributed to the ever-increasing
development of NLP [3] [4]. It is a sub field of artificial intelligence and linguistics whose
goal is to achieve a human like language processing by analyzing and understanding texts
based on theories and technology for communication [5].
With this in mind natural language processing has several application in text simplification,
text to speech synthesizing, word sense disambiguation, sign language processing, automatic
speech recognition systems, speech to text automations, text categorization, sentence
1
boundary detections, optical character recognition (OCR), natural language generation
(NLG),question answering and .etc. [1] [6] [7]. Machine Translation is one of the areas
where NLP application is most widely applied.
Machine translation is a term that refers to a computer aided system to produce a natural
language translation (MT) from one language (source language) to other language (target
language) with minimal intervention of human or without human intervention so as to create
the best possible translation [8] [4]. It is more than word to word substitution. The translation
in source must be transferred holding grammar, syntax, and semantics of the target language
[3]. Translations can be speech to text, text to speech or text to text or other type. Even
though machines are capable of translating human languages no machines are like humans
[9]. So sometimes linguists has to interfere to edit some anomalies of the machines created
translations.
In order to accomplish the machine translation task there are different types of translation
approaches used by many researchers .These are corpus based machine translation, rule
based machine translations (RBMT), and hybrid approaches [2]. The RBMT builds linguistic
rules based on morphological, syntactic and semantic information related to source and target
language [9].It includes Interlingua, dictionary based and transfer based approaches. The
corpus-based approach generates translations from bilingual text corpora. Statistical, neural
and example-based machine translations are categorized as corpus based machine translation.
The hybrid method [2] is an advanced method that shares the features in corpus based and
rule-based approaches.
There are around 7,097 living languages in the world [3]. Understanding these languages
without translator is not easy. And human translation to do so is a resource intensive. So
machine translation is needed. In order to share scientific, educational, legal, web sources
and other resources found in specific languages; like English. 71% of web pages are also in
English language. And German, Japanese, French and Spanish are of most common
languages [10].
The continent Africa contributes 30% of the 7,097 languages which are resource grieving
including Ethiopian Languages [3]. As a result of language barrier in the world, most of the
2
people are not able to use their own resource in the internet [11]. So, in order to use the
resources reside in the most common available languages like English human translation is
not easy task. Human expertise who knows each elements of the language and knows how
the words affect the meaning of the translation is costly. So there is a need for Machine
translation that uses different types of machine learning approaches [12] [7].Specially for
developing countries like Ethiopia with more than 80 different languages.
Hadiyyisa is one among 80s Ethiopian languages spoken by Hadiyya people in Southern
Nations Nationalities and Peoples Region (SNNPR), Hadiyya zone [13]. According to central
statistical census report the population of Hadiyya people was 1.6 million [14] [15]. However
as CIA estimated as of March 2018, the population of Hadiyya people was 1.8 million that is
1.7% out of 105,350,020 Ethiopian population [14].
Hadiyyisa language belongs to afro Asiatic language family which is widely spoken in West
Asia/Middle East, Horn of Africa, North Africa, and parts of the Sahel. The language
Hadiyyisa belongs to the language of Arabic, Oromo, Hausa, Amharic, Somali, Hebrew,
Tigrinya, Kabyle, Tamazight, and many more [14] [16]. It is also categorized as Cushitic
language subfamily in which Oromo, Somali, Sidamo, Afar, Agew, Gedeo, Saho, and Beja
are included [17].
Hadiyyisa language was of one from the 15 Ethiopian languages a literacy campaign was
made using a Saba script in 1970s and 1980s.The latter is essentially a syllabus each of
whose characters with a consonant and a vowel. In 1994, the Latin script was adopted for
Hadiyyisa, as some other Cushitic languages of Ethiopia use. Currently, the Latin-based
orthography is in use especially at schools and college, as well as university level [13].
Like all other Ethiopian languages, Hadiyyisa is severely a resource grieving language.
Currently, machine translations works are being used to solve such kind of problems around
the world. Therefore, this is what this research study, bidirectional English-Hadiyyisa
machine translation, had attempted to solve by statistical and neural machine translation
approaches.
3
1.2 Statement of the Problems
Now days, borderless economy demands the needs of communication between the people to
facilitates their day to day business [2] [18]. In addition to this, people want to communicate
with others in their own language or wants to learn the new language which may require a
translation from one language to another. Beside these, internet is one of the sources of
information being only in certain resourced and technological supported languages such as
European, Asian and English language [19] [20]. On the contrary, under-resourced Ethiopian
language suffers from the lack of digital resource to take the advantage of the technology as
discussed by different researchers [3]. Ethiopia has 83 registered languages, among these,
Hadiyyisa is one of the languages that require digital resource to facilitate the communication
between different people. Currently Hadiyyisa is given as a subject in primary, secondary
and tertiary level of education [13]. However, the resource availability is too small. The
resources unavailability is also leading Hadiyyisa to lose the ground by native speakers and it
is not functionally being used in zonal administrative offices [14] [16]. Therefore, to prevent
the language endangerments there should have written documents translation that
encompasses new and traditional contents of the language to other language [21]. Compared
to other languages, English language is technologically supported and well-resourced that
needs translating to the under-resourced one.
The driving factor to develop the English-Hadiyyisa machine translation is the absence of
existing automatic machine translation tools that translates documents from English to
Hadiyyisa in both directions. The automated translation helps to minimize the time and cost
of translator by making resource available in the respective language through translation.
Therefore, there must have an automated system that bridges the language barrier too. So, the
tools must be developed to translate from one language to another languages. In order to
fulfill this need translating languages manually is impossible due to time, cost, accuracy and
other factors.
Thus, the main objective of this research study is to design and develop the bi-directional
English-Hadiyyisa language machine translation. In attempt to solve the problems of the
above, the following research questions were formulated:-
4
How to develop a parallel corpus for English-Hadiyyisa machine translation system?
How English-Hadiyyisa machine translation system contribute a vital role for under
resourced language Hadiyyisa?
1.3 Objective
1.3.1 General Objective
The general objective of this study is to design and develop a bidirectional English-Hadiyyisa
language machine translation to facilitate the communication between English and Hadiyyisa
language speaker.
To review the state of the art literature of machine translation between resourced
and under resourced languages.
5
1.4 Methodology
In order to achieve a proposed Bidirectional English Hadiyya machine translation
different types of literatures and related works form national and international perspective
have been studied. Different types of tools and techniques applied. Data collection, data
preparation, experimentation and translation performance evaluation made.
In this research study both statistical (SMT) and neural (NMT) translation approaches
applied. The software tools used for machine translation, evaluation and word preparation in
each approach were: -
While working with statistical method, Linux (Ubuntu 16.04.4) in laptop with RAM size
4GB and a processor speed core i5 used. To work with statistical machine translation tools
like Moses, MGIZA and ISRLM must be installed in Linux. Moses is an SMT system that
trains a translation model into any language. MGiza++ for word-aligning of a parallel corpus.
SRILM used for language modeling. The bilingual evaluation understudy (BLEU) is used to
evaluate the performance of the translation model. There is a text editor in Linux that is used
to prepare data in the usable format for machine translation. During data preparation the
Nova Aligner app was used to prepare the parallel corpus from the original sources. It helps
see the data side by side and allows to break larger sentences into smaller one. To data
preparation purpose Microsoft Excel 2013 was also used. Lastly the Microsoft Office Word
2013 is sued to write a report.
Using these same data, the OpenNMT-py, an open source neural machine translation
platform with Pytorch backend was used. It was installed in Google Colab’s Jupyter
notebook that is used for code editing and running. Google Colab is a free online application
6
which provides access for graphic processer unit (GPU) that is used for data processing
during data training. It also provides a storage for data running. The BLEU score, MS offices
and data preparations tools were common in NMT.
As other African and Ethiopian languages there is a resource unavailability for Hadiyyisa. As
a result, small amount of data collected from different sources. Websites, Wachamo
University, Hosanna TTC College and Bible Society of Ethiopia (BSE) are some sources the
data collected from. A total data of 40,492 English-Hadiyissa parallel sentence were
collected from two different categories: religious and education. The religious category
contributes around 83% of all data from religious domain. In this domain, after data
extraction and cleaning 33,880 bilingual parallel preprocessed sentences were prepared. The
second category were from the educational resource consisting of 6,610 bilingual parallel
preprocessed sentences which were prepared from teaching and learning sources of
Hadiyyisa language.
1.4.4 Evaluation
The performance evaluation of the system in this study was evaluated by the BLEU score. It
is the most widely used tool in machine translation performance evaluation. This because it is
inexpensive and automatic whose evaluation result is also nearly comparable with human.
In order to get well performing machine translation there should have large number of
bilingual corpora. But in this study, the data used were only limited number of bilingual
corpora from specific religious and educational domains. Even if several attempts were made
to get some monolingual corpora for Hadiyyisa, it failed. So, a resource unavailability and
7
preparing these small data set from bible and educational sources was the most challenging
time. Besides these all, lack of data training machines for neural machine translation was also
another problem. Because of this Google Colab used but while training the neural machine
translation, it limits the access and closes for hours or for whole day but maximum. Because
of these and other constraints human evaluation of the system, models evaluation for SMT
and corpus validation by linguistics were not included in this study.
8
and neural machine translation. The test result of the experimentation and the corpus used are
briefly explained. Finally, chapter six discusses conclusion and recommendation.
CHAPTER TWO
LITRATURE REVIEW
2.1 introduction
In this chapter literatures in the field of machine translation have been studied. Section 2.2 of
this chapter discuss the concept of machine translation followed by the historical
development of machine translation in section 2.3. Consequently, section 2.4 discusses
different approaches of machine translation systems including rule-based, corpus based,
Hybrid and neural-based machine translation system. Finally, the review of related work
attempted in translation from resourced to under-resourced and between under-resourced
language is discussed in section 2.5 followed by the discussion of challenges in machine
translation system in section 2.6.
Because of human language being a media for communication there is a need in translations.
This is because it contains different types of information to be shared. Therefore the need of
translation among the languages is currently increasing [2].Today we are leaving in
information age with numerous resources that have to be translated like newspapers, legal
documents, working manuals and instructions, different types of books, scientific
publications and others. Because of the abundance and variety of these resources professional
translators cannot fulfill the increasing demand of the users. Therefore, there must have an
automatic tool (MT) that easily translates these resources from resource rich languages.
9
In order to solve the translation problems there are different types of machine translations
[22], such as machine translation for watchers (MT-W), machine translation for revisers
(MT-R), machine translation for translators (MT-T) and machine translation for Authors
(MT-A). MT-W is used for the readers that need to get some information written in a foreign
language. It came into usage by the pioneers of the machine translation to translate the
written military technological documents. The second one MT-R is used to produce an
automatic raw translation comparable with original human translated draft. Since it frees the
professional translators by avoiding the difficult and time-consuming activities that have to
be reported to the revisers it is taken as a brush up. The third one MT-T is by human
translators and it is categorized in the personal computer (PC) based translation tools that
runs in some standard platforms integrated with numerous text processing tools. It provides
an online dictionaries and translation memories for professional translators. The last one MT-
A is used to help authors translate their texts into one or so many other languages by
accepting to assist the machines lest there could have an ambiguity in order to get a satisfying
translation without revision.
Machine translations have several advantages besides their limitations [6].Some of the
advantages: -
Quick translation: - When we compare MT with human translation that takes too
much time it is quite preferable. This is because, while translating large amount of
texts it saves time and effort.
Low price: - If we hire a professional translator for translation task paying the money
according the need he/she requires is mandatory. So doing could be too costly thing.
Instead using machine translation is reliable and cost effective.
Universality: - Machines translate any text into the required language regardless of
the texts from the domain they belong to. However, the human translators are not;
they are limited to certain specific fields they specialized their translation profession.
10
Online translations and web page content translations: - Today there are so many
online machine translation tools that help translate information from one language to
the intended language we need. Moreover, there are machine translation tools to
translate queries of search engines and the web pages’ contents.
However, machine translations have some disadvantages that only human translations
can solve. Like lack of superior exactness and inferior translation quality because of texts
with ambiguous words and sentences.
In 1952, there had been few examples of machine translations, during then only interested,
very few in number were invited to the first conference [23].This time was first machine
translations conference. Bar-Hillel’s paper was conveyed in progress, together with Weaver’s
1949 memorandum, as foundation perusing to all members at the primary Conference on
Mechanical Interpretation, facilitated by Bar-Hillel at MIT on 17-20 June 1952 and the
summaries of the this conference were published within the recently made journal
“Mechanical Translation ” [24].During the conference closure time there had been a
discussion what the coming step would be and they agreed that sources of financing had to
be investigated, and accordingly Duncan Harkin of the US Office of Defense accepted that
his division and likely other US organizations would be forthcoming with reserves for
projects. Jerome Wiesner included that fund and help might moreover be forthcoming from
the Investigate Research facility of Hardware at MIT [23].
11
During this time Leon Dostert who was, Eisenhower’s individual interpreter amid the war
(1944-1945), and contact officer to the French commander had been welcomed for his
involvement with mechanical helps for interpretation. The equipment for such activities were
donated by International Business Machines (IBM). So then, he contacted Watson at IBM
and they agreed to collaborate. Dostert, he contacted his friend Watson the founder of IBM
and they agreed to work together, after constellation of linguists and engineers, that they
confirmed this work is “feasible”. After all he couldn’t get a computer mathematician for
processing side, but he couldn’t, instead he decided to demonstration should translate
Russian into English. This is because of the political case of the time for their enemy was
Russia not Germany, and lastly this demonstration was made with vocabulary size 250
lexemes at on 7th January 1954 at the New York headquarters of IBM [23].
Very works of this early time in MT were very devastating because of the meaning problems
generated and the ambiguities not handled. The Translation system made in Georgetown
University, when it translates English sentence into Russian:
“The spirit is willing, but the flesh is weak”, it translates as “the vodka
is good but the meat is rotten” [1].
The studies continued, in 1962 the Association for Machine Translation and Computational
Linguistics was established in U.S. This time many researchers were joining this association.
Followed by this the Automatic Language Processing Advisory Committee (ALPAC) was
established in 1964. In 1966 the ALPAC report tells since machine translations could not
deliver the expected result the support funds for the field was reduced and the development
of this field was also very slow. However, after its application on the web started 1996 with
SYSTRAN that translates small texts. In 2003 and 2007 more innovations like Moses an
open source SMT tool was invented. Later text and SMS translating mobile phones in 2008
and speech to speech translating mobile devices for English, Japanese and Chinese were
invented in 2009, in Japan. In 2012 Google announced Google Translate (GT) [2].
12
features there is also another approach called hybrid [2]. The rule-based approach works with
built in linguistic rule and bilingual dictionaries for both languages. The corpus-based
approaches like statistical machine translation (SMT) translates from source to target
language based on the given corpus, taking it as example. The following sections discusses
these approaches in detail and the Vauquois triangle in the figure 2.1 shows MT approaches.
The direct approach, Interlingua and the transfer based in accordance with source language
analysis and target language generation, are clearly shown by this figure. From the zero level
that is text in the form of strings and characters to higher levels that is understandable we can
imagine [24].
13
2. Transfer Based Machine translation
3. Interlingua RBMT
Source language content are deciphered without passing through an internal representation.
The words will be translated as a word reference does, more often than not without much
relationship of meaning between them. Dictionary lookups may be done with or without
morphological investigation. This translation approaches are bilingual and unidirectional one
that needs a little syntactic and semantic analysis [9].There is no source data analysis
processing and source phrases were directly translated into its matching target language
translation, through the encoded word references [24]. Figure 2.1 presents a direct machine
translation using Vauquois triangle.
Example: let us see the language divergence by the following sentence pairs in Hadiyyisa and
English.
Because of the drawbacks observed in DMT RBMT approach was discovered [2]. It
explicitly arranges the machine translation process in three stages. Firstly, the source analysis
takes place the intermediate representations of source language syntax tree or semantic
representations (SL representation). Secondly the source language intermediary
representation is changed to corresponding target language intermediary representation (TL
representations) and finally, target language text is generated from its target language
intermediary representation [24].Figure 2.1 above in section 2.4 with horizontal, downward
and upward arrows is a transfer-based machine translation representations.
This paradigm has also certain limitation, like DMT. Rules must be defined at every stage of
translation, no module reusability and transfer modules are not simple [2].
15
2.4.1.3 Interlingua approaches
This approach is to create linguistic homogeneity over the world. In this strategy, source
language is translated into a mediator representation which does not depends on any
languages. Target language is derived from this assistant shape of representation [2]. It is the
foremost reasonable approach for multilingual frameworks. It works in two phases, analysis
(from SL to the Interlingua) and generation (from the Interlingua to the TL) [22].
This approach is also called a data driven approach, which overcomes the existing anomalies
in RBMTs. CBMTs uses a bilingual parallel corpora in order to translate one source language
into the target.it learns the knowledge from the corpora to make such task and large amount
of data is required [2] [22].Statistical machine translation (SMT) and Example based
machine translations (EMBT) are known examples of the this class.
Statistical machine translation (SMT) is produced on the basis of statistical models whose
parameters are derived from the analysis of bilingual corpora. The starting model of SMT,
based on Bayes Theorem and it takes the view that each sentence in one language may be a
possible translation of any sentence within the other and the most appropriate is the
interpretation that’s doled out the highest probability by the system. The document
translation is based on the probability indicated by p (e|f). Which is the probability of
translating a sentence f in the source language (SL) F (for example, English) to a sentence e
in the target language (TL) E (for example, Hadiyyisa) [2].
Using such probability function SMT finds the most probable target text based on the source
sentences. Since this approach is independent of any prior customization of linguistic to
conduct a translation task it has an eminent advantage solving the enumerable problems of
the rule-based approaches.
For translation work, SMT needs, source and target texts should be taken in the translation
model. There must have the target language texts (monolingual data) for language model.
Now these all with source text enter the decoder that manipulates the translation. Finally, the
16
target sentences (translation) will be given. Before all there must a processed input data, data
are tokenized, true cased and cleaned. After the translation, there is translation evaluation,
like Bilingual Evaluation Understudy (BLEU) which closes to human evaluators. Or human
evaluators are needed; it is resource intensive.
3. Data dilution: statistical anomaly unique to a subset of natural language, that impact
in machine translation while developing MT for commercial use.
5. Word orders: from language to language word order is different. Example Subject
word order (SVO) in the other language it can be SOV. etc.
EBMT is also known as “Memory based translation” which translates the source sentences
and generates equivalent translation of the target language with point to point mapping. It
uses examples to check if similar sentences are used previously in translated sentences and to
check its correctness again.it works well with small data set and this is it advantage. It is
mainly used to translate two too different languages. However, applying a linguistic analysis
in detail is impossible [9]. Even if EBMT is a good approach in avoiding manual rules for
machine translation, it needs module’s analysis and generation to create a dependency tree
that is used for translation database and for sentence analysis [2].
17
2.4.3 Neural machine translations (NMT)
Unlike all machine translation approaches discussed here the new MT approach is neural
machine translation. As statistical machine translation replaces the traditional rule-based
machine translations the neural replaces the statistical machine translation.
Neural machine translation uses word vector representations that utilizes numerous amount
of neural networks to predict the probability of word sequence. It translates sentences at
once, that other traditional machine translation systems could not do. In NMT the input
sentence passes through the encoder that designates the meaning of the input
sentences ,known as “thought vector” or a sentence vector, which passes through a decoder
that process the input to provide a translation. This model is known as an encoder decoder
architecture [25] as shown in the figure 2.4
Translated
Input text Encoder Decoder text
ht = f (xt, ht−1)
(1)
18
c = q ({h1, · · ·, hTx }),
Where ht ∈ Rn is a hidden state at time t, and c is a vector generated from the sequence of the
hidden states. f and q are nonlinear functions. Used an Long-Short Term Memory (LSTM) as
f and q ({h1, · · · , hT }) = hT ) for instance.
The decoder is often trained to predict the next word yt given the context vector c and all the
previously predicted words {y1, · · ·, yt 0−1}. In other words, the decoder defines a
probability over the translation y by decomposing the probability into the ordered
conditional.
T
p ( y ) =∏ p ¿ ¿ (2)
t =1
Where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt,
and st is the hidden state of the recurrent neural network (RNN).
It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)) above,
here the probability is conditioned on a distinct context vector ci for each target word yi.
19
NMT has recently been changing to newly emerged techniques especially from
Convolutional Neural network (CNN) and Long –Short Term Memory to Transformer. These
approaches of RNNs has their own drawbacks while training data, taking too much time and
memory consumption. The newly transformation to Transformation with attention
approaches is sparingly playing a role in NMT, surpassing all traditional and RNNs’ NMT
approaches. This is because it supports parallelization.
This approach shares the characteristic of rule-based machine translation and statistical
machine translation. Hybrid models differ in various ways [9]:-
2. Statistics Guided by Rules: -rules used for preprocessing in data input which
a well guide for statically tool is. And it is also used for post-processing the
statistical output.
The first automatic MT evaluation tool proposed by Papineni [27] and it is widely accepted
evaluation metrics. It can be calculated on the n-gram precession of machine translation and
reference translations [22].
20
BLEU =BP∗e (∑ wi.log (pn)) (1)
Where: “pn”: the number of n-grams of machine translation is also present in one or more
reference translation, divided by the number of total n-grams of machine translation. “ wi”:
positive weights. “BP”: Brevity Penalty, which penalizes translations for being “too short".
The brevity penalty is computed over the entire corpus and was chosen to be a decaying
exponential in “r/c”, where “c” is the length of the candidate translation and “r” is the
effective length of the reference translation.
To do this research, sample data collected from different sources. The corresponding
sentences, phrases and words in both parts were recognized. Query preparation was
performed so as to urge prepared data for the testing and arrangement was also implemented
in order, on the sentences to create sense. Basic sentences were manually prepared and
complex sentences were moreover collected to test its applicability.1020 simple sentences
21
prepared manually and 1951 complex sentences were also collected. That is, 414 from the
Public Procurement Directive and 1537 sentences from the Bible. Very small data was
collected, because finding data specific to human daily life is not easy. Most resources
collected were from specific areas, like bible, law and some other.
The research work was implemented using SMT approach using two corpus categories, one
with simple sentences and the other with complex sentences, creating two language models
to assist bidirectional language translation i.e. one for English one for Amharic.
The data distribution for training and testing made. The separately taken experiment result
record shows, both BLEU score and manual evaluation was 82.22% accuracy for the
English to Amharic, 90.59% for the Amharic to English and using the manual questionnaire
preparation method, the accuracy from English to Amharic was 91% and from Amharic to
English was 97%. This was in simple sentences. The complex sentence pair result shows, the
result acquired from the BLEU Score was approximately 73.38% for the English to Amharic,
84.12% for the Amharic to English and from the questionnaire method from English to
Amharic was 87% and from Amharic to English was 89%. In order to achieve this result the
data preprocessing activities data cleaning, tokenization and true case were applied.
To conduct this research work the tools for statistical machine translation, in Linux terminal
were installed. Like GIZA++ that incorporates two tools, one GIZA++ itself and the other
MKCLS which is a tool used to train word classes. And, IRSTLM Toolkit. GIZA++ and
IRSTLM integrated into Moses which makes the installation task completed.
Finally, the experimentation result shows how human translation good at evaluating though
time consuming and expensive. Here another observation from results shows translation from
English to Amharic is lesser accuracy recorded. Eleni justifies it, because of the Amharic
language’s morphological richness.
The objective of English-Afan Oromo machine translation [28] is to develop a prototype for
English-Afan Oromo machine translation. In this thesis work, both languages background
22
discussion, selected literatures and related works in machine translation, data collection and
preparation, experimentations and evaluations were done accordingly.
Statistical machine translation approach is used on the data collected from Constitution of
FDRE (Federal Democratic Republic of Ethiopia), Universal Declaration of Human Right,
proclamations of the Council of Oromia Regional State, religious documents, and other
documents.
The collected documents were then divided into training set and testing set in such a way that
nine-tenth of it was used for training and the rest (one-tenth) was used for testing the system.
i.e 90/10. 62,300 sentences of monolingual was used to train the Language Modeling
subsystem. And finally using limited corpus about 20,000 bilingual sentences, and a
translation accuracy of 17.74% achieved. To arrive at this result data preprocessing,
tokenization, true casing, and cleaning were made. And the SMT translations tools for model
creation, sentence alignment and translation installed in Linux. Those data trained and tested
separately according to their domains. Evaluation result for the test data from the three data
domain shows as 13.69%, 1.97%, and 21.72% for medical, legal and religious domains
respectively. From this, it is possible to conclude that the BLEU score of the system is highly
dependent on size of the training and testing data. Finally, Sisay recommends, in order to
improve the accuracy of the system there must have additional bilingual data set and further
experimentation needed.
Bidirectional English Afan Oromo machine translation [7], its objective was to develop
English Afan Oromo machine translations using Hybrid approach which shares the benefits
of both statistical and rule based machine translation approaches.
For this work data collected from, Holy Bible, the Constitution of FDRE, the Criminal Code
of FDRE, international conventions, Megeleta Oromia and a bulletin from Oromia health
bureau. Monolingual Afaan Oromo and English corpus are also collected from certain web
sites. The data collected were in many different formats that is not easily used for machine
23
translation. In order to make data usable to machine translation the data was preprocessed
using tokenization, true casing and cleaning.
To train and test the model the data used was total of 3000 English Afaan Oromo parallel
sentences. From the total of 3000 parallel sentences, 2,900 parallel sentences were used for
training the system and the rest were used for testing the system. And the statistical machine
translation tools for model training, word alignment were used installing them in Linux.
Based on the evaluation outputs of the experiments Jabessa concludes that hybrid
approaches are well performing than statistical approaches when Oromifa is taken as source
and English taken as target.
The objective of this study was experimenting Amharic Tigrigna sentences using the
statistical machine translation approach [3]. Data source being only bible is, because of there
was no other available Amharic-Tigrigna parallel corpora.
In order to conduct the experiment these data were prepossessed. This is because data from
online cannot be used as they are for statistical machine translation. Especially normalization,
24
type errors and uncleanliness were solved by Perl and Python. The parallel corpora for each
language after all, 25470 Amharic sentences with 3555993 tokens (64259 types) 14 words in
average per sentence. Whereas 25,470 sentences consisting of 396,565 tokens (61,175 types)
with an average of 16 words per sentence. These data set identified as training, modeling and
test sets. The data also expressed in morpheme, word and sentence levels. Additionally, the
language model consists of 36,989 sentences (697,716 tokens of 112,511 types) for Amharic
languages and 62,335 sentences (1,089,435 tokens of 109,988 types) for Tigrigna language at
word level.
These languages are morphologically rich ones, therefore conducting the experiment in word
and morpheme units is good. After training the data based on training, developmental and
test sets, eight models were created, and 1000 sentences were used to evaluate for word and
morpheme based evaluation from this same domain. The Evaluation result shows, word-word
unit based translation BLEU score of 6.65 and 8.25 from Tigrigna-Amharic and Amharic-
Tigrigna respectively. In addition, using morpheme as a unit for Amharic and Tigrigna,
Amharic-Tigrigna resulted 13.49 while Tigrigna-Amharic scores 12.93 BLEU score.
The objective of this study is to develop Amharic Arabic Neural Machine translation
[12].The study was studied using data collected from holly Quran. 13,501 Amharic-Arabic
parallel corpora have been constructed to train the Amharic to Arabic NMT system by
splitting the verses of Quran manually into separate sentences making Amharic language as a
source and Arabic language as a target and this data were split into (80%) and (20%). 80%
data used as training data set and the rest 20% used for validation and test set.
Neural machine translation needs a bilingual corpus, aligned each in a sentence level. It
uses a neural network to train data and NMT is preferable MT approaches compared with
statistical and rule based ones.
The preprocessed data was trained using LSTM and GRU, a Recurrent Neural Network
(RNNs) models tuning maximum sentence length to 44 and batch-size 80 and 40.And
learning rate to 0.001 with Adam optimization. Other parameters remained as default. The
trained models saved every 10000 steps displaying the validation accuracy and perplexity,
25
that shows how well the model performing. The smaller the perplexity, the better will be
translations.
The Evaluation result of this both models 12%, 11%, for LSTM and GRU respectively .And
this result compared with Google translator and its result was 6%.This shows there is a
promising future for developing machine translations for local languages by the neural
machine translation approach beside SMT method.
The objective of this study is to develop Arabic-Chinese Neural machine transitions [29].
Both languages are the most widely spoken languages in the word, approximately 319 and
918 million Arabic and Chinese people, respectively. However, studies done concerning
these languages are understudy. Therefore, developing NMT system for these languages may
gain economic, cultural and social impacts.
The inexistence of machine translation work for these languages made the researchers build
this system so as to bridge that gap .The approach used contains five things.1) Data filtering
2) morphological segmentation, 3) Arabic romanization, 4) data-driven sub word units, and
5) linguistic feature integration.
Like SMT since NMT depends on the pure data set, finding corpora was not easiy task. But
Lewis and Zipporah approaches for filtering and selection of high-quality data was used.
These approaches used to clean the most problematic sentences, which increases the quality
of data in both language pairs. To implement this study 2 million sentences were selected
from UN corpus and filtering process conducted by Data ambiguity filtering (removing
diacritic marks in Arabic), Bad encoding filtering (corrupt symbols), Unique parallel
sentences (remove duplicate sentences pairs), remove repeating sentence pairs, incorrect
language filtering and Bad alignment filtering.
The filtering approach removes 109,484 bad and noise sentence pairs. 1,726,170 cleaned and
filtered sentences with 42,592,208 and 41,884,593 Arabic and Chinese words respectively
refined. And 4000 sentence pairs for validation and test data was also filtered.
26
The experimentation made using TensorFlow framework and GRU, tuning some higher
parameters. Word embedding dimension 500, RNN hidden layers 1024, mini batches of size
80, and a maximum sentence length of 50. The vocabulary size for input and output is set to
40K, decoding is performed with a beam size of 12, and the default hidden and embedding
dropout are applied. Models are trained using Adam optimizer, thus reshuffling the training
corpus among epochs. Model validation every 10000, saving every 30000 steps. The models
are trained for approximately one month by using 8 Tesla P100 GPUs. From those saved
models the lastly saved models four models were selected to write a report. Finally, BLEU
(multi-BLEU. Perl) is used to evaluate the accuracy of the models.
The objective of this thesis work is to design the bidirectional Tigrigna-English machine
translation by statistical approach [30]. It is aimed to solve existing problems like lack of
information and communication gap for Tigrigna speakers.
So to kick this objective the data collected from various sources and classified in five
categories. Three experiments were made. Baseline (phrase-based machine translation
system), morph-based (based on morphemes obtained using unsupervised method) and post
processed segmented systems (based on morphemes obtained by post-processing the output
of the unsupervised segmented).
Statistical machine translation approach cared on, and by it four models were created to
handle bidirectional machine translation. For English and the other three are for Tigrigna
language includes for baseline, morph-based and the other for the post processed experiment.
And BLEU score evaluation conducted and accordingly, the results of corpus 2 the result
obtained has a BLEU score of 53.35 % for Tigrigna -English and 22.46 % for English –
Tigrigna translations.
2.6.9 Summery
Based on the review of related works for machine translations conducted in English-
Ethiopian and Ethiopian language to Ethiopian languages the researchers were sparingly
using statistical approach and limited data set. The evaluation results of the proposed models
of these research works point the accuracy of the translation system depends on quality and
27
size of the input data. The statistical machine translation tools Moses, ISRLM and Giza++
are frequently used. And the BLEU score used for translation performance evaluation.
However, the thing must be taken into account is the translation approaches we use. It has its
own role in resulting better BLEU score. For instance, neural machine translation with
attention is the new approach that outperforms better than statistical machine translation
approach does. This is because, the NMT translation works with encoder-decoder deep
neural network architecture with the mechanism called attention so as to create simple word
alignments’ model which is used for source language vector representations (encoding) and
the target language decoding. The Amharic-Arabic neural machine translation work by
LSTM and GRU RNNs on very small amount of dataset shows the new state-of-the-art
NMT from local MT works. And the Chinese-Arabic neural machine translation from
international perspective by TensorFlow and GRU is also the other NMT work. In this study
SMT and NMT MT approaches applied on the parallel corpora collected from two different
sources. We used both methods due to the rapidness and cost effectiveness for MT system
development, though there is poor translation performance in data out of the domain, the
system trained. So using these both approaches differs English-Hadiyyisa machine
translation research study from prior works that had been studied in English and Ethiopian
languages and Ethiopian language to Ethiopian language. To the best knowledge of the
researcher, not only two MT approaches used differs this study from others but also the
language pairs. There is no English-Hadiyyisa or Hadiyyisa-English machine translation
works studied before.
28
CHAPTER THREE
HADIYYISA LANGUAGE
3.1 Introduction
In this chapter we are going to discuss general information about Hadiyyisa language. The
following subsections discuss the language family Hadiyyisa belongs to, the phonology,
alphabets used in Hadiyyisa language, nouns, pronouns, conjunctions, adjectives and
challenges of Hadiyyisa language.
There are more than 83 distinct languages spoken by nations and nationalities in Ethiopia [6].
Hadiyyisa is one among these languages which is spoken by Hadiyya people in southern
nation’s regional state [13]. The majority of the language speakers are found in Hadiyya zone
which is located to the west of the Rift valley. There are also some considerable number of
language speakers in Dawro, Bale and Arsi zones [16]. The World Fact estimation of March
2018 shows there are 1.8 million Hadiyyisa speakers [14] speaking it in four dialectics
Leemo, Sooro, Badawwaachcho and Shaashoogo. It is also grouped in Cushitic language
subfamily which include Oromo, Somali, Sidamo, Afar, Agew, Gedeo, and Beja to name a
few [17].
Hadiyissa has been used as a media of instruction in education from grades 1-4 and is
thought as a subject in Junior and senior secondary grades levels. It is also given as a subject
in Teachers Training Colleges in Hosanna. Moreover, since 2014/2015 G.C Hadiyyisa
language department has been launched in Wachamo University [16].
In order to have a better understanding in Hadiyyisa language this chapter briefly discusses
Hadiyyisa language in some subsequent sections as shown below. Section 3.2 discusses
orthography of Hadiyyisa language, how to write the language. In section 3.3 the phonology
with consonants and vowel sound is discussed. Section 3.4 discusses Hadiyyisa alphabets, 23
consonants, five vowels, one glottal, and seven diagraphs. In section 3.5 Hadiyyisa nouns
ending with <a, o and e > sounds, including feminine and masculine gender information is
discussed. Section 3.6 discusses the Hadiyyisa pronounces which distinctly identifies the
29
plural and singular, male and female. In section 3.7 dative pronouns like “to/for me” for
singular and plural from first person singular and plural to the third person singular and
plural is discussed. Section 3.8 discusses Hadiyyisa possessive pronouns like “my, your, his”
with corresponding Hadiyyisa translations. Section 3.9 discusses Hadiyyisa verbs that shows
how objects and subjects are connected; this section includes the subject, verb and objects
order in both languages. Section 3.10 discusses connecting conjunctions. Section 3.11
discusses the noun identifiers, adjectives. In section 3.12 Hadiyyisa punctuations that are
adopted from English punctuations are briefly discussed. Section 3.13 discusses some ordinal
and cardinal Hadiyyisa numerals. And lastly section 3.14 addresses the challenges of
Hadiyyisa language.
Orthography of Hadiyyisa is based on a Latin script by which English is also written, which
is also alphabetic that represents a phoneme. But the writing and pronunciation of both
Hadiyyisa and English is in not identical [13]. Firstly, it is important to discuss the
phonology of Hadiyyisa language. This is because phonology is a base for orthography [16].
3.3 Phonology
There are identified Hadiyyisa consonant and vowel phonemes, 23 consonants and five
vowels in total. Symbols in parentheses represent consonants used for loanwords [13].
“There are six Plosives /b/, /t/, /d/, /k/, /g/, and /ʔ/, five fricatives /f/, /s/, /z/, /ʃ/ and /h/, two
affricates /tʃ/and /dʒ /, four ejectives /p’/, /t’/, /tʃ’/ and /k’/, two nasals /m/and /n/, lateral
approximant /l/ and trill /r/ and two approximants /w/ and /j/” [16].These are shown in the
following figure.
30
Like other Highland East Cushitic languages, Hadiyya has five vowel phonemes /a, u, i, e, o/.
Each vowel has a long counterpart which exhibit meaning difference. Hence, vowel length is
phonemic in the language [16]. Table 3.1 below shows the vowel phonemes in Hadiyyisa.
When we insert these vowel phonemes in the medial position of the consonants the length
of the vowels according to the position, they give different meanings [31]. Vowel length and
consonant germination are written by doubling the graphemes. For vowel length, dasa
‘slower versus daasa ‘the tent” and dira ‘dirty’ versus diira ‘fattened’ are some of the
examples.
3.4 Alphabets
In order to represent the above consonant and vowel phonemes Hadiyyisa uses 33
graphemes. There are 23 consonants and five vowels from these graphemes. Twenty-six
graphemes are identical with English letter alphabets. But there are seven additional
31
Hadiyyisa alphabets < CH NY PH SH TS ZH > that are six digraphs and one glottal symbol
representation, apostrophe {‘} [13].
Hadiyyisa Alphabets has capital and small letters like English. From these alphabets for
example “X” is not read as it is in English. But it is used as other Ethiopian languages do use
the Latin scripts like Afan Oromo, Sidaama, Kambata and Gede’o. In these language
categories the glottal word representation is also the same while using these alphabets in the
word construction. An apostrophe {‘} does represent it [13] [31]. It is shown in appendix III.
Hadiyyisa nouns end with vowels, specially <a, o, e >, but no ending with < I, u > [31].
Qashsha, Buyya, Sanna, Qooqa etc are ending with ‘a’ whereas Bu’o, Buuro, Laso, Godabo,
Baado, Wi’llo ends with ‘o’ and Afare, Sane, Aade, Xibbe, Miine, Baalle and Biqe are
ending with ‘e’. See table 3. 2 below.
32
There are also some masculine and feminine representation of gender in nouns available in
Hadiyyisa. Example: gannichcho, saayya, ada, are, amaayya, aayya and amaya are some
the nouns indicating feminine whereas aro’o, ambula, bula, abbaayyo, annabbaayyo, sanga
and eeshimma are also some masculine indicators. Their English meaning and Hadiyyisa
representation are shown in the table 3.3.
Feminine Masculine
Hadiyyisa pronounces like English has their own impact in the meaning of the languages.
Pronoun is a word that is used instead of a noun. Pronoun is very much used in our day to
33
day conversation. Let us know the pronouns [32]. Here M, F, PL represent male, female and
plural respectively.
English Hadiyyisa
I Ane/ Ani
WE Neese
SHE Ise
HE Ixxo
THEY Issuwwi/ixxuwa
Let us see some sentence constructions with these pronounces inserting positive and negative
form of certain tense, in the following example with past continuous tenses.
34
In this example pronouns, past continues tense with negative and positive forms of the verb,
plural and singular are seen. For instance, “An itummuya hee’ummo.” ‘An‘ represents –I,
Itummuuyyi represents –eating, hee’ummo represents -was.to negate it adding a suffix –yyo
to the verb “hee’ummo”.
There are additive pronouns ending with –Ina adding pronoun or subject representatives;
these are in addition to the known pronounces described in table four. Everything here is
represented with phonemic representation in the original desperation work [31].English
meaning and their phonemic representations are shown in the table below.
35
3PL Ixxuwwina (?ittu-uww-ina ) ‘for/to them’
Person first singular (1SG), a dative pronoun I-ina (Iina) to mean ‘for/to me’. Person 2 nd
singular (2SG), a dative pronoun Ki-ina (Kiina) “for/to you” .Ki’nena (Honor) for aged
people or some respectful expression; this is also a singular second person. Amharic
representation says it “Le-erso, lerso”. Ixxena (to him), isena (to her), niina (to us), issena (to
him) honoring somebody in age or respectful person in 3 rd person singular, ixxuwwina (to/for
them).Example: the Hadiyyisa sentence <Ha’i iina ki koboorta kaballina aggiisda’e” > is
translated < Borrow me your coat today, please> as shown in the figure below.
36
37
Here “iina” represents “me”, “ki” to “your” are used represent the dative pronounces as
discussed in the above table 3.5.
Figure 3.2. Dative Pronounces
Such pronouns are used as a modifier element and usually occur preceding ahead noun in the
formation of genitive NP. The following paradigm shows the genitive pronouns along with
their attributive functions. The head noun involved in the phrase is lókko ‘leg’ [31].
I lokko My leg
pronounces type, ablative pronounces like, iiniinse (from me), kiinniinse (from you),
ki’nnuwwwiinse (plural) to mean from you, iseense (from her) are also there.
38
3.9 Verbs
It is obvious that verbs are used to represent the action how subjects or objects relate to. With
verbs we do know when action has occurred i.e. tenses. We can see some examples of
Hadiyyisa sentences and the corresponding English meaning, changing some of these verbs
time like present, past and future tenses in table 3.8. Example 1 in the previous section 3.6
had shown the verb with Hadiyyisa word (ite) in combination with subject pronouns. That
simple word, changes its forms based on the subjects that precedes it and the time (tense) it
represents. But most of Hadiyyisa root words for verbs ends -e, but there are also some words
ending with –e to represent a noun. These sample verbs shown here are only in second
person singular command forms.
Afe Reach
Waare Come
Mare go
Fire Go out
Hine dig
39
Table 3.8. Hadiyyisa verbs with simple past tense and subjects.
Subject verb and object order in Hadiyyisa is not like in English. From the above table let us
see simple word order examples. Each sentence in Hadiyyisa follows an S+O+V order but
English uses (S+V+O) order.
40
Figure 3.3. Hadiyyisa word order
In this sentence Hadiyyisa sentence “ Ise buna aggo’o “, Ise (she) represents subject, buna
(coffee) represents object ,and aggo’o represents the verb (drunk), following S+O+V order.
In Hadiyya, coordinating conjunction ‘and’ is marked by the vowel length which is attached
to all words in the enumeration [16].CNJ is conjunction connects something before with after
it.
In this table adoo buuroo,”-oo” connects both nouns being at the end of each nouns, ado and
buuro which means milk and butter; in annii beetii “-ii” is also another conjunction, to mean
father and son whereas isee ixxoo to mean her and him.
41
3.11 Adjectives
Adjectives are used to construct a class of words that are not obligatory to gender maker.
Hadiyyisa language has inestimable adjectives that can be created. Hence, every adjective
has a noun and verbal counterpart, with which it shares the basic phonological structure and
semantic content [31]. In the picture below “biijaal lanchcho “, biijalli is an adjective which
identifies the known landichcho (girl) .
3.1
2
Figure 3.4: Hadiyyisa adjectives.
Punctuations
Punctuations used in Hadiyyisa sentences are adopted from English sentences and the use of
the punctuation in the given sentences are like English sentences as shown in the table below.
42
; shiqqeen giphit mare’e semi colon
Example: -
43
3.13 Numerals
Same numerals that English uses Hadiyyisa uses. Here are some Cardinal numerals in the
table below.
Hadiyyisa does share the challenges that all Ethiopian languages can face. Particularly,
morphological richness and complexity of words. Because of the various inflections,
compounding, clipping, or blending of words there can have plenty of meaning differences in
words. For instance, simple word “waare” by adding numerous suffixes we can have many
meaningful morphemes as shown in the example below.
Even if a Latin-based writing system is currently in use, writing those words by adding
several morphemes is a bit problematic issue for many users. This is because of the absence
of spell checker. If we miss single letter or a glottal representation that word can’t give any
sense, or totally meaningless. Unlike the native speakers, this complexity could probably
hinder non-natives from learning Hadiyyisa. In addition to the morphological complexity
44
inaccessibility of digital resources is also another most devastating challenge that makes
Hadiyyisa a resource grieving language.
45
CHAPTER FOUR
4.1 Introduction
The objective of this study is to develop bidirectional English Hadiyyisa machine translation.
In order to achieve the objective appropriate approaches were studied and chosen, data
collected and manipulated in the way to be used for MT. In addition to these, the tools and
techniques have been used.
This chapter briefly discusses how the study has been made, classifying the chapter into
subsections. Consequently, section 4.2 discusses the general explanation of English-
Hadiyyisa MT system architectures. The detail explanation is shown in its sub sections. The
data collection and data preparation for conducting experimentation is discussed as well in
section 4.3 and 4.4, respectively. Following this, sections 4.5 and 4.6 discusses the
experimental setups of SMT and NMT approaches in which the SMT possesses language
modeling, training translation model, tuning and evaluation whereas NMT possesses
preprocessing, training, translation and evaluation process.
46
4.2 System Architecture of English Hadiyyisa Machine Translation
English Hadiyyisa machine translation architecture shown below in the figure 4.1 shows the
general architecture of SMT and NMT in one. The Statistical part shows there are three main
components namely, language modeling , translation model creation and decoding whereas
the neural part shows there are four interrelated components, 1) preprocessing which differs
from data preprocessing, 2) training the data using the preprocessed data, 3) translating and
4) testing. The input and out texts in the processes are shown by pile of sheets; the created
models from the process are shown by storages; the processes are also shown by rectangles;
And the flow of the directions are shown by arrows.
47
of the approaches there must have the source and target data sets that are pre-preprocessed.
The pre-preprocessed data can be splatted into the training, validation, and test sets. Taking
these data language modeling, translation modeling and decoding processes will take place
while working with SMT whereas while working with NMT the same data passes through
the preprocessing which creates three pytorch files, training steps that creates the language
model from the pytorch files, the translation phase that predicts the target texts using the
model together with the source test data set and finally the evaluation.
This figure shows only the general overview of the two models. But the detail explanation of
English-Hadiyyisa SMT and NMT architectures will be discussed in the following sections
as shown in section 4.2.1 and 4.2.2.
The SMT approach, to create a translated output it depends on the bilingual text corpora [9],
no need of customization of linguistic rules but the system learns through analysis of
statistical text corpora. The document translation is based on the probability p(e|f). Which is
the probability of translating a sentence f in the source language (SL) F (for example,
English) to a sentence e in the target language (TL) E (for example, Hadiyyisa) [2].
This probability distribution works based on Bayes Theorem: - That is, if p(f|e) and p(e)
indicate translation model and language model, respectively, then the probability distribution
p(e|f) ∞ p(f|e)p(e). The translation model p(f|e) is the probability that the source sentence is
the translation of the target sentence or the way sentences in E get converted to sentences in
F. The language model p (e) is the probability of seeing that TL string or the kind of
sentences that are likely in the language E.
This decomposition is attractive as it splits the problem into two sub problems. Finding the
best translation is done by picking up the one that gives the highest probability:
48
P (e|f) = P (e) * P (f | e) / P (f) (1)
Using Bayes Rule, we can rewrite the expression for the most likely translation:
The figure 4.2 below shows the architecture of the model and with subsections of statistical
machine translations, and how it works.
49
Figure 4.2. English Hadiyyisa machine translation SMT architecture
As it is tried to show in the general architecture above, this figure shows the SMT model
architecture for English-Hadiyyisa machine translation. While working with SMT both
bilingual and monolingual data must be preprocessed so as to process the training, validation
and test processes. The preprocessed monolingual data is used for target language modeling
and the bilingual corpora is used for training purpose. Some amount of the bilingual corpora
50
are taken as source test that is used for translation model generation. And some amount of the
test set of the target language is used as a reference for the evaluation of the system. The
translation model created together with language modeling it will be decoded by decoding
algorithm. Decoding is used to find for the best sequence of alterations that translates source
input texts to the corresponding target output sentences, in this case Hadiyyisa sentences or
English sentences. This is because, this study is bidirectional. The target translated texts are
tested by a reference test to check the system performance.
Neural machine translation is the stat-of-the-art in machine translation that surpasses all
existing machine translation methods. It uses word vector representation that utilizes
numerous amounts of neural networks to predict the probability of word sequence. It
translates sentences at once, that other translation approaches could not do. In NMT the input
sentence passes through the encoder that designates the meaning of the input
sentences ,known as “thought vector” or a sentence vector, which passes through a decoder
that process the input to provide a translation. This model is known as an encoder decoder
architecture [25].
Just as there is a successive change in other translation approaches, through time, there are
also a promising improvement in neural machine translations. Initially it was implemented by
sequence to sequence RNNs methods like LSTM, GRU and CNN. Later, attention
mechanism added. Now it has been brought forth with supper improvements “Transformer
with attention”. In addition to SMT approach this study has also been implemented by
transformer attention mechanism.
It was first presented in paper “attention is all you need” [34], in 2017.It was developed to
solve a transduction problems (parallelization, time during training, accuracy etc.) of the
existing NMT models LSTM, GRU and CNN. Even though it is similar to LSTMs, an
architecture for transforming one sequence into another one with the help of two parts
(Encoder and Decoder) it is completely different from sequence to sequence models [35].Its
51
allowing more parallelization during computation made it best performing new model
architecture.
This figure shows a model architecture of transfer model. The left side is encoder and the
right side is decoder. Neural sequence transduction models have an encoder-decoder
structure. The Transformer follows this overall architecture using stacked self-attention and
point-wise, fully connected layers for both the encoder and decoder [34].
This model architecture fits for any language, in this experiment English can be taken as
input and Hadiyyisa as an output, and vice versa. This means both languages’ training set
pass through the input Embedding i.e. English training set through input and Hadiyyisa
training set through output. It depends on the language we make as source and target. First
52
let us see the encoder block (shown in sec 1 below). The input sequences pass simultaneously
at once, each word and each sentences at once and input Embedding generated at once. But
this is not possible in RNNs instead each word passes in time sequence and word Embedding
created one at a time. Input Embeddings are, representation of inputs in computer
understandable format like vectors and arranging these similar words or vectors in
meaningful or physical, in input emending spaces where the words represented by vectors.
The words in the Embedding space can have different types of meanings based on the
position they are in the sentences. Then the positional embedding comes into action. The
positional Embeddings are vectors that add a position in vectors in the input embedding
space, i.e. the context of the words based on their position to be identified. To do this cosine
and sign functions applied as described in the below sections (section 3). Now here is where
a word vector is created (Input Embedding + Positional Embedding). The word vector now
passes through encoder block with Multi-Head Attention and Feed Forward layer. The idea
of attention is described in the section 2 below. It can be self-attention and multi-head
attentions. In self-attention each word in each language sentences are, how relevant in other
sentences of the language is checked. The attention layer adds an attention vectors to the
word vectors and passes it to feed forward block. The Feed Forward block takes the attention
vectors and transforms them into a usable form for encoder and decoder i.e. next level.
Decoder block (section 1); similarly, this blocks accepts the output language Hadiyyisa
sentences at once and creates the input embedding then the input words of the output
language. The input embedding added a positional embedding and forwarded to Masked-
Multi Head attention block in a decoder block. The attention (self) checks how each word
related to each word in the same sentence of the output language.
Now Attention vectors from encoder block and decoder block forwarded to Multi-Head
Attention layer which is another layer of encoder-decoder block. English and Hadiyyisa
sentences at this time. Source and target word mapping happens at this stage. At this stage (Q
–Query output of encoding of target sentence, K and V, key and value output part of
encoding of the source sentences) forwarded to Multi-Head Attention block. The mapped
vectors (every English words with Hadiyyisa words), now forwarded to Feed forward block
that manipulates data usable in other decoder block and Linear layer. This layer is another
53
Feed Forward Layer, where the inputs (vectors) manipulated by Feed forward layer, will be
expanded into number of words in target language, Hadiyyisa. Lastly the Soft Max layer,
where the linear layer input will be changed into human interpretable format. This done, by
probability distribution for each words.
Each encoder decoder has normalization, these are batch (training sample) normalization and
layer normalization (sample of dimensions). This can be tuned before starting training. The
accuracy of the translation also depends on how we tune the training parameters.
The following section are taken form Vaswani, “Attention is all you need” [34].
Encoder: The encoder is made of a stack of N = 6 identical layers. Each layer with two sub-
layers. Multi-head self-attention mechanism the first one, and the second one is feed-forward
network that is fully connected simple and position wise. The residual connections would
normalize the residual connections.
Decoder: It is similar with encoder and it is also made of a stack of N = 6 identical layers.
Unlike encoder additionally the decoder has another third sub layer which performs Multi-
Head Attention over the output of the encoder stack.
2. Attention:
It is a function described by mapping a query and a set of key-value pairs to the output,
where the query, keys, values, and output are all vector values. The result is calculated as a
weighted sum of the values, where the weight assigned to each value is computed by a
compatibility function of the query with the corresponding key.
So attention is giving special focus only on the relevant information other that taking whole
information at once. Self-attention is represented by the following formula.
Where Q represents the matrix that contains the query result, K represents all keys, and V are
values. Each symbols represents the vector representations of all words in the sequence [35].
54
3. Positional Encoding:-The positional encodings have the same dimension dmodel as the
embedding, so they can be summed. We use sine and cosine functions of different
frequencies for this purposes: -
PE(pos,2i)=sin(pos/10000 2i/dmodel)
PE(pos,2i+1)=cos(pos/10000 2i/dmodel)
Attention (Multi-Head and Scaled product attention), softmask and embedings, positional
embedding, Position-Wise feed forward network, optimizer, and Regularization (drop out
and label smoothing) are the crucial contents in Transformer model with attention.
In this section the detail Transformer architecture how it works with source and target
language in each encoder decoder blocks has been well addressed. In addition to this, while
working with transformer model architecture, the overall NMT for English-Hadiyyisa NMT
architecture is summarized in the figure 4.3 below.
55
Figure 4.4 English Hadiyyisa NMT Architecture
56
Model Architecture is chosen. At the end of the training there will have certain models
created by which the translation will take place. It is time consuming activity like tuning
phase in statistical machine translation. The model created here, together with source test
data set predicts an appropriate target translations. The predicted target translation is tested
with target test data set (reference) to check the performance of the given models (system).
Once pre-preprocessed data got ready, data categorization was made with its appropriate
naming. Whole documents in English and Hadiyyisa domain first removed all special
characters using a Linux commands and Moses scripts, then true cased. No empty lines left.
The data preprocessing activities tokenization (inserting white spaces between available
words and punctuations), true-casing (changing data into lowercase ,incase single word
written in lower and upper case could be taken as two different words) and cleaning applied
(removing long sentences and empty lines). Figure 4.5 and 4.6 show the preprocessed data.
Corpus after preprocessing shows all data are in lowercase, separated, punctuations removed,
and special characters removed. Apostrophes from English sentences were removed but same
apostrophe is used for Hadiyyisa language to show a glottal words was not removed.
Distribution of sample words in a given corpus is shown in the following table 4.1. These are
English and Hadiyyisa words in a corpus they belong to.
58
English words Counts Hadiyyisa Counts
Words
field 273
keenim 223
turned 272
hiraagaanch 223
same 272 i
The data after preprocessing is shown in the table below. Religion domain, educational
domain and all together in one domain. The religious and educational data are classified in
80/20 in %. 80% of these data used for training and 20% for validation and test [12].
59
4.5 SMT Experimental setup
4.5.1 Language Modeling For SMT
Language modeling is used to ensure the fluent output and it is modeled by target language
[6]. For this thesis, a bidirectional machine translation is built, i.e. both languages taken as
targets interchangeably. This model helps the translation system with selecting words or
phrases appropriate for the local context and with combining them in a sequence with better
word order [20]. For this study 3-gram language model, by removing singletons, smoothing,
and adding sentence boundary symbols were built for both languages. This is because of the
translation is bidirectional. The likelihoods of the obtained word from the n-gram model
could be unigram, bigram, trigram, or higher order n-grams. Let us see in the following
Hadiyyisa sentences:
Here 3 are the number of times the word 'waa'i' was used and 18 is the total words in the
corpus (sample corpus).
Where 2 is the number of times the words 'iina' and 'iina' have been happened together in
corpus and 3 is the number of times the word 'waa'i' is happen in corpus
60
And the trigram probability becomes:
Where 1 is the number of times the word “waa'i”, “iina” and “woda” have been used together
and 2 is the number of times the words “waa'i”, and “iina” were used.
The model is trained by a monolingual data and there is free tool for this purpose. In this
research work SRILM is used. It doesn’t need more memory spaces and supports three
output formats ARPA format, qARPA format, and iARPA [20]. In order to speed up faster
loading of the model we need to binarise the file with (.blm). The model created here also
checks the inputs we give whether they are in the model created and calculates perplexity of
out of vocabulary (OOV) and displays number of tokens.
Because of data shortage for Hadiyyisa monolingual data couldn’t be used. Instead of that,
data from the bilingual parallel corpora used to build the language modeling.
This is the main part of translation modeling. At this step word-alignment (using GIZA++),
phrase extraction and scoring, creating lexicalized reordering tables and Moses configuration
file (mosis.ini) with some weights for translation accuracy optimization is created. Model
training tracking is also recorded in (train.out) file.
This is the slowest process. In this experimentation tuning took from an hour to eight hours,
according to the data domain. Religious domain data constitutes more than 83% of the total
corpus .It was too slow step, so is all data domain except educational.
To tune models, small amount of parallel source and target validation data set is needed, it
must be tokenized, cleaned and true cased. The mosis.ini file with both validation test set is
given to “mert-moses.pl ", and during this process other versions of (mosis.ini) files will be
created. Finally the filtered moses.ini with phrase table, input information and reordering
61
table will be created. So to get translation model with better weights the system must be
tuned.
During translation the validation test data, translated into appropriate target language. See the
screen shout of predictions.
Figure 4.7. During tuning source validation tests being translated to target
The input sentence is English and the target sentence is Hadiyyisa. The time taken to search
it is 0.178s. Decision time took 0.00s to display the translation, additional reporting took
0.008s and translation took 0.0187s.
During tuning, when the prediction made, for one sentence there can have many best fit
target predictions. The unknown vocabularies also handled as UKN. The step after tuning is
testing. We will it see in chapter 5.
Now the model is ready to translate the inputs we give into the target language. Now from
English to Hadiyyisa .To run the model and to see the translation we can run the command in
terminal. Then terminal displays with interface to enter input set and displays the translation
as shown figure 4.7 below. Example: input sentence is “good morning”, the expected output
is “Xumma gattaa ", or “Xumma waagalata”.
Input Hadiyyisa sentence: ‘wa’im lama'l balla usheexxukko ‘, expected output: “God rested
on the seventh day “or something similar to this.
63
The output “the seventh day he rested” displays. Here Translation took lesser time than
English to Hadiyyisa. It is shown in appendix I.
For SMT Linux ( ubuntu 16.04.4) in laptop with RAM size 4GB and a processor speed corei
5 were used. Statistical machine translation tools GIZA++ for word alignment SRILM for
model estimation installed in Ubuntu and Moses for decoding purpose. LibraryOfficeCalc for
recording data in Linux. Textediter to arrange, prepare and view data. MS Excel 2013 for
parallel data preparation. LibraryOffice writer to record the experimentation results.
Data collection, cleaning, tokenization, and true casing is over in section 4.2 and 4.3. These
preprocessed data are trained and tested by neural translation approach. To do this, the files
each with file naming and (.txt) extension were adjusted. Since neural machine translation
needs a parallel aligned data, paired each sentence per line, the data are aligned and checked
manually. In NMT data sets are classified as source train, source validation, source test,
target train, target validation and target test. It is done. 80 % by 20 % corpus distribution
applied.
4.6.1 Preprocessing
From the data given it also creates some features, like source and target vocabulary.
Vocabularies created for all data were 50,004 for Hadiyyisa and 15,897 for English. Finally
the shards the divided data into smaller parts are also created (only one shard in this study).
64
To do this and other steps the data were upload to Google drive. A Google colab with Jupyter
Notebook must be ready too. OpenNMT-py with pytorch backend must also be installed in
Colab’s Jupyter Notebook.
4.6.2 Training
This step in NMT is where we select our model architecture that fits. For this experiment”
Transformer model with attention is selected”. Codes for Transformer model is available
online5. For training step there are some parameters should be customized. Like learning rate,
batch size (numbers samples from training data set, to be taken for training), number of gpu,
validation steps, etc. In this experiment the data are trained, by batch size at 2048 (base line)
and 4096 that are default batch size for transformer model [36]. Single gpu (-gpu_rank 0),
learning rate 2 ( default learning rate for Transformer model; its value fluctuates during
training time by adjusting optimizer” -optim adam -adam_beta2 0.998 -decay method and
noam”). The training step was 100000, valid steps 1000, saving every 3000 steps.
Having taken those parameters together with others they associate the scripts run, taking data
inputs created during preprocessing (Pytorch train, Pytorch vocabulary and validation).
During training the system calculates validation accuracy and validation perplexity that show
how accurate the models are being created. The higher the accuracy and the better the model
is being created. The perplexity is the probability how well the model predicts the sample.
The smaller the perplexity the good perfuming is the model. The system saves every 3000
steps. There were nearly 10 different epochs (steps) saved and tested in this study. During
training time the report of every 50 steps displayed, with information how well the models
performing.
The important bases on which a transformer model is built on (input and out embeddings,
MultiHeadedAttention, linear keys, softmax with its dimensions and dropouts,
PositionwiseFeedForward network, LayerNorm etc.) had been observed during training
time. These are dealt in the figure 4.3 above and in paper “Attention is all you need” [29].See
appendix II training step screen shoot section.
5
https://opennmt.net/OpenNMTpy/.
65
This is too slow process like tuning step in SMT. For training small educational data
minimum of three hours; for larger data size bible domain and all in one 9 hours. The training
speed during English to Hadiyyisa is too slow and Hadiyyisa to English was the opposite.
The models created at every 3000 steps and were very large in size that Google drive can’t
save if more than 12 models created. Google drive provides only 15 GB. The size of the
models created depends on the batch size we tune and the amount of data set we are training.
Google Colab also gives the access for 12 hours only sometimes below that. When the
parallel tests being checked on the models checked Google Colab limits the access and
disconnects from graphic processor units (GPU) and 76GB running storage size.
According to the observation during this time, training same data with the same parameters
in different time results in different BLEU score. So, the training made again and again. After
all the better one was taken.
4.6.3 Translation
After the translation step is over at some points the succeeding step is translation which
predicts the possible best translations taking a model (for instance model_step_16000) with
source test data (src-test.txt) and it gives best possible translations. The best possible
translation predicted is used to evaluate the performance of that model.
This translation attempt is somehow close to Hadiyyisa translation but it lacks the exact
translation. This is because of the data size. The smaller is the data size the poor is the result.
66
NMT basically operates with input vocabulary from 30000 to 50000 [29], but here input
vocabulary used is 15897 for English language.
4.6.4 Testing
At the end of every epochs’ translation is over we test the models performance by taking
target test data (tgt-test.txt) with pred.txt which was the output of the translation step. Test
results of each epoch of all data will be presented in chapter 5.
One of the specific Objective of this study was to show the prototype of the proposed model.
So the prototype with some sentences are shown below. Hadiyyis sentence” qooccanchchi
aganna” to English: “Story of creation “.
Example 1: Input
Output:
Example: input
Output:
67
The observation from these shows translation from Hadiyyisa to English is faster and good
performing than English to Hadiyyisa. The prediction perplexity of Hadiyyisa to English is
greater than Hadiyyisa to English. The prediction perplexity (the probability of how our
model is predicating well) must be smaller. The prediction score average is also so small in
English to Hadiyyisa.
For neural machine translation OpenNMT-py an open source neural machine translation
platform with a pytorch backend was used. Google Colab a free online application tool that
provides an interface with Jupyter notebook to install OpenNMT-py with its instruments. It
also provides free GPU, running storage and Python 3.6.
68
CHAPTER FIVE
5.1 Introduction
In this chapter the performance evaluation results and the reason behind the results will be
discussed. Section 5.2 discusses the system performance evaluation and section 5.3 explains
the corpus analysis for training, test, and evaluation purpose. The evaluation results of both
approaches are also discussed in section 5.4 and 5.5. And the reason behind the performance
of the system is also discussed section 5.7.
The evaluation of system performance was tested by bilingual evaluation understudy (BLEU)
score which is inexpensive and quick [27]. The BLEU score was used in this study for both
statistical and neural machine translation system in both directions. The experimentations
made by both approaches in each data set were tested and recorded separately.
5.3 corpus
The corpus collected contains two categories. As it has already been discussed in chapter
four, the first corpus from religious domain was 33,880 bilingual, parallel sentences with
vocabulary size of 50,002 for Hadiyyisa and 12,803 for English. The second corpus category
was education domain 6,610 bilingual parallel sentences aligned each per line. Vocabulary
size in educational domain was 13,442 for Hadiyyisa and 5,057 for English. The
experimentation conducted was in three different phases. The first experimentation was
done in each domain and the result of these phase evaluated. Lastly all data merged in one
and experimentation conducted. All in one category with 50,004 Hadiyyisa vocabulary and
15,897 English vocabulary size. In order to conduct the experimentation 80% of each data
category was used for training, and 20% for validation and testing [12] for both approaches
i.e. SMT and NMT.
69
Table 5.1. Corpus distribution
Data in religious domain constitutes above 83% of whole data, it is divided into training and
test set as mentioned in section 5.3 above. The results from English to Hadiyyisa and
Hadiyyisa to English, based on the data distribution are shown in table 5.2 below. The BLEU
report for religion domain shows, Hadiyyisa to English is 3.58% (in multibue.perl) and
English to Hadiyyisa with the same data shows 2.16%.
Educational domain evaluation result, Hadiyyisa to English 0.68% and Hadiyyisa to English
is 0.33%. And all in one evaluation test shows From Hadiyyisa to English 3 .29% and
English to Hadiyyisa 2.07%. See appendix I.
And the evaluation result also depends on the amount of corpus, though all in one and
religious domain couldn’t confirm this. It is not possible to confidently conclude this by this
small amount of data. The religious data constitutes 83% of the total data that is why there is
no that much difference. There was no enough monolingual data domain too for language
modeling. Instead only data from the bilingual data of each part 80% was used for language
modeling. Since the BLEU counts n-grams (uni-grams, bi-gram and tri-gram ) while testing
the translated output with test set if there is no any matching pair between the test set and
translated output the BLEU score will result in 0.00.
As discussed in the previous chapter, to conduct a neural machine translation, there are some
hyper parameters must be tuned. Like batch size, learning rate with optimizer, rnn size, etc.
The transformer approach is already tuned by appropriate parameters. But to see better
results, the Transformer model with default parameters has been applied by changing the
batch size 2048 and 4096 recommended default batches [36]. The data were categorized as it
had been explained in SMT evaluation section. The results of the trained models test are
shown in the below tables.
Some trained models created from the training shown in the tables below.
Table 5.3. Neural machine translation English to Hadiyyisa, batch size 4096 (all domain)
71
Y ACCURACY
The models created are with accuracy almost 99 and above. This shows the epochs created
are functioning well. Their prediction perplexity is so small (probability how the model
predicts the sample), the validation accuracy (testing accuracy of unseen data in the training
set).
Table 5.4. Neural machine translation English to Hadiyyisa, batch size 2048 (all data)
72
5 99.01 22.2369 1.0477 1.06
The BLEU score depends on some factors like hyper parameter optimizations (batch size and
learning rate for instance), data size, model size, training hour etc. [36]. Transformer model
tuned with these parameters was applied in this translation, and accordingly the BlEU score
for both batch sizes shown in these tables confirms a bit right. But with this data size used in
this study, it is not easy to reach at the right decision. This is why, neural machine translation
performs well at larger data sizes [36] [29].But, by experimenting this limited corpus
changing the batch sizes the maximum BLEU recoded is 1.47.The result is calculated by
multiBLEU.perl.
Table 5.5. Neural machine translation Hadiyyisa-English by batch size 4096 (all data)
73
5 99.18 27.8074 1.0232 3.58
Table 5.6. Neural machine translation Hadiyyisa-English by batch size 2048 (all data)
The same data distribution used in English to Hadiyyisa used as well. The models seems
performing better. Especially validation accuracy of the models that is calculated at the end
of every 1000 steps shows it is fine. And the BLEU score recorded in both batches shows a
74
bit variance. We can conclude that Hadiyyisa to English is performing better than English to
Hadiyyisa did. This is because, Hadiyyisa is morphologically rich as it has been observed and
concluded while evaluating same data with statistical approach in the previous section. As
result, maximum of 3.81 BLEU score is recorded.
Eventually, the evaluation record of each data domain by each approach looks like: -English
to Hadiyyisa for all data domain 2.07%, 1.47% SMT and NMT respectively; Hadiyyisa to
English 3.29%, 3.81% SMT and NMT respectively. The religion domain English to
Hadiyyisa 2.16%, 1.47% SMT and NMT respectively; Hadiyyisa to English 3.58%, 3.59%
75
SMT and NMT respectively. Education domain English to Hadiyyisa 0.33%, 0.31% SMT
and NMT respectively; Hadiyyisa to English 0.68% and 0.83% SMT and NMT. As shown in
Fig 5.3.
5.7 Discussion
In this section the reason behind the performance of the system will be discussed. Most of the
ideas have been addressed in result section above. We can observe the varying evaluation
results while conducting the experimentation in those different data domains. The result
varies not only in the data domain but also in the approaches and the direction the translation
undertaken. The result by both approaches in smaller data set (educational) and larger data
set (religious) shows us, the performance of the system depends on the data size. The larger
is the data size the better is the evaluation result. However, the evaluation results of all data
set is a bit smaller than the bible domain result, in SMT testing. But the NMT test has no
difference. This is because of creating the confusion for the system; and the religious data
size is consisting the higher portion of all training data. Similarly, Hadiyyisa to English is
best performing than English to Hadiyyisa because of morphological richness of Hadiyyisa
language. Finally, because of the small data size NMT couldn’t outperform than SMT and
finally lack of larger data size has highly affected the BLEU score in this study.
76
CHAPTER SIX
6.1 Conclusion
The purpose of this study was to develop the bidirectional English-Hadiyyisa machine
translation. In order to do that, small amount of bilingual data were collected from different
sources and categorized into two main domains, bible (religious) and educational domains.
By using two machine translation approaches, statistical and neural approaches, the
experimentation conducted on the preprocessed corpora. The experimentation was
bidirectional.
At the beginning of this study, Hadiyyisa language had been discussed. Background of
Hadiyyisa, phonology, orthography, nouns, pronouns, adjectives, conjunctions, numerals and
word orders in Hadiyyisa and English sentences were discussed.
Additionally, literatures were reviewed. In so doing, related works from local and
international studies were also reviewed. And language models for both languages by both
translation approaches were also created. Language model can be created during corpus
training. ISRTLM tool in SMT was used to create a language model whereas the well-tuned
transformer model was used in NMT.
77
The models created, were eventually evaluated by BLEU. The evaluation results recorded in
each domain were: - The statistical evaluation result, English to Hadiyyisa was 2.07%,
2.16%, 0.33% in all data domain, bible data domain and educational data domain
respectively. In this domain Hadiyyisa to English evaluation result was 3.29 %, 3.58%,
0.68% all data domain and educational data domain respectively. The neural translation
approach, English to Hadiyyisa evaluation result was 1.47% ,1.47%, 0.31% for all data
domain, bible data domain and educational, respectively. The Neural Hadiyyisa to English
evaluation result was 3.81%, 3.59%, 0.83% for all data, bible domain and educational data
domain respectively. These results in each data domain by both approaches show Hadiyyisa
to English is performing a bit better than English to Hadiyyisa. It also shows BLEU score
depends on corpus size though all data domain’s and religion data domain’s BLEU score
didn’t confirm this but nearly similar. This is because, more than 83 % of the data was from
bible domain. Another reason for this is making confusion to the system by combining two
different data sources. It is not easy to conclude NMT is performing better than SMT using
small data set in this this study. Of course, yes, NMT is a supper performing approach, but it
needs large amount of data.
The system prototypes ran and observed if the trained models do translate the user inputs.
Then the system translates sentences exactly if the input sentences are similar with training
data. Otherwise, it translates into something that relates with exact meaning. However, if
there is no matching words it translates something that doesn’t match. So, this answers the
research question “to what extent will Hadiyyisa-English machine translation works”. And
to answer the research question “How will English-Hadiyyisa machine translation system
contribute a vital role for under resourced language Hadiyyisa?” the way has already been
started in this study. Adding other additional data from various sources and further studies
will answer this question.
6.2 Recommendation
The reason to get evaluation result below expected was data size. With very small amount
data, expecting MT system with good quality is not possible. So larger data clean and good
quality should be prepared. Based on the observations from the undertaken study, some
additional recommendations were also raised: -
78
During experimentation, single Hadiyyisa word was counted as two or more
vocabularies because of spelling error. Therefore, it is better to develop a spell
checker in the future research.
The time how long the data training stays, has effect on the evaluation result.
Therefore training the data at local machine installing openNMT-py and
running for more than 12 hours is recommended.
References
[1] k. Ela, Natural Language processing, NewDelh: I.K international Publishing house,
2011.
[3] M. Michael and M. Milion, "Experimenting Statistical Machine Translation for Ethiopic
Semitic Languages: The Case of Amharic-Tigrigna," in Bahir Dar, Ethiopia, Addis
Abeba ,Ethiopia, 2017.
[4] M. Melese, T. Solomon, Y. Martha and M. Million, "Parallel Corpora for bi-lingual
English-Ethiopian Languages Statistical Machine Translation," in Proceedings of the
First Workshop on Linguistic Resources for Natural Language Processing, Santa Fe;
New Mexico; USA, 2018.
[7] D. Jabesa, "Bidirectional English – Afaan Oromo Machine Translation Using Hybrid
Approach," 2013.
79
[8] H. W.John, Concise history of the language sciences: from the Sumerians to the
cognitivists., Pergamon Press, 1995, pp. 431-445.
[9] A. Ballabh and C. Dr.Umesh, "Study of Machine Translation Method and their
Challenges”," international Journal of Advance Research In Science And Engineering,
2015.
[10] S. A. Sisay Fisseha, "Machine Translation for Amharic:-Where we are," pp. 47-49,
2006.
[11] G. Björn and A. Lars, "Experiences with Developing Language Processing Tools and
Corpora for Amharic," 2010.
[13] S. Biniyam and J. Janne, Multilingual Ethiopia: Linguistic Challenges and Capacity
Building Efforts, Oslo: Henning Andersen, Los Angeles (historical linguistics;Östen
Dahl, Stockholm (typology), 2016.
[18] N. Satoshi, "Overcoming the Language Barrier with Speech Translation Technology,"
QUARTERLY REVIEW, pp. 35-47, 2009.
[19] F. Sisay and A. Saba, "Machine Translation for Amharic:-Where we are," pp. 47-49,
2006.
80
[22] C. Mohamed Amine, "Theoretical Overview of Machine translation," Proceedings
ICWIT 2012, 2012.
[24] L. Schwartz, "The History and Promise of Machine Translation," Creative Commons
Attribution 4.0 License, pp. 13-15, 2016.
[25] B. Pranjali, H. Hemant and K. Shivani, "Survey on Neural Machine Translation for
multilingual," in Proceedings of the Third International Conference on Computing
Methodologies and Communication (ICCMC 2019), 2019.
[26] D. Bahadanau and Y.Bengio, "Nueral machine translation by jointly leaning to align
and translate," pp. 1-11, 2014.
[27] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, "BLEU: a Method for Automatic
Evaluation of Machine Translation," in Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), Philadelphia, 2002.
[33] K. Knight, A Statistical MT Tutorial Workbook, prepared in connection with the JHU
summer workshop, 1998.
[34] A. Vaswani, N. Shazeer, N. Parmar and J. Uszkoreit, "Attention Is All You Need," in
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach,
CA, USA, Long Beach, 2017.
81
[Accessed 11 June 2020].
[36] M. Popel and O. Bojar, "Training Tips for the Transformer Model,"
arXiv:1804.00247v2, vol. 2, pp. 1-24, 2 May 2018.
82
2. Test result
83
Figure VI. Educational Hadiyyisa to English
84
Figure VII. All in one Hadiyyisa to English
85
Figure VIII. All in one English to Hadiyyisa
86
Figure IX. Neural preprocessing 2) Trai
ning. The second step training phase comes after features created in preprocessing
phase. The training model information displays here. The embedding’s, rnn level,
encoder and decoders, optimization functions, positional embedding etc.
87
Figure XI. English to Hadiyyisa
4)
Prototype, by giving sending input sentences to the model created.
5) Evaluation
88
Figure XVII. Bible domain English-Hadiyyisa
89
Figure XVIII. Neural Evaluation all data English-Hadiyyisa
LETTER NAME/SOUND
(LARGE/SMALL)
Aa [a]
Bb [ba]
Cc [t∫’a]
CH ch [t∫a]
Dd [da]
Ee [e]
Fe [fa]
Gg [ga]
Hh [ha]
Ii [i]
Jj [dʒa]
Kk [ka]
Ll [la]
Mm [ma]
Nn [na]
NY ny [ɲa]
Oo [o]
Pp [pa]
PH ph [p’a]
Qq [k’a]
Rr [ra]
Ss [sa]
SH sh [∫a]
Tt [ta]
TS ts [s’a]
Uu [u]
Vv [va]
Ww [wa]
Xx [t’a]
Yy [ya]
Zz [za]
ZH zh [ʒa]
‘ (no allograph) [ʔa]
90
Appendix IV: sample corpus
91
10 God called the dry ground "land" and the 10 Waa'im wo'i bee'i beyyo «Uulla» yaa
gathered waters he called "seas." And God weeshukko; odim wixxaa yoo wo'om
saw that it was good. «Dambalaqa» yaa weeshukko; oo issu
luwwi hundim eran ihukkisa moo'ukko.
92