Dire Dawa University Thesis on Designing and Developing an Afan Oromo Word Sequence Prediction Model Using Deep Learning

DIRE DAWA UNIVERSITY
INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE
Design and Develop Word Sequence Prediction for Afan Oromo Using Deep
Learning Technique
By:
Muaz Hassen
Advisor: Gaddisa Olani (PhD)
A Thesis Submitted to Dire Dawa Institute of Technology, Dire Dawa

University in Partial fulfillment of the requirements for the Degree of
Master of Science in Computer Science
Dire Dawa, Ethiopia
June, 2022
INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE
06/06/2022
Muaz Hassen Abdella _______________ _____________
Student Full Name Signature Date
________________________ _______________ ______________

Thesis Advisor Signature Date
Examining Board Approval
1. ___________________ _______________ ____________
External Examiner Signature Date
2. ___________________ _______________ ____________
Internal Examiner Signature Date
3. ___________________ _______________ ____________

Chair Person Signature Date
_____________________ _______________ _____________

Chair head Signature Date
______________________ _______________ _____________

School Postgraduate Coordinator Signature Date
______________________ _______________ _____________

School Dean Signature Date
_____________________ _______________ ____________

PG Director Signature Date
Declaration
I hereby declare that the work which is being presented in this thesis entitled “Design and
Develop Word Sequence Prediction for Afan Oromo Using Deep Learning Technique” is
original work of my own, has not been presented for a degree of any other university and all the
resources used for the thesis have been duly acknowledged. I understand that non-adherence to
the principles of academic honesty and integrity, misrepresentation/fabrication of any idea, data,
fact and source will constitute sufficient ground for disciplinary action by the university and can
also evoke penal action from the sources which have been properly cited or acknowledged.
Muaz Hassen Abdella ------------------------------- 06/062022
(Candidate) Signature Date

Abstract
Word prediction is one of the most widely used techniques to enhance communication rate in
augmentative and alternative communication. Next word prediction involves guessing the next
word. A number of word sequence prediction models exist for different languages to assist users
on their text entry. So given a sequence of words generated from the corpus, the possibility to
guess the next word which has highest probability of occurrence. Thus, it is a predictive modeling
problem for languages also known as Language Modeling. Word sequence prediction helps
physically disabled individuals who have typing difficulties, facilitates typing speed by decreasing
keystrokes, assists in spelling and error detection and it also supports in speech recognition and
hand writing recognition.
Even if Afaan Oromo is one of the major languages widely spoken and written in Ethiopia, there
is no significant research conducted on the area of word sequence prediction. In this study, word
sequence prediction model is designed and developed. In order to achieve the objectives, corpus
data was collected from different sources and divided into training and testing sets. 80% of total
dataset for training and 20% of total dataset for testing were used. Afan Oromo word sequence
prediction model was designed and developed using deep learning technique which is RNN
approach. Different 6 RNN models were implemented with various techniques on GRU, LSTM,
Bidirectional GRU, Bidirectional LSTM, GRU with attention, LSTM with attention, Bidirectional
LSTM with attention and bidirectional GRU with attention. Three systems were implemented
where the first system uses a word-based statistical approach that can be used as a baseline, while
the second system with recurrent neural network approach is used as a competitive model, and
lastly, the third system with recurrent neural networks with attention word sequence prediction,
Afan Oromo languages.
The designed model is evaluated based on the developed model. Perplexity Score is used to
evaluate model performance. According to the evaluation we get for LSTM 83.63%, GRU 84.87%,
BILSTM 82.94%, BIGRU 88.68%, LSTM with attention 86.58%, GRU with attention 86.71%,
BILSTM+ attention 89%, BIGRU+ attention 90%, performance respectively. Therefore, BIGRU
model have quite good and BIGRU+ attention is shows more performance.
Keywords: Word Sequence Prediction, Afaan Oromo Word Sequence Prediction, Recurrent
Neural Network.
i|P a g e
Acknowledgement
First and foremost, I would like to praise Allah the Almighty, the Most Gracious, and the Most
Merciful for His blessing, granting me the strength, courage, knowledge, patience and inspirations
given to me during my study and in completing this thesis. May Allah’s blessing goes to His Final
Prophet Muhammad (peace be up on him), his family and his companions. I would like to express
my gratitude and sincere thanks to my advisor, Dr. Gaddisa Olani for his valuable guidance, advice
and encouragement so I could complete this thesis in time and for having the patience and time to
supervise my thesis. His pieces of advice, corrections and encouragement contributed to the
success of this thesis work. I would also like to express my special gratitude to my dear wife
Mawardi Abdela, which without her continuance support and guidance this paper would not have
been a reality. I want to sincerely thank her for her unwavering support, unconditional love and
for bearing with me through all this. I know I can’t thank you enough for all your support. I’d also
like to express my special gratitude to Daral fiker Ethiopia, my classmate and my staff members
and all my friends who support me and gave motivation to complete this thesis. I would also like
to express my thanks to my mother and my families for their support and motivation throughout
my life.
ii | P a g e
Table of Contents
Chapter One: Introduction ...........................................................................................................1
1.1. Introduction ......................................................................................................................1
1.2. Statement of the Problem ..................................................................................................3
1.3. Motivation ........................................................................................................................4
1.4. Objectives ........................................................................................................................4
1.4.1. General Objective ......................................................................................................4
1.4.2. Specific Objectives ....................................................................................................4
1.5. Scope and limitation .........................................................................................................5
1.6. Methods ...........................................................................................................................5
1.7. Application of Word Sequence Prediction ........................................................................6
1.8. Organization of the Rest of the Thesis ..............................................................................7
Chapter Two: Literature Review .................................................................................................8
Overview .................................................................................................................................8
2.1. Introduction of Machine Learning and Word Prediction ................................................8
2.2. Approaches of word sequence prediction .................................................................... 10
2.2.1. Statistical Modeling ................................................................................................. 10
2.2.2. Knowledge Based Modeling .................................................................................... 12
2.3. Deep Learning for Word Sequence Processing ............................................................ 14
2.3.1. Word Embedding’s .................................................................................................. 14
2.3.2. Convolutional Neural Networks for Sequence Modeling ......................................... 15
2.3.3. Recurrent Neural Networks for Sequence Modeling ................................................ 15
2.3.4. Gated Recurrent Unit (GRU) Approach ................................................................... 17
2.3.5. Long Short-Term Memory Approach ...................................................................... 17
2.3.6. Bidirectional RNN ................................................................................................... 18
2.3.7. Recursive neural networks for sequence modeling ................................................... 19
2.4. Evaluation Techniques for Word Prediction ................................................................ 20
2.5. Related Work .............................................................................................................. 20
2.5.1. Related work on Foreign Language ......................................................................... 20
2.5.2. Word prediction for Local Language ....................................................................... 23
2.6. Summary .................................................................................................................... 26
Chapter Three: Nature Afaan Oromo Language......................................................................... 27
3.1. Introduction ................................................................................................................ 27
3.2. Grammatical Structure of Afaan Oromo ...................................................................... 27
3.3. Summary .................................................................................................................... 29
Chapter 4: Methodology............................................................................................................ 30
4.1. Introduction ................................................................................................................ 30
4.2. Model Designing......................................................................................................... 30
4.3. Componets of the Proposed Model .............................................................................. 31
4.3.1. Corpus Collection .................................................................................................... 31
4.3.2. Data Preparation and Preprocessing ......................................................................... 32
4.3.3. Converting sentence to N-gram Tokens Sequence ................................................... 33
4.3.4. Tokenization............................................................................................................ 33
4.3.5. Pad Sequence .......................................................................................................... 33
4.4. Proposed Model Design andArchitecture .................................................................... 34
4.4.1. LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit) ..................... 35
iii | P a g e
4.4.2. Model Layer Description ......................................................................................... 38
4.5. Tune Hyper parameters for proposed models............................................................... 39
4.6. The Evaluation ............................................................................................................ 40
Chapter Five: Experimentation .................................................................................................. 41
5.1. Introduction ................................................................................................................ 41
5.2. Experimental Environment and Parameter Settings ..................................................... 41
5.3. Experiment procedure ................................................................................................. 41
5.4. Description of Proposed Model ................................................................................... 43
5.5. Training the Model ..................................................................................................... 44
5.6. Proposed model Training Result.................................................................................. 45
5.7. Test Results of Proposed Model .................................................................................. 46
5.7.1. Model Evaluation Result ......................................................................................... 47
5.8. Prototype .................................................................................................................... 48
5.9. Error Analysis for unigram data points ........................................................................ 49
5.10. Discussion................................................................................................................... 50
Chapter 6: Conclusion and Future Work.................................................................................... 55
6. Conclusion ..................................................................................................................... 55
6.1. Contribution of the Thesis ........................................................................................... 56
6.2. Future work ................................................................................................................ 57
REFERENCES ......................................................................................................................... 58
iv | P a g e
List of tables
Table 3. 1 Shows of simple and complex sentences in Afaan Oromo ......... Error! Bookmark not
defined.
Table 4. 1 Detail of corpus
length…………........………………………………………………..Error! Bookmark not
defined.
Table 4. 2 Parameter of the proposed model
……………………………………………………Error! Bookmark not defined.
Table 5. 1 Result of training
model……….........………………………………….…..……….Error! Bookmark not defined.
Table 5. 2 Result of testing model ............................................. Error! Bookmark not defined.
v|P ag e
List of figures
figure 2. 1 Approaches of word sequence prediction.................... Error! Bookmark not defined.

figure 2. 2 RNN basic architecture .............................................. Error! Bookmark not defined.
figure 2. 3 Architecture of GRU Vs LSTM ................................. Error! Bookmark not defined.
figure 2. 4 Bidirectional RNN (Feng et al. [2017]). ...................... Error! Bookmark not defined.
figure 2. 5 Recursive neural network for syntactic parsing [63]. .. Error! Bookmark not defined.
figure 4. 1 The architecture of proposed Afaan Oromo word sequence prediction .............. Error!
Bookmark not defined.
figure 4. 2 Description of the length of the corpus ....................... Error! Bookmark not defined.
figure 4. 4 Proposed model architecture....................................... Error! Bookmark not defined.
figure 4. 5 Example word sequence prediction of the model ...................................................... 36
Table 4. 2 Parameter of the proposed model ............................... Error! Bookmark not defined.
figure 5. 1 Proposed RNN models ............................................... Error! Bookmark not defined.
figure 5. 2 Result of training model ............................................. Error! Bookmark not defined.
figure 5. 3 Accuracy and loss of testing result sorted according to their performance .......... Error!
Bookmark not defined.
figure 5. 4 Predicting two word input and outputs one word ........ Error! Bookmark not defined.
figure 5. 5 Take one input and predict one or more out put .......... Error! Bookmark not defined.
figure 5. 6 Error Analysis of uni gram ....................................................................................... 49
figure 5. 7 Error analysis of trigram ........................................................................................... 50
figure 5. 8 LSTM with attention .................................................. Error! Bookmark not defined.
figure 5. 9 LSTM model loss ....................................................... Error! Bookmark not defined.
figure 5. 10 Training Result LSTM model ................................... Error! Bookmark not defined.
figure 5. 11 BILSTM with attention ............................................ Error! Bookmark not defined.
figure 5. 12 BI LSTM with attention ........................................... Error! Bookmark not defined.
figure 5. 13 Training Result BIGRU model ................................. Error! Bookmark not defined.
figure 5. 14 Training Result of GRU model with attention ........... Error! Bookmark not defined.
vi | P a g e
figure 5. 15 BI GRU with attention.............................................. Error! Bookmark not defined.
figure 5. 16 GRU Model.............................................................. Error! Bookmark not defined.
LIST OF ACRONYMS
ATT Attention
BIGRU Bidirectional Gate Recurrent Unit
BILSTM Bidirectional Long-Short-Term-Memory
BLEU Bi-Lingual Evaluation Understudy
CNN Convolutional Neural Network
CPU Central Processing Unit
FDRE Federal Democratic Republic of Ethiopia
GLU Gate Linear Unit
GRU Gate Recurrent Unit
GPU Graphical processing Unit
LSTM Long-Short-Term-Memory
NLP Natural Language Processing
RAM Random Access Memory
RNN Recurrent Neural Network
vii | P a g e
Chapter One: Introduction
1.1. Introduction
Natural Language Processing (NLP) is an interdisciplinary research area at the border between linguistics and
AI aiming at developing computer programs capable of human-like activities associated with understanding or
producing texts or speech during a tongue [1]. It is a neighborhood of research and application that explores how
computers often want to understand and manipulate tongue text or speech to try to do useful things.
NLP researchers aim to collect knowledge on how citizens understand and use language in order that appropriate
tools and techniques are often developed to form computer systems understand and manipulate natural languages
to perform the desired tasks [2].
Applications of NLP include variety of fields of studies, like MT, morphology, syntax, named entity recognition,
tongue text processing and summarization, multilingual and cross language information retrieval (CLIR), speech
recognition, information retrieval and text clustering, and so on [2]. Data entry may be a core aspect of human
computer interaction. Images, documents, music, and video data are entered to computers so as to urge
processed. There are a number of data entry techniques that include speech, chorded keyboards, handwriting
recognition, various gloved techniques [1], scanner, microphone, and digital camera [2]. Keyboards and pointing
devices are the most commonly used devices during human-computer interaction [3]. Because of its ease of
implementation, higher speed, and less error rate, keyboard dominated text entry system [4]. However, one must
master the pc keyboard so as to realize the advantage of a keyboard.
Word prediction provides better data entry performance by improving the writing mainly for people with
disabilities [5, 6]. Word prediction helps disabled people for typing, speed up typing speed by decreasing
keystrokes, helps in spelling and error detection and it also helps in speech recognition and hand writing
recognition. Auto completion decreases misspelling of word. Word completion and word prediction also helps
student to spell any word correctly and to type anything with fewer errors [7].
viii | P a g e
In general, word prediction is the process of guessing the next word in a sentence as the sentence is being entered,
and updates this prediction as the word is typed [8]. Currently, word prediction implies both “word completion
and word prediction” [8]. Word completion is defined as offering the user an inventory of words after a letter
has been typed, while word prediction is defined as offering the user an inventory of probable words after a word
has been typed or selected, supported previous words instead of on the idea of the letter. Word completion
problem is easier to solve since the knowledge of some letter(s) provides the predictor a chance to eliminate
many of irrelevant words [8, 9]. The task of prediction the foremost likely word supported properties of its
surrounding context is that the archetypical prediction problem in tongue Processing (NLP) [8]. In many NPL
tasks, it's necessary to work out the foremost likely word, part-of-speech (POS) tag or the other token, given its
history or context. Examples include part-of-speech tagging, word-sense disambiguation, speech recognition,
accent restoration, context-sensitive spelling correction, and identifying discourse markers [9]. Currently, word
prediction is used in many real-life applications such as augmentative communication devices [10].
Afaan Oromo is one among the main languages that's widely spoken and utilized in Ethiopia [11]. Currently, it's
a politician language of Oromia regional state. It is used by Oromo people, who are the largest ethnic group in
Ethiopia, which amounts to 34.5% of the total population according to the 2008 census [11, 12]. In addition, the
language is also spoken in Kenya [11]. With regard to the writing system, Qubee (Latin-based alphabet) has
been adopted and it became the official script of Afaan Oromo since 1991 [11, 12, and 13]. Besides being an
official working language of Oromia regional State, Afaan Oromo is the instructional medium for primary and
junior secondary schools throughout the region and its administrative zones. Thus, the language has well
established and standardized writing and spoken system [12, 13].
To use computers for understanding and manipulation of Afaan Oromo language, there are very few researches
attempted so far. These attempts include spell checker [13], text-to-speech system [14], sentence parser [15],
morphological analyzer [16], and part-of-speech tagger [17]. We understand the characteristics of the language
from these researches which provide hint how to design the system.
Word prediction provides better data entry performance by improving the writing mainly for people with
disabilities [5, 6]. Word prediction helps disabled people for typing, speed up typing speed by decreasing
keystrokes, helps in spelling and error detection and it also helps in speech recognition and hand writing
recognition. Auto completion decreases misspelling of word. Word completion and word prediction also helps
student to spell any word correctly and to type anything with fewer errors [7]. In general, word prediction is the
process of guessing the next word in a sentence as the sentence is being entered, and updates this prediction as
the word is typed [8]. Currently, word prediction implies both “word completion and word prediction” [8]. Word
2|P a g e
completion is defined as offering the user a list of words after a letter has been typed, while word prediction is
defined as offering the user a list of probable words after a word has been typed or selected, based on previous
words rather than on the basis of the letter. Word completion problem is easier to solve since the knowledge of
some letter(s) provides the predictor a chance to eliminate many of irrelevant words [8, 9].
The task of prediction the most likely word based on properties of its surrounding context is the archetypical
prediction problem in Natural Language Processing (NLP) [8]. In many NPL tasks, it is necessary to determine
the most likely word, part-of-speech (POS) tag or any other token, given its history or context. Examples include
part-of-speech tagging, word-sense disambiguation, speech recognition, accent restoration, context-sensitive
spelling correction, and identifying discourse markers [9]. Currently, word prediction is used in many real-life
applications such as augmentative communication devices [10].
Thus, this study designed and developed word sequence prediction for Afaan Oromo. In order to design and
develop word sequence prediction, Afaan Oromo corpus was collected and prepared first. And second, the study
proposed 4 RNN which is GRU/LSTM/ BIGRU/BILSTM model and 4 RNN which is GRU/LSTM/
BIGRU/BILSTM with attention mechanism architecture.
1.2. Statement of the Problem
Next word prediction involves predicting the next word. So given a sequence of words generated from the
corpus, the possibility to predict the next word which has highest probability of occurrence. Thus, it is a
predictive modeling problem for languages also known as Language Modeling. We can also approach this
problem in another way. We can consider each of the next word to be predicted as a class. So, it can be treated
as Multiclass Classification problem.
Word prediction is one of the most widely used techniques to enhance communication rate in augmentative and
alternative communication [18]. A number of word prediction software packages exist for different languages
to assist users on their text entry. Amharic [2, 19, 20], Swedish [21, 22], English [23], Italian [24, 25], Persian
[26], Bangle [18] are some of word prediction studies conducted lately. These studies contribute in reducing the
time and effort to write a text for slow typists, or people who are not able to use a conventional keyboard.
Like a number of other African and Ethiopian languages, Afaan Oromo has a very complex morphology. It has
the basic features of agglutinative languages where all bound forms (morphemes) are affixes. In agglutinative
languages like Afaan Oromo most of the grammatical information is conveyed through affixes (prefixes, infixes
and suffixes) attached to the roots or stems. Both Afaan Oromo nouns and adjectives are highly inflected for
3|P a g e
number and gender. Afaan Oromo verbs are also highly inflected for gender, person, number and tenses.
Moreover, possessions, cases and article markers are often indicated through affixes in Afaan Oromo. Since
Afaan Oromo is morphologically very productive, derivations and word formations in the language involve a
number of different linguistic features including affixation, reduplication and compounding [27]. Furthermore,
the grammatical structure of Afaan Oromo is unique. Hence, this makes word sequence prediction unique to the
language.
Currently, word prediction is used in many real-life applications such as augmentative communication devices
[10] and different social media like Facebook, WhatsApp, Instagram, Twitter, Imo, Messenger and etc... One of
the importance of automated word prediction is saving time while we chat or write a sentence by providing next
word prediction. For example: In sentence ‘’Caalaan bishaan______________’’ in this case next word
prediction completes such a problem or incomplete sentence by providing the correct next word which is
“dhugee’’ in this example. The purpose of this thesis is to design and develop word sequence prediction model
for Afaan Oromo using deep learning technique which has promising performance in current research.
1.3. Motivation
Deep learning has originated as an influential technique to solve multitude of problems in the domains of
computer vision, topic modeling, natural language processing, speech recognition, social media analytics, etc.
[14]. Inspired by the same, applying deep learning-based language Translation achieved great popularity.
Afaan Oromo uses Qube (Latin-based script) for writing system. People who use Qube have difficulties in
typing. For instance, Qube use many characters compared to other languages which slows down the typing
process. To the best of the researcher’s knowledge, there is no single attempt to study word sequence prediction
for Afaan Oromo using deep learning. Hence, this motivated the us to carry out the present study on word
sequence prediction using deep learning technique.
1.4. Objectives
1.4.1. General Objective
The general objective of this study is to design and develop word sequence prediction system for Afaan Oromo
using a deep learning approach.
1.4.2. Specific Objective
The study specifically attempts to:

 Review the Nature of Afaan Oromo language and Approaches of word sequence prediction,
4|P a g e
 collect and prepare corpus for training and testing the model,
 design and Develop word sequence prediction model for Afaan Oromo,
 to train the development model, and
 evaluate the performance of word sequence prediction model using collected test data.
1.5. Scope and limitation
This thesis covered with the aim to design and develop model word sequence prediction for Afaan Oromo using
a deep learning approach. The study used only RNN model and attention mechanism to design proposed deep
learning approach for Afaan Oromo next word sequence prediction. This thesis experiment covers only RNN
models and RNN with attention BI/GRU/LSTM and Single GRU/LSTM+ Attention which is total 8 models.
1.6. Methods
The study pursued experimental research design [24] to achieve the thesis aim (objectives). Different stages of
this thesis named as discussion with NLP experts, data collection, and preparation, literature review, analyzing
written documents, selecting the approach, preparing parallel corpus for dataset, training, validating, and testing
the proposed model, develop a prototype, and evaluation of the proposed model. The following steps are used
to address the thesis objective.
i. Data Collection and Preparation: To show a morphological behavior of a language, a well collected, sized,
and defined text data is required. The collection of text data is, therefore, an essential component in
developing RNN model for next word sequence prediction of Afaan Oromo. Thus, the corpus used in this
research collected from online documents, different social media and Federal Democrat Republic of Ethiopia
(FDRE) constitution, FDRE criminal code, Council of Oromia regional state, and Afaan Oromo language
education materials. Afaan Oromo-dataset prepared for the study are 2872073 sentences, 9983441 words
and a total of 102528 unique words in the corpus excluding stop words. The need for dataset preparation is
because there is no well-organized and prepared standard dataset for the proposed model. Additionally, the
need for dataset preparation is that, some of the prepared parallel dataset have lack of corpus clarity. The
dataset preparation is needed to experiment the model on sequence prediction Afaan Oromo language in
model training and testing phases.
ii. Model selection: This section shows model training, validating, and testing data preparation(pre-processing)
steps. One difficulty in word sequence prediction is, it requires a very large amount of corpus preparation. It
is challenging to train, validate, and test the RNN model which is encoder decoder using word sequence
prediction for those under corpus resourced language like Afaan Oromo. To overcome massive amounts of
5|P a g e
corpus preparation problems, the researcher used zero resources [25] machine learning strategies to train and
test the model [26]. From the prepared dataset, we used 80% for model training and 20% for model testing.
Once the Afaan Oromo datasets ready, then it passes through the training phase, tuning phase, and finally
testing step.
iii. Tools and techniques: Beforehand choosing a toolkit, a researcher needs to gain a general idea of the
various open-source toolkits that are available at the time of writing. The system model uses open-source
tensor flow [27] toolkit; tensor flow is a large-scale, general-purpose, open-source machine learning toolkit,
for the implementation of these deep artificial [28]. open-source tensor flow is not a language-dependent,
and preferable tool kit for morphologically abundant languages like Afaan Oromo and under-resourced
dataset assets in the state-of-art approaches [29].
iv. System Prototyping Tool: the proposed system model prototype is developed using a Python programming
language used. The system used a python anaconda environment for model experimentation and
implementation, and the proposed model used python libraries and packages [27]. Python programming
language included libraries and packages for scientific computing and technical computing. Also, through
python, the system can easily import Tensor Flow and open-source application programming interface (API)
function
v. Evaluation Metrics: Word Sequence Prediction systems evaluated by using human evaluation methods or
automatic evaluation methods.
Accuracy (Acc): The percentage of words that have been successfully completed by the program before the
user reached the end of the word. A good completion program is one that successfully completes words in the
early stages of typing [62].
Perplexity (PP) is the average number of possible choices or words that may occur after a string of words and it
can be measured with cross entropy calculated on test set with N words Cavalieri et al. [63].
1.7. Application of Word Sequence Prediction
Word Prediction is useful in many domains and used in many applications (Ghayoomi and Momtazi [9],
Aliprandi et al. [8], Väyrynen et al. [7].
 Text production proper where we generate texts by predicting the next words.
 Writing assistance systems and assistive communication system such as Augmentative and Alternative
Communication (AAC) devices, where those systems predict the next word or character that the users
wants to write to help them and reduce the effort needed to write.
6|P a g e
 Speech recognition, where in the case of different pronunciation of words from one person to another,
we can predict those words based on what the user previously said, and we can improve the results of
speech recognition by correcting the resulting words by predicting them.
 Spelling correction and error detection, where we predict correct words based on typed characters and
words.
 Word-sense disambiguation, where we can know the exact meaning of it based on its predecessors or
predicting a synonym for that word, which makes the meaning clearer.
 Statistical machine translation, where when translating from a language to another we may make
mistakes due to the difference between languages so we can use word prediction to minimize and correct
those errors.
 Handwriting recognition and optical character recognition where many wrong words can be obtained
due to the difference in the writing method from person to another, here we can use word prediction to
reduce these errors and letter prediction to make Optical Character Recognition more accurate.
 Also it can be used in text-entry interfaces for messaging on mobile phones and typing on handheld and
ubiquitous devices or in combination with assistive devices such as keyboards, virtual keyboards,
touchpads and pointing devices.
1.8. Organization of the Rest of the Thesis
The rest of this thesis is organized as follows. In Chapter 2, literature review, briefly states fundamental concepts
of word prediction, methods of word prediction, structure of Afaan Oromo and its grammatical rules. Chapter 3
presents researches conducted by different scholars on the topic of word sequence prediction, their approaches,
and findings. In Chapter 4, architecture of the proposed word sequence prediction model, its approach, and
related concepts are explained. Experiment and results are presented in Chapter 5. Finally, conclusion and future
works are stated in Chapter 6.
7|P a g e
Chapter Two: Literature Review
2.1. Overview
This chapter presents the machine learning background underlying the whole thesis and introduces the notation
used in this work. We introduce the neural network formalism, including feed-forward, convolutional and
recursive neural networks as well as a review of the natural language processing literature using deep neural
networks. Note that the related work specific to each task tackled in this thesis is discussed in the corresponding
chapters. This chapter is dedicated to present related work on word or text prediction. The approaches used and
the result obtained are included in it. Word prediction for Amharic, Russian, English, Persian, and Hebrew
language are some of research conducted in the area that we exhaustively reviewed for this work in order to
understand and identify appropriate approaches for Afaan Oromo.
2.2. Introduction of Machine Learning and Word Prediction
Machine learning is a field in computer science that explores how machines can learn to solve problems from
experimental data rather than being explicitly programmed. The behavior of most machine learning algorithms
is conditioned by a set of parameters that define a model. The goal of machine learning is to estimate the
parameters of this model to learn regular patterns from data observations while avoiding learning the training
samples. In practice, given a database of training samples an algorithm is expected to learn how to solve a
specific task.
Note that non-parametric approaches do memorize training examples by while nature generalizing well to
unseen examples. Natural Language Processing (NLP) is an interdisciplinary research area at the border
between linguistics and artificial intelligence aiming at developing computer programs capable of human-like
activities related to understanding or producing texts or speech in a natural language [1]. It is an area of research
and application that explores how computers can be used to understand and manipulate natural language text or
speech to do useful things.
NLP researchers aim to gather knowledge on how human beings understand and use language so that appropriate
tools and techniques can be developed to make computer systems understand and manipulate natural languages
to perform the desired tasks [2]. Applications of NLP include a number of fields of studies, such as machine
8|P a g e
translation, morphology, syntax, named entity recognition, natural language text processing and summarization,
multilingual and cross language information retrieval (CLIR), speech recognition, information retrieval and text
clustering, and so on [2].
A number of people with physical disabilities were rocketed dramatically after world war second [39]. In order
to assist them to interact with the outside world, assistant technology such as word prediction was used.
Researchers dedicated to develop systems that are alternative to the users‟ disabilities and could augment their
abilities too. Since the early 1980, the prediction systems have been in use [39].
Word prediction is about estimating what word the user is going to write for the purpose of facilitating the text
production process [18, 39]. Sometimes a distinction is made between systems that require the initial letters of
an upcoming word to make a prediction and systems that may predict a word regardless of whether the word has
been initialized or not [39, 40]. The former systems are said to perform word completion while the latter perform
proper word prediction.
Prediction refers to those systems that figure out which letters, words, or phrases are likely to follow in a given
segment of a text. Such systems are very useful for user, mainly the ones with writing disabilities. The systems
usually run by displaying a list of the most probable letters, words, or phrases for the current position of the
sentence being typed by the user.
As the user continues to enter letters of the required word, the system displays a list of the most probable words
that could appear in that position. Then, the system updates the list according to the sequence of the so-far
entered letters. Next, a list of the most common words or phrases that could come after the selected word would
appear. The process continues until the text is completed [39].
Whenever the user types a letter or confirms a prediction, the system updates its guesses taking the extended
context into account. The size and nature of the context on which the predictions are based, varies among
different systems. While the simplest systems only take single word form frequencies into account, thus not at
all making use of the previous context, more complex systems may consider the previous one or two word forms
and/or the grammatical categories. Yet more complex systems combine these methods with other strategies such
as topic guidance, recency promotion and grammatical structure.
The goal of all writing assistance systems is increasing the Key Stroke Saving (KSS) which is the percentage
of keystrokes that the user saves by using the word prediction system. A higher value for KSS implies a better
performance; as a result, decreasing the user’s effort to type a text. In other words, the amount of text to be
9|P a g e
typed needs to be as short as possible for the user with the least effort. Perplexity is one of the important
standard performance metrics to evaluate prediction systems [39, 41, and 42].
2.3. Approaches of word sequence prediction
There are numbers of prediction systems that were developed and are developing with different methods for
different languages.
In these sections three major approaches are described: statistical modeling, knowledge-based modeling, and
heuristic modeling (adaptive).
Artificial neural
network
statistical
modeling
Approaches of word sequence
prediction knowledge-based
modeling,
heuristic modeling
(adaptive).
figure 2. 1 Approaches of word sequence prediction
2.3.1. Statistical Modeling
Traditionally, predicting words has solely been based on statistical modeling of the language. In statistical
modeling, the choice of words is based on the probability that a string may appear in a text. Consequently, a
natural language could be considered as a stochastic system. Such a modeling is also named probabilistic
modeling. The task of predicting the next word can be stated as attempting to estimate the probability function
PR:
PR (input word|next Word) = PR (input word) PR (next Word | input Word)

PR (next Word)
10 | P a g e
A Markov model: - is an effective way of describing a stochastic chain of events, for example a string of words.
Such a model consists of a set of states and probabilities of transitions between them. The transition probabilities
represent the conditional probabilities for the next word given the previous words. For example, the probability
of a transition from state AA to state AB represents the probability that B is written when the two previous words
were AA.
Sequences of words extracted from the training texts are called n-grams. In this example, 1, 2, and 3-grams are
named uni-, bi-, and tri-grams, respectively. The probabilities for the transitions in a second order Markov model
can be estimated simply by counting the number of bi-grams and tri-grams in the training text, and by using the
relative frequency as an estimate. Thus
Where is the count of n-grams, wn-2, Wn-1, and Wn are words, P (Wn/Wn-2, Wn-1) is probability of a word
Wn given wn-2, Wn-1previous word, C (Wn-2, Wn-1, Wn) is frequency of word sequence Wn-2, Wn-1, Wn in
a corpus, C (Wn-2, Wn-1) is frequency of Wn-2, Wn-1 in a corpus [25].
The probabilities for the transitions in a first order Markov model can be estimated simply by counting the
number of uni-grams and bi-grams in the training text, and by using the relative frequency as an estimate. Bi-
gram probability is computed using (Eq.4) [25].
𝑊𝑛 𝐶(𝑊𝑛−1,𝑊𝑛)
𝑃(𝑊𝑛−1) = 6
𝐶(𝑊𝑛−1)
where, C is the count of n-grams, Wn-1, Wn are words, P(Wn|Wn-1) is probability of a word Wn given Wn-1, C
(Wn-1, Wn) is frequency of word sequence Wn-1 Wn in a corpus, C(Wn-1) is frequency of Wn-1 in a corpus. The
mth order Markov model requires m+1-grams to be extracted from the training texts in order to calculate.
In such a stochastic problem, we use the previous word(s), the history, to predict the next word. To give
reasonable prediction to the words which appear together, we try to use the Markov assumption that only the
last few words affect the next word [25]. So if we construct a model where all histories restrict the word that
would appear in the next position, we will then have an (n-1) th order Markov model or an n-gram word model
[25]. The statistical information and its distribution could be used for predicting letters, words, phrases, and
sentences [39]. Statistical language modeling is broadly used in these systems. Markov assumption is used as a
base line for statistical word prediction which only the last n-1 words of the history have effect on the next word
[47]. Thus, the model could be named n-gram Markov model.
11 | P a g e
Word frequency and word sequence frequency: - are the methods that are frequently used in word prediction
systems [48], particularly for the ones that are established commercially. Constructing a dictionary that contains
words and their corresponding relative frequency of occurrence is the most common and simplest word
prediction methods. It comes up with the most frequent words begin by this string in similar order they are stored
in the system. This method may require some improvement by a user in order to amend its concordance when
used to inflected words since context information are not considered. In addition, this method uses unigram
model with fixed lexicon and it provided with the same proposal for similar sequences of letters. To increase
word prediction accuracy result, sign about regency of use of each word may be involved in the lexicon. In this
way, the prediction system is capable to provide most recently used words among most likely words. This
method provides access to adaptation of each word to a user’s vocabulary by updating frequency and regency
of each word used [6, 39]. Most likely words that start with the similar characters are offered when a user has
typed the beginning of a word. If the required word is not accessible among options presented by the system, a
user may continue typing, else the required word is used from the given list and it may automatically adapt to
user’s lexicon by means of simply updating frequencies of words used and assigning an initial frequency for
new words added to the system. In order to improve the result of this approach, regency value is stored in a
dictionary each word with their corresponding frequency information. The outputs found with regency and
frequency-based methods are better than the ones based on frequency alone. However, this method requires
storage of more information and increases computational complexity [6, 49].
N-Gram Language Model: - N-gram language model is a probabilistic language model based on Markov
assumption, the beginnings were in 1913 with Markov in Markov [49], who propose this technique, which called
later markov chains, to predict from a roman if the next letter will be a vowel or a constant, for more histories
check Jurafsky and Martin [50].
This method has been developed to overcome the limitation of the previous method. It takes into account the
previous context where the previous words are used to predict the next word. When using only the previous
word, the model is called bigram, and when using the previous two words it is called trigram and so on (in
general when using the previous n − 1 word to predict the n word is called the n-gram model). This method
provides smart suggestions and saves time by moving away from grammar rules.
2.3.2. Knowledge Based Modeling
The systems that only dedicated to use statistical modeling for prediction often present words that are
syntactically, semantically, or pragmatically inappropriate. Then they enforce a heavy cognition load on the user
to choose the proposed word and decrease the writing speed as a result [50, 51]. If the system minimizes
12 | P a g e
improper words from the prediction list, it will provide more comfort and confidence to the user. The linguistic
knowledge that could be used in prediction systems is syntactic, semantic, and pragmatic.
Syntactic prediction is a method that tries to present words that are appropriate syntactically in a particular
position within the sentence [25]. This means that knowledge from the syntactic structure of the language is
used. A method that tries to present words appropriate statically that position of the sentence is called syntactic
prediction. In syntactic prediction, part-of-speech (POS) tags of all words are identified in a corpus and the
system has to incorporate the syntactic knowledge for prediction. Statistical syntax and rule-based grammar are
two general syntactic prediction methods that will be presented in more detail [39]. This method includes various
types of probabilistic and parsing methods such as Markov model and artificial neural network.
Statistical Syntax: The sequence of syntactic categories and POS tags are used for predictions in this approach.
In this method, the appearance of a word is based upon the correct usage of syntactic categories. It means the
Markov assumption about n-gram word tags is used. In the simplest method, the POS tags are adequate for
prediction. Therefore, a probability would be allocated to each candidate word by guessing the probability of
having this word with its tag in the current position and about the most probable tags for the previous word(s)
[39, 42].
Statistical Syntax uses the sequence of syntactic categories and POS tags for predictions. The appearance of a
word in this method is based upon the correct usage of syntactic categories. In other words, the Markov
assumption about n-gram word tags is used. The most frequent tag for a particular word is used when producing
surface words. Bi-gram and tri-gram probabilities for the next tags are computed using (Eq.5) and (Eq.6)
respectively [25].
𝑡𝑛 𝑃(𝑡𝑛−1,𝑡𝑛)
𝑃(𝑡𝑛−1) = 7
𝑃(𝑡𝑛−1)
Where tn-1, tn are tag a given corpus, P (t n-1, tn) is a probability of tag t i-1 and ti sequence in a given corpus, P (t n-1) is
probability of tag t n, P (tn/tn-1) is probability of tag tn after tag tn-1.
𝑡𝑛 𝑃(𝑡𝑛−2,𝑡𝑛−1,𝑡𝑛)
𝑃(𝑡𝑛−2 , 𝑡𝑛 − 1) = 8
𝑃(𝑡𝑛−2,𝑡𝑛−1)
Where tn-2, tn-1, and tn are tags in a given corpus, P (t n-2, tn-1, tn) is a probability of tag t n-2, tn-1and tn sequence in
a given corpus, P (t n-2, tn-1) is a probability of tag tn-1and tn sequence, P (tn/tn-2, tn-1) is probability of tag t n after
tag tn-2 and tag tn-1.
13 | P a g e
In another approach, the system attempts to estimate the probability of each candidate word according to the
previous word and its POS tag, and the POS tag of its preceding word(s). In addition, the system uses word
bigram and POS trigram model [39]. A linear combination model of POS tags tries to estimate the probability
of POS tag for the current position according to the two previous POS tags. Then, it attempts to find words that
have the highest probability of being in the current position according to the predicted POS tag. Then, it
combines this probability with the probability of the word given the previous word. So, there are two predictors
in which one predicts the current tag according to the two POS tags and the one that uses bigram probability to
find the most likely word [39, 43].
2.3.3. Heuristic Modeling
The other Predictions become more appropriate for specific users when the adaptation methods are used. In this
approach, the system adapts every individual user [41, 42 and 43]. Short term learning and long-term learning
are the two general methods that make the system adapted to the users. Short -term Learning approach, the
system adapts to the user on a current text that is going to be typed by an individual user. Regency promotion,
topic guidance, trigger and target, and n-gram cache are the methods that a system could use to adapt itself to a
user in a single text. The methods are commonly used in prediction systems [54].
2.4. Deep Learning for Word Sequence Processing
This section introduces different approaches for sequence modeling using continuous representations, including
convolutional neural networks, recurrent neural networks and recursive neural networks.
Modeling natural language sequences using neural networks and continuous vector representations has a long
history. Early work on distributed representations includes Hinton et al. [86] and Elman [87]. More recently,
Bengio et al. [85] was able to outperform n-gram language models in terms of perplexity by training a neural
network using continuous word vectors as inputs.
This idea was then taken up in Collobert and Weston [85] to learn word embedding’s in an unsupervised
manner. They showed that jointly learning these embedding’s, and taking advantage of the large amount of
unlabeled data in a multitask framework improved the generalization on all the considered tasks obtaining state-
of-the-art results. Word embedding’s obtained by predicting words given their context tend to capture semantic
and syntactic regularities. They have been shown to preserve semantic proximity in the embedded space, leading
to better generalization for unseen words in supervised tasks. Such word embedding’s have been reported to
improve performance on many NLP. The study in Turian et al. [75] used unsupervised word representations as
extra word features to further improve system accuracy.
14 | P a g e
2.4.1. Word Embedding’s
Natural language must deal with a large number of words that span a high dimensional and sparse space of
possibilities. However, as discussed in Harris [39], Firth [39] and Wittgenstein [39], words that occur in similar
contexts tend to have similar meanings. This suggests that the underlying structure of such high dimensional
space can be represented in a more compact way. One of the first approaches to capture linguistic knowledge in
a low dimensional space is the Brown clustering algorithm [79], grouping words into clusters assumed to be
semantically related. A word is then represented by a low-dimensional binary vector representing a path in a
binary tree.
2.4.2. Convolutional Neural Networks for Sequence Modeling
The order of the words of a sentence are essential for its comprehension. For NLP tasks such as sentiment
analysis which consists in identifying and extracting subjective information from pieces of text, taking the word
order into account is critical. Classical NLP features such as bag-of-words do not conserve this information and
would assign the sentences “it was not good, it was actually quite bad” and “it was not bad, it was actually quite
good” the exact same representation.
Convolutional neural networks (CNN), first introduced in the computer vision literature [89] and it uses for the
extraction of contextual information and focus on relevant information regardless of its position in the input
sequence. In NLP, CNN were first introduced by the pioneering work of Collobert et al. [65] for the task of
semantic role labeling. In this task, the tag of a word depends on a verb (or, more correctly, predicate) chosen
beforehand in the sentence. The tagging of a word requires the consideration of the whole sentence. The authors
introduced an architecture that extracts local feature vectors using a convolutional layer. These features are then
combined using a pooling operation in order to obtain a global feature vector. The pooling operation is a simple
max-over-time operation which forces the network to capture the most useful local features produced by the
convolutional layer. This procedure results in a fixed-size representation independent of the sentence length, so
that subsequent linear layers can be applied.
2.4.3. Recurrent Neural Networks for Sequence Modeling
Convolutional networks encode a sequence into a fixed-size vector. However, order sensitivity is constrained to
mostly local patterns while disregarding the order of these patterns. On the other hand, recurrent neural networks
allow to represent arbitrarily sized linearly structured inputs into a fixed-size vector, while taking the structured
properties of the input into account.
15 | P a g e
We can identify three different recurrent neural networks, simple recurrent neural networks (SRNN), long short
term memory (LSTM) and gated recurrent units (GRU).
Recurrent Neural Networks (RNN) are designed to work with sequential data. Sequential data (can be time-
series) can be in form of text, audio, video etc. RNN uses the previous information in the sequence to produce
the current output. To understand this better let we take below example sentence. Let we take below sentence
as example: -
“Doctor Gadisa is my Advisor.”
At the time (T0), the first step is to feed the word “Doctor” into the network. The RNN produces an output. At
the time (T1), then at the next step we feed the word “Gadisa” and the activation value from the previous step.
Now the RNN has information of both words “Doctor” and “Gadisa”. And this process goes until all words in
the sentence are given input. You can see the animation below to visualize and understand. At the last step, the
RNN has information about all the previous words. In RNN weights and bias for all the nodes in the layer are
same. It takes input from the previous step and current input. Here tan (h) is the activation function, instead
of tan (h) we can use other activation function as well.
figure 2. 2 RNN basic architecture
16 | P a g e
Recurrent Neural Networks suffer from short-term memory. If a sequence is long enough, they’ll have a hard
time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text
to do predictions, RNN’s may leave out important information from the beginning.
During back propagation, recurrent neural networks suffer from the vanishing gradient problem. Gradients are
values used to update a neural networks weight. The vanishing gradient problem is when the gradient shrinks as
it back propagates through time. If a gradient value becomes extremely small, it doesn’t contribute too much
learning.
2.4.4. Gated Recurrent Unit (GRU) Approach
GRU were become the state of art for machine Translation starting from 2014, when Cho et al. uses it to
overcome the problem of LSTM, due to its waste of cell memory. The workflow of GRU is same as RNN but
the difference is in the operations inside the GRU unit. The GRU is the newer generation of Recurrent Neural
networks and is pretty similar to an LSTM. GRU’s got rid of the cell state and used the hidden state to transfer
information.
2.4.5. Long Short-Term Memory Approach
An LSTM developed by Hoch Reiter and Schmidhuber in 1997 as the novel method for sequential model. An
LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it
propagates forward. The differences are the operations within the LSTM’s cells. LSTMs are pretty much similar
to GRU’s, they are also intended to solve the vanishing gradient problem. Additional to GRU here there are 2
more gates.
LSTM’s and GRU’s are used in state-of-the-art deep learning applications like sequence-to-sequence prediction,
speech synthesis, natural language understanding, etc. LSTM’s and GRU’s were created as the solution to short-
term memory. They have internal mechanisms called gates that can regulate the flow of information.
17 | P a g e
figure 2. 3 Architecture of GRU Vs LSTM
These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass
relevant information down the long chain of sequences to make predictions. Almost all state-of-the-art results
based on recurrent neural networks are achieved with these two networks. LSTM’s and GRU’s can be found in
speech recognition, speech synthesis, and text generation.
LSTM were shown to be surprisingly effective for machine translation by Sutskever et in 2014. While the LSTM
architecture is very effective, it is also complex and computationally intensive, making it hard to be analyze
[71]. The gated recurrent unit (GRU) was recently introduced by Cho et al. as an alternative to the LSTM. It
was shown to perform comparably to the LSTM on several tasks [65] The GRU was also shown to be effective
for machine translation [65].
2.4.6. Bidirectional RNN
As discussed prior, RNN have knowledge of previous entries, but this knowledge is up to a certain point, and
so does not have information about future cases. In the bidirectional recurrent network, we have separate hidden
states h (t) and g (t) for the forward and backward directions as shown in Figure 1.16. The forward states interact
in the forwards direction, while the backwards states interact in the backwards direction. Both h (t) and g (t)
receive input from the same vector x (t) and they interact with the same output vector o (t) (Aggarwal [2018a]).
18 | P a g e
figure 2. 4 Bidirectional RNN (Feng et al. [2017]).
2.4.7. Recursive neural networks for sequence modeling
While recurrent neural networks are useful for modeling sequences, natural language often requires to take tree
structures into account. For example, the syntactic structure of a sentence can be represented as a tree of syntactic
relations between sub-constituents. The recursive neural networks (RNN) abstraction introduced in Pollack [63]
is a generalization of recurrent neural networks which allows to deal with arbitrary data structures. In particular,
they have been popularized in NLP by the work of Socher et al. [63] for syntactic parsing. In this work, the
authors learn syntactic-semantic vector representations of tree nodes by recursively applying a compositional
operation, following the parse tree. As illustrated in Figure 8, the leaves correspond to the sentence words and
are assigned a continuous vector representation. Node representations are computed in a bottom-up manner from
the leaves to the top tree node. These representations are trained to discriminate the correct parse tree from trees
coming from a generative parser. The system is then used to re-rank the 200-best output of a generative syntactic
parser by computing the global score for each tree candidate.
19 | P a g e
figure 2. 5 Recursive neural network for syntactic parsing [63].
Recursive models were successfully applied to structure prediction tasks such as constituency parse re-ranking
[63], dependency parsing [64], discourse parsing [65], semantic relation classification political ideology
detection based on parse trees, sentiment, target-dependent sentiment classification and question answering.
From above Review We can generalize that:
Traditional Sequence Prediction:
 Skipped hundreds of important details.

 A lot of human feature engineering.
 Very complex systems
 Many different, independent machine learning problem
A RNN model handles variable-sized input using attention layers instead of RNNs or CNNs.
2.5. Evaluation Techniques for Word Prediction
There are four standard performance metrics to evaluate word prediction system. Those are keystrokes saving
(KSS), Hit rate (HR), Keystrokes until completion and Accuracy (Acc).
keystrokes saving: is referred to the percentage of keystrokes that the user saves by using the word prediction
system and is calculated by comparing two kinds of measures: the total number of key strokes(KT) needed to
type the text without the help of the word prediction and the effective number of keystrokes (KE) saved using
word prediction.
Therefore, the number of keystrokes to type texts taken from the test data with and without sequence word
prediction program will be counted to calculate keystroke savings accordingly. The obtained KSS will be
compared for word based and POS based models. A higher value for keystrokes implies better performance [26].
20 | P a g e
Hit rate (HR): The percentage of times that the intended word appears in the suggestion list. A higher hit rate
implies a better performance [62]. Keystrokes until completion (KuC): The average number of keystrokes that
the user enters for each word, before it appears in the suggestion list. The lower the value of this measure the
better the algorithm [62].
Accuracy (Acc): The percentage of words that have been successfully completed by the program before the
user reached the end of the word. A good completion program is one that successfully completes words in the
early stages of typing [62].
Perplexity (PP): is the average number of possible choices or words that may occur after a string of words and
it can be measured with cross entropy calculated on test set with N words (Cavalieri et al. [65]).
2.6.Related Work
In this section discusses the related work of this thesis both in foreign and local language
2.6.1. Related work on Foreign Language
We discussed the related work done on Foreign languages.
Word Prediction for English
Antal van den Bosch [58] proposed classification-based word prediction model based on IGTREE. A decision-
tree induction algorithm has been favorable scaling abilities. Token prediction accuracy, token prediction speed,
number of nodes and discrete perplexity are evaluation metrics used for this work. Through a first series of
experiments, they demonstrate that the system exhibits log-linear increases in prediction accuracy and decreases
in discrete perplexity, a new evaluation metric, with increasing numbers of training examples. The induced trees
grow linearly with the amount of training examples. Trained on 30 million words of newswire text, prediction
accuracies reach 42.2% on the same type of text. In a second series of experiments, we show that this generic
approach to word prediction can be specialized to confusable prediction, yielding high accuracies on nine
example confusable sets in all genres of text. The confusable-specific approach outperforms the generic word-
prediction approach, but with more data the difference decreases.
Agarwal and Arora [59] proposed a Context Based Word Prediction system for SMS messaging in which context
is used to predict the most appropriate word for a given code. The growth of wireless technology has provided
alternative ways of communication such as Short Message service (SMS) and with tremendous increase in
21 | P a g e
mobile Text Messaging, there is a need for an efficient text input system. With limited keys on the mobile phone,
multiple letters are mapped to same number (8 keys, 2 to 9, for 26 alphabets).
For example, for code „63‟, two possible words are „me‟ and „of‟. Based on a frequency list where „of‟ is more
likely than „me‟, T-9 system will always predict „of‟ for code „63‟. So, for a sentence like „Give me a box of
chocolate‟, the prediction would be „Give of a box of chocolate‟. The sentence itself indeed gives us information
about what should be the correct word for a given code. Consider the above sentence with blanks, “Give _ a box
_ chocolate”. The current systems for word prediction in Text Messaging predict the word for a code based on
its frequency obtained from a huge corpus. However, the word at a particular position in a sentence depends on
its context and this intuition motivated them to use Machine Learning algorithms to predict a word, based on its
context. The system also takes into consideration the proper English words for the codes corresponding to the
words in informal language. The proposed method uses machine learning algorithms to predict the current word
given its code and previous word’s part of speech (POS). The training was done on about 19,000 emails and the
testing was done on about 1,900 emails, with each email consisting of 300 words on average. The results show
31 % good improvement over the traditional frequency-based word estimation.
Trunk [60] conducted a research on topic Adaptive Language Modeling for Word. AAC devices are highly
specialized keyboards with speech synthesis, typically providing single-button input for common words or
phrases, but requiring a user to type letter-by-letter for other words, called fringe vocabulary. Word prediction
helps speed AAC communication rate. The previous research conducted by different scholars using ngram
models. At best, modern devices utilize a trigram model and very basic recency promotion. However, one of the
lamented weaknesses of ngram models is their sensitivity to the training data. The objective of this work is to
develop and integrate style adaptations from the experience of topic models to dynamically adapt to both
topically and stylistically. They address the problem of balancing training size and similarity by dynamically
adapting the language model to the most topically relevant portions of then training data.
The inclusion of all the training data as well as the usage of frequencies addresses the problem of sparse data in
an adaptive model. They have demonstrated that topic modeling can significantly increase keystroke savings for
traditional testing as well as testing on text from other domains. They have also addressed the problem of
annotated topics through fine-grained modeling and found that it is also a significant improvement over a
baseline ngram model.
Word Prediction for Persian Language
Masood Ghayoomi and Seyyed Mostafa Assi [61] studied word prediction for Persian language. In this study,
they designed and developed based a system on a Statistical Language Modeling. The corpus contained
22 | P a g e
approximately 8 million tokens. The corpus is divided in to three sections: one was the training corpus that
contained 6,258,000 tokens, and 72,494 tokens; the other section was used as the developing corpus which
contained 872,450 tokens, and the last section was used as the test corpus which contained 11,960 tokens. The
user enters each letters of the required word; the system displays a list of the most probable words that could
appear in that position. Three standard performance metrics were used to evaluate the system including
keystroke saving, the most important one. The system achieved 57.57% saving in keystrokes. Using such a
system saved a great number of keystrokes; and it led to reduction of user’s effort.
Ghayoomi and Daroodi [26] studied word prediction for Persian language in three approaches. Persian is a
member of the Indo-European language family and has many features in common with them in terms of
morphology, syntax, phonology, and lexicon. This work is based on bi-gram, tri-gram, 4-gram models and it
utilized around 10 million tokens in the collected corpus. Using Keystroke Saving (KSS) as the most important
metrics to evaluate systems‟ performance, the primary word-based statistical system achieved 37%KSS, and the
second system that used only the main syntactic categories with word-statistics achieved 38.95% KSS. Their
last system which used all of the available information to the words get the best result by 42.45% KSS.
Word Prediction for Russian Language
Hunnicutt et al. [56] performed research on Russian word prediction with morphological support as a co-
operative project between two research groups in Tbilisi and Stockholm. This work is an extension of a word
predictor developed by Swedish partner for other languages in order to make it suitable for Russian language.
Inclusion of morphological component is found necessary since Russian language is much richer in
morphological forms. In order to develop Russian language database, an extensive text corpus containing 2.3
million tokens is collected. It provides inflectional categories and resulting inflections for verbs, nouns and
adjectives. With this, the correct word forms can be presented in a consistent manner, which allows a user to
easily choose the desired word form. The researchers introduced special operations for constructing word forms
from a word’s morphological components. Verbs are the most complex word class and algorithm for expanding
root form of verbs to which their inflectional form is done. This system suggests successful completion of verbs
with the remaining inflect able words.
Word Prediction for Hebrew Language
Netzer et al. [57] are probably the first to present results of experiments in word prediction for Hebrew. They
developed a NLP-based system for Augmentative and Alternative Communication (AAC) in Hebrew. They used
three general kinds of methods: (1) Statistical methods: based on word frequencies and repetition of previous
23 | P a g e
words in the text. These methods can be implemented by using language models (LMs) such as the Markov
model, and unigram/bigram/trigram prediction, (2) Syntactic knowledge: part of speech tags (e.g. nouns,
adjectives, verbs, and adverbs) and phrase structures. Syntactic knowledge can be Statistical-based or can be
based on hand-coded rules and (3) Semantic knowledge: assigning categories to words and finding a set of rules
that constrain the possible candidates for the next word. They used 3 corpuses of varying length (1M words,
10M words, 27M words) to train their system. The best results have been achieved while training a language
model (a hidden Markov model) on the 27M corpus. They applied their model on various genres including
personal writing in blogs and in open forums in the Internet. Contrary to what they expected, the use of morpho-
syntactic information such as part of speech tags didn't improve the results. Furthermore, it decreases the
prediction results. The best results were obtained using statistical data on the Hebrew language with rich
morphology. They report on keystroke saving up to 29% with nine-word proposals and 34% for seven proposals,
54% for a single proposal.
Word Sequence Prediction Hindi Language
Two deep learning techniques namely Long Short Term Memory (LSTM) and Bi-LSTM have been explored
for the task of predicting next word and accuracy of 59.46% and 81.07% was observed for LSTM and Bi-LSTM
respectively [55].
2.6.2. Word prediction for Local Language
We discussed the related work done on local languages such as Amharic, Tigrigna and Afaan Oromo. Regarding
to local language there are different work done using traditional work but there is no work done before using
deep learning technique.
Word Prediction for Amharic
The research conducted by Alemebante Mulu and Goyal [19] performed a research on Amharic Text Prediction
System for mobile. During this work, they need designed text prediction model for Amharic language: a corpus
of 1,193,719 Amharic words, 242,383 Amharic lexicons and an inventory of names of persons and places with
a complete size of 20,170 has been used. to point out the validity of the word prediction model and therefore the
algorithm designed, a prototype is developed. Amharic text prediction system describes the info entry techniques
that are wont to enter data into mobile devices, like a smartphone. Data entry might be either predictive or non-
predictive during which the primary two characters is written and listed down all predicted word, supported the
frequency of the word also as going the alphabetical order if the frequency is that the same. The experiment is
24 | P a g e
tested by a database or lexicon of Alembante Mulu also conducted to live the accuracy of the Amharic text
prediction engine and eventually the prediction accuracy achieved 91.79%.
The research conducted Tigist Tensou [20] performed a search on word sequence prediction for Amharic. during
this work, Amharic word sequence prediction model is developed using statistical methods and linguistic rules.
Statistical models are constructed for root/stem, and morphological properties of words like aspect, voice, tense,
and affixes are modeled using the training corpus. Consequently, morphological features like gender, number,
and person are captured from a user’s input to make sure grammatical agreements among words. Initially, root
or stem words are suggested using root or stem statistical models. Then, morphological features for the suggested
root/ stem words are predicted using voice, tense, aspect, affixes statistical information and grammatical
agreement rules of the language. Predicting morphological features is important in Amharic due to its high
morphological complexity, and this approach isn't required in less inflected languages since there's an
opportunity of storing all word forms during a dictionary. Finally, surface words are generated supported the
proposed root or stem words and morphological features. Word sequence prediction employing a hybrid of bi-
gram and tri-gram model offers better keystroke savings altogether scenarios for his or her experiment. as an
example, when using test data disjoint from the training corpus, 20.5%, 17.4% and 13.1% keystroke savings are
obtained in hybrid, tri-gram and bi-gram models respectively. Evaluation of the model is performed using
developed prototype and keystroke savings (KSS) as a metrics. consistent with their experiment, prediction
result employing a hybrid of bi-gram and tri-gram model has higher KSS and it's better compared to bi-gram
and tri-gram models. Therefore, statistical and linguistic rules have quite good potential on word sequence
prediction for Amharic language.
The research conducted Nesredin Suleiman [2] performed a search on word prediction model for Amharic online
hand writing recognition. during this work, he designs the model using: a corpus of 131,399 Amharic words is
ready to extract statistical information that's wont to determine the worth of N for the N-gram model, where the
worth two (2) is taken into account as a results of the analyses made a mixture of an Amharic dictionary (lexicon)
and an inventory of names of persons and places with a complete size of 17,137 has been used. to point out the
validity of the word prediction model and therefore the algorithm designed, a prototype is developed. Experiment
is additionally conducted to live the accuracy of the word prediction engine and a prediction accuracy of 81.39%
is achieved. Analyses are done on the corpus prepared.
These analyses are wont to get information just like the average word-length of Amharic language; the foremost
frequently used Amharic word-length and therefore the like. This information wont to decide the core element
25 | P a g e
of word prediction engine which is N for N-gram model, where N is that the number of characters after which
the prediction process starts. supported the analyses done, the worth of N has been decided to be two (N=2).
Alemebante Mulu and Goyal [19] performed research on Amharic Text Prediction System for Mobile Phone. In
this work, they have designed text prediction model for Amharic language: a corpus of 1,193,719 Amharic
words, 242,383 Amharic lexicons and a list of names of persons and places with a total size of 20,170 has been
used. To show the validity of the word prediction model and the algorithm designed, a prototype is developed.
Amharic text prediction system describes the data entry techniques that are used to enter data into mobile
devices, such as a smartphone. Data entry could be either predictive or non-predictive in which the first two
characters is written and listed down all predicted word, based on the frequency of the word as well as going the
alphabetical order if the frequency is the same. The experiment is tested by a database or lexicon of Alembante
Mulu also conducted to measure the accuracy of the Amharic text prediction engine and finally the prediction
accuracy achieved 91.79%.
Word Prediction for Tigrigna
According to Senait [68] the research conducted on designed and developed a word sequence prediction model
for Tigrigna language. This is done using n-gram statistical models based on two Markov language models, one
for tag, the other for words which are developed using manually tagged corpus, and grammatical rules of the
language. The designed model is evaluated based on a precision evaluation metric that is used to evaluate
performance of the system. According to our evaluation, On the average 85 % performance of correctly predicted
words are obtained using Sequence of two tags and 81.5 % performance of correctly predicted words are
obtained using Sequence of Three tags. According to our result, Word prediction using Sequence of two tags
provides better performance than Sequence of Three tag.
Word Prediction for Afaan Oromo
According to Ashenafi Bekele [20] conducting research on design and implementation of a sequence word
prediction for Afaan Oromo using the bi and tri-word statistics, and the bi-, and tri- POS tag statistics of the
language. The work also compares a system that solely uses word statistics with the designed systems that use
word statistics as well as POS tags information. Testing done using case one, 20.5%, 17.4% and 13.1%
keystroke savings are obtained in hybrid, tri-gram and bi-gram models respectively. Hybrid of bi-gram and tri-
gram is the highest in case one. In case two, 22.4%, 19.4% and 13.1% keystroke savings are obtained in hybrid,
tri-gram and bi-gram models respectively. Word sequence prediction using a hybrid of bi-gram and tri-gram
26 | P a g e
case one provides higher than using a hybrid of bi-gram and tri-gram System one keystroke savings. Hybrid of
bi-gram and tri-gram in case two is the highest.
2.7. Summary
In this section, we have discussed different Approaches and works related to word sequence prediction for
different languages. We understand that languages have their own linguistic characteristics requiring specific
approaches to word prediction. Hence, the research conducted on one language cannot be directly applied to
other languages. Therefore, the aim of this study is to design and develop word sequence prediction model for
Afaan Oromo by using deep learning technique and taking the unique features of the language into consideration.
Neural sequence prediction Technique is the advanced version of statistical word Sequence Prediction. It makes
use of a large artificial neural network that predicts the likely sequence of long phrases and sentences. Unlike
statistical-based, neural network word sequence prediction uses less memory since the models are trained jointly
to maximize the quality of the prediction. Recently the state of art RNN model handles variable-sized input
using stacks of attention layers instead of statistical approach.
Chapter Three: Nature Afaan Oromo Language
3.1. Introduction
This chapter concentrates on major concept of word prediction and ideas associated with linguistic
characteristics of Afaan Oromo. Statistical, knowledge based, and heuristics are prediction methods that are
presented in order to understand basic concepts of the research area. Since the aim of this study is design and
develop word sequence prediction model for Afaan Oromo, the structure of Afaan Oromo like morphological
characteristics, grammatical properties, and parts of- speech of the language are discussed in respective sections
of this chapter.
3.2. Grammatical Structure of Afaan Oromo
27 | P a g e
Grammar is a set of structural rules governing the composition of sentences, clauses, phrases, and words in a
given natural language. These rules guide how words should put together to make sentences. Word order and
morphological agreements are basic issues considered in Afaan Oromo grammar and are used as part of our
word sequence prediction study. A sentence is a group of words that express a complete thought. Sentences are
formed from verb phrase and noun phrase and can be classified as simple and complex sentences. A phrase is a
small group of words that stands as a conceptual unit. Simple sentences are formed from one verb phrase and
one noun phrase whereas a complex sentence contains one or more subordinate verbs other than the main verb,
where subordinate verbs are verbs that are integrated with conjunctions. A sentence is said to be complex because
it has capability to contain other sentences within it [37].
Table 3. 1 shows of simple and complex sentences in Afaan Oromo
Simple sentence Gammaachu kalessa dhufe“Gemechu

came yesterday”
Complex sentence Gammaachu kalessaa dhufe fi Kitaaboota

isaa fudhe”Gemechu cameyesterday and
he took his books”
A subject is part of a sentence or utterance, usually noun, noun phrase, pronouns or equivalent that the rest of a
sentence asserts something about and that agrees with the verb. It usually expresses an action performed by a
verb. In Afaan Oromo sentence, subjects more often occur at the beginning of a sentence. The subject of a
sentence should be in accordance with verb in gender, number, and person.
In the sentence, Roobaa intala isaa waammee “Roobaa called his daughter”, the subject Roobaa “Roobaa”
shows person, gender, number information which is third person, masculine, and singular respectively. This
morphological property is reflected on the verb, waammee “called”.
Therefore, in order to predict words in proper morphological information, morphological properties of subject
of a sentence should be captured and properly used on the verb while providing word suggestions.
Object and Verb Agreement
In Afaan Oromo object of the sentence has no any grammatical relation with the subject and verb of the sentence
[37, 38].
28 | P a g e
Examples:
1. Isheen isa jaalatti. “She likes him.”
2. Inni ishee jaalata. “He likes her.”
3. Nuti isa binna. “We buy it.”

Adjective and Noun Agreement
Adjectives are very important in Afaan Oromo because its structure is used in every day conversation. Oromo
adjectives are words that describe or modify another person or thing in the sentence [27, 37]. Afaan Oromo
adjectives should be in agreement in number and gender with the noun it modifies. Afaan Oromo adjectives may
mark number (singular or plural) and gender (feminine or masculine or neutral) of a noun it qualifies and hence
it should agree with number and gender of the noun [27, 36].
For example: In noun phrase namoota beekoo “knowledgeable men”, the word beekoo is an adjective that
modifies the noun namoota “men. It is marked for plural number and is reflected on the noun. It is inappropriate
to write the above phrase as nama beekoo “knowledgeable man”, since it shows number disagreement between
the adjective and noun.
To write this incorrect grammatical format either the adjective should be marked with singular number nama
beekaa knowledgeable man or the noun should be marked with plural numbers. Noun phrase, namicha furda
“The fat man”, the word furdaa“fat”, is an adjective that modifies the noun namicha “The man”. It is marked
with masculine gender and is in agreement with the noun. However, if we take a phrase namicha furdoo “The
fat man”, the adjective is marked with feminine gender while the noun it modifies is masculine.
Therefore, the adjective and noun are in disagreement and to avoid this kind of inconsistency either the adjective
should be marked with masculine or the noun should be marked with feminine gender. For this particular
example an appropriate phrase is either namicha furda “the fat man” or namiti furdoo “The fat woman”, where
there is agreement in number and gender between the adjective and noun.
Adverb and Verb Agreement
Oromo adverbs are part of speech. Generally, they're words that modify any part of language other than a noun.
Adverbs can modify verbs, adjectives (including numbers), clauses, sentences and other adverbs [27].
For example: In a sentence Guta boru dhufa “Guta will come tomorrow”, the word boru “tomorrow” is an
adverb that modifies the verb dhufa “will come”. The adverb and verb are in agreement taking imperfective
tense form.
29 | P a g e
3.3. Summary
In this chapter, we have reviewed linguistic characteristics of Afaan Oromo like part -of - speech, morphology
and grammar. We understand that Afaan Oromo nouns are inflated for number, gender and case, verbs are
inflated for number, gender, tense voice and aspect and adjectives are inflated for number and gender.
Chapter 4: Methodology
4.1. Introduction
This Chapter presents details of the methodology followed for developing and designing Afaan Oromo word
sequence prediction system including corpus collection, corpus preparation and designing Architecture of the
proposed word sequence prediction model.
4.2. Model Designing
The architecture of a word sequence prediction for Afaan Oromo text system shows the overall workflow of the
model proposed. This architecture works through different stages, namely the preprocessing, data splitting,
embedding, model building, and evaluation phases. First, the bi-lingual corpus goes to preprocessing and has
cleaning, normalization, padding, tokenization, and other preprocessing tasks. Then the preprocessed corpus is
divided into two core sets called to train and test set. Afterward apply embedding to make the corpus readable
30 | P a g e
to the model, which is transforming words into vocabulary then converts them to vectors of continuous real
value numbers.
Moreover, the encoder encodes the input sequence to an internal representation called ‘context vector’, which is
used by the decoder to generate the output sequence, as well, the decoder decodes the encoded sequence as per
the input language to be predicted to a sentence in the output language. The output embedding and the output
positional encoding are also applied at the end just like the stages in the input process. Finally, Evaluating the
model then go back to data splitting for another experiment if the prediction quality is poor.
In system designing, most of the time the authors used different approaches other than deep learning approach
for machine learning [20] [31]. But, in recent time the neural seq2seq prediction approach or encoder-decoder
end-to-end language modeling is becoming more attractive language modeling for seq2seq prediction task [6].
This language modeling is based on deep learning algorithms which performs more compared to the other
approaches.
Language Modelling is the most significant piece of present-day NLP. There is some piece of the assignment,
for example, Text Summarization, Machine Interpretation, Text Generation, Speech to Text Generation and so
forth. Text generation is a noteworthy piece of Language Modelling. A well-prepared language model acquires
information of the likelihood of the occasion of a word dependent on the past arrangement of words. This paper
talked about the word sequence prediction modelling with a Afaan Oromo word embedding for Text generation
and make a Bi-directional Recurrent Neural Network for preparing the model. The figure 4.1 shows the work
methodology stream of this study.
31 | P a g e
Embedding
Collecting Data
Building parameter
Data preprocessing
Encoder
i
n
Corpus building p RNN(GRU/LSTM/BIGRU/BILSTM)
u
t
Split into Train
Attention
and Test
Tokenizing
Converting
to sequence of Tokens Decoder
Split into X and Y Output
Pre Padding X by Zero
figure 4. 1 The architecture of proposed Afaan Oromo word sequence prediction
4.3.Componets of the Proposed Model
4.3.1. Corpus Collection
Corpus collection was one the challenging part of our work. Since Afaan Oromo is one of under resourced
language the study tries to simplify our technique of collection by developing web scrapping ApI using python
library Beauty soup. Additionally, Afaan Oromo word sequence prediction requires algebraic information such
as the frequency of occurrence of words corresponding POS tag. This can be achieved by using a corpus. Since
the Afaan Oromo word corpora are not available easily, the study prepared the corpora from various sources
that include newspapers (Bariisaa, Bakkalcha Oromiyaa and Oromiyaa), journals, criminal code, books, social
media like Facebook, webpages, books which are written by different authors on different issues such as politics,
religion, history, fiction and love. Data collected are:
32 | P a g e
Ξ 6k data collected from Bariisa
Ξ 10 k collected from BBC
Ξ 5k data collected from Minnesota Health Ministry site
Ξ 5k data collected from jw.org site
Ξ 1k data collected from gullalle
Ξ 7k data collected from Ethiopia Criminal Code
figure 4. 2 description of the length of the corpus
The study prepared total 2872073 sentences, 9983441 words and a total of 102528 unique words in the corpus
excluding stop words. All files are converted to txt format in order to make readable for python tool. All the
collected words which is referred as: “muaz hassen Word Sequence Prediction Corpus” that is used in this work.
The datasets are collected from different domains using scrapping python script.
Table 4. 1 Detail of corpus length
corpus status length

Sentences 2872073
words 9983441
vocabulary size 102528
4.3.2. Data Preparation and Preprocessing
The goal of data preprocessing is to achieve the highest quality prediction output with the available data. Data
preprocessing steps are listed as follows.
33 | P a g e
4.3.3. Converting sentence to N-gram Tokens Sequence
Text generation language model required an arrangement of the token and which can anticipate the likelihood
next word or grouping. So, need to tokenize the words. The study used keras work in tokenize model which
concentrate word with their record number from the corpus. After this, all content changes the arrangement of
the token. In n-gram, the arrangement contains whole number token which was produced using the info content
corpus. Each whole number speak to the record of the word which is in the content vocabulary. The study used
word embedding which represents the word vocabulary number. Each vector number present a word. So when
generating n-gram sequences each word represents by a vector number in the embedded file.
4.3.4. Tokenization
The study used keras Tokenization in order to convert text to words and vice versa. tokenization is used for
giving unique words an integer representation in the corpus.
For example: converting word to Integer:
akkam nagaa jirtuu: 128 162 1774
4.3.5. Pad Sequence
Every development has a substitute length. So, the study needs to pad sequences for making arrangement
length proportionate. For this point, the study uses keras pad sequence’s function. The commitment of the
learning models the study use n-gram gathering as given word and the foreseen word as the accompanying
word.
akkam nagaa jirtuu: 0 0 128 162 1774
nagaa galaata rabbi: 0 0 162 6769 3908
atii akkam jirtaa: 0 0 12640 128 523
eessatti baddee: 0 0 0 12641 5400
Example of pre padded sequences with sequence length 5
The study converted sentence into a sequence of integers with maximum sentence length in the corpus used
for training and padding end of sentence by zeros to the first of shorter sentences to make them equal to the
longest one.
34 | P a g e
4.4.Proposed Model Design andArchitecture
As shown in Figure. 4.3, the overall proposed model consists of six parts: Input Layer and Embedding, Encoder
(LSTM, GRU, Bidirectional), Attention, Decoder Layer (LSTM, GRU, Bidirectional), Activation Layer and
Dense Layer, dropout and Output Layer.
Input
Embedding
RNN
Model
GRU
BILSTM LSTM BIGRU
Adam
optimizer
Loss
Sparse
categoric
Attention al
Rectifierli
nearunit
and Complie
Dense/ Metrics
Softmax d
output Accuracy
with
figure 4. 3 proposed model architecture
35 | P a g e
In this paper the we worked with different RNN models which have single direction, bi directions
in which one is forward and another is Backward and as well as model with attention for both
single and bi directional. The Output layer is dense which gets the information from single
GRU/LSTM layer. For layer with Bi directional the past information provides by backwards
direction and next or predicting sequence provides forward direction. In the proposed model, the
study used the weight (w) of text sequence as input with the time(t). LSTM cell can store previous
input state and then working with the current state. When working in the current state it can
remember previous then using activation function it can predict the next word or sequence. Since
also the study used Bidirectional RNN the previous input was remembered by backwards direction
then for the future word or sequence prediction forward direction will help for prediction.
4.4.1. LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit)
The RNN model is a way of using recurrent neural networks for sequence-to-sequence prediction
problems. It was initially developed for machine translation problems, although it has proven
successful at related sequence-to-sequence prediction problems such as next word prediction, text
summarization and question answering [33]. The approach involves two recurrent neural networks,
one to encode the input sequence, called the encoder, and a second to decode the encoded input
sequence into the target sequence called the decoder. The architectures used is RNN (GRU and
LSTM) based and it is the best from the other non-neural architectures [5] [33]. Therefore,
designing of encoder part considers the adjustment of number of units which are needed to receive
and process the data, choices of number of layers needed to read the total words of source language
sentence, choice of layer types, and the choice of required technique for network optimizations,
selecting appropriate formulation algorithm or choosing proper learning rate, appropriate
activation function [6] [34].
Encoder accepts a single element of the input sequence at each time step, process it, collects
information for that element and propagates it forward. The encoder is basically LSTM/GRU cell
[33]. An encoder takes the input sequence and encapsulates the information as the internal state
vectors. The encoder of RNN based system is designed based on GRU and LSTM neural network-
based architecture. The gated recurrent neural network (GRU) uses gate unit to control the flow of
information. It also uses the current input and its previous output, which can be considered as the
current internal state of the network, to give current output [34]. Long Short-Term Memory in
short LSTM is a special kind of RNN capable of learning long term sequences. It is explicitly
36 | P a g e
designed to avoid long term dependency problems. Remembering the long sequences for a long
period of time is its way of working. In LSTM, the cell state act as a transport highway that
transfers relative information all the way down the sequence chain [35] [36]. The cell state, in
theory, can carry relevant information throughout the processing of the sequence. The encoder
takes a list of token IDs, looks up an embedding vector for each token, processes the embedding
into a new sequence [34]. Therefore, each element of vector from word embedding element is read
by each unit of the network.
Encoder reads the input sequence and summarizes the information in something called the internal
state vectors or context vector (in case of LSTM these are called the hidden state and cell state
vectors). the outputs of the encoder were discarded and only preserve the internal states. This
context vector aims to encapsulate the information for all input elements in order to help the
decoder make accurate predictions [6] [36].
figure 4. 1 example word sequence prediction of the model
Our proposed model takes the embedded vectors representation of the input sequence or sentence
of different word length which later be equalized by zero padding in a manner of fixed size vector
representation. To pass the input sequence to the encoder LSTM or GRU layer first it should be
37 | P a g e
embedded and then padded so that it passed to the encoder layer. The encoder layers (LSTM and
GRU) reads the input sentence or input sequence in a specific vector length which means the
specified vocabulary length before giving the input sequence to the designed model [37].
The gated recurrent unit neural network (GRU) architecture uses gate unit to control the flow of
information. It uses the current input and its previous output, which can be considered as the
current internal state of the network which is used to give current output [36]. The LSTM uses a
cell that carry important information throughout the sequence chain of the architecture. As the
information flows, the gates decide which information is relevant to keep or forget during training
of the model. And also it uses a gate that regulate a flow of information in the LSTM cell [31]
[34].
These LSTM and GRU cells are used for both the encoding and decoding of the information in
this research. The encoder takes in the input vectors of the words to be translated and formulates
a connection among them; encoding. Then this encoded data is fed to the decoder as well as the
desired output. The decoder then forms the connection amongst the decoders last output and the
desired output. This is the training phase where the neurons within these cells are being thought of
the relation amongst the words to be translated and the final translated output.
In order to implement encoder with this architecture, the study designed the number of internal
units require to read the input sequence from the word embedding layer with equal size of the word
embedding. Therefore, each element of vector from word embedding element is read by each unit
of the network.
In order to come up with the framework of our input language, our encoder layer accepts input
words with word embedding layer which is vector representation of input language sentence. After
deriving the new vector which represents the contextual relationship between words of the
sentences by generating one word at time, it will feed the context vector into attention mechanism
layer.
Therefore, embedding layer is at the top which embeds the given input sentence and at the middle
is the encoder layers then the attention mechanism at the bottom. The attention mechanism
interconnects the encoder layers with the decoder layers of the proposed system.
38 | P a g e
4.4.2. Model Layer Description
Our proposed model is keras sequential model and our proposed RNN contains 5 layers for taking
input, doing calculation and giving decision.
Input layer: After the raw corpus is preprocessed by the data, it needs to be converted into a form
of word vector that the hidden layer can receive and process. The traditional machine learning
method mainly used the classic one hot representation. The representation method is very simple,
and there is no way to measure between words. Semantic relations and other issues. Using
Embedding to train text word vectors that incorporate emotional information can better learn the
semantic information contained in words in low-dimensional space.
Hidden layer part: The input of the hidden layer is the text word vector of the upper layer. This
part uses GRU, LSTM, BILSTM, BIGRU to ensure that the information between the text contexts
is fully learned while the model training time is greatly reduced. In addition, in order to highlight
the importance of word predict, the model introduces the Attention mechanism. By calculating and
assigning the corresponding probability weights of different word vectors, the key information of
the text can be further highlighted, which is beneficial to extract the deep features of the text.
Output layer: The classification result of the output layer is calculated by the softmax function
on the output of BIGRU. The specific formula is as follows: max( ) i bi y  soft w x  (1) Where
wi represents the probability weight assigned by the Attention mechanism, i x represents the
hidden layer vector to be classified, bi represents the corresponding offset, and i y is the
predicted label of the output.
An input embedding layer was taken as initial layer of neural network as input layer. Then the
hidden layer was taken place. It could be explained as the main LSTM and GRU layer and did it
for 50 units. For network layer with attention it takes 50 units. And also we Add Bidirectional and
attention model with LSTM cell and GRU cell. Add Dense which is equal of vocabulary size and
use softmax activate function. We compiled the model with sparse categorical cross entropy' since
numeric value and we used 'Adam' optimization function. Finally fit the define model and set input
and output sequence with verbose. The final and output layer is dense layer. An activation function
is applied here named softmax. Softmax calculates the probability of event distribution over n
events. This function generally calculates the probabilities of each target class across all possible
target classes. We try to select optimal loss function among sparse and categorical cross entropy
based on our dataset. however, from our experiment with both loss function we get sparse
39 | P a g e
categorical cross entropy due to memory error in categorical cross entropy. As this problem is
treated as a multiclass classification problem so the loss function is sparse categorical cross
entropy.
4.5. Tune Hyper parameters for proposed models
We had to choose a number of hyper parameters for defining and training the model. We relied on
intuition, examples and best practice recommendations. Our first choice of hyper parameter values,
however, may not yield the best results. It only gives us a good starting point for training. Every
problem is different and tuning these hyper parameters will help refine our model to better
represent the particularities of the problem at hand. Let’s take a look at some of the hyper
parameters the study used and what it means to tune them:
Number of layers in the model: The number of layers in a neural network is an indicator of its
complexity. The study must be careful in choosing this value. Too many layers will allow the
model to learn too much information about the training data, causing overfitting. Too few layers
can limit the model’s learning ability, causing under fitting. For text classification datasets, the
study experimented with one, two, and three-layer MLPs. Models with two layers performed well,
and in some cases better than three-layer models [68]. Similarly, the study tried RNN with four
and six layers, and the four-layer models performed well.
Number of units per layer: The units in a layer must hold the information for the transformation
that a layer performs. For the first layer, this is driven by the number of features. In subsequent
layers, the number of units depends on the choice of expanding or contracting the representation
from the previous layer. Try to minimize the information loss between layers. The study tried unit
values in the range [8, 16, 32, 64], and 32/64 units worked well.
Dropout rate: Dropout layers are used in the model for regularization. They define the fraction of
input to drop as a precaution for overfitting. Recommended range: 0.2–0.5.
Learning rate: This is the rate at which the neural network weights change between iterations. A
large learning rate may cause large swings in the weights, and the study may never find their
optimal values. A low learning rate is good, but the model will take more iterations to converge.
It is a good idea to start low, say at 1e-4. If the training is very slow, increase this value. If your
model is not learning, try decreasing learning rate.
40 | P a g e
Table 4. 2 parameter of the proposed model
N Model Parameters Number of layers

[Batch, unit, epoch]
o
1 LSTM (single layer) 128,50,105 5
2 LSTM (single with 128,50,105 6

attention layer)
3 LSTM (bi-layer) 128,50,105 5
4 GRU (single layer) 128,50,105 5
5 GRU (single with attention 128,50,105 6

layer)
6 GRU (bi-layer) 128,50,105 5
7 BIGRU+ Attention 128,50,105 6
8 BILSTM+ Attention 128,50,105 6
4.6. The Evaluation
After the candidate models (RNN) have been trained with the available bi-lingual corpora, and
they will be evaluated based on the perplexity score metrics. However, the type of experiment may
differ from one to another by the hyper parameters. The corpus size and the ratio of dataset split
may also be other distinct. The main experiment used for comparison and evaluation of all models
is conducted with the 2872073corpus size.
Generally, when designing a next word prediction for Afaan Oromo’s, the study has to have a
training and testing phases so that the system can achieve its goal. The training phase comes first
and then the testing phase comes at last.
41 | P a g e
Chapter Five: Experimentation
5.1. Introduction
This chapter describes the experimental results of this thesis work by showing the experimental
setups and performance testing results of the experimental systems using BLEU score metrics and
time units and it gives comparison of the results of different models. Additionally, this chapters
present about the tools and development environments used to implement the next word prediction
of Afaan Oromo.
5.2. Experimental Environment and Parameter Settings
This experiment is carried out under the window 10 operating system; the CPU computer model
is the development environment is config. We used anaconda version 3.8 which supports different
python library and as well as we used keras and tensor flow as the backend. The generated word
vector dimension is 120 dimensions, and the number of iterations is 100 times. The parameters of
each neural network in the experiment are: hidden layer size 100, number of batch processing is
128, stacking loss function is cross entropy, and RMSprop algorithm is used for initialization
method.
5.3. Experiment procedure
In this experiment, eight neural networks were set up to compare with the methods of this paper
by setting LSTM, LSTM +ATT, GRU, GRU +ATT, Bi LSTM, Bi GRU, Bi LSTM, Bi GRU + Att
and Bi LSTM + Att. The experimental steps are as follows: (1) After the corresponding data set is
preprocessed, the corresponding word vector is obtained as each model input. (2) After receiving
the input matrix, the hidden layer outputs the accuracy and loss rate of the training set every 100
steps. After each iteration is completed, the accuracy rate and loss rate of the corresponding
training set and test set are output. (3) Introduce the Attention mechanism to assign a corresponding
weight to the text vector outputted by each iteration, and apply it to the classification function. (4)
After iteration 100 times, the model will evaluate the performance of the test set, and output the
classification accuracy rate, loss rate and time cost of the model classification task. Each of the
proposed model with comparative experiments was performed in the same data set and
experimental environment to ensure the validity of the experiment.
42 | P a g e
Data Selection Methodology: As data is one amongst the foremost key parts during training and
support any neural network the study selected datasets based on word length and the study
separated our corpus into the length of words in the sentences. the information vector during this
work could be a sequence of words. Similarly, so as to convert these numbers into corresponding
words, associate idx2word lexicon is employed for mapping of these distinctive numbers into a
singular set of words. information vector, during this case, is said as “text to Sequence”, that is
made when taking a recurrent sequence of words reborn into variety sequence. Before this text to
sequence are often used, a dataset undergoes a great deal of preprocessing. Firstly, all the dataset
from a document is collected and serialized with the assistance of python library beutysoup. After
this, the vocabularies of the dataset, idx2word and word2idx dictionaries square measure shaped
and with the assistance of those dictionaries and vocabularies, a sequence of words is reborn into
the numerical sequence
Text to Sequence Conversion: A Sequence-to-Sequence model requires integers sequences and
the study converted both the input and the output words into integer sequences of fixed length.
Data Splitting into Train and Test: Once a numerical sequence is generated with the assistance
of Tensor Flow Keras Tokenization, library this array is reborn into the Sequence and also the
dataset is split into 2 elements, the 1st train set and the test set, shuffling and random selection
techniques are done on the dataset to feature some variance and to boost model performance. We
divided the total dataset into train and test set for model training and evaluation, following the rule
of 80% and 20% correspondingly.
The study set some part of the first words of Afaan Oromo text in the sentence except the last word
based sentence length as the input sequences and the last words in sentence as the target sequences.
This has to be done for both the train and test datasets.
Selecting data for model training based on their length of words
The study selected datasets to train based on the length of sentences from total dataset the study
collected using below technique.
 In the first line the study read the data set using panda’s library from directory to python
environment.
 In second line we convert data frame into list of array which consist all sentences.
 In third line we convert list of sentences into words.
 In 4th and 5th line we Tokenize and fit the dataset in order to get unique words in the
corpus
43 | P a g e
 In the 6th,7th line we convert words into numbers and select sentences with length
greater than 2.
 In the 8th line we did pre padding with zero the train dataset.
 In the 9th line we create corpus vocabulary size.
 In 10th and 11th line we split train datasets into input and labels correspondingly.
5.4. Description of Proposed Model
1) Input Layer takes a sequence of words as X and next word as Y.
2)Embedding layer composed 50 neurons (units) accustomed map all the distinctive vocabulary
points into these 50 units. Given a sentence of N words, w1, w2..., wN, each word wn ∈ W is first
embedded in a D-dimensional vector space by applying the lookup-table operation LTW(wn) =
Wwn, where the matrix W ∈ RD×|W| represents the parameters to be trained in this lookup layer.
Each column Wn ∈ RD corresponds to the vector embedding of the nth word in our dictionary W.
In practice, it is common to give several features (for each tree node) as input to the network. This
can be easily done by adding a different lookup table for each discrete feature.
3) A neural layer from the pool of LSTM, GRU and two-way RNN with a mixture of either single-
layer and bi-layers
4) Dropout layer, to generalize the training method within the neural models to find out the
sequences with efficiency and forestall overfitting in these models.
5). Dense layer, this layer plays a vital role in connecting all the neurons from neural layers in
unison and turn out the desirable output from these layers as per the necessity of the user. during
this case, the output is analogous to the input within the embedding layer, a vector of the dimension
of vocabulary set.
6) Adam optimizer this is often the same as gradient descent [22] optimizer formula to optimize
the steps throughout our model coaching so the losses area unit converged at a quicker rate
After training them, the neural models are ready to generate a new sequence of words. To ensure
better prediction and diverse output of sequences, an annotated dataset is used. The goal was to
expose the model with a diverse dataset which would lead to a better tuning of the model. The text
44 | P a g e
file format was used to extract dataset. The text files were annotated which played an important
role in determining the speaker and the correct sequence of the dialogues.
The model was compiled using Moon et al. [19] as a suggested guide for dropouts. Dropout of 0.5
was applied to each of the neural layers, i.e., LSTM, GRU and Bidirectional LSTM. The optimizer
selected was Adam [20], with the learning rate of 1e-3 for model parameter optimization.
5.5. Training the Model
In training the model, train, valid and test datasets are created for prediction of words. Train
datasets are used to primary training, validation datasets are used for check the validation of
training accuracy and test datasets are used for final testing of the accuracy.
After defining our model, the study optimized our model loss function by experimenting sparse
categorical cross entropy. The study used 8 RNN model for training using sparse categorical cross
entropy.
LSTM BILSTM BIGRU

GRU with
LSTM GRU BiLSTM BiGRU with with with
attention
attention attention attention
figure 5. 1 proposed RNN models

Model with loss function with categorical cross entropy Delay while training and Memory out of
overflow
Constraints
 Memory Constraint - Corpus size may be too large which might cause a memory error.
 Latency Constraint - It is a low latency problem as the entire problem is designed to enable
fast typing.
 OOV words- It is important to take care of out of vocabulary words because all words may
not be present in the corpus but the model should be able to handle it.
45 | P a g e
 Divide x and y into train and test dataset for each sequence length and
 Train different LSTM models for different sequence lengths.
 Train Different LSTM models will be used for different input lengths.
Based on the model testing result the study also see that model limited to few vocabularies size
out of our total dataset. And finally, the study selected sparse categorical cross entropy for our
thesis.
5.6. Proposed model Training Result
Table 5. 1 result of training model
Model Training accuracy Training loss
LSTM 83.63% 0. 6
GRU 84.87% 0.6
BILSTM 82.94% 0.6
BIGRU 88.68% 0.3
LSTM+ Att 86.58% 0.5
GRU +Att 86.71% 0.47
90% 0.1
BIGRU+ Att
BILSTM+ Att 89% 0.2
46 | P a g e
Model Performance
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
LSTM GRU BILSTM BIGRU LSTMat GRUatt BIGRU+att
BILSTM+att
Training accuracy Training loss
figure 5. 2 result of training model
Generally during training Bi directional GRU with attention models performs best than other with
Train accuracy 99% and loss 0.1
5.7. Test Results of Proposed Model

The study trained only one week’s dataset corpus for having limitation of hardware limitation. And
finally, did test with diﬀerent Afaan Oromo words, then the model generated some text according
to previous text.
The study used different tools and developing environments in order to implement the algorithms
and to do necessary experiment on the system.
Our system generates word sequence predictions in a manner that closely matches Afaan Oromo
typing systems. User type text in text box (input) and when space bar or one of delimiters is
pressed, the system predicts 5 likely single next words and shows them in equal-sized suggestion
list box, with the most likely suggestion in the top of the list. Next, a user clicks his or her preferred
word from a given list of word options instead of typing each character and then click add button.
However, if the required word is not listed in a given option, then a user continues typing in usual
writing method. Figure bellow shows User Interface of Word Sequence Prediction.
47 | P a g e
5.7.1. Model Evaluation Result
Table 5. 2 Result of testing
Model accuracy perplexity

LSTM 77.39% 0.9
GRU 78.97% 0.8
BILSTM 77.93% 0.8
BIGRU 81.95% 0.6
LSTM+ attention 80.04% 0.7
GRU+ attention 80.33% 0.7
BILSTM+ attention 81% 0.4
BIGRU+ attention 82% 0.3
180.00%
160.00%
140.00%
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
accuracy prexprexity
figure 5. 3 accuracy and perplexity of testing result sorted according to their performance
In order to give results analysis from the different aspects, the study also conducted experiments
on training set convergence performance. It should be noted that for better presenting comparison
results, the study remained 6 contrast models on our datasets. The loss values of all models are
smoothing processed to avoid the problem possibly caused by shock loss. The number of iterations
is 125. In order to present the convergence changes clearly, the study recorded one point per epoch.
As shown in Figure. 5.3, converges quickly with the notable decreasing slope, while LSTM shows
a slight weaker convergence speed. Same as the disappointing performance in accuracy, the
48 | P a g e
convergence speed is quite poor, which further proves that attention-based Bi-LSTM is not suitable
for short text classification. BILSTM shows unstable converging performance in the 8 experiment.
Combined with its unstable classification accuracy, the study considered that this simple model is
easily caused over-fitting problem. While BIGRU shows a relatively good convergence speed,
which further proves the effectiveness of adding recurrent network. From the above two
perspectives, the BIGRU model not only is of high accuracy in the testing set, but also converges
quickly in the training set.
5.8. Prototype
The prototype of Afaan Oromo word sequence prediction is developed using python framework.
The main aim of the prototype of the development is to demonstrate and test the developed world
sequence prediction model.
figure 5. 4 predicting two word input and outputs one word
figure 5. 5 take one input and predict one or more out put
49 | P a g e
5.9. Error Analysis for unigram data points
figure 5. 1 Error Analysis of uni gram
50 | P a g e
figure 5. 2 Error analysis of trigram
From the EDA of the above 3 cases, the study concluded that stop words mainly overlap in best
and worst prediction data points. The rest are distinct.
5.10. Discussion
Comparative study between all the variants of the neural layers to train and generate text sequence
for the scripts, as well as the average time required to train each of these steps on CPU. While
training different models, it was observed that LSTM based neural networks took the least time to
execute a training epoch, Bidirectional RNN took the most time and GRU took slightly greater
time than LSTM.
Figures 5.8-5.16 represents the overall performance for each architecture of the model.
figure 5. 8 LSTM with attention
51 | P a g e
figure 5. 9 LSTM model loss
figure 5. 10 training Result LSTM model
52 | P a g e
figure 5. 11 BILSTM with attention
figure 5. 12 BI LSTM with attention
53 | P a g e
figure 5. 13 Training Result BIGRU model
figure 5. 14 Training Result of GRU model with attention
54 | P a g e
figure 5. 15 BI GRU with attention
figure 5. 16 GRU model
55 | P a g e
Chapter 6: Conclusion and Future Work
6. Conclusion
There are a lot of approaches to the word sequence prediction that intermittently designed. The
main objective is to speed up typing, reduce efforts, and consuming the time for composing a text
message also to boost the communication rate. These approaches mainly produced to help people
with disabilities or dyslexia also people with no disabilities can utilize next word prediction
systems to help them to correct spelling mistakes and type their desired words with few efforts
during composing text messages in other words it needs less typing words. Also, various
computations of next word prediction models have been represented especially perspective of view
in saving time, performance, input text rate, and accuracy.
Various techniques have been proposed to enhance the level of next word prediction systems such
as statistical and deep learning. This paper has addressed the study of these approaches. From our
Review of study, we see that deep learning Approach is one of the popular technique used for
language modeling unlike statistical technique. RNN attention is a recent come with a state of art
by solving a problem of sequence prediction. RNN with attention get more popularity than RNN
by solving a problem of long sequence dependency.
Additional this thesis work discussed the design and development of a sequence word prediction
for Afaan Oromo using the both RNN with attention and without (LSTM / GRU/BIGRU/BILSTM)
which are part of deep learning model. The study has proposed a decent technique for creating a
programmed Afaan Oromo next word sequence prediction using Bi-directional RNN. Since no
model gives a precise outcome but yet our model gives better yield and maximum output is exact.
Utilizing our proposed model, the study has effectively created a fixed length and importance full
Afaan Oromo content. The performance of the models is further analyzed to reach a conclusion
that LSTM generates text an in most efficient way followed by GRU and then Bidirectional RNN
while loss is least in Bidirectional RNN followed by LSTM and it is most in GRU. The LSTM
model takes the least time for text generation, GRU takes slightly more time and Bidirectional
RNN takes the highest time. The addition of attention also increases performance over RNN
following Bi directional RRN. From our comparison of our experiment of all 8 models Bi
directional GRU with attention performs best than other models.
56 | P a g e
Generally, in order to achieve the objectives, corpus data was collected from different sources and
divided into training and testing sets. 80% of total dataset for training and 20% of total dataset for
testing were used. Afan Oromo word sequence prediction model was designed and developed
using deep learning technique which is RNN approach. Different 6 RNN models were
implemented with various techniques on GRU, LSTM, Bidirectional GRU, Bidirectional LSTM,
GRU with attention, LSTM with attention, Bidirectional LSTM with attention and bidirectional
GRU with attention. Three systems were implemented where the first system uses a word-based
statistical approach that can be used as a baseline, while the second system with recurrent neural
network approach is used as a competitive model, and lastly, the third system with recurrent neural
networks with attention word sequence prediction, Afan Oromo languages.
The designed model is evaluated based on the developed model. Perplexity Score is used to
evaluate model performance. According to the evaluation we get for LSTM 83.63%, GRU 84.87%,
BILSTM 82.94%, BIGRU 88.68%, LSTM with attention 86.58%, GRU with attention 86.71%,
BILSTM+ attention 89%, BIGRU+ attention 90%, performance respectively. Therefore, BIGRU
model have quite good and BIGRU+ attention is shows more performance.
6.1. Contribution of the Thesis
The contributions of this thesis work are summarized as follows:
 We proposed RNN architecture for Afaan Oromo word sequence prediction.
 We design and develop Afaan Oromo word sequence prediction system.
 The study identified Bi directional GRU model of word prediction approach that is
suitable for Afaan Oromo word sequence prediction with training accuracy 88% and
Testing accuracy 81%.
 The study gets individual and cumulative 1 gram of perplexity score 83%
 The study gets individual and cumulative 2 gram of perplexity score 67% and75%
respectively
 The study gets individual and cumulative 3 gram of perplexity score 56% and 68 %
respectively
57 | P a g e
 The study gets individual and cumulative 4 gram of perplexity score 47 % and 62%
respectively
6.2. Future work
There are a number of holes for improvement and modification for Word sequence prediction of
Afaan Oromo. Below are some of the recommendations the study propose for future work. There
are a few imperfections in our proposed system, for example, cannot create arbitrary length
content. The study has to characterize the creating content length. Another deformity is the study
has to characterize cushion token for foreseeing next words. In our future work, the study will
make a programmed Afaan Oromo content generator which gives an arbitrary length Afaan Oromo
content without utilizing any token or succession. And also, in this paper the study worked with
less data, due to hardware limitations. Afterwards the study will enhance our dataset. In future the
study will improve the model for achieving multi task sequence to sequence text generation.
58 | P a g e
REFERENCES
1. Barry McCaul and Alistair Sutherland, “Predictive Text Entry in Immersive

Environments”, Proceedings of the IEEE Virtual Reality 2004 (VR'04), P: 241, 2004
2. Nesredin Suleiman, “Word Prediction for Amharic Online Handwriting Recognition”,

Unpublished MSc Thesis, Addis Ababa University, 2008.
3. Kumiko Tanaka-ishii, “Word-Based Predictive Text Entry Using Adaptive Language

Models”, Natural Language Engineering 13 (1): 51–74. 2006 Cambridge University
Press, 15 February 2006
4. Nicola Carmignani, “Predicting words and sentences using statistical models,”Language

and Intelligence Reading Group, date: July5, 2006
5. Garay-Vitoria Nestor and Julio Abascal, “Text Prediction Systems: A Survey”,Universal

Access in the Information Society, 4(3): 188-203,2006
6. Lesher G, Moulton B., and Higginbotham D., “Effects of N-gram Order and Training
Text Size on Word Prediction,” in Proceedings of (RESNA‟99) Annual Conference,
Arlington, VA, pp. 52-54, 1999.
7. Even-Zohar Y. and Roth D., “A Classification Approach to Word Prediction”, in

Proceedings of The 1st North American Conference on Computational Linguistics
(NAACL' 2000), pp. 124-131, 2000.
8. Koester H. and Levine S., “Modeling the Speed of Text Entry with a Word Prediction
Interface”, IEEE Trans. on Rehabilitation Engineering, vol.2, no. 3, pp. 177-187,
September 1994.
9. Tesfaye Guta Debela, “Afaan Oromo Search Engine", Unpublished MSc Thesis, Addis
Ababa University, 2010.
10. Debela Tesfaye", Designing a Stemmer for Afaan Oromo Text: A Hybrid Approach",
unpublished MSc Thesis, Addis Ababa University ,2010
11. Gaddisa Olani Ganfure and Dida Midekso, “Design and Implementation of Morphology
Based Spell Checker”, International Journal of Scientific & Technology Research,
December 2014 pp118-125
12. Morka Mekonnen, “Text to speech system for Afaan Oromo”, Unpublished MSc Thesis,
Addis Ababa University, 2001.
13. Diriba Magarsa, “An automatic sentence parser for Oromo language”, Unpublished MSc
Thesis, Addis Ababa University, 2000.
59 | P a g e
14. Assefa W/Mariam, “Developing morphological analysis for Afaan Oromo text”,
Unpublished MSc Thesis, Addis Ababa University, 2000.
15. Abraham Tesso Nedjo and Degen Huang, “Automatic Part-of-speech Tagging for Oromo
Language Using Maximum Entropy Markov Model (MEMM)”, Journal of Information &
Computational Science pp. 3319–3334, July 1, 2014
16. Md. Masudul Haque and Md. Tarek Habib, “Automated Word Prediction in Bangla
Language Using Stochastic Language Models”, International Journal in Foundations of
Computer Science & Technology (IJFCST) Vol.5, No.6, pp 67-75, November 2015
17. Alemebante Mulu and Vishal Goyal, “Amharic Text Predict System for Mobile Phone”,
International Journal of Computer Science Trends and Technology (IJCST) –Volume 3
Issue 4, Jul-Aug 2015.
18. Tigist Tensou, “Word Sequence Prediction for Amharic Language”, Unpublished MSc
19. Johannes Matiasek, Marco Baroni, and HaraldTrost, “FASTYA multi-lingual approach to
text prediction”, In Computers Helping People with Special Needs, pp. 243-250. Springer
Berlin Heidelberg, 2002.
20. Alice Carlberger, Sheri Hunnicutt, John Carlberger, Gunnar Stromstedt, and Henrik
Wachtmeister, “Constructing a database for a new Word Prediction System,” TMH-
QPSR 37(2): 101-104, 1996.
21. Sachin Agarwal and Shilpa Arora, “Context based word prediction for texting language”,
In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 360-
368, 2007.
22. Carlo Aliprandi, Nicola Carmignani, NedjmaDeha, Paolo Mancarella, and Michele
Rubino, “Advances in NLP applied to Word Prediction”, University of Pisa, Italy
February, 2008.
23. Aliprandi Carlo, Nicola Carmignani, and Paolo Mancarella, “An Inflected-Sensitive
Letter and Word Prediction System”, International Journal of Computing and Information
Sciences, 5(2): 79-852007
24. Masood Ghayoomi and Ehsan Daroodi, “A POS-based word prediction system for the
Persian language”, In Advances in Natural Language Processing, pp. 138-147, Springer
Berlin Heidelberg, 2008.
25. G. Q. A. Oromo, “Caasluga Afaan Oromo Jildi I”, Komishinii Aadaaf Turizmii
Oromiyaa, Finfinnee, Ethiopia, pp. 105-220 (1995).
60 | P a g e
26. Keith Trnka, John McCaw, Debra Yarrington and Kathleen F. McCoy, “Word Prediction
and Communication Rate in AAC”, IASTED international conference Tele health and
assistive technology, April 16-18 2007, Maryland USA
27. Getachew Mamo Wegari and Million Meshesha, “Parts of Speech Tagging for Afaan
Oromo”, (IJACSA) International Journal of Advanced Computer Science and
Applications, Special Issue on Artificial Intelligence.
28. Getachew Emiru, "Development of Part of Speech Tagger Using Hybrid Approach"
Unpublished MSc Thesis, Addis Ababa University ,2016
29. Mohammed Hussen Abubeker, "Part-Of-Speech Tagging for Afaan Oromo Language
Using Transformational Error Driven Learning (Tel) Approach", Unpublished MSc
30. Aberra Nefa, “Oromo verb inflection”, Unpublished MA Thesis, Addis Ababa
University,2000.
31. Baye Yimam, “The Phrase Structure of Ethiopian Oromo”, Unpublished Doctoral Thesis,
University of London, 1986.
32. Addunyaa Barkeessaa, “Sanyii Jechaa fi caasaa Isaa (Word and Its structure)”,
Alem,2004.
33. Michael Gasser, Hornmorph User's Guide, 2012.
34. Wakweya Olani, “Inflectional Morphology in Oromo,”2012
35. Debela Tesfaye,” A rule-based Afan Oromo Grammar Checker”, IJACSA)

International Journal of Advanced Computer Science and Applications, Vol. 2, No. 8,
2011.
36. C. G. Mewis.” A Grammatical sketch of Written Oromo”, Germany: Koln, pp. 25-99
(2001).
37. Masood Ghayoomi and Saeedeh Momtazi, “An overview on the existing language
models for prediction systems as writing assistant tools”, In Systems, Man and
Cybernetics, 2009. SMC 2009. IEEE International Conference on, pp. 5083 5087, IEEE,
2009
38. Klund, J. and Novak, M. (2001). If word prediction can help, which program
doyouchoose?Availableat:http://trace.wisc.edu/docs/wordprediction2001/index.htm?
39. M. E. J. Woods, “Syntactic Pre-Processing in Single-Word Prediction for Disabled

People”, Unpublished Doctoral Thesis. dissertation, University of Bristol, Bristol, 1996
61 | P a g e
40. Fazly, “The Use of Syntax in Word Completion Utilities,” Unpublished MSc, University
of Toronto, Canada, 2002
41. E. Gustavii and E. Pettersson, “A Swedish Grammar for Word Prediction”, Unpublished
MSc, Uppsala University, Stockholm, 2003
42. J. Hasselgren, E. Montnemery, P. Nugues, and M. Svensson, “HSM: Apredictive text

entry method using bigrams”, 10th Conference of EACL, In Proceedings of the
Workshop on Language Modeling for Text Entry Methods, Budapest, Hungary, pp. 59-
99, 2003
43. Sharma, Radhika; Goel, Nishtha; Aggarwal, Nishita; Kaur, Prajyot; Prakash, Chandra
“Next Word Prediction in Hindi Using Deep Learning Techniques. (2019)”. [IEEE 2019
International Conference on Data Science and Engineering (ICDSE) - Patna, India
(2019.9.26-2019.9.28)] 2019 International Conference on Data Science and Engineering
(ICDSE) -
44. C. L. James, and K. M. Reischel, “Text input for mobile devices: Comparing model
prediction to actual performance”, In Proceedings of CHI-2001, ACM, New York, pp.
365-371, 2001
45. Zi Corporation, eZiText. Technical report, 2002.http://www.zicorp.com
46. LexicusDivision,iTap.Technicalreport,Motorolla,2002.http://www.motorola.com/lexi cu
47. D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to
48. Garay-Vitoria Nestor, and Julio Abascal, “Word prediction for inflected languages”,
Application to Basque language, 1997.
49. R. Rosenfeld, “Adaptive Statistical Language Modeling: A Maximum Entropy

Approach”, Unpublished PhD dissertation, Canegie Mellon University, Pittsburgh,1994
50. S. Hunnicutt and J. Carlberger, “Improving word prediction using markov models and
heuristic methods”, Augmentative and Alternative Communication, vol. 17, pp. 255-264,
2001
51. FerranPla and Antonio Molina, “Natural Language Engineering: Improving part of
speech tagging using lexicalized HMMs_ 2004”, Cambridge University Press, United
Kingdom, 2000
52. http://www.gusinc.com/wordprediction.html
53. S. Hunnicutt and J. Carlberger, “Improving word prediction using markov models and
heuristic methods”, Augmentative and Alternative Communication, vol. 17, pp.
62 | P a g e
54. SENAIT KIROS BERHE, “Word Sequence Prediction Model for Tigrigna Language”
,2020, Addis Ababa University
55. Sheri Hunnicutt, Lela Nozadze, and George Chikoidze, “Russian word prediction with
morphological support”, In 5th International symposium on language, logic and
computation, Tbilisi, Georgia, 2003.
56. Yael Netzer, Meni Adler, and Micheal Elhadad, “Word Prediction in Hebrew:
Preliminary and Surprising Results”, ISAAC, 2008.
57. Antal van den Bosch, “Scalable classification-based word prediction and confusable
correction”, TAL. Volume 46 – n° 2/2005.
58. Sachin Agarwal and ShilpaArora, “Context based word prediction for texting language,”
Conference RIAO, 2007.
59. Keith Trnka, “Adaptive Language Modeling for Word Prediction,” Proceedings of the
ACL-08: HLT Student Research Workshop (Companion Volume), pages 61–66,
Columbus, June 2008.
60. Masood Ghayoomi and Seyyed Mostafa,” Word prediction in Running Text: A Statistical
Language Modeling for the Persian Language”, Proceeding of the Australasian Language
Technology Workshop 2005, pages 57-63 Sydney, Australia December 2005.
61. Afsaneh Fazly and Graeme Hirst, “Testing the Efficacy of Part-of-Speech Information in
Word Completion”, Proceedings of EACL 2003 Workshop on Language Modeling for
Text Entry Method.
62. Keith Trnka and Kathleen F. McCoy,” Evaluating Word Prediction: Framing
Keystroke Savings”, Proceedings of ACL-08: HLT, Short Papers (Companion Volume),
pages 261–264, Columbus, Ohio, USA, June 2008. c 2008.
63. S Mangal ,”LSTM vs. GRU vs. Bidirectional RNN for script generation”
arXivhttps://arxiv.org ,· 2019
64. Nishtha Gael , Nishita Aggarwal, Prajyot Kaur and Chandra Prakash.
65. ,“text Word Prediction in Hindi Using Deep Learning Techniques Radhika Sharma,”
Author Proof S. Yu et al. / Attention-based LSTM, GRU and CNN for short text
classification,2019
66. Liang Zhou and Xiaoyong Bian ,”Improved text sentiment classification method based on
Bi GRU Attention “,2019 J. Phys.: Conf. Ser. 1345 032097
63 | P a g e

Dire Dawa University Thesis on Designing and Developing an Afan Oromo Word Sequence Prediction Model Using Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dire Dawa University Thesis on Designing and Developing an Afan Oromo Word Sequence Prediction Model Using Deep Learning

Uploaded by

Copyright:

Available Formats

DIRE DAWA UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE

Advisor: Gaddisa Olani (PhD)

A Thesis Submitted to Dire Dawa Institute of Technology, Dire Dawa

Dire Dawa, Ethiopia

DEPARTMENT OF COMPUTER SCIENCE

________________________ _______________ ______________

Examining Board Approval

1. ___________________ _______________ ____________

External Examiner Signature Date

2. ___________________ _______________ ____________

Internal Examiner Signature Date

3. ___________________ _______________ ____________

_____________________ _______________ _____________

______________________ _______________ _____________

______________________ _______________ _____________

_____________________ _______________ ____________

Muaz Hassen Abdella ------------------------------- 06/062022

(Candidate) Signature Date

figure 2. 1 Approaches of word sequence prediction.................... Error! Bookmark not defined.

1.2. Statement of the Problem

1.4.1. General Objective

1.4.2. Specific Objective

The study specifically attempts to:

1.5. Scope and limitation

1.7. Application of Word Sequence Prediction

1.8. Organization of the Rest of the Thesis

2.3. Approaches of word sequence prediction

2.3.1. Statistical Modeling

PR (input word|next Word) = PR (input word) PR (next Word | input Word)

2.3.3. Heuristic Modeling

2.4. Deep Learning for Word Sequence Processing

2.4.2. Convolutional Neural Networks for Sequence Modeling

2.4.3. Recurrent Neural Networks for Sequence Modeling

“Doctor Gadisa is my Advisor.”

figure 2. 2 RNN basic architecture

2.4.4. Gated Recurrent Unit (GRU) Approach

2.4.5. Long Short-Term Memory Approach

2.4.6. Bidirectional RNN

2.4.7. Recursive neural networks for sequence modeling

From above Review We can generalize that:

Traditional Sequence Prediction:

 Skipped hundreds of important details.

2.5. Evaluation Techniques for Word Prediction

2.6.1. Related work on Foreign Language

We discussed the related work done on Foreign languages.

Word Prediction for English

Word Prediction for Persian Language

Word Prediction for Russian Language

Word Prediction for Hebrew Language

Word Sequence Prediction Hindi Language

2.6.2. Word prediction for Local Language

Word Prediction for Amharic

Word Prediction for Tigrigna

Word Prediction for Afaan Oromo

Chapter Three: Nature Afaan Oromo Language

3.2. Grammatical Structure of Afaan Oromo

Simple sentence Gammaachu kalessa dhufe“Gemechu

Complex sentence Gammaachu kalessaa dhufe fi Kitaaboota

2. Inni ishee jaalata. “He likes her.”

3. Nuti isa binna. “We buy it.”

4.2. Model Designing

________________ _ ________

1. ___________ _ ______

2. ___________ _ ______

3. ___________ _ ______

_____________ _ _______

______________ _ _______

______________ _ _______

_____________ _ ______