You are on page 1of 34

Wachemo University

College of Post Graduate Studies

Parts-of-Speech Tagging for Hadiyyisa Using


Deep Learning

MSc Thesis Proposal

By
Wadola Habte

Aug, 2023

Hossana, Ethiopia
PARTS-OF-SPEECH TAGGING FOR HADIYYISA USING
DEEP LEARNING

By

WADOLA HABTE

ADVISOR: SOFONIAS YITAGESU(Ph.D.)

A THESIS PROPOSAL SUBMITTED TO THE DEPARTMENT OF INFOR-


MATION TECHNOLOGY, COLLEGE OF ENGINEERING AND TECHNOL-
OGY, SCHOOL OF GRADUATE STUDIES, WACHEMO UNIVERSITY IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
OF MASTER OF SCIENCE IN INFORMATION TECHNOLOGY

Aug, 2023

WACHEMO UNIVERSITY
Wachemo University
School of Graduate Studies

This approval sheet of masters thesis proposal proposal on the title, ”Parts-
of-Speech Tagging for Hadiyyisa Using Deep Learning”,submitted in partial
fulfilment for masters of Information Technology complies with regulations of
the university university and meets the accepted standards with respect to orig-
inality and quality.

Approved by:

Name Signature Date

(Advisor)

(Program Chairman)

(Examiner)

(Department Head)

(Post Graduate Studies)


Abbreviations
AI: Artificial Intelligence

ANN: Artificial Neural Network

BiLSTM: Bi-directional Long Short-Term Memory

BiGRU: Bi-directional Gate Recurrent Unit

CRF: Conditional Random Fields

DL: Deep Learning

DNN: Deep Neural Network

FFNN: Feed Forward Neural Network

GRU: Gate Recurrent Unit

HMM: Hidden Markov Model

LSTM: Long Short-Term Memory

ML: Machine Learning

NLP: Natural Language Processing

POS: Parts-of-Speech

RNN: Recurrent Neural Network

SVM: Support Vector Machine

TnT: Trigram’s ’n’ Tags

i
Contents

Abbreviations i

1 Introduction 1
1.1 Background of the Study . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . 4
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 General Objective . . . . . . . . . . . . . . . . . . . 5
1.4.2 Specific Objectives . . . . . . . . . . . . . . . . . . . 5
1.5 Significance of the Study . . . . . . . . . . . . . . . . . . . . 6
1.6 Scope and Limitations of the Study . . . . . . . . . . . . . . . 6
1.7 Organization of the Proposal . . . . . . . . . . . . . . . . . . 6

2 Literature Review 7
2.1 Approaches to POS Tagging . . . . . . . . . . . . . . . . . . 7
2.1.1 Rule-based Approach . . . . . . . . . . . . . . . . . . 7
2.1.2 Stochastic Approach . . . . . . . . . . . . . . . . . . 8
2.1.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Hybrid Approach . . . . . . . . . . . . . . . . . . . . 13
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Materials and Methods 16


3.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Data Pr-eprocessing . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Model Design and Development . . . . . . . . . . . . . . . . 18
3.5.1 Development Tools . . . . . . . . . . . . . . . . . . . 18

4 Timeline 20

ii
5 Budget 21

iii
List of Figures
1 Snapshot from from Stanford POS Tagger . . . . . . . . . . . 3
2 The standard and unfolded RNN . . . . . . . . . . . . . . . . 11
3 The LSTM repeating module . . . . . . . . . . . . . . . . . . 12
4 The Bi-LSTM Architecture . . . . . . . . . . . . . . . . . . . 13
5 The Proposed Hadiyyisa POS tagging Model Architecture . . 19
6 Research Time Line . . . . . . . . . . . . . . . . . . . . . . . 20

iv
1 Introduction

1.1 Background of the Study

Computers cannot understand human languages, meanings, and syntax. How-


ever, as the amount of data of each natural language grows, it becomes more
challenging for humans to manually analyze and extract the necessary compo-
nents. To manipulate the vast amount of data already available, we require the
assistance of computers. Natural Language Processing (NLP) is the process of
creating software tools to enable computers to understand human languages[8].
The task of NLP can be performed at various levels like at word, phrase, or
sentence level. So that need for the help of computers has made NLP become a
fascinating field of computer science[5].

NLP can alternatively be described as a method for computers to process


and recognize natural language. As result, NLP makes the communication be-
tween human and human or human and machine better through the processing
of speeches and texts[8]. NLP has variety of applications including parts-of-
speech (POS) tagging, named entity recognition (NER), machine translation
(MT), information retrieval (IR), information extraction (IE), morphological
analysis, syntactic parsing, stemming[5, 40].

POS tagging is assigning POS or lexical tags (e.g., verb, noun, etc.) to to-
kens in a sentence, i.e., the input is a sequence x1 , x2 , x3 , . . . , xn of tokens and
a tagset, and the output is a sequence y1 , y2 , y3 , . . . , yn of tags provided that
each output yi corresponding exactly to one input xi [38]. In addition, it aims
to solve ambiguity of words in a language. Words can have different meaning
and POS depending on the context of use in a text. For example, an English
sentence, ”I dish the dish”, is POS tagged by Stanford POS tagger as ”I PRP
dish VBP the DT dish NN ”, where ”PRP” is represented as pronoun, ”VBP”
as verb(Past form), ”NN” as noun, and ”DT” as determinant. In the given ex-
ample sentence, the word ”dish” appeared in two contexts, the first context as

1
a verb and the other context as a noun. POS tagging is a prerequisite for the
majority of NLP tasks, such as machine translation, syntactic parsing, informa-
tion extraction, named entity recognition, spelling and grammar checker, and
stemming. It aims to identify lexical categories of tokens according to their
context of appearance in that text[5].

The problem of POS tagging is investigated by employing three basic ap-


proaches namely: rule-based[14], corpus-based[5, 8, 25, 13, 1] and hybrid[18].
The rule-based approach utilizes a set of hand-crafted rules developed by lin-
guistic professionals. Developing a rule-based POS tagger for a language will
be harder if the language is rich in morphology[5]. Linguistic experts may re-
quire years to design hundreds or thousands of rules for developing a rule-based
POS tagger for a single language. Even, after developing the tagger in such a
manner, it will not be easily applicable to other languages having different mor-
phology. It needs to start developing rules from the ground. The corpus-based
POS taggers, on the other hand, build models from the training dataset using
one or more algorithms. They include statistical methods like Hidden Markov
model (HMM), Machine Learning(ML) and Deep Learning(DL) methods. The
hybrid approach can be the combination of rule-based and the corpus-based
approaches[5, 9].

There are NLP tools that has been developed for international languages
like English. Stanford POS tagger is among those tools. It has a tagging perfor-
mance of 97% and is the state-of-the-art for English. To see how how Stanford
POS tagger tags tokens in Hadiyyisa sentences, I gave a sample Hadiyyisa sen-
tence,” Mat manchina bee’ukki luwwa liqaayyixxanchinne awwaaxxookko.”,
as an input for the tagger and obtained the result as indicated in Figure 1 be-
low. The appropriate POS tag for tokens in the sentence: ” Mat determinant
manchina Noun bee’ukki Adjective luwwa Noun liqaayyixxanchinne Adverb
awwaaxxookko Verb . Punctuation ”

2
Figure 1: Snapshot from from Stanford POS Tagger

As seen in Figure 1, the Stanford POS tagger tagged every token, except
the full-stop, of the given sentence as nouns. Luckily, it was able to correctly
tag three tokens out of the seven. The only token it can recognize is the punc-
tuation. That is because English and Hadiyyisa use similar punctuation. That
indicates currently available NLP tools especially POS taggers merely under-
stand Hadiyyisa texts.

There are different POS tagging research works done previously for Ethiopian
languages as well as for the languages of the other world. Some of the works
published in recent years are reviewed and discussed in the literature. Among
those works, most of them employed the deep learning approach in their POS
tagging model. For instance, Alebachew et al[8] in their systematic review an-
alyzed some of the POS tagging research works published between the year
2017 and 2022. According to their study, about 68 of the POS tagging works
employed the DL approach and the rest 32 used ML and hybrid approaches.The
result of that study indicates DL is the state-of-the art and most efficient method
for POS tagging problems.

The POS tagging of Hadiyyisa is not well investigated yet. The state-of-the
art tagger has used rule-based and Trigram’s ’n’ Tags(TnT) approaches. The
model was trained on small size manually tagged Hadiyyisa sentences. How-
ever, the investigation poorly addressed the POS tagging problem for Hadiyy-
isa. Hence, further investigation shall be done in order to have a better POS
tagging application for Hadiyyisa. In this study, the researcher will try to ad-

3
dress the POS tagging problem for Hadiyyisa by using the DL approach, the
state-of-the art technique[9], and compare the result with the previous work.

1.2 Statement of the Problem

POS tagging, assigning POS tags (e.g., verb, noun, preposition, etc.) to to-
kens in the text, can be the first step to supporting subsequent applications, such
as parsing sentences. POS tagging methods have been studied for English[44]
and other languages, with English tags reaching 97% accuracy. Ethiopian
languages, including Amharic[8, 1, 3, 36], Afaan Oromo[43, 6], Wolaita[34],
Tigrigna[39, 37], Ge’ez[18, 18], Hadiyyisa[14], Gamo[19], Sidama[41], Awngi[7],
and Sheko[46]., have yet to be well studied.

In contrast, Hadiyyisa language contains a word morphology unrecogniz-


able by current NLP tools such as the examples shown in Figure 1. Due to
these unique characteristics, natural language methods cannot identify words
specific to the Hadiyyisa language documents. This is because natural lan-
guage methods do not use any domain-specific training data. Machine learning
techniques can learn distinct features, but limited training data is available. Hu-
man annotations for model training are time-consuming and costly as datasets
grow.

The main goal of POS tagging is to solve ambiguity in languages because


words can have different POS and meanings based on the context of use. For
instance in Hadiyyisa, the word ”Mato” can have two different POS, it can ap-
pear as noun or as an adjective. When it appears as a noun it means ”One” that
indicates number. Whereas as an adjective, it means ”Similar” that indicates
some similarity between things. Previous POS tagging for Hadiyyisa[14] used
rule-based, and TnT approaches with a small-sized manually tagged Hadiyyisa
sentence. Rule-based methods recognize predefined patterns in text using ex-
haustive rules and hand-crafted features, but creating comprehensive rules for
morphologically rich languages (e.g., Hadiyyisa language) is challenging[9].

4
Furthermore, the rule-based approach requires linguistic professionals to de-
sign hand-crafted rules.

This study will investigate the POS tagging problem for Hadiyyisa by ap-
plying the deep neural network(DNN) approach to more labeled data and com-
paring the result with the previous studies. Among the DNNs, recurrent neural
networks (RNNs) with long short-term memories will be used to address the
problem of Hadiyyisa POS tagging.

1.3 Research Question

RQ1: Which DL algorithm is best for the development of POS tagger for
the Hadiyyisa language?

RQ2: What performance level can selected DL algorithms yield on a


limited-size Hadiyyisa corpus?

1.4 Objectives
1.4.1 General Objective

The general objective of this study is to design and develop POS tagger for
Hadiyyisa Language.

1.4.2 Specific Objectives

This study desires to achieve its general objective by concentrating on the


following specific objectives:

• To study the structure of Hadiyyisa sentence

• To identify word category and tagset for Hadiyyisa

• To collect and clean Hadiyyisa Language corpus for training and testing

• To design and develop prototype for Hadiyyisa POS tagger

• To test the performance of deep learning algorithms on Hadiyyisa POS


tagging

5
1.5 Significance of the Study

POS tagging is applicable in many NLP systems and acts as one of the most
important component. Conducting this study will have number of significance
for the language development and future researchers. For instance, the corpus
that will be prepared for this study can serve for researchers who want to con-
duct related future studies. In addition, the output of this research will also serve
as the baseline for future research in POS tagging. Further more, this research
can also have a contribution to other NLP research topics specifically machine
translation, name entity recognition, text to speech synthesis, syntactic parsing,
information retrieval, question and answering, grammatical analysis, and so on.

1.6 Scope and Limitations of the Study

This study will focus on investigating Hadiyyisa POS tagger with DL meth-
ods more particularly adopting the Long Short Term Memory (LSTM), Bidirec-
tional Long Short Term Memory (Bi-LSTM), and Gated Recurrent Unit (GRU)
models.

1.7 Organization of the Proposal

The rest of this thesis proposal document is organized as follows: Section 2


presents different POS tagging approaches and the summary of related works.
Section 3 is about study design and data collection.

6
2 Literature Review

2.1 Approaches to POS Tagging

Numerous approaches have been proposed for tagging tokens in a text. Most
of the POS tagging techniques fall under Rule-Based, Stochastic, ML and Hy-
brid approaches. Generally, the stochastic and ML approaches are categorized
as corpus-based approaches[5]. The rule-based approach considers different
grammatical and morphological features for tagging words. In stochastic ap-
proach, probability calculations are done to select the most likely lexical cat-
egory for a word in sentence[29]. On the other hand, the ML approaches use
training corpus and learn features that are hand-crafted or learned automatically
to select the appropriate tag for words. Finally, the hybrid-approach combines
the positive sides two or more approaches[9]. More discussion on the possible
approaches for a POS tagging is presented as follows.

2.1.1 Rule-based Approach

Rule-based POS tagging is one of the earliest techniques and typically relies
on set of rules that contain morphological, syntactical and lexical information.
The set of rules (known as context frame rules) are written by linguistic pro-
fessionals and assign POS tags to words by using contextual information[5].
Assume, for instance, that a word must be a noun if the word’s preceding word
is an article[12]. The set of rules hand-crafted for one language is mostly spe-
cific to the language in that it can not be used directly for the same task on
another language. The quality of the rules designed also depends on the skills
of the linguistic professionals[5]. The rules used for the tagging process can be
constructed manually by language experts or automatically by machine learn-
ing algorithms[9]. The manual way costs a lot of effort, time and it is vulnerable
to error as well. The automatic way learns rules from annotated input corpus
without need to linguistic professionals[5, 9]. This way is called Transforma-
tion Based Learning(TBL) approach, proposed by Eric Brill and also called
Brill tagger[29]. The rule-based approach has both positive aspects and draw-

7
backs. The positive aspect is it does not depend on the size of corpus that it can
work well on small data. When it comes to the drawback, rule construction is
time taking and tedious task. In addition, developing rule-base POS tagger for
languages with complex morphology is harder[5].

2.1.2 Stochastic Approach

The stochastic also called statistical POS tagging is based on the frequency
or probability calculation. It figures out the most frequently used tag for a
specific word in the labeled training data and uses this information to assign
parts of speech to that word in the unlabeled text. Statistical taggers generally
solve the tagging problem by using a tagged corpus[5]. Models like N-gram,
maximum-likelihood and HMM are the examples for probabilistic POS tagging
approach[29]. Among the stochastic approach examples, HMM is the most
widely implemented POS tagging model[9]. It depends on the real Markov
model in which the tagger framework being demonstrated is expected to be
explored from one state to another with a hidden state. In the HMM, the state
i.e., the tag is not directly visible to observers but the output i.e., the word based
on the hidden state is visible[5]. The basic idea of HMM is that it is possible to
choose the highest priority tag for a word given a sequence of words with their
own potential tag sequences. Words can have one or more tags and a sequence
of tags with the highest probability is chosen. In this process, the sequence of
words is directly visible to the observer and the sequence of tags is hidden since
it is only estimated[46].

2.1.3 Deep Learning

Machine learning is a subset of Artificial Intelligence(AI) that makes ma-


chines capable of learning by themselves by using mathematical approaches
to extract knowledge from a given dateset[9]. The ML algorithms are mostly
classified as supervised and unsupervised learning. The supervised algorithms
are those trained using labeled data. Oppositely, the unsupervised algorithms
are trained on and extract features from unlabeled data. For selecting appro-
priate lexical tags for words in sentence, ML algorithms utilize training corpus

8
and learn the word features[5]. The most common ML algorithms used for
POS tagging are Naı̈ve Bayes, Support Vector Machine (SVM), Conditional
Random Field (CRF), Brill, and Trigram’s ‘n’ Tags (TnT),and Artificial Neural
Networks(ANN)[9, 19]. Those algorithms listed in the example are categorized
as traditional or shallow ML algorithms. The other type and most recent ML is
deep learning, the subset of ML[24].

Neural networks, also called artificial neural networks, are computer pro-
grams that are designed to process information by mimicking biological human
brain[5]. ANNs are composed of interconnected nodes, also called neurons,
each of them taking some data input and displaying output[19]. The intercon-
nection between neurons are called weights[27]. Most of ANN architectures
constitute three layers: the input layer, hidden layer and the output layer. The
input layer is the first layer that takes prepossessed information as input of the
network. The hidden layer is the layer in between the input and output layers.
The output layer represents the final output of the network[5]. Each neuron in
the output layer computes weighted sum on the values they receive from the in-
put neurons and generates outputs by applying transformation functions, called
activation functions, on the summation results. There are different activation
functions: Rectified Linear Unit (ReLU), Sigmoid, Tanh,etc[19].

Deep learning is set of ML algorithms that uses neural network as a back-


bone to process information. It is a network with multiple hidden layers[19].
The term deep is used to indicate the depth of the artificial neural network ar-
chitecture. DL is an extended field of ML that has recently proven to be highly
effective in domains like NLP, computer vision, etc [5]. It has substantial ap-
plications in language translation, speech ad image processing, and so on. The
traditional ML algorithms rely on feature engineering, i.e humans have to man-
ually select appropriate features to train the model. However, the DL mod-
els can automatically extract features from complex datasets[27]. This days,
DL approaches have become the most commonly utilized method in machine

9
learning to automatically extract complex data features[8]. The DL approach
requires larger data and can yield a better result than the other classical ML
algorithms. The most common deep learning methods are Feed-forward neural
network (FFNN), Recurrent neural networks (RNN) and its descendants like
LSTM, GRU, Bi-LSTM and Bi-GRU, Convolutional neural network (CNN)
and Multi-Layer Perceptron (MLP)[25].

2.1.3.1 Recurrent Neural Network (RNN)

RNNs are taken as neural networks having memory. It operate well for
applications requiring sequence labeling. They receive a sequence of vec-
tors and output another sequence[19]. RNNs can model a given time-varying
(sequential) patterns that were otherwise hard to capture by standard FFNN
architectures[27]. Its preference for being used in sequence labeling is a re-
sult of its capacity to link the earlier data with the current task. It recursively
applies computations on each instance of input sequences depending on pre-
viously computed results[20]. RNN can be thought of as numerous identical
networks that have been organized in a particular order such that each network
communicates with the one after[5]. Although RNN had been solution for se-
quence labeling tasks, it gets to be weak on linking relevant information from
input data due to its vanishing and exploding gradient problems. These problem
arise especially when the input sequence is long[42]. To overcome this prob-
lem, Hochreiter et al[21] RNNs with long short-term memories,called LSTM.
Among the RNN models, this study will adopt the LSTM and Bi-LSTM mod-
els for Hadiyyisa POS tagging. Standard RNN has loop on its hidden layer[15]
and when unfolded, it holds the same structure copied multiple times and the
copies of the three layers are denoted a time. It has two inputs at each step, one
is the input from the data and the second the output from the previous step[39].
The standard and unfolded architecture of RNN is presented in Figure 2 below.

10
Figure 2: The standard and unfolded RNN

The notations x, h and o represent the input, hidden and output layers re-
spectively. And U, W and V are the weighting matrices of the input-to-hidden
connection, hidden-to-hidden connection, and hidden-to-output connection re-
spectively.

2.1.3.2 Long Short-Term Memory Networks

As discussed earlier in section 2.1.3.2, LSTM networks are designed to solve


vanishing gradient problems that the standard RNN suffers from long data se-
quences, as they can only remember the latest inputs because of the short mem-
ory issue[39]. LSTMs, special kinds of RNNs, are better than the standard
RNNs in handling long data sequences. They have a default behavior designed
to enable them easily learn long-term dependencies[5]. LSTM takes a sequence
of vectors of length n as input and produces a hidden state output sequence of
vectors[20]. LSTM networks have the similar chain like structure with RNN,
but a different structure repeating module. The repeating module of LSTMs has
different component layers interacting in a very special way unlike that of the
standard RNNs. Those components are the cell state, hidden state, forget gate,
input gate,input modulation gate and output gate[19]. The LSTM architecture
is presented in the Figure 3 below.

11
Figure 3: The LSTM repeating module

The top horizontal line of LSTM repeating module in the figure 3 above is
called the cell state, the key component of LSTM. Cell state

2.1.3.3 Bidirectional Long Short-Term Memory Networks

BiLSTM is the extended version of the tradional LSTM. Both LSTM and
RNN take only information of the past context in to account[42]. However,
for tasks like POS tagging, both past and future words are important to select
appropriate POS tag for a given word. So BiLSTM takes both past and future
future sequences in to consideration and trains on both left to right and right to
left directions. This learning capability of BiLSTM is vital to sequence tagging
since a word may get to have different meaning because of the word located
before or after it[19]. As it can be seen from the architecture diagram of a
BiLSTM depicted in Figure 4 below, the output of a Bi-LSTM comes from the
output of left to right LSTM (ht ) concatenated with the output of a ht right to

left LSTM. Here A is a left to right LSTM, A is a right to left LSTM, and xt is
an input.

12
Figure 4: The Bi-LSTM Architecture

2.1.4 Hybrid Approach

Hybrid approach is the combination of two approaches by taking positive as-


pects of both[9]. The positive aspects of stochastic approach may be combined
with the positive aspects of rule-based approach to have a better POS tagger.
The positive aspects of artificial neural network approach and rule-based ap-
proach can also be combined to have a better performing POS tagger. Such
combinations of two approaches are thus known as hybrid approach[5].

2.2 Related Works

For the last decades, natural language research has efforted to address the
problem of POS tagging for natural language documents based on machine
learning and deep learning approaches. For example, English [44], Nepali[30],
Malayalam[4], Marathi[13], Odia[11], Tamil[31], Bengali[23], Kannada[35],
Persian[33], and Indonesian[31]. One limitation of machine learning, but not
deep learning, is that it heavily depends on human expertise for feature selec-
tion. Deep learning, however, can easily be tailored to minimize the limitation
of these manual feature selections[24].

For international languages such as English, POS tagging researches have


been conducted well enough and achieved tagging accuracy of 97%. POS tag-

13
gers for other foreign languages have also been studied by employing different
methods. For instance, POS tagger for Malayalam Language investigated with
DL-approach[4] obtained maximum F1-score(98.33%) tagging performance.
Another example of foreign language POS tagging tool is for Nepali based on
BiLSTM and BiGRU[30] that achieved maximum accuracy of 96.1%. Simi-
larly, POS tagger for Marathi, based on BiLSTM[13], obtained tagging accu-
racy of 97%. As further example, POS tagger for Odia language based on DL
and CRF[11] as also developed and achieved performance accuracy of 92.08%
for with CRF and 94.48% with BiLSTM. POS tagger also for Persian language
with hybrid of HMM and DL[33] was developed and obtained 97.29% tagging
accuracy.

When it comes to the Ethiopian languages, recently, there has been growing
concern in developing POS tagging tools, including Amharic[5, 8, 17], Afaan
Oromo[43, 6], Ge’ez[25, 18], Wolaita[34], Tigrigna[39, 37], Hadiyyisa[14],
Gamo[19], Sidama[41], Awngi[7], Somali[28], and Shekki’noono[46]. For ex-
ample, Ge’ez language POS taggers, a deep learning-based[25], hybrid approach[18],
obtained accuracy of (86.7%) and (94.32%), respectively. As another exam-
ple, Amharic language POS taggers, LSTM-based and Bi-LSTM-based[5], Bi-
LSTM and CRF-based[8], CRF, Brill and TnT-based[17], achieved F1-score
(93.6%), accuracy (91.12%), (95.87%) tagging performance, respectively. Ad-
ditionally, Tigrigna POS tagger has been investigated with Bi-LSTM[39] and
CRF[37] that achieved tagging performance accuracy of (91.8%) and (90.8%)
respectively.

Further more, Shekki’noono POS tagger was studied with HMM method[46]
and achieved an accuracy of (91.28%) tagging performance. Similarly, Awgni
POS tagger was investigated with HMM-based approach[7] and obtained a
maximum tagging performance accuracy of (94.77%). On the other hand, So-
mali POS tagger that used hybrid-approach[28] also obtained average tagging
accuracy of (87.51%). Wolaita POS tagger was investigated Brill[34] method

14
and achieved a tagging accuracy of (92.96%). Further example, Gamo lan-
guage POS tagger[19] investigated with BiLSTM, LSTM,GRU and HMM, ob-
tained maximum tagging performance accuracy of (85%). The other POS tag-
ger investigated was for Sidama language that used TBL approach[41] resulted
tagging accuracy of(93.64%). As another POS tagging investigation example,
POS tagger for Afan Oromo, studied with TBL[22, 6] and HMM[43], achieved
the performance accuracy of 80.08%, 95.6% and 91.97% respectively.

However, POS tagging in the Hadiyyisa language has not been well inves-
tigated. Among others, a few works that employed machine learning and rule-
based techniques for the POS tagging of the Hadiyyisa language can be sum-
marized as follows:

Desta et al.[14] proposed a POS tagger for the Hadiyyisa language using
Rule-based and TnT approaches. The authors used 1280 manually tagged
Hadiyyisa sentences for training and testing the POS tagger. They tested vari-
ous techniques for maximum tagging performance on a limited corpus, includ-
ing TnT with Affix tagger, Rule-based with Affix initial state tagger, and TnT-
unknown word handling with backoff. Experiments demonstrated maximum
tagging performance with accuracy levels of 73.06% with 32 derived tag sets
and 89.03% with 10 basic tag sets. Hence, POS tagger for Hadiyyisa should
be investigated further. And in this research, it will studied by employing DL
approach.

15
3 Materials and Methods

3.1 Study Design

The most prominent research methods adopted in computer science disci-


plines are theoretical, experimental, and simulation-based methods[16]. As
literature[8, 1, 3, 36, 43, 6, 39, 37, 18, 18] indicate, many of the NLP research
works conducted especially on POS tagging used experimental research design.
When it comes to this study also, the experimental research methodology will
be used.

• The general tasks in the research will be:

– Corpus collection
– Pre-processing
– Annotation
– Model designing and Development
– Model training
– Model evaluation
– Comparative analysis with the baseline model

3.2 Data Collection

Hadiyyisa is one of the most under resourced local languages. There is no


manually tagged publicly available text dateset for Hadiyyisa language unlike
other resource rich languages like English[40]. Hence, data sources has to be
identified and data needs to be collected for this research. The sources of data
for this research will be Hadiya Zone Administrative Office, Hadiya Culture
and Tourism Department, Hadiya Zone Education Department, Hadiya Tele-
vision, Hadiya Development Association, Hadiyyisa Language and Literature
Department and any other related text document written in the Hadiyyisa lan-
guage. Collecting data from diverse sources will help to make the data balanced
and representative.

16
3.3 Annotation

The proposed Hadiyyisa POS tagging model needs labeled text corpus for
training. The labeling task requires a domain knowledge so that linguistic ex-
perts in Hadiyyisa language will be hired for this research.

3.4 Data Pr-eprocessing

Data needs to pre-processed before feeding it to the respective model. The


data preprocessing step is crucial task in deep learning. The major preprocess-
ing task will include corpus cleaning, tokenization, corpus splitting and word
vector generation.

3.4.0.1 Corpus Cleaning

The texts that will be collected from the early mentioned sources may not
be organized and clean as expected. So, the researcher will try to handle any
inconsistencies that may reduce the corpus quality. The corpus cleaning task
will be done before giving the corpus for annotation to a linguistic expert.

3.4.0.2 Splitting the corpus

For the sake of performance evaluation, the corpus prepared to be fed to the
model will be separated as training and testing sets.

3.4.0.3 Word Vector Generation

It is known that DL models does not recognize text data. They only work
with numbers generically called vectors. So, it is mandatory to covert the raw
text in the corpus into vectors to make them ready for the DL models to train
on. It is also an essential task in NLP[10]. There are different vectorization
techniques used such as One-Hot Encoding(OHE), Term-Frequency Inverse-
Document Frequency(TF-IDF), Bag-of-Words(BOW) and Word Embedding[2].
The OHE and BOW technique produces very high dimensional and sparse vec-
tor representations for words, especially OHE represents each word with di-

17
mension equal to the size of the vocabulary. So, that will lead to requirements
for higher memory and other computational resources. In addition it will be
harder the ML algorithms to train[45]. TF-IDF, BOW and OHE also do not
capture semantic relationship between words in a sentence[10]. To overcome
this, issue another technique called word embedding was introduced[45]. Word
embedding uses a shallow feed forward neural network to learn the word vector
representations. It will represent words in to lower-dimensional vector space
depending on their linguistic context in that similar words will have similar
vector representations[2, 45]. There are different types of pre-trained word
embedding techniques recently such as Word2Vec, Glove, FastText, ELMO,
BERT,etc[26]. However, Hadiyyisa language does not have any of these pre-
trained models. So, in this study, custom embedding for Hadiyyisa will be
trained with both FastText and Word2Vec tools and their performance will be
compared to check which one works better on small dateset. After that, the em-
bedding weights learned for each word will mapped to the corresponding word
in the training dataset and the resulting vector representation will be fed to the
DL-based POS tagging model for training.

3.5 Model Design and Development

The general structure of the proposed POS tagging model for Hadiyyisa is
presented in Figure 5 below. The major components of the model architecture
are the annotated corpus, pre-processing, Vectorization, model building and
evaluation.

3.5.1 Development Tools

As stated in earlier sections, RNNs with long short memories, LSTM, BiL-
STM, GRU and BiGRU, the DL models will be used to implement the pro-
posed POS tagging model. The performance of each RNNs will be tested and
compared with one another to check one works best among the others. As in
the case of programming language, python language will be used for imple-
mentation of the model during experiment. One of the python development

18
environments, anaconda, will be used for execution of python codes. There
different python libraries for machine learning, like keras, tensorflow, pandas,
sickit-learn, numpy and so on, each of them having different roles[32]. In
this research some of those libraries will be used during experiment for pre-
processing, model building, and visualization of data.

Figure 5: The Proposed Hadiyyisa POS tagging Model Architecture

19
4 Timeline

Figure 6: Research Time Line

20
5 Budget

Table 1: Budget breakdown

Unit
Task Material Unit Quantity Total
Price

Paper Packet 1 300 300

Literature Review Pen Packet 1 300 300

Notebook Number 1 300 300

Data Collection Flash Disk 32GB 1 600 600

Data Annotation Hired expert - - - 15000

Contingency 2000

Total Cost = 18500

21
References
[1] Tsegaye Abebe and Esubalew Alemneh. Amharic text corpus based on
parts of speech tagging and headwords. In 2021 International Confer-
ence on Information and Communication Technology for Development for
Africa (ICT4DA), pages 77–82. IEEE, 2021.

[2] Haisal Dauda Abubakar, Mahmood Umar, and Muhammad Abdullahi


Bakale. Sentiment classification: Review of text vectorization methods:
Bag of words, tf-idf, word2vec and doc2vec. SLU Journal of Science and
Technology, 4(1&2):27–33, 2022.

[3] Sisay Fissaha Adafre. Part of speech tagging for amharic using conditional
random fields. In Proceedings of the ACL workshop on computational
approaches to semitic languages, pages 47–54, 2005.

[4] KK Akhil, R Rajimol, and VS Anoop. Parts-of-speech tagging for malay-


alam using deep learning techniques. International Journal of Information
Technology, 12:741–748, 2020.

[5] Mequanent Argaw. Amharic parts-of-speech tagger using neural word


embeddings as features, 2019.

[6] Abraham Gizaw Ayana. Towards improving brill’s tagger lexical and
transformation rule for afaan oromo language. Department of Geographic
Information Science, Hawassa Universty, Hawassa, 2015.

[7] Wubetu Barud Demilie. Parts of speech tagger for awngi language. 07
2020.

[8] Worku Kelemework Birhanie and Miriam Butt. Automatic amharic part
of speech tagging (aapost): A comparative approach using bidirectional
lstm and conditional random fields (crf) methods. In Advances of Science
and Technology: 7th EAI International Conference, ICAST 2019, Bahir
Dar, Ethiopia, August 2–4, 2019, Proceedings 7, pages 512–521. Springer,
2020.

22
[9] Alebachew Chiche and Betselot Yitagesu. Part of speech tagging: a sys-
tematic review of deep learning and machine learning approaches. Journal
of Big Data, 9(1):1–25, 2022.

[10] Mwamba Kasongo Dahouda and Inwhee Joe. A deep-learned embed-


ding technique for categorical features encoding. IEEE Access, 9:114381–
114391, 2021.

[11] Tusarkanta Dalai, Tapas Kumar Mishra, and Pankaj K Sa. Part-of-speech
tagging of odia language using statistical and deep learning-based ap-
proaches. arXiv preprint arXiv:2207.03256, 2022.

[12] WB Demilie. Analysis of implemented part of speech tagger approaches:


the case of ethiopian languages. Indian J Sci Technol, 13(48):4661–71,
2020.

[13] Rushali Dhumal Deshmukh and Arvind Kiwelekar. Deep learning tech-
niques for part of speech tagging by natural language processing. In 2020
2nd International Conference on Innovative Mechanisms for Industry Ap-
plications (ICIMIA), pages 76–81. IEEE, 2020.

[14] Kacha Desta. Part of speech tagger for hadiyyisa language. 2019.

[15] Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, and Zhigang
Luo. Audio visual speech recognition with multimodal recurrent neu-
ral networks. In 2017 International Joint Conference on neural networks
(IJCNN), pages 681–688. IEEE, 2017.

[16] Ricardo Freitas. Scientific research methods and computer science. In


Proceedings of the MAP-i seminars workshop, 2009.

[17] Ibrahim Gashaw and H L Shashirekha. Machine learning approaches for


amharic parts-of-speech tagging. arXiv preprint arXiv:2001.03324, 2020.

[18] Gebremeskel Hagos Gerbremedhin. Design and development of part of


speech tagger for geez language using hybrid approach. The International
Journal of Science & Technoledge, 7(12), 2019.

23
[19] Gezahegn Gaje Hadaro. Development Of Parts Of Speech Tagger For
Gamo Language: Comparative Analysis Of Deep Learning And Stochas-
tic Approaches. PhD thesis, 2020.

[20] Sintayehu Hirpassa and GS Lehal. Improving part-of-speech tagging in


amharic language using deep neural network. Heliyon, 2023.

[21] Sepp Hochreiter and Jürgen Schmidhuber. Lstm can solve hard long
time lag problems. Advances in neural information processing systems,
9, 1996.

[22] Mohammed Hussen. Part of speech tagger for afaan oromo language using
transformational error driven learning (tel) approach. Master’s Thesis,
Addis Ababa University, Addis Ababa, unpublished, 2010.

[23] Fatima Jahara, Adrita Barua, MD Asif Iqbal, Avishek Das, Omar Sharif,
Mohammed Moshiul Hoque, and Iqbal H Sarker. Towards pos tagging
methods for bengali language: a comparative analysis. In International
Conference on Intelligent Computing & Optimization, pages 1111–1123.
Springer, 2020.

[24] Christian Janiesch, Patrick Zschech, and Kai Heinrich. Machine learning
and deep learning. Electronic Markets, 31(3):685–695, 2021.

[25] Asnak Yihunie Kassahun and Tessfu Geteye Fantaye. Design and de-
velop a part of speech tagging for ge’ez language using deep learning
approach. In 2022 International Conference on Information and Commu-
nication Technology for Development for Africa (ICT4DA), pages 66–71.
IEEE, 2022.

[26] Saurav Kumar, Saunack Kumar, Diptesh Kanojia, and Pushpak Bhat-
tacharyya. “a passage to india”: Pre-trained word embeddings for indian
languages. In Proceedings of the 1st Joint Workshop on Spoken Language
Technologies for Under-resourced languages (SLTU) and Collaboration
and Computing for Under-Resourced Languages (CCURL), pages 352–
357, 2020.

24
[27] Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea,
and Soujanya Poria. A review of deep learning techniques for speech
processing. Information Fusion, page 101869, 2023.

[28] Siraj Mohammed. Using machine learning to build pos tagger for under-
resourced language: the case of somali. International Journal of Informa-
tion Technology, 12(3):717–729, 2020.

[29] Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao
Pham. A robust transformation-based learning approach using ripple
down rules for part-of-speech tagging. AI communications, 29(3):409–
422, 2016.

[30] Greeshma Prabha, PV Jyothsna, KK Shahina, B Premjith, and KP Soman.


A deep learning approach for part-of-speech tagging in nepali language.
In 2018 International Conference on Advances in Computing, Communi-
cations and Informatics (ICACCI), pages 1132–1136. IEEE, 2018.

[31] KK Purnamasari and IS Suwardi. Rule-based part of speech tagger for


indonesian language. In IOP Conference Series: Materials Science and
Engineering, volume 407, page 012151. IOP Publishing, 2018.

[32] Sebastian Raschka, Joshua Patterson, and Corey Nolet. Machine learning
in python: Main developments and technology trends in data science, ma-
chine learning, and artificial intelligence. Information, 11(4):193, 2020.

[33] Mohammad Javad Rezai and Tayebeh Mosavi Miangah. Farsitag: a part-
of-speech tagging system for persian. Digital Scholarship in the Humani-
ties, 32(3):632–642, 2017.

[34] Birhanesh Fikre Shirko. Part of speech tagging for wolaita language using
transformation based learning (tbl) approach, 2020.

[35] LR Swaroop, Rakshith Gowda GS, U Sourabh, and Shriram Hegde. Parts
of speech tagging for kannada. In Proceedings of the Student Research
Workshop Associated with RANLP 2019, pages 28–31, 2019.

25
[36] Martha Yifiru Tachbelie and Wolfgang Menzel. Amharic part-of-speech
tagger for factored language modeling. In Proceedings of the Interna-
tional Conference RANLP-2009, pages 428–433, 2009.

[37] Yemane Tedla and Kazuhide Yamamoto. Analyzing word embeddings


and improving pos tagger of tigrinya. In 2017 International Conference
on Asian Language Processing (IALP), pages 115–118. IEEE, 2017.

[38] Virginia Teller. Speech and language processing: an introduction to nat-


ural language processing, computational linguistics, and speech recogni-
tion, 2000.

[39] Senait Gebremichael Tesfagergish and Jurgita Kapočiūtė-Dzikienė. Part-


of-speech tagging via deep neural networks for northern-ethiopic lan-
guages. Information technology and control, 49(4):482–494, 2020.

[40] Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime,
Abinew Ali Ayele, Moges Ahmed Mehamed, Olga Kolesnikova, and
Seid Muhie Yimam. Natural language processing in ethiopian lan-
guages: Current state, challenges, and opportunities. arXiv preprint
arXiv:2303.14406, 2023.

[41] Addisu Bole Tunsisa. COLLEGE OF NATURAL AND COMPUTA-


TIONAL SCIENCES SCHOOL OF INFORMATION SCIENCE. PhD the-
sis, Addis Ababa University, 2016.

[42] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. Part-
of-speech tagging with bidirectional long short-term memory recurrent
neural network. arXiv preprint arXiv:1510.06168, 2015.

[43] Getachew Mamo Wegari and M Meshesha. Parts of speech tagging for
afaan oromo. International Journal of Advanced Computer Science and
Applications, 1(3):1–5, 2011.

[44] ACL Wiki. Pos tagging (state of the art), 2014.

26
[45] Zeyu Xiong, Qiangqiang Shen, Yueshan Xiong, Yijie Wang, and Weizi
Li. New generation model of word vector representation based on cbow
or skip-gram. Computers, Materials & Continua, 60(1), 2019.

[46] Alebachew Zewdu, Hiwot Kadi, and Tibebu Bekele. A hidden markov
model-based part of speech tagger for shekki’noono language. Interna-
tional Journal of Computing, pages 587–595, 12 2021.

27

You might also like