You are on page 1of 13

Improve the performance of part-of-speech tagger for Kafi-noonoo

Text Using deep Learning Approach

Abbreviations

ELRC Ethiopian Language Research Center

HMM Hidden Markov Model

IE Information Extraction

MaxEnt Maximum Entropy

MT Machine Translation

NER Named Entity Recognition

NLP Natural Language Processing

POS Part-of-Speech

SVM Support Vector Machine

WIC Walta Information Center

i
Table of Contents
Abbreviations....................................................................................................................................ii
Abstract............................................................................................................................................iv
1. Introduction................................................................................................................................1
2. Literature review........................................................................................................................2
3. Statement of the problem...........................................................................................................3
4. Objectives of the study...............................................................................................................5
4.1. General objective................................................................................................................5
4.2. Specific Objective...............................................................................................................5
5. Methodology..............................................................................................................................5
5.1. Literature Review...............................................................................................................6
5.2. Corpus collection and Data preparation............................................................................6
5.3. Development Tools.............................................................................................................6
5.4. Evaluation techniques.........................................................................................................6
6. Scope of the Study.....................................................................................................................6
7. Significance of the Study...........................................................................................................7
8. References..................................................................................................................................8

ii
Abstract

iii
1. Introduction

Many researches in Natural Language Processing (NLP) has been dedicated to highly resourced
languages like English and others. Ethiopian languages, like that of Kafi-noonoo which is spoken
by about 3 million people in south western part of Ethiopia have, however, received far too little
attention. In fact, this language is being spoken by less and less peoples. Kafi-noonoo belongs to
the Afro-Asiatic language super family of the NorthOmotic Southern Gonga sub-group [1]. Kafi-
noonoo uses Latin script for writing. It has 22 consonant phonemes. Out of these, six of them are
both long and short consonants. Among the 22 consonants, five of them are borrowed from
English and Amharic languages. In addition to the consonants, it has five long and short vowels.
The long vowels and consonants can be obtained by doubling the corresponding short vowels and
consonants, respectively. The difference in length of both vowels and consonants induces
difference in meaning.

Now a day the numbers of kafi-noonoo documents on the web, News, articles, messages research
papers, and in other machine-readable forms are also increasing from time to time. As a result of
this growth, the huge amount of text that contains different valuable information, which can be
used for knowledge representation, are hidden under the unstructured representation of the textual
data. This shows that getting the right information for decision making from existing abundant
unstructured text is a big challenge. The non-availability of tools for tagging the valuable
information which is enough effective the users need have been a major problem. One basic task
to solve this problem in natural language processing is part-of-speech (POS) tagging. It is the
process of assigning a part-of-speech tag like noun, verb, pronoun, preposition, adverb, adjective
or other lexical class markers to each word in a text [2][3][4]. POS tagging is not useful by itself
but it is generally accepted to be the first step to understanding a natural language. Most other
tasks and applications heavily depend on it. In addition to that, POS tagging is seen as a prototype
problem because any NLP problem can be reduced to a tagging problem. For example, machine
translation can be seen as the tagging of words in a given language by words of another language;
speech recognition can be seen as the tagging of signals by letters and so on. In general, the input-
output relationship can be as complex as sequences, sets, trees and others that can be imagined.
POS tagging represents the simplest of these problems. At first sight, the solution to this POS

1
tagging problem may seem trivial, but it is actually very hard. There is no known method that
solves the problem with complete accuracy for any language. The reason for this is partly related
to inconsistencies of our understanding of categories of words. Even trained human annotators do
not agree as to the category of a word 3-4% of the times [5]. The other reason arises from
language ambiguities and the ineffectiveness of the resolving methods. Language expressions are
ambiguous and computers do not have the commonsense and the world knowledge that humans
have when they communicate.

These different meanings are caused by a number of ambiguities. The first one is part of speech
ambiguity. The other ambiguities are related to semantics and syntax. So, the purpose of tagging
is to give the computer as much information and knowledge as necessary to enable it to assign
each word the correct tag as used in the given context. There are three approaches to solving this
tagging problem based on two fundamental concepts: rules and statistics. Rule-based taggers use
handcrafted linguistically-motivated rules. Stochastic taggers, by contrast, use probabilistic
mathematical models and a corpus. The third approach combines the best of both concepts. None
of them is perfect for all languages and for all purposes [5]. The relevance and effectiveness of
each approach depends on the purpose and the given language. In this paper propose POS tagging
for Kafi-noonoo language using deep learning methods.

2. Literature review

Different NLP research on different languages has started fairly recently and has been constrained
by lack of linguistic resources and an authoritative body to define and develop them. Unlike
English and other languages, Kafi-noono does not yet have a Treebank. Even so, NLP researchers
from native speakers to non-speakers have shown interest in the language and developed
prototypes by applying some of the state-of-the-art tagging models.

The study of Gamback et al. (2009) [6] and Tachbelie and Menzel (2009) [7] applied different
tagging methods and obtained worse performances than the state-of-the-art results for Arabic or
English. Gamback conducted detailed experiments using TnT (Brants, 2000) [8], Support Vector
Machine (SVM) Tool [9] and Mallet [10] on three different tag sets. The overall accuracies using
the Ethiopian Language Research Center (ELRC) tag set are 85.56% for TnT, 88.30% for SVM

2
and 87.87% for Maximum Entropy (MaxEnt). Similarly, Tachbelie and Menzel (2009) [7] also
conducted similar experiments using TnT and SVM Tool models with overall accuracies of
82.99% for TnT and 84.44% for SVM. For both sets of experiments, the best performances are
achieved by SVM but Gamback's SVM performs better (88.30% against 84.44%).

The study of Adafre (2005) [11], who did POS tagging experiment for Amharic language. Adafre
collected five news articles for Walta Information Center (WIC) and manually annotated them,
which he then used for both training and testing of a stochastic model based on conditional
random fields [12]. He obtained an average accuracy of 74% on a 5-fold cross-validation where
one file is used for testing and the other files for training.

The most related work for our problem is studied by Zelalem and Yaregal (2014) [13] were part-
of-speech tagging for Kafi-noonoo language using a hybrid approach (rule-based and statistical).
They were applied a set of transformational rules for the purpose of improve the performance of
the proposed model, the reason they used this approach is because of limited number of corpus.
Also, they have used Hidden Markov Model (HMM) approach to model the lexical and
transitional probability of each word classes. They were test the proposed system using test
corpus and they used 90% the corpus for training purpose. Using the above corpus, they obtained
80.47% of accuracy.

The main reason for the poor performance can be explained by four reasons. First, the corpus they
have used is small. Second, the taggers use no more knowledge source than a pre-tagged training
corpus. Third, the quality of the corpus is poor and also the approaches they have used for this
task have several limitations.

Therefore, this paper will attempt to improve performance by filling the above three things. All of
those things combined have contributed to the most accurate part of speech tagger for Kafi-
noonoo language.

3
3. Statement of the problem

POS tagging is the process of assigning a word in a text with a particular word class, based on
both its definition as well as its context i.e. relationship with adjacent and related words within a
phrase, sentence or paragraph [3], [4]. In other words, POS tagger reads text in given language
and assigns parts-of-speech such as noun, verb, adjective, etc. to each word within the text. It is an
important component of high-level natural language processing applications and plays an
important role in parsing, machine translation, grammar checking, speech synthesis, information
retrieval, word sense disambiguation, etc. Most tagging algorithms fall in to one of two classes
[4]: statistical and statistical taggers.

Rule-based tagger uses a set of rules that containing a set of morphological, syntactical and lexical
information. Since rules containing those above-mentioned information’s, this needs highly
human effort to construct handcrafted linguistic patterns including lexicon-syntactic patterns and
semantic patterns in order to tag words from texts. This approach has several limitations, such as
it needs extensively manual preprocessing, and it is a domain specific and requires knowledge
experts to develop clues and constraints manually [5].

On the other hand, statistical methods assign tag for a word by calculating the most likely tag in
the context of the word and its immediate neighbor [14]. The main idea behind all statistical
taggers is a simple generalization and picks the most-likely tag for the word [5]. A statistical
approach includes most frequent tag, n-gram and HMM. However, to extract different features of
text it needs many external NLP toolkits. These many NLP toolkits are might be imperfect and
will cause error propagation to the POS tagging system for Kafi-noonoo language.

In recent year, with the emerging of deep learning [15], researchers practiced to designed models
without using complicated feature engineering and to reduce the accessing of NLP toolkits for
feature extractions. However, these models rely on large amount of training data which should
cover all the POS tagging’s in natural language. Due to the above limitations of rule based and
statistical approach we will use deep learning approach to improve the performance of POS
tagging system for Kafi-noonoo language.

4
At these ends, we set the following research question to examine the problem in Kafi-noonoo
language text POS tagging.

1. How to develop a POS tagging model for Kafi-noonoo language?


2. To what extent deep learning approach can improve the performance of POS tagging system
for Kafi-noonoo language?
3. Which deep learning approaches are better for Kafi-noonoo language POS tagging system?

4. Objectives of the study


4.1. General objective
The general objective of this study is to improve the performance of Kafi-noonoo POS tagging
system using deep learning approach.

4.2. Specific Objective


To meet the above main objective of the study the following specific objectives are formulated:

 To reviewing relevant literature around the area to identify the gaps and identify the
techniques, in general, to get a better understanding of the study
 To build up an architecture for POS tagging for Kafi-noonoo language.
 To develop suitable approaches and techniques for Kafi-noonoo language
 To design a prototype system that demonstrates the potentials of the Kafi-noonoo
language POS tagging system
 To preparing a Dataset using different domains of Kafi-noonoo language
 To evaluate the performance of developed model of Kafi-noonoo language POS tagging
system.

5
5. Methodology

Methodology is a systematic process by which to systematically solve the research problem [16].
This research will be conducted in order to figure out challenges of implementing POS tagging
system for Kafi-noonoo language. Towards achieving the main objective of the study, the
following step-by-step procedures will be followed.

5.1. Literature Review

To understand the gap created in previous works and to have full point of view on POS tagging of
Kafi-noonoo and different language literatures will be conducted. In this study in order to
understand the problem domain and for conceptual understanding literatures, which are directly
related to the study especially on Kafi-noonoo language, got emphasized.

5.2. Corpus collection and Data preparation

Relevant Kafi-noonoo text data will be collected from different data sources in order to present
the data to the experiment. Different facts about Kafi-noonoo language like, the grammatical and
syntactical structure of the language will be conducted in order to understand the nature of the
language with respect to POS tagging.

5.3. Development Tools


In order to develop a prototype system, different appropriate tools will be used. Python
programming language will be used to implement the different language specific algorithms
developed.

5.4. Evaluation techniques

The study involves designing Kafi-noonoo language POS tagging model and implementing the
prototype of the model. So, the performance of the system will be evaluated using different
performance evaluation techniques like that of Accuracy, precision, recall and f1-score.

6
6. Scope of the Study

The focus of the study is developing a POS tagging model using deep learning approach for
Kafi-noonoo language. The scope of the study covered only Kafi-noonoo language text POS
tagging Other data types such as video, audio and graphic were not the focus of the study. While
deep learning approach needs huge amount of data to get the best performance but to collect such
data is a time consuming and a challenging task. So, the study will be bounded with limited
corpus, in order to evaluating the performance of the system developed.

7. Significance of the Study

The POS tagging model for Kafi-noonoo language which is the core of this study will benefit a
variety of natural language processing applications.

The first one is for Machine Translation (MT), it is an application used to translate a text in one
language to its equivalent meaning in another language. It is one of the widely used and very
important natural language processing applications in today’s world of technology. MT uses a
bilingual dataset and other resources to do the translation task. POS tagging is the first essential
part in preprocessing for machine translation. POS tagging is used to determine the grammatical
structure of the concerned languages [17].

The second is for Information Extraction (IE), it is another natural language processing
application which is the task of extracting structured facts from unstructured dataset
automatically. It involves some steps including identifying a predefined set of concepts and
deciding whether a text is relevant for a certain domain. It is indicated in [18] that information
extraction involves six different main tasks. These tasks are: POS tagging, named entity
recognition, syntax analysis, co-references and discourse analysis, extraction patterns and
bootstrapping.

The third one is for Named Entity Recognition (NER), it also the task of classifying proper nouns
in a given text into their corresponding predefined categories. The categories of proper nouns
include person, location, organization, date, and others [19]. POS tagging is a prerequisite for

7
named entity recognition to extract nouns from a given annotated corpus. Proper nouns then can
be further refined from the collection of nouns to be used in the NER application development.

And the last one is for syntactic Parsing; parsing is the process of converting a given flat sentence
into a hierarchical structure that corresponds to the units of meaning in the sentence. It is a
procedure of finding ways on how to combine grammatical rules that can generate a tree
representing the structure of the input sentence. Having a set of grammatical rules is one of the
requirements of a functional syntactic parser. [20].

8
8. References

[1] Harold, F.: The non-Semitic languages of Ethiopia. Michigan State University, Michigan
(1976)

[2] Allen, J.: Natural language Understanding. The Benjamin/Cummings Publishing Company,
Redwood (1995

[3] Bird, S., Klein, E., Loper, E.: Natural Language processing with python: Analyzing text
with the natural language toolkit. O’Reilly Media, Cambridge (2009)

[4] Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural
Speech Recognition. Prentice-Hall, New Jersey (2000)

[5] MARCUS, M.P., M A R C I N K I E W I C Z , M. A. and S A N T O R I N I , B. (1993),


“Building a large annotated corpus of English: the Penn Treebank”, Comput. Linguist.
19(2), 313–330

[6] GAMBACK, B., O L S S O N , F., ARGAW, A. A. and ASKER, L. (2009), “ Methods for
Amharic part-of-speech tagging”, in ‘AfLaT ’09: Proceedings of the First Workshop on
Language Technologies for African Languages’, Association for Computational
Linguistics, Morristown, NJ, USA, pp. 104–111.

[7] TACHBELIE, M. and MENZEL, W. (2009), “Amharic part-of-speech tagger for factored
language modeling”

[8] BRANTS, T. (2000), “TnT- a statistical part-of-speech tagger”

[9] GIMENEZ, J. and Marquez, L. (2004), “SVMTool: A general pos tagger generator based
on support vector machines”, in ‘Proceedings of the 4th International Conference on
Language Resources and Evaluation’, Citeseer, pp. 43–46.

[10] McCALLUM, A. (2002), “Mallet: machine learning for l a n g u a g e toolkit”

[11] ADAFRE, S. F. (2005), “Part of speech tagging for Amharic using conditional random
fields”, in ‘Semitic ’05: Proceedings of the ACL Work-shop on Computational Approaches
to Semitic Languages’, Association for Computational Linguistics, Morristown, NJ, USA,

9
pp. 47–54. Eng., vol. 11, no. 6, pp. 696–701, 2017.

[12] LAFFERTY, J. (2001), “Conditional random fields: Probabilistic models for segmenting
and labeling sequence data”, Morgan Kaufmann, pp. 282–289.

[13] Zelalem M., Yaregal A., “A Hybrid Approach to the Development of Part-of-speech
Tagger for Kafi-noonoo text,” Springer-Velag Berlin Heidelberg, pp. 214–224, 2014.

[14] Dand, S., Sarkar, S., Basu, A.: Automatic Part-of-Speech Tagging for Bengali: An
Approach for Morphologically Rich Languages in a Poor Resource Scenario. In:
Department of Computer Science and Engineering, Kharagpur, India, Indian Institute of
Technology (2007)

[15] Ian Goodfellow, Yoshua Bengio, A. C. (2017). The Deep Learning Book. MIT Press,
521(7553), 785. https://doi.org/10.1016/B978-0-12-391420-0.09987-X

[16] C. R. Kothari, Research methodology: Methods and techniques, 2nd editio. New Delhi:
New Age International Ltd, 2004.

[17] Zewgneh, S. (October 2017). English-Amharic Document Translation Using Hybrid


Approach. MSc Thesis, Addis Ababa University, Ethiopia.

[18] Hirpassa, S. (2017, Mar – Apr). Information Extraction System for Amharic Text.
International Journal of Computer Science Trends and Technology (IJCST), 5(2).

[19] Demissie, D. (2017). Amharic Named Entity Recognition using Word Embeddig as a
Feature. MSc Thesis, Addis Ababa University, Addis Ababa, Ethiopia.

[20] Megersa, D. (June 2002). An Automatic Sentence Parser for Oromo Language using
Supervised Learning Technique. MSc Thesis, Addis Ababa University, Ethiopia.

10

You might also like