Development of Morphological Analyzer For Af-Somali

DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR AF-SOMALI
MAHDI YONIS KAYAD
A Thesis Submitted to the Department of Computer Science in

Partial Fulfillment for the Degree of Master of Science in
Computer Science
Addis Ababa, Ethiopia
May, 2017
MAHDI YONIS KAYAD
Advisor: Dr. Yaregal Assabie
This is to certify that the thesis prepared by Mahdi Yonis, titled: development of Morphological
analyzer for Af-Somali and submitted in partial fulfillment of the requirements for the Degree of
Master of Science in Computer Science complies with the regulations of the University and meets
the accepted standards with respect to originality and quality.
Signed by the Examining Committee:
Name __________________________Signature__________ Date_______
Advisor:_______________________________________
Examiner:______________________________________
Examiner:______________________________________
ABSTRACT
Morphological analysis is a very critical issue especially for natural language processing related
tasks on inflectional languages. This thesis work gives the implementation details of the
development of morphological analyzer for Af-Somali, which is an inflectional language. A
detailed computational analysis of Af-Somali morphology such as formalization of alternation and
morphotactic rules for Af-Somali is worked out in order to create the morphological analyzer. In
the implementation of the morphological analyzer, alternation and morphotactic rules of Af-
Somali are represented by two-level morphology rules. This is the first detailed computational
analysis of Af-Somali from morphological view. The attempt of this thesis is mainly based on the
dictionary book Annarita, known as Qaamuus and the declensions of nouns Andrzejewski. This
thesis work is employed by finite state two level approach using Xerox finite state toolkit. The
work is done in two parts, means to encode the lexicon we have used lexical formalism (lexc) and
the alternation rules are implemented by xfst.
Generally, we evaluated the morphological analyzer by measuring the following things, the total
number of word tokens correctly accepted by the analyzer versus the number of words incorrectly
processed by the analyzer. We have manually annotated 218 tokens, 90 nouns, 120 verbs and 8
adjectives of words from the book known as (qaamuus). 77 nominal, 105 verbal and 6 adjectives
were correctly analyzed. So, from this we can understand that, 85.5% Nominal, 87.5% verbal and
75% of adjectives were correctly analyzed, and total of 218 tokens 86.2% was correctly analyzed,
13.76% is wrongly analyzed and total 10 tokens failed to be analyzed by the system. The results
were evaluated by a human reader familiar with the languages. Therefore we found an encouraging
result which is a preliminary work for computational development of Af-Somali.
Keywords: (NLP) Natural language Processing, morphological analyzer, (FST) finite state
transducer, (XFST) Xerox finite state toolkit and lexical formalism (LEXC).
I
ACKNOWLEDGEMENTS
I thank all who in one way or another contributed in the completion of this thesis. First, I
give thanks to Allah who gives me protection and ability to do work. I am so grateful to
the Addis Ababa university college of natural science and computer science department for
making it possible for me to study here. I give deep thanks to the lecturers at the department
of computer science, the librarians, and other workers of the faculty. My special and
heartily thanks to my Advisor, Dr. Yaregal Assabie who encouraged and directed me. His
challenges brought this work towards a completion. It is with his advices that this work
came into existence. For any faults I take full responsibility. My special gratitude and
appreciation also goes to Annarita Puglielli and Cabdalla Cumar Mansuur for their
invaluable service contribution to Af-Somali dictionary which was first fully written
dictionary with the full grammatical information. Their discussions and comments on
Af-Somali Lexicons and Morphology have been the base of this work. Moreover, I am
grateful to many friends and colleague through these difficult years. I appreciate my dear,
Mother and goodhearted brothers, Mr Abdirashid Yonis and Hamse Yonis, who have
supported and helped me many setback and I greatly value their contribution.
II
Table of Contents
List of Figures............................................................................................................................................VI
List of Tables ........................................................................................................................................... VII
Chapter 1 : Introduction ............................................................................................................................ 1
1.1 Background of the Study ............................................................................................................ 1
1.2 Morphological Analysis .............................................................................................................. 1
1.3 Statement of the Problem ........................................................................................................... 3
1.4 Objectives..................................................................................................................................... 4
1.5 Methodology ................................................................................................................................ 5
1.5.1 Literature Review ............................................................................................................... 5
1.5.2 Data Collection and Classification..................................................................................... 6
1.5.3 Analysis ................................................................................................................................ 6
1.5.4 Implementation ................................................................................................................... 6
1.5.5 Testing .................................................................................................................................. 6
1.6 Application of the Result ............................................................................................................ 6
1.7 Scope and Limitation .................................................................................................................. 7
1.8 Organization of the Thesis ......................................................................................................... 7
Chapter 2 : Literature Review ................................................................................................................... 8
2.1 Introduction ................................................................................................................................. 8
2.2 Introduction to Morphological Analysis ................................................................................... 8
2.2.1 Morphemes .......................................................................................................................... 8
2.2.2 Affixes................................................................................................................................... 9
2.2.3 Types of Morphological Processes ..................................................................................... 9
2.2.4 Inflection ............................................................................................................................ 10
2.2.5 Derivation .......................................................................................................................... 10
2.2.6 Compounding .................................................................................................................... 10
2.3 AF-Somali Morphology ............................................................................................................ 10
2.3.1 AF-Somali Phonetics ......................................................................................................... 11
2.3.2 Basic Characteristics of Af-Somali .................................................................................. 11
III
2.4 Inflectional Process of AF-Somali ........................................................................................... 12
2.4.1 Nouns .................................................................................................................................. 12
2.4.2 AF-Somali Noun Determiners.......................................................................................... 15
2.4.3 Adjectives ........................................................................................................................... 17
2.4.4 The Verb ............................................................................................................................ 17
2.4.5 Classification AF-Somali Verbs ....................................................................................... 18
2.5 Derivational System of AF-Somali .......................................................................................... 20
2.6 Approaches to Morphological Analysis .................................................................................. 21
2.6.1 Corpus-based Approaches ............................................................................................... 21
2.6.2 Rule-based Approach ....................................................................................................... 22
2.7 Finite State Technology ............................................................................................................ 23
2.7.1 Finite State Machines........................................................................................................ 24
2.7.2 Finite-state transducers .................................................................................................... 24
2.7.3 Two Level Morphological Approach ............................................................................... 25
2.7.4 The Xerox Finite State Frame work ................................................................................ 25
2.8 Summary.................................................................................................................................... 28
Chapter 3 : Related work ......................................................................................................................... 29
3.1 Introduction ............................................................................................................................... 29
3.2 Morphological Analyzer for European Languages ................................................................ 29
3.3 Morphological Analyzer for Asian Languages ....................................................................... 30
3.4 Morphological Analyzer for Ethiopian Languages ................................................................ 31
3.5 Summary.................................................................................................................................... 32
Chapter 4 : Design of Af-Somali Morphological Analyzer ................................................................... 33
4.1 Introduction ............................................................................................................................... 33
4.2 General Architecture of AF-Somali Morphological Analyzer .............................................. 33
4.2.1 Lexicon/ Morph-tactics ..................................................................................................... 35
4.2.2 Alternation Rules .............................................................................................................. 36
4.3 The Design of AF-Somali Part-Of-Speech Lexicon and Alternation Rules ......................... 37
4.3.1 AF-Somali Verb Lexicon Design ..................................................................................... 37
4.3.2 Alternation Rules of AF-Somali Verbs ........................................................................... 41
4.3.3 Noun Lexicon Design ........................................................................................................ 44
4.3.4 Alternation Rules of AF-Somali Nouns ........................................................................... 47
4.3.5 Adjectives Lexicon Design ................................................................................................ 48
Chapter 5 : Experimentation and Evaluation ........................................................................................ 50
5.1 Introduction ............................................................................................................................... 50
IV
5.2 Experimentation ........................................................................................................................ 50
5.3 Discussion and Evaluation........................................................................................................ 51
Chapter 6 : Conclusion and Future Work .............................................................................................. 53
6.1 Conclusion ................................................................................................................................. 53
6.2 Future Work .............................................................................................................................. 54
References .................................................................................................................................................. 55
1.9 Appendix-A: Alternation Rules for Noun and Verb ................................................................ 1
1.10 Appendix-B: Af-Somali verb Lexicon ....................................................................................... 4
1.11 Appendix-C: Af-Somali Noun lexicon ....................................................................................... 9
V
List of Figures
Figure 2-1: Example of two level representation of Af-Somali ................................................................ 27

Figure 2-2: Creation of a lexical transducer. The .o. operator represents the composition operation ...... 28
Figure 4-1: Af-Somali morphological analyzer architecture design ......................................................... 34
Figure 4-2: Af-Somali verb lexicon .......................................................................................................... 38
Figure 4-3: Af-Somali verbs finite state networks .................................................................................... 39
Figure 4-4: Example representation of Af-Somali second and third group verb FSN .............................. 41
Figure 4-5: Af-Somali verbs alternation rules .......................................................................................... 42
Figure 4-6: Alternation rule representation with xfst ................................................................................ 43
Figure 4-7: person morpheme realization ................................................................................................ 44
Figure 4-8: Af-Somali noun lexicon ......................................................................................................... 45
Figure 4-9: Af-Somali noun suffixes ........................................................................................................ 45
Figure 4-10: Af-Somali verb finite state networks ..................................................................................... 46
Figure 4-11: Af-Somali noun alternation rules ........................................................................................... 47
Figure 4-12: Af-Somali adjective lexicon ................................................................................................... 48
Figure 4-13: Af-Somali Adjective finite state networks ............................................................................. 49
Figure 5-1: AF-Somali Verb to suffix attachment .................................................................................... 51
VI
List of Tables
Table 2.1: Pluralization system of Af-Somali........................................................................................... 11

Table 2.2: Derivational inflected plural form of Af-Somali ..................................................................... 12
Table 2.3: Af-Somali Gender Markers ..................................................................................................... 13
Table 2.4: Example of noun with gender markers .................................................................................... 13
Table 2.5: Example of Af-Somali pluralization and declension formation .............................................. 14
Table 2.6: Example of Af-Somali Articles ............................................................................................... 15
Table 2.7: Af-Somali Demonstratives ...................................................................................................... 16
Table 2.8: AF-Somali possessive .............................................................................................................. 16
Table 2.9: Interrogative representation of A-Somali ................................................................................ 17
Table 2.10: Pluralization of adjectives ........................................................................................................ 17
Table 2.11: Example of person agreement with tenses ............................................................................... 18
Table 2.12: First conjugation representation of Af-Somali verbs ............................................................... 19
Table 2.13: Second Af-Somali verb conjugation (toosi) ............................................................................ 19
Table 2.14: Example of Af-Somali 3rd. conjugation representation .......................................................... 20
Table 2.15: Fourth Af-Somali verb conjugation representation ................................................................. 20
Table 2.16: Example of Af-Somali two level representation...................................................................... 25
Table 4.1: Tags of AF-Somali grammatical information .......................................................................... 35
Table 4.2: Mappings of root words and their morphemes ........................................................................ 36
Table 4.3: An example of Af-Somali verb morphotactics ........................................................................ 40
Table 4.4: Realization with sh when it suffixed with t ............................................................................. 42
Table 4.5: Example of noun declension 2 morphotactics ......................................................................... 47
Table 4.6: Partial reduplication of nouns .................................................................................................. 48
Table 4.7: The alternation of declension 5 representation ......................................................................... 48
Table 4.8: Example of adjective morphotactics ........................................................................................ 49
Table 5.1: Overall accuracy of the system ................................................................................................ 52
VII
List of Abbreviations
 Af-Somali Somali Language
 FSA Finite State Automata
 FST Finite State Transducers
 IR Information Retrieval
 MT Machine Translation
 NLP Natural Language Processing
 POS Part-Of-Speech
 SOV Subject-Object-Verb
VIII
Chapter 1 : Introduction
1.1 Background of the Study
A natural language is the preferred medium of communication for people and it can be in
a spoken or written form, which is difficult to be simply understood by the computers. This
needs a mechanism with enough information of the language including its word grammar
and sentence structure to be understood by the computers. The processing of this
information by a computer is known as natural language processing (NLP). NLP is used
for both generating human readable information from computer systems and converting
human language into more formal structures that a computer can understand [6]. It is a field
of study which consists of different levels of linguistics analysis such as phonetic,
morphological, syntactic and semantic analysis, and the basic level is the morphological
analysis to different NLP applications.
1.2 Morphological Analysis
Morphological analysis is a process of segmenting words into morphemes, the assignment

of grammatical information to grammatical categories and the assignment of the lexical
information to particular lexeme or lemma [30]. It retrieves the grammatical features and
properties of an inflected word. The analyzer breaks the word into minimal meaning
bearing morphemes and produces the morph syntactic features such as the root, tense,
person and number etc. Morpheme Words are formed by combination of one or more free
morphemes and zero or more bound morphemes. In spoken language, morphemes are
composed of phonemes, the smallest linguistically distinctive units of sound. re-, de-, un-,
-ish, -ly, -cieve, -mand, tie, boy, like, etc. of receive, demand, untie, boyish, likely.
Morphology is seen as ‘the study of words that are formally and semantically related’. In
order to consider a word as an expression, it must be characterized as having three
1|Page
features, a phonological form, a category or word classes and a meaning. Morphology is
concerned with the study of internal structure of words. Morphological analysis consists of
the identification of parts of the words or constituents of the words. For example the word
toosi (strengthen) in Af-Somali consists of two constituents, the root word toos (straight)
and the imperative marker (i). The morphological analysis primarily consists in breaking
up the words into their parts and establishing the rules that govern the co-occurrence of
these parts. Morphology can be viewed as the process of building words by inflection and
word-formation. So, the task of morphological analysis, is to take forms and relate them to
other word forms, at the same time deriving information about the form [30].
A morphological analyzer is an essential and basic tool for building any language
processing application in natural language e.g., Machine Translation system and it is an
essential technology for most text analysis applications like information retrieval (IR) and
text summarization etc. The most obvious applications are found in the areas of
lexicography and computational linguistics [24]. Two factors are essential to achieve
accurate automatic morphological analysis, one factor is the construction of a set of
morphological rules (morphotactic) and the other is the morphological analysis procedure
[24]. The absence or underperformance of either of them impairs the overall ability of the
morphological analyzer.
For example, with respect to the word "dogs", we can say that the "dog" is the root form,
and s‟ is the affix. Here the affix gives the number information of the root word. Thus,
morphological analysis is found to be centered on the analysis and generation of the word
forms. It deals with the internal structure of the words and how those words can be formed.
Morphological analysis also play an important role in applications such as spell checking,
electronic dictionary interfacing and information retrieving systems, where it is important
that words are only morphological variants of each other are identified and treated similarly
[30]. In NLP and especially in machine translation (MT) systems, we need to identify
words in texts in order to determine their syntactic and semantic properties. Morphological
study helps us by providing rules for analyzing the structure and formation of the words.
2|Page
Therefore, having a morphological analyzer for any natural language is a vital step in
starting natural language processing; especially those lesser-studied and under-resourced
languages, it is often a practical and extremely valuable first step, making use of corpora,
lexicons, morphological grammars and phonological rules already produced by field of
linguists and descriptive linguists [9]. Several Morphological Analyzers have been
developed for different well documented languages such as English [30] and Arabic [13].
On the other hand, there is some significant studies in the area of computational
morphology for Ethiopian languages like Amharic [5, 8, 21, 22 and 29], Oromo [22] and
Tigrinya [22]. Moreover, there are also works performed for Afaraf [2] and Ge’ez by
Yitayal Abate [34]. But, to the best of our knowledge there is no academically or published
study that had been made so far to develop morphological analyzer for Af-Somali.
1.3 Statement of the Problem
Af-Somali is the official language of Somalia, Ethiopian Somali region and it’s the working
language for Kenyan Northern Province and Djibouti [26]. It is also the instructional medium of
education of all the schools of these countries, which means that the language is spoken by a
number of people and needs to be given attention to computationally process the language.
Furthermore, a large number of official documents, religious books and computerized documents
are found in Af-Somali, these makes the language to be predominantly used in word processing
activities in different areas. In addition to this, there are some NLP applications developed for Af-
Somali like, machine translation system by Google, bilingual electronic dictionary project which
is an English to Somali and Somali speech corpus by Niman Abdillahi [26] and these need to
identify words in texts in order to determine their syntactic and semantic properties and the word
is lexical category. For example, to translate a word in Af-Somali to English using the electronic
dictionary, the users couldn’t find the exact meaning or the corresponding word in English
language. Firstly, this process needs to have the morphological analyzer to distinguish the word
category like that tells the word is past or present and it identifies its part-of-speech. Furthermore,
if someone wants to conduct a research on NLP and to access the different resources found in
different format of the Af-Somali, we need a computational processing of the language or in other
way we need to translate the language to the well-developed languages.
3|Page
Considerable research has been done on NLP systems for main Ethiopian languages in general
including various works on computational morphology like, Amharic [5, 8, 21, 22 and 29], Afan
Oromo [22], Tigrinya [22] and Afaraf [2]. However, No research has been conducted so far in the
area of automatic morphological analyzer for Af-Somali. The absence of morphological analysis
systems limits the effort of making computers work comfortable with Af-Somali. Af-Somali is the
same Cushitic origin to the Afaraf and Afaan Oromo and the other Cushitic language family; and
has a much similarity in its vocabulary and grammatical structure, which means they follow SOV
structure. However, it has its own uniqueness by which it differs extensively in terms of focus
noun and verb markers’, morphology and word order which seems to the semantic family of Arabic
language. It is also unique in that, the modifiers occupy a single position, it is pluralization pattern
of the language and their word formation process; hence, it needs its own independent
morphological analyzer.
Af-Somali is morphologically rich and the word formation in the language possesses a number of
different linguistic morphological features including complex verb and noun inflectional,
derivational and compounding, and because of this complexity, automated morphological analyzer
is difficult to construct. Hence, it is a challenging task. Moreover, Af-Somali has more complex
inflectional verbs, adding a large number of affix to the stem word and morphological analysis, is
vital for the development of many practical natural language processing systems such as machine
readable dictionaries, machine translation, information retrieval, spell-checkers, and speech
recognition. Therefore, the aim of this work is to conduct a research on morphological analysis for
Af-Somali morphology that can be implemented from computational point of view, to analyze the
word and morphological category, the word formation process in the language and to model
computational morphological analysis for Af-Somali.
1.4 Objectives
General Objective
The main objective of this thesis work is to develop a morphological analyzer for Af-Somali
word morphology.
4|Page
Specific Objectives
In order to achieve the above general objectives, this thesis work has the following specific
objectives;
 Studying and understanding the word and morphological categories in Af-Somali

 Studying and understanding the phonological and morphological alternation rules
involved in Af-Somali word formations and conjugations.
 Assessing the different techniques and approaches employed so far in morphological
analysis tasks and select the ones that appropriate to the morphological property of Somali
inflectional morphology.
 Designing morphological analyzer for Af-Somali;
 Formulating the phonological/orthographic rules involved in inflectional morphological
processes in the language
 Test the prototype for morphological analyzer to measure it-s performance.
1.5 Methodology
1.5.1 Literature Review
Literature review will be conducted to understand the language’s morphology in developing the
morphological analyzer. Consultations of the scholars in the area of Af-Somali morphology will
be conducted to better understand the morphology of the language and to get information which is
helpful for the thesis work. Developing a morphological analyzer requires to analyze and identify
the property of Somali word formation and it will be important to review the researches done on
the development of morphological analyzer for other languages. It is also, so important and will
be helpful by studying and selecting the suitable approach of morphology for Af-Somali. Besides
this, literature in the area of morphological analysis in particular and computational linguistics in
general (e.g. approaches) will be reviewed to better understand how words are analyzed. Thus, the
Finite State transducer based Approach to morphological analysis was selected to analyze and
derive the root and grammatical properties of Somali words.
5|Page
1.5.2 Data Collection and Classification
To conduct any study needs to collect and analyze a data important for the research to be
conducted. In this thesis work a corpus data or a list of words, being electronic text data consisting
of list of words such found in a Book Known as Qaamuus and different magazines from internet
of Af-Somali words will be collected. The unique word-forms will be classified into different
categories such as nouns, verbs, adjective, etc. and further subdivisions have been made according
to their morpho-syntactic behaviors using Xerox finite state tool.
1.5.3 Analysis
The classified data will be analyzed into root or (stems) and affixes for each category using Xerox
finite state tool in lexicon formalism. Then phonological rules have also been identified and
formalized for each category by using xfst-tool.
1.5.4 Implementation
Finite state transducers for each group of words will created following concept of ‘finite state
transducer’. Then, a computational model for Af-Somali inflectional morphology will be
implemented using Xerox Finite State Tool (xfst) developed by the two principle researchers at
the Xerox Palo Alto Research Center.
1.5.5 Testing
In this thesis work, finite state approach will be used to develop, the morphological analyzer. A
wordlists of surface word forms (tokens) will be extracted from Af-Somali Dictionary Book
(Qaamuus) and will be inserted in to the prototype to be analyzed. An output was considered
correct only if it found all legal combinations of roots and grammatical structure for a given word
form and included no incorrect roots or structures.
1.6 Application of the Result
As morphological analyzer is a vital step in starting natural language processing for any language,
Af-Somali morphological analyzer is developed for Af-Somali morphology to have more efficient
and improved NLP applications like Spelling and grammar checker, POS tagger, machine
translation system, etc. Besides it has a great contribution to the linguistic experts to easily analyze
6|Page
the language’s morphological properties and when the applications related to Af-Somali are
developed, such as the end users who are seeking the information stored in Af-Somali can be
benefited from the analyzer by identifying the word is morphological categorical property. In this
regard, this work can be basic and very much useful for the languages’ technological improvement.
The computational analysis of morphology in Af-Somali would be a central and essential
component for the development of other Af-Somali processing applications.
1.7 Scope and Limitation
Somali linguistic varieties are divided into three main groups: Northern, Benadir and Maay. The
Northern Somali forms the basis for Standard Af-Somali. So, the scope of this study is limited to
develop a morphological analyzer for the standard Af-Somali/northern Af-Somali morphology. It
doesn’t include other dialects of Af-Somali. On the other hand, this study mainly focuses on the
written form of words. Derivation and compounding are also morphologically important, but they
have not been dealt with in this thesis work. Despite the fact that there are a number of
models/approaches for computational analysis in the literature, a finite state approach is employed
in this thesis work.
1.8 Organization of the Thesis
This thesis work has been structured into six chapters. The first chapter of this thesis work, started
by giving background information of the thesis work, which introduces natural language
processing and morphological analysis, presenting the problems that motivated us, objectives and
the methodologies followed. Also the first chapter describes about the importance and the scope
of the thesis work. In chapter 2, we presented literatures reviewed for the thesis work. It looks into
the general Af-Somali word morphology and the general characteristics of Af-Somali part of
speech. In this chapter, we also presented the morphological analysis approaches. The studies
related to this thesis work are presented in chapter 3. The fourth chapter describes the design and
implementation of all those analyses done in the preceding chapters. In chapter 5, the
experimentation and evaluation are discussed. In the last chapter 6 we have concluded the thesis
work and give a direction to the future works related to this thesis.
7|Page
Chapter 2 : Literature Review
2.1 Introduction
This chapter presents documents reviewed, which are important for the development of Af-Somali
morphological analyzer. Mainly, this chapter presents Af-Somali morphology giving more
emphasis on the description of the morphological processes involved in the word formation and
generation. It also presents the Af-Somali background information and phonetics. In addition to
this the chapter reviews the different computational approaches employed in natural language
processing systems and morphological analysis.
2.2 Introduction to Morphological Analysis
2.2.1 Morphemes
Morphs are the phonological/orthographical realization of morphemes. A single morpheme

may be realized by more than one morph. In such cases, the morphs are said to be
allomorphs of a single morpheme. The following examples demonstrate the concept of
morphemes and their realization as morphs. Free morphemes like town, dog can appear
with other lexemes (as in town hall or dog house) or they can stand alone, i.e. "free". Free
morphemes are morphemes, which can stand by themselves as single words, for example
caleemo (‘leaves’) and saar (“get off’) in Af-Somali whereas, bound morphemes (or
affixes) never stand alone. They always appear attached with other morphemes like "un-"
appears only together with other morphemes to form a lexeme. Bound morphemes in
general tend to be prefixes and suffixes. For example, in Af-Somali, the morph ‘in’ is the
realization of morpheme for denoting verb infinitive marker.
For example, the words like “afuri, ababi, toosi” and other second group of Af-Somali
verbs use “in” as infinitive marker which makes “afurin, ababin and toosin”. But when the
same morpheme is attached with a different word, it is realized as a different morph. So,
the same morpheme can be realized by different morphs in a language. These different
8|Page
morphs of the same morpheme are called allomorphs. An allomorph is a special variant of
a morpheme. For example, the second person singular marker in Af-Somali is sometimes
realized as o, t or s and the morpheme -t has the morph "-t" in birta (the metal), but "d" in
mindida (the knife) of definite marker in feminine nouns. These are the allomorphs of "-t".
A group of allomorphs make up one morpheme class.
In addition to this, morphology deals with all combinations that word forms or parts of
words. So, the two broad classes of morphemes are stems and affixes. The stem is the
“main morpheme” of the word, supplying the main meaning, for example, “guriga” where
guri (house) is the stem and “ga” is the affixes which adds an additional meaning “the”.
2.2.2 Affixes
An affix is a bound morph that is realized as a sequence of phonemes. Affixes are classified
according to whether they are attached before or after the form to which they are added.
Prefixes are attached before and suffixes after. Most Af-Somali word uses the suffixes and
a few number of verbs may use the prefix type of affixes. Therefore, we can classify
languages into concatenative and non-concatenative languages based on the morphology
they possess. Non-concatenative language is called template or root-and-pattern
morphology and Af-Somali possesses this system in its plural formation of nouns. For
example, its duplifix property of the fourth noun declension “aC” as buug-buugag and fool-
foolal.
2.2.3 Types of Morphological Processes
Word is defined as the smallest thought unit vocally expressible composed of one or more
sounds combined in one or more syllables. A word is a minimum free form consisting of
one or more morphemes. There are three broad classes of ways to form words from
morphemes and Af-Somali make use of these three forms in word formation, inflection,
derivation and compounding.
9|Page
2.2.4 Inflection
Inflection is the combination of a word stem with a grammatical morpheme, usually

resulting in a word of the same class as the original stem, and usually filling some syntactic
function and is productive, e.g. imperative of verb Toos (direct!) toos+i (straighten)
the meaning of the resulting word is easily predictable. Inflectional morphemes modify a
word's tense, number, aspect, and so on.
2.2.5 Derivation
Derivation is the combination of a word stem with a grammatical morpheme, usually

resulting in a word of a different class, often with a meaning hard to predict exactly. In case
of derivation, the part of speech (POS) of the new derived word may change. Mostly, in
Af-Somali we use inflectional word formation process even if some word uses to form a
word in derivational.
2.2.6 Compounding
Compounding is the joining of two or more base forms to form a new word. Such frequent
root-root fusions are very common in written Af-Somali. Compounds are formed by
combining uninflected noun forms with semantic content with either different inflected
verbal forms with no semantic content. For example, the Af-Somali plural noun “buugag”
books with the verbal form sheeg for another noun of “buugagsheeg” bibliography.
2.3 AF-Somali Morphology
Somali language (Af-Somali) is an Afro-Asiatic language, belonging to the Cushitic family's

branch. It is a Lowland East-Cushitic language spoken by roughly up to 16 million people in
Somalia, Somaliland, Puntland, Djibouti, Ethiopia (Somali Region) and Kenya (Northeastern
Province) [25]. Somali linguistic varieties are divided into three main groups Northern, Benadir
and Maay. Northern Somali (or Northern-Central Somali) forms the basis for Standard Somali
language. Northern Somali dialect, commonly known as Somali language is spoken in Djibouti,
Ethiopia, Puntland and North of the Wabi-Shabeele, which represent the spoken standard of
10 | P a g e
literary Somali [26]. The written system of the language was adopted in 1972 and there are no
textual archives before this date. It uses Roman letters and doesn’t consider the tonal accent [26].
2.3.1 AF-Somali Phonetics
The phonetic structure of Af-Somali has 22 consonants and 10 vowels, 5 long and 5 short vowels
[33]. Af-Somali is also a tone accent language with 2 to 3 lexical tons. Af-Somali consonants
follow the same order and have the same value with the equivalent letters of the Arabic alphabet,
except G. As presented below some alphabets are not found in English and this alphabets are
similar to Arabic voiced. The Af-Somali alphabets are preceded by ' (‘= alif) ' and contains 21
consonants which are B, T, J, X, KH, D, R, S, SH, DH, C, G, F, Q, K, L, M, N, W, H, Y and other
ten vowels of Somali language which are a, i, e, u, and o and their long counterparts aa, ee, ii, oo
and uu. There is no problem for the Latin understanding and the vowels have the same value as in
Spanish or Italian.
2.3.2 Basic Characteristics of AF-Somali
The syllable structure of the Somali language is (C) V(C) (C) [items in parentheses are optional]
and most words have a di- or tri-syllabic structure (root morphemes and affixes are usually mono-
or disyllabic [33]. Af-Somali is of the same Cushitic origin to the Afaraf and Afaan Oromo and
the other Cushitic language family; and has a similarity in its vocabulary and in their basic word
order, which means they follow SOV structure. But, the most distinguishing characteristics of Af-
Somali is that, double pluralization processes such as the ones illustrated in Table 2.1, where an
independently productive plural suffix -yáal can be added to already plural forms such as nim-á-n
‘men’ or naag-ó ‘women’.
Table 2.1: Pluralization system of Af-Somali
Singular word Simple plural Plural of plural

Nin(ka)-masculine Niman(ka)-masculine Niman-yaal-Feminine
‘The Man’ ‘Men’ ‘Groups of men’
Roob(ka)-masculine Roobab(ka) -masculine ‘Roobab-yow’(ga)-masculine
‘The rain’ ‘rains’
11 | P a g e
The other and important characteristics that distinguishes Af-Somali from the other Cushitic
languages is that, existence of unquestionably derivational process that takes inflected plural forms
as a basis as illustrated in Table 2.2.
Table 2.2: Derivational inflected plural form of Af-Somali
Root word Inflected Plural Derivation Word/English

Buug ag Sheeg Buugagsheeg/bibliography
Buug ag haye Buugaghaye/librarian
Geed O Aqoon Geedaqoon/botany
Xagl O Gooye Xaglogooye/diagonal
Therefore, like any other language there are some common notable characteristics in AF-Somali
and these are inflectional system, inflected forms in composition/derivation, conjugational classes,
affixation, and reduplication. In addition to this, there are three broad classes of ways to form
words from morphemes in AF-Somali namely, inflectional, derivational and compounding. So, in
this work we consider the analysis of inflectional word formation processes relating to the
important AF-Somali part of speech. Therefore, the most important part of speech in Somali
language are nouns, verbs and adjectives and we present their word formation process in the
following sections.
2.4 Inflectional Process of AF-Somali
2.4.1 Nouns
Grammatically, Af-Somali nouns are encoded morphologically by way of affixation to root and
stems. Also, as in other related languages, Af-Somali nouns are inflected for gender, number and
person. Nouns in Af-Somali, like any other languages, are the names of persons, places, things and
abstract entities from estimated point of view. Nouns are inherently masculine or feminine. In
general, a noun consists of a root and affixes, which provides a combination of gender and number
marking. The main complication is that there are several declension classes, with specific singular
and plural suffixes for groups of classes. So, Af-Somali is marked for gender distinction,
pluralization and determiners as we will present as follows.
12 | P a g e
Gender Markers
Somali language nouns can be marked for gender to distinguish between masculine and feminine.
Some of the Af-Somali nouns are distinguished by accentual tone difference. But in this thesis
work, we will only consider the nouns that are marked for gender changes. The markers for Af-
Somali gender changes are only suffixes that distinguish between the masculine and feminine. The
markers for the feminine and masculine are shown in the following Table 2.3. As the Table 2.3
shows “ka, ha, a and ga” are masculine markers and ta, da and sha.
Table 2.3: Af-Somali Gender Markers
Masculine marker Ka Ha Ga a
Feminine markers Ta Da sha
The gender markers in Af-Somali are attached to the nouns as suffixes to differentiate between the
masculine and feminine. In the Table 2.4, we will describe how the markers are suffixed to the
nouns of Af-Somali.
Table 2.4: Example of noun with gender markers
Words Masculine marker Words Feminine marker

Ninka( the man) Ka Gabadha (the girl) da
Guriga (the house) Ga Badda (the sea) da
Aabaha (a father) Ha Hasha (the camel) Sha
Even though we have presented the nouns and how the gender markers are suffixed to them, there
are different rules that have to be captured in this study. In Af-Somali the basic markers for gender
are ‘ka” and “kii’ for masculine and ‘ta and “tii” for feminine. But the markers can be changed
based on the last character of the words. For example if the masculine nouns are ended up with the
vowels i and e the ka marker is changed into ga and ha respectively and if the feminine nouns are
ended up with the consonant l the feminine marker “A” is changed into “sha” and the “l” is deleted.
The other rule is that all feminine words that end up with the vowel o take the “da” gender marker.
13 | P a g e
Pluralization System of AF-Somali Nouns
There are different rules to change the singular nouns of Af- Somali into plural by looking at the
gender of the words. Most Af-Somali noun pluralization is inflectional, which means it doesn’t
change the grammatical word category and most of them become plural by simply taking suffixes.
As described in a Table 2.5, one syllabic Af-Somali words can be plural with partial reduplication
of their last consonant alphabet and ‘a’ vowel is inserted between the double consonants. If singular
Af-Somali noun ends with the consonants like b, d, n, l and r the last consonant of the word
becomes double and ‘o’ vowel is added to make the word plural and the gender is changed in to
feminine. And also, if the noun ends with the consonants like s, q, c, f, x, and I, we add the root
word ‘yo’ suffix as a plural. Some nouns which are two syllabic singular words are changed into
plural by adding the suffix ‘o’ and the alphabet that is found before the last consonant is deleted
and their gender remains unchanged. In addition to this, the nouns that end up with the alphabet –
e is changed into plural by adding the suffix –yaal. There are some words derived from Arabic
language which becomes plural like the Arabic pluralization. As a result of this, Af-Somali nouns
are classified into seven declensions as shown in Table 2.5, based on how they become plural and
the gender of the plural with respect to the singular, as shown in the Table 2.5, if the singular word
is masculine and changed into feminine when it becomes, plural that word is in declension one [1].
Table 2.5: Example of Af-Somali pluralization and declension formation
Word Gender Number Word Gender Num English declension

Miis Masculine Singular Miis+as Masculine Plural Tables Dec-4
Baal Feminine Singular Baal+al Feminine Plural Diagonal Dec-4
erey Masculine Singular erey+yo Masculine Plural Words Dec-2
Mindi Feminine Singular Mindi+yo Masculine Plural Knifes Dec-2
Naag Feminine Singular Naag+o Masculine Plural Women Dec-1
Ilig Masculine Singular Ilk+o Masculine Plural Teeth Dec-3
Gabadh Feminine Singular Gabdh+o Masculine Plural Girls Decl-3
Dameer Masculine Singular Demeer+ro Feminine Plural Donkeys Dec-5
Hooyo Feminine Singular Hooyooyin Masculine Plural Mothers Dec-6
Sheeko Feminine Singular Sheeko+oyin Masculine Plural Stories Dec-6
14 | P a g e
Aabe Masculine Singular Aabeyaal Feminine Plural Father Dec-7
2.4.2 AF-Somali Noun Determiners
The determiners are the modifiers which add meaning to the noun by attaching as a suffix. They
are classified into 4 types according to the meaning they add to the noun. These are, Articles
(Qodob), demonstrative (Tilmaame), interrogative (Weydimo) and possessive (Lahaansho).
Articles
AF-Somali Articles take different forms like, -ka and – kii for masculine nouns, and –ta and –tii
for feminine nouns. If the person we are talking about is far from us or the thing we are reporting
is past, we will change ka/ta into –kii/-tii respectively. The form of the articles are changed into
another form by looking at the last alphabet of the noun that the article is attached to. For example,
let us take the noun “kabo” and add the article “ka”; then “ka” is changed into ha and the word
becomes “kabaha”. So, we have described this process in the Table 2.6, which article is attached
to the noun and how it was changed. As indicated in Table 2.6, the article marker –k can be
changed into –g when it is suffixed to the masculine nouns that ends with the characters like,-g, -
w, -aa, -u, -y or –I and the article –k can be changed into –a when the masculine nouns ends up
with the characters like,-h, -x, -q, -c, -kh. In addition to this, the feminine article marker –ta can
be changed into –da or –sha. –T can be –d if it is suffixed to the noun that ends with the characters
like -o or –d, -c, -x, -h, -y, (‘) and the “ta” article can be –sh when it was suffixed to the feminine
nouns that end with the character “–l” by deleting the “l” character.
Table 2.6: Example of Af-Somali Articles
Root word Gender Article Formed word

Kabo Masculine Ka Kaba(ha)
Buug Masculine ka Buug(ga)
Magic Masculine Ka magac(a)
Maro Feminine Ta Mara(da)
Bac Feminine Ta Bac(da)
Ul Feminine Ta Usha
15 | P a g e
Demonstrative suffixes
Like the articles, demonstratives are suffixed to the nouns to modify the meaning of nouns in
determining the farness or where the things are. Their difference depends on the relationship that
is found between the subject and object or the distance between the person talking and what he
was talking about. So, in Af-Somali we have three different demonstratives of noun markers as
described in the Table 2.7, which indicates nearness (kan), farness (kaas), to left/right (keer) for
masculine and nearness (tan), farness (taas) and to left/right (teer) for feminine.
Table 2.7: Af-Somali Demonstratives
Word Gender Near Farness To left/right

Gabadh Feminine Tan Taas teer
Gabadh Feminine Gabadhan Gabadhaas gabadheer
Nin Masculine Kan Kaas keer
Nin Masculine Ninkan Ninkaas ninkeer
The Table 2.7 also describes that whenever, a suffix starting with t is added to a feminine noun
which the last character is “dh”, the t is deleted and only takes the remaining part of the suffix.
Possessive Suffixes
In Af-Somali the possessive suffixes are used to represent in the word that something you own or
possession like other languages and are classified into masculine and feminine which depends on
the degree of person and this forms 6 different possessives as indicated in Table 2.8.
Table 2.8: AF-Somali possessive
Person Masculine Feminine Root noun Gender Word

1st.Sg. Kayga Tayda Buug Masculine Buug-gayga
2nd.Sg. Kaaga Taada Gabadh Feminine Gabadh-aada
3rd.Sg Kiisa Tiisa Buug Masculine Buug-giisa
3rd.Sg Keeda Teeda Gabadh Feminine Gabadh-eeda
1st.Pl Keenna Teenna Nin Masculine Nin-keenna
2nd.Pl Kiinna Tiinna Wiil Masculine Wiil-kiinna
3rd.Pl Kooda Tooda Bac Feminine Bac-dooda
16 | P a g e
Interrogative Suffixes
The interrogative suffixes are determiners which adds question like meaning and uses markers like
other determiners that can be masculine and feminine. So use – (kee) for masculine nouns and the
– (tee) suffix for feminine nouns as we described in the Table 2.9.
Table 2.9: Interrogative representation of A-Somali
Root noun Gender Interrogative suffix Word

Dal Masculine Kee Dalkee
Sacad Feminine Tee Sacaddee
Meel Feminine Tee meeshee
2.4.3 Adjectives
Adjectives, in turn, do not belong to a clearly defined category in Af-Somali. Items such as yár
‘small’ and wéyn ‘big’ are best interpreted as state verbs displaying a particular defective
paradigm. Adjectives are inflectionally pluralized through reduplication. The reduplicated plural
is formed by prefixing a copy of the first syllable to the stem. Only the second syllable bears the
high tone. Besides this adjectives can be marked for person, definiteness and have tense markers.
For example the plural form of adjective words like cad, cusub, yar are described with the Table
2.10.
Table 2.10: Pluralization of adjectives
Root adjective word Number Word Number

Cad(white) Sg Cadcad plural
Cusub Sg Cuscusub plural
Yar Sg Yaryar Plural
2.4.4 The Verb
The verb is the most important part of speech in Af-Somali, which can be inflectionally complex
than other parts of speeches. Verb morphology is slightly more complex. Again, a typical verb
17 | P a g e
consists of a root plus a number of affixes. These include derivational affixes (Somali includes a
passivizing form which can only be applied to verbs which have a ‘causative’ argument, and a
causative affix which adds such an argument) and a set of inflectional affixes which mark aspect,
tense and agreement [25]. It has complex alternation patterns and it is basic building part of the
Af-Somali verbs are the root word, modifiers, person and conjugation. The most important that
have to be described is the verbs conjugations. So we have presented some of properties of
conjugations with an examples as follows. The conjugation is a thing that shows the verb’s tense,
aspect and mood. The agreement of the person and tense produces 6 different forms of a word as
we illustrated with an examples in Table 2.11. And also the table shows the person agreement with
tenses and the person markers for each the 6 forms.
Table 2.11: Example of person agreement with tenses
Person The root verb Tense Present verb Past verb

Present past
1st.Sg Cun Cun-0-aa Cun-0-Ay Cunaa cunay
2nd.Sg Cun Cun-t-aa Cun-t-ay Cuntaa cuntay
3rd.Sg.masc Cun Cun-0-aa Cun-0-ay Cunaa Cunay
3rd.Sg.fem Cun Cun-t-aa Cun-0-ay Cuntaa Cuntay
1st.Pl Cun Cun-n-aa Cun-n-ay Cunnaa Cunnay
2nd.Pl Cun Cun-t-aan Cun-t-een Cuntaan cunteen
As shown in the above table 2.11 (0) indicates the person 1st.sg, 3rd.sg.masc; 3rd.pl. And the suffix
–t shows the 2nd.sg, 3rd.sg.fem, 2nd.pl; the suffix –n also indicates the 1st.person Pl. These can be
also affixed by the suffixes like –ay or –een for the conjugation of the past verb and when the verb
is present it takes the suffixes like –aa/-aan. Af-Somali verbs are classified into five conjugation
categories based on their imperative markers.
2.4.5 Classification AF-Somali Verbs
Based on conjugation Af-Somali verbs are classified into two broad categories, huge number of
Af-Somali verbs with only suffixes and small number of verbs with both prefix and suffixes. So,
firstly late is consider the conjugation of verbs only with suffix which we mostly used in Af-
Somali. This types of Af-Somali verbs are classified into five types of conjugations known as 1st.
18 | P a g e
conjugations, 2nd. Conjugations, 3rd. conjugations, 4th. Conjugations and 5th. Conjugations. The
1st. conjugation verbs are characterized by that, this verbs didn’t use an imperative marker and they
are mostly one syllabic words. For example let us consider and present this in the Table 2.12.
Table 2.12: First conjugation representation of Af-Somali verbs
Verb Person Tense Imperative The word

Cun 0 Ay 0 cunay
Jab 0 Ay 0 Jabay
Qor T Ay 0 qortay
Secondly, the 2nd. Conjugation of Af-Somali verbs are characterized by that, these verbs are mostly
formed from other verbs and they are suffixed with imperative marker “I”. For example, the verb
“toos” is suffixed with “I” to become the 2nd. Conjugation type of Af-Somali verbs as shown in
Table 2.13.
Table 2.13: Second Af-Somali verb conjugation (toosi)
Person/number imperative Habitual Present past Past continues

present continuous
1st.Sg I Toosiyaa Toosinayaa Toosiyay Toosinayay
2nd.Sg I Toosisaa Toosinaysaa Toosiyay Toosinaysay
3rd.Sg.masc I Toosiyaa Toosinayaa Toosiyay Toosinayay
3rd.Sg.fem I Toosisaa Toosinaysaa Toosisay Toosinaysay
1st.pl I Toosinaa Toosinaynaa Toosinay Toosinaynay
2nd.Pl I Toosisaan Toosinaysaan Toosiseen Toosinayseen
3rd.Pl I Toosiyaan Toosinayaan Toosiyeen toosinayeen
The other type of Af-Somali verbs is that, 3rd. conjugation verbs which is characterized to be
suffixed with “ee” of imperative marker as shown in Table 2.14 and this indicates that the verb is
in 3rd. conjugation and we have listed some of the verbs in this conjugation and represented in an
example found in the Table 2.14.
19 | P a g e
Table 2.14: Example of Af-Somali 3rd. conjugation representation
The root verb Imperative Infinitive The verb In English

Dhab Ee Eyn Dhabeyn Make the truth
Ciid Ee Eyn Ciideyn Put the soil
Lastly, the 4th. Af-Somali verb conjugations are characterized by their imperative marker “o”
which makes this verbs to have different representation and 5th. Af-Somali verb conjugation are
also characterized by their imperative marker “so” and we clearly described the following example
found in Table 2.15 to represent the verb conjugation which shows their inflections like person,
number, tenses and other properties and how this conjugation forms seven different part of verbs
which formed from the person agreement with number and tenses.
Table 2.15: Fourth Af-Somali verb conjugation representation
Person/number imperative Habitual Present Paste Paste

present continues continues
1st.Sg O(dhaqo) Dhaqdaa Dhaqanayaa Dhaqday Dhaqanayay
2nd.Sg O Dhaqataa Dhaqanaysaa Dhaqatay Dhaqanaysay
3rd.masc O Dhaqdaa Dhaqanayaa Dhaqday Dhaqanayay
3rd.fem O Dhaqataa Dhaqanaysaa Dhaqatay Dhaqanaysay
1st.Pl O Dhaqannaa Dhaqanaynaa Dhaqannay Dhaqanaynay
2nd.Pl O Dhaqataan Dhaqanaysaan Dhaqateen Dhaqanayseen
3rd.Pl O Dhaqdaan Dhaqanayaan Dhaqdeen dhaqanayeen
2.5 Derivational System of AF-Somali
Morphologically Af-Somali words are inflectional like other Cushitic languages, but some words
are derivational. Mostly words which are derivational in Af-Somali are verbs and Adjectives,
which can be formed from other categories of words and most adjectives are formed from verbs.
Some nouns are morphologically derived from other categorical word classes in the process of
20 | P a g e
word formations. Most verbs in Af-Somali can be changed in to nouns by taking the suffix (a) and
doubling the last consonant. For example the verb “dil” can be changed into noun by simply adding
“aa” and it becomes “dilaa” the verb “cun” is also changed into noun by adding the character “o”
and the noun formed is “cunto”.
Verb morphology is slightly more complex and gain, a typical verb consists of a root plus a number
of affixes. These include derivational affixes (Somali includes a passivizing form which can only
be applied to verbs which have a ‘causative’ argument, and a causative affix which adds such an
[25]. For example Aadaan (prayer)-noun word becomes “aadanay” (praying) which is a verb and
the noun word iskaashato (cooperation) noun word is changed in iskashi which is a verb.
Also like other part of speech Af-Somali adjectives have a derivational process. There are two
sorts of adjectives, ‘basic adjectives’ (a small number), such as yár ‘small’ and wéyn ‘big’ and
those formed from nouns and verbs by addition of lexical suffixes, such as caan-sán ‘famous’ (cáan
‘fame’), wanaag-sán ‘good’ (wanáag ‘goodness’) and jar-án ‘chopped’ (jár ‘to break’). On the
other hand the compounding of words creates a derivational word which can be formed from two
different words like verb and noun or adjective to noun and others.
2.6 Approaches to Morphological Analysis
There are a number of approaches which are widely used in computational morphology. Some of
these approaches are based on concepts in automata theory, probability, principle of analogy, and
information theory. The computational morphological approaches are broadly categorized into
rule-based and corpus-based approaches.
2.6.1 Corpus-based Approaches
Corpus-based approaches are statistical in nature and these approaches do not strictly follow
explicit theory of linguistics [32].Suitable machine learning algorithm is used to train the system
and collect the necessary information and features from the corpus. The knowledge acquired is
then used to perform the morphological analysis task [32].Based on the type of text corpora used,
corpus-based approaches can be further categorized into supervised and unsupervised approaches.
Supervised approaches use annotated text corpora while unsupervised approaches uses natural
corpus as those found in newspaper and books. As noted above, these approaches need a huge
21 | P a g e
corpus of words which used to train the algorithm to be developed. So this approach is difficult for
under resourced languages like Somali and it may not produce an efficient and quality output.
Mostly, the most developed languages used the machine learning approach, which mostly requires
huge number of word corpora and electronic dictionary, newspapers and other documents that are
found in the Internet. The languages used this approach to overcome the overload created by the
rule based approach and some of the languages that used this approach are English [30], Arabic
[13], etc. Limited researches are done in this area for local languages such as Amharic [22] and
Ge’ez [34] using corpus based approaches. But, most of local languages are used a rule based
approach specifically the two level morphological analysis.
2.6.2 Rule-based Approach
The rule-based approach strictly follows the explicit theory of the linguistics, which is based on a
theory of morphology laid down by an expert. Kazakov and Munandhar [32] stated that this
approach enables to incorporate sophisticated linguistic theories such as generative phonology into
computational morphology processes [32]. Because of their reliance on linguistic theories, systems
developed using rule-based approaches are often efficient and produce better quality outputs [28].
There are different rule-based methods used to develop morphological analyzer for any languages
and some of these are, paradigm based and finite state automata.
In paradigm based method for a particular language, each word category like nouns, verbs,
adjectives, adverbs and postpositions will be classified into certain types of paradigms. Based on
their morphophonemic behavior, a paradigm based morphological compiler program is used to
develop the morphological analyzer.
The Finite State Automata (FSA) based method uses regular expressions and is used to accept or
reject a string in a given language. In general, an FSA is used to study the behavior of a system
composing of states, transitions and actions. When FSA starts working, it will be in the initial stage
and if the automation is in any one of the final states it accepts its input and stops working. Within
computational morphology, a very significant advance came with the demonstration that
phonological rules could be implemented as finite state transducers (FSTs) and that the rule
ordering could be dispensed with using FSTs that relate the surface and lexical levels directly, so-
called “two level” morphology (TLM) to lexical output) to one that performs generation (lexical
22 | P a g e
input to surface output) [32].TLM is devised to handle morphological analysis and generation in a
bi-directional way. The approach is based on two lexica (one for the underlying and the other for
surface word forms), and a set of morphological rules. The rules establish whether a given
sequence of characters at the surface level (as it appears in the text) can correspond to a sequence
of symbols used to represent the morphemes in the lexicon. In other word, the rules map the two
strings to each other. TLM is currently very popular method in computational morphology
[32].And the most common benefits of FST for NLP stem from several properties of finite-state
devices are true representation, modularity, compactness, efficiency and reversibility.
True representation means that the kind of phonological and morphological rules that are common
in linguistic theories can be directly implemented as finite-state relations. The implementation of
linguistically motivated rules in FST is therefore straightforward and direct. Modularity is the
closure properties of regular languages and relations provide various means for combining regular
expressions, supporting a variety of operations on the languages these expressions denote. For
example, closure under union facilitates a separate development of two grammar fragments which
can then be directly combined in a single operation. The most useful operations under which
transductions are closed is probably composition, which is the central vehicle for implementing
replace rules. Finite-state automata can be minimized, guaranteeing that for a given language, an
automaton with a minimal number of states can always be generated and this property is known
as compactness. Toolboxes can apply minimization either explicitly or implicitly to improve
storage requirements. When an automaton is deterministic, recognition is optimally efficient
(linear in the length of the string to be recognized). Automata can always be determined, and
toolboxes can take advantage of this to improve time efficiency. In addition to this finite-state
automata and transducers are inherently declarative, it is the application program which either
implements recognition or generation. In particular, transducers can be used to map strings from
the upper language to the lower language or vice versa with no changes in the underlying finite-
state device [28].
2.7 Finite State Technology
Finite-state technology (FST) denotes the use of finite-state devices, such as automata and
transducers, in natural language processing. Since the early works which demonstrated the
23 | P a g e
applicability of this technology to linguistic representation. FST is considered adequate for
describing the phonological and morphological processes of the world’s languages [32].In order
to understand how to build the linguistic application, we first need to be acquainted with the basics
of how a finite-state machine works.
2.7.1 Finite State Machines
A finite-state machine (FSM) is an abstract machine that implements a regular language.

Regular languages can be described formally in a concise notation, through regular expressions.
A finite-state machine is a network consisting of states indicating one start state and one or more
final states. Transitions between states are possible only if the required input is recognized. A path
is a sequence of transition over arcs to a particular state. In computational morphology, a path is a
set of alphabets equivalent to a word in natural language. So, it can be said that the technology that
utilizes the finite-state network in the processing of creating an application is said to be a finite
state technology. But, the finite state automata only accepts word and checks if the word is a valid
word that found in the language. It does not gives or produces an output or generate.
2.7.2 Finite-state transducers
So far, the analysis of words in a network has simply yielded one of two responses, either accept,
indicating that the word is in the language of the network, or a reject, indicating that the word is
not in the language. While this can be valuable, as for instance in spell-checking, finite-state
networks are capable of storing and returning much more interesting information [28].
Within computational morphology, a very significant advance came with the demonstration
that phonological rules could be implemented as finite state transducers [11] and that the rule
ordering could be dispensed with using FSTs that relate the surface and lexical levels directly [11],
so-called “Two-level” morphology. A second important advance was the recognition by [11] that
a cascade of composed FSTs could implement the two-level model. Finite-state techniques are
probably the most prevalent approach employed by automatic morphology systems, as their
simplicity and outstanding efficiency are unequaled.
FSAs can be used to recognize particular patterns, but don’t, by themselves, allow for any analysis
of word forms. Hence for morphology, we use finite state transducers (FSTs) which allow the
24 | P a g e
surface structure to be mapped into the list of morphemes. FSTs are useful for both analysis and
generation, since the mapping is bidirectional [28].
2.7.3 Two Level Morphological Approach
The two-level morphology approach to morphological analysis is a language independent general

formalism for analysis and generation of word-forms [28]. Kimmo invented this approach in 1983.
The Generative phonology approach creates un-necessary intermediate levels and is also uni-
directional. Kimmo decided to eliminate the intermediate levels. This created a new approach,
which has only two levels, the lexical level and the surface level, hence the name Two-Level
Morphology. This model has also an added advantage of being bi-directional, implying that both
analysis and generation could be done using the same system, which was not possible with the
earlier approaches which were uni-directional. Two-level morphology depends heavily on finite
state methods, which are well known and are often described as elegant [28]. The two level
approach has already successfully been used to develop a comprehensive morphological analyzer
for Swahili, a Bantu, Amharic, Afan Oromo and Tigrign languages. The following examples
described in Table 2.16, shows the two level representation of Af- Somali words of tagay (he went)
and waddooyin (roads). The surface level is the inflected word form and the lexical level defines
the stem plus a set of morphological feature tags relating to the word. For example; let as describe
with an example shown in table 2.16 using the Tagay and waddooyin of Af-Somali words.
Table 2.16: Example of Af-Somali two level representation
Word Word class Inflectional type Generated word

Lexical level Tag (go) Verb paste 3rd.per.Sg.masc tagay
Surface level Tag 0 Ay 0 tagay
Lexical level Waddo Noun Pl waddooyin
Surface level Waddo 0 oyin waddooyin
2.7.4 The Xerox Finite State Frame work

Xerox research institute has developed a set of finite-state tools which provide a means of
implementing two level morphologies. The tools are natural language independent and have been
used to implement morphologies for many of the major languages English, Spanish, French, and
25 | P a g e
German, Arabic etc. as well as Afaraf, Afan Oromo, Amharic and others. Xerox finite state
technology (XFST) is a programming language for regular expressions, which can be compiled
into finite state networks and is used here for analysis of Af-Somali morphology. It comes bundled
with a set of tools for compiling and working with FSTs. XFST includes two components known
as lexc and xfst. lexc is a compiler for lexicons in the lexc language, which is specifically designed
for handling morphotactics (the syntax of the morphemes) in natural languages and xfst is the core
tool providing an interface to the finite state calculus for building, accessing, manipulating finite
state networks and a compiler for regular expressions and replacement rules which will be essential
for any work.
Lexicon Compiler
Lexicon compiler (Lexc) is the finite-state tool which has been developed by Xerox for defining
two-level lexicons. Lexc is just one of several ways to specify finite-state transducers, but it is
especially designed to facilitate the work of the lexicographer [28].
Lexicons and morphotactic information are encoded in the lexc language, which is a kind of right
recursive phrase-structure grammar, and are compiled into finite-state transducers as shown in
figure 2.1. Finite-state transducers (FSTs) are data structures that encode regular relations [28]
which are mappings between two regular languages. For our human convenience, we can visualize
a finite-state relation as having an upper-side regular language and a lower-side regular language
and each string in one language is related to one or more strings in the other language. By
convention, the upper-side or analysis strings of an FST compiled from a lexc description consist
of underlying morphemes (strings of phonemes and morphophonemic) and multi-character symbol
tags like +Noun, +Verb, +Adj(adjectives, +Conj (conjugations), +ImpeV (imperative verb),
+Masc[masculine], +Fem[feminine], +Sg[singular], +Pl[plural], etc. that identify the morphemes
[3].It accepts a text file containing a user-defined lexicon encoded using to the following syntax.
Lexical-item Continuation-class;
The lexical item is usually the unmarked form of the word (the root or headword given in a
dictionary). In the context of this work the lexical item is the stem (the root in most cases) to
which inflectional affixes are attached, i.e. a free morpheme. The continuation class can be
a pointer to another lexicon or it can be the end-of-string marker, the example below found in
26 | P a g e
Figure 2-1 shows two entries for ‘tag (go)’, one of which is followed by the end-of-string marker
‘#’ and the second which points to the continuation class past Tense, where the aspect form of the
word will be defined.
Figure 2-1: Example of two level representation of Af-Somali

We make use of the two-level representation to encode valuable morphological information
about the words as the above example shows. The symbols to the left of the colon represent the
lexical level Verb+tag’, and the symbols to the right of the colon represent the surface form ‘tag’.
Xerox Finite State Technology Interface
The xfst part of this frame work is mainly concerned with the realization, i.e. surface forms, and
phonological alternation rules. This component takes the output of lexc transducer (lexical
grammar) as input, which has stems with grammatical features labeled with tags and it is passed
through additional rules to obtain the acceptable surface forms. The xfst component helps to
compile the lexc grammar into an FST as well as other rule FSTs using lexc files and rule files
respectively. Generally, the following Figure 2-2 illustrates the components of morphological
analyzer using finite state transducer, where the The .o. operator represents the composition operation.
27 | P a g e
Figure 2-2: Creation of a lexical transducer
2.8 Summary
In this chapter, we introduced Af-Somali background information, morphology and the Af-Somali
important part of speech words. We have also described finite state technology that is successfully
applied to computational morphology. The regular expression that can be compiled into finite state
network which signifies regular language and the same language can be encoded by the finite state
network. The complex finite state network can be built from the smaller networks using various
mathematical operations such as union, concatenation, composition, complementation, subtraction
and intersection.
28 | P a g e
Chapter 3 : Related work
3.1 Introduction
In this chapter, we present the system developed for computational morphological analysis for
different languages in the world and also in this chapter we look at the approaches they used to
develop the morphological analyzers. Specifically, we will look in detail the rule based approach
of finite state technologies developed and used for the morphological analyzer of Ethiopian and
Cushitic language which are related to Af-Somali.
Creating an automatic morphological analyzer/generator is just one step in starting natural

language processing for any language; but especially for minority, emerging or generally lesser-
studied languages, it is often a practical and extremely valuable first step, making use of corpora,
lexicons, morphological grammars and phonological rules already produced by linguists and
descriptive linguists [6].
3.2 Morphological Analyzer for European Languages
Cagri [17] developed TRmorph, a two-level morphological analyzer for Turkish. The system is
completely implemented using freely available Stuttgart finite state transducer tools (SFST). As
Cagri [17] presented, SFST is a freely available finite state tool set particularly aimed for
implementing morphological analyzers. The tool uses a simple specification language mainly
based on regular expressions, with additions of the well-known two-level operators that are
particularly useful in implementing phonological (or orthographic) alternations. The TRmorph
was analyzed and evaluated with real world data during its development and the system has been
tested on two relatively large corpora, the METU corpus and Turkish Wikipedia. Generally,
Cagri[17] said, the same process is repeated for successfully analyzed words, where there was no
errors, but with some ambiguous analyses.
Elaine [18] also developed morphological analyzer for Irish language. The system was developed
by using finite-state two-level description with Xerox Finite-State Tools. The system encodes the
inflectional morphology of all inflected parts-of-speech in modern Irish and the morphotactics of
29 | P a g e
stems and affixes are encoded in the lexicon and word mutations are implemented as a series of
replace rules encoded as regular expressions. A major advantage that Elaine [18] get from finite-
state two-level implementations of morphology is their inherent bi-directionality; the same system
is used for both analysis and generation of word forms in the language. The system designed for
broad coverage of the language, is evaluated against the most frequently used words in a corpus
of contemporary Irish texts. Finally, Elaine [18] gives as suggestion to include derivational
morphology and dialectal or historical word-forms that the system was not implemented.
Generally, we can understand that, morphological analyzer systems can be used as a component
part in many NLP applications such as spelling checkers/correctors, stemmers, and text to speech
synthesizer’s [18].
In addition to this, Xuri [30] developed an English morphological analyzer using machine learning
approach. The system is consists of two closely related components; morphological rule learning
and morphological analyzing. As Xuri [30] presented unsupervised learning has been employed to
obtain a set of affix transformational rules and the experiment presented shows that the analyzer
has a satisfactory performance.
However as stated in [30], problems remain and the most difficult is combinatory ambiguity. This
shows that a larger context, such as part of speech or context between words is needed for a correct
analysis of these words. So, mostly the machine learning approaches require to have huge number
of wordlist in a corpus trained to give an analysis which did not exactly follow the linguistic rules
of the languages.
3.3 Morphological Analyzer for Asian Languages
Gulshat and Ilyas [19] developed a rule based morphological analyzer and a morphological
disambiguator for Kazakh language. This system gives the implementation details of a rule-based
morphological analyzer of Kazakh language which is an agglutinative language. In the
implementation of the morphological analyzer, alternation and morphotactic rules of these systems
are represented by two-level morphology rules and Foma finite state compiler is employed. As
Gulshat and Ilyas [19] have presented the Morphotactic rules and possible morphemes are defined
in the lexicon file and alternation rules in the system are defined and the rules are composed with
30 | P a g e
the lexicon file in a Foma file. The system was tested and evaluated which shows a beginning work
on the development of morphological analyzer of Kazakh language. This system is working in two
directions as at lexical and surface level and due to the ambiguities in language there is no one-to-
one mapping between surface and lexical forms of words and the system can produce more than
one result.
Also Kenneth [20] developed a morphological analysis and generation of Arabic language. The
system uses Xerox finite state transducer toolkit for its implementation. Kenneth [20] described
that, the Lexicons and morphotactic information are encoded in the lexc language which is a kind
of right recursive phrase-structure grammar, and are compiled into finite-state transducers and
Alternation rules to perform deletion, epenthesis, assimilation and metathesis are written in the
twolc language and/or in a notation known as REPLACE rules. The system was tested and
evaluated with an encouraging performance containing include about 4930 roots. So, for any
language to have a morphological analyzer is one step forwarding to technology for that language.
3.4 Morphological Analyzer for Ethiopian Languages
Micheal [22] developed a morphological analyzer for three of Ethiopian languages, Amharic,
Afaan Oromo and Tigrinya called HornMorpho. The system uses finite state transducer integrated
with python programming language for the implementation and the system uses separate finite
state transducer for each language. In addition to this, the system was evaluated with a web crawler
developed by Biniam Gebremicheal and Michael Gasser [22], stated that, more testing is called
for, this evaluation suggests excellent coverage of Amharic and Tigrinya verbs for which the roots
are known. Although Oromo, a Cushitic language, does not exhibit the root+template morphology
that is typical of Semitic languages, it is also convenient to handle its morphology using the same
technique because there are some long-distance dependencies and because it is useful to have the
grammatical output that this approach yields for analysis. For Amharic, however, the system is
apparently able to at least analyze the great majority of nouns and adjectives. The system treats all
Amharic words other than verbs, nouns, and adjectives as unanalyzed lexemes. But, the tool is not
convenient to Afaan Oromo, because of the language is complicated by the great variation in the
use of double consonants and vowels by Oromo writers [22].
31 | P a g e
The other mostly related language is Afaraf and Ali Mohamed [2] developed the first
morphological analyzer for this languages and used a finite state transducer. As Ali described that
the analyzer, manually annotated 312 tokens, 200 (100 consonant-initial & 100 vowel-initial)
verbal, 80 nominal and 32 adjectival words from three popular Afar magazines2 published in
Ethiopia and Djibouti. 192 verbal, 75 nominal and 28 adjectives were correctly analyzed and said
that the results were evaluated by a human reader familiar with the languages. An output was
considered correct only if it found all legal combinations of roots and grammatical structure for a
given word form and included no incorrect roots or structures [2].
3.5 Summary
A limited researches have been conducted in developing morphological analyzer for Cushitic
languages like Afaan Oromo [22] and Afaraf [2] and both languages analyzers used rule based
approach with finite state transducer. But, to the best of our knowledge no research has been
conducted so far in the area of automatic morphological analyzer for Af-Somali. The absence of
morphological analysis systems limits the effort of making computers work comfortable with
Somali.
32 | P a g e
Chapter 4 : Design of Af-Somali Morphological Analyzer
4.1 Introduction
This chapter presents the design of Af-Somali morphological categories and phonological rules to
design a computational model using the Xerox finite state toolkit. It presents the general
architecture of lexical FSTs for Af-Somali morphological analysis and the morph-tactics of the
language which means how the morphemes co-occur. It also, shows the morph-tactics for each
word class separately with lexc formalism and the alternation rules using xfst interface.
The main objective in the design of the morphological analyzer is to construct a network
which accepts all and only the valid Somali words, and delivers the right analysis. So, in this
section, we clearly present the detailed overview of the morphological analyzer system design and
its components.
4.2 General Architecture of AF-Somali Morphological Analyzer
The construction of the morphological analyzer system, using finite state transducer will be broken
down into two large components lexicon/ morph-tactics part and phonological or alternation rules
part. The morph-tactics of the language describes what stems and affixes can co-occur and in what
order, are captured in the lexicon. While phonological and morph-phonological alternations
between underlying forms and surface spoken or written forms are implemented using alternation
rules.
A word, in order to be analyzed, follows the path lexicon→morphotactic rules→alternation

rules→surface. Before the result of the morphological analyzer appears at the surface, it will follow
the lexicon path to determine the actual morpheme of that word. After moving from the lexicon,
that word will be analyzed by morph-tactic and morphophonemic rules. Only after finishing the
process in morph-tactic and morphophonemic rules, the result of morphological analyzer for that
word will be delivered as shown in Figure 4-1.
33 | P a g e
Figure 4-1: Af-Somali morphological analyzer architecture design
The other common applications of finite-state techniques include handling words whose roots or
stems are not found in the lexicon using guessers, by which the lexical component is replaced by
a phonotactic component characterizing the possible shapes of roots or stems. Guessers is to define
or recognize the words, which are not found in the lexicon, because all words, cannot be collected
or it is time consuming.
34 | P a g e
4.2.1 Lexicon/ Morph-tactics
The design of the tags has become very important in the development of morphological analyzers,
since the tags will deliver linguistic information that occurs on a word being analyzed. The
morphological analyses of Somali word forms are presented in this system in terms of the
following symbols found in Table 4.1.
Table 4.1: Tags of AF-Somali grammatical information
No. Grammatical Tags

information
1 POS +N(noun), +V(verb), +Adj(adjective)
2 Number +Sg(singular), +pl(plural)
3 Definiteness +def(definite), +indef(indefinite)
4 Gender +fem(feminine), +masc(masculine)
5 Tenses +pres(present tense), +paste(paste tense, +pres.conti(present

continuous tense),+paste.conti(paste continuous tense)
6 Imperative +imp(imperative)
7 Demonstratives +close, +far, +near
8 Possessives 1st.Sg,2nd.Sg,3rd.masc,3rd.fem,1st.pl,2nd.pl,3rd.pl
9 Interrogatives +inter(interrogative)
10 Infinitive +inf(infinitive)
After various affixes in the morphology were identified, the order in which these affixes are
attached to the verbal, nominal, adjectival stem was determined in the lexicon database.
35 | P a g e
The lexicon component will be a transducer that accepts as input only valid Somali stems/roots
followed by only legal sequence of tags and produces as output from these, an intermediate form,
where the tags are replaced by the morphemes that they correspond to. Within a lexicon, word
classes (stems) are assigned to separate classes depending on their inflection they require. Each
stem class has an associated continuation class where morphological tags and affixes are
concatenated to the stem. Internal modifications (ablaut) to stems also have been implemented in
the lexicon. The part that accomplishes this, the lexicon transducer, will be written in a formalism
called lexc. The lexc-formalism is more suited for lexicon construction and expressing morph
tactics. For example, in the analyzer about to be constructed, the lexicon component FST will
perform the following mappings shown in the Table 4.2.
Table 4.2: Mappings of root words and their morphemes
Word verb Imperative infinitive
Lexical level Caddeyn +V +imp +inf
Surface level caddeyn +0 Ee Eyn
All root words and morph tactics rules were entered into lexicon database and all spelling rules
were entered into rules database. Separate FSTs were created for lexicon and rules, and then
combined into one big FST by applying FST composition operation. Therefore, for each word
class we created a separate lexicon and alternation rules described in the following sections.
4.2.2 Alternation Rules
Having accomplished the first part of the grammar construction, we now turn to the alternation
rules component. The idea is to construct a set of ordered rule transducers that modify the
intermediate forms output by the lexicon component. At the very least we will need to remove
the ^-symbol which is used to separate morpheme boundaries before we produce valid surface
forms. The role of the alternation rules is to modify the output of the lexicon transducer according
to phonological and morph-phonological rules. So, for the above example in Table 4.2, we've seen
that Af-Somali verb3 word class root concatenated with imperative ee and infinitive marker eyn
36 | P a g e
cadd caddeyn (clarifying). However, when the infinitive marker eyn is suffixed to double vowels
(ee) the last vowel of the double vowels e is replaced with the character y.
A way to describe the process of forming the correct verb3 word class is to always represent the
infinitive suffix as the morpheme eyn as we have, and then subject these word forms to alternation
rules that eliminate the final double vowels and only add the infinitive suffix. This, among others,
is the task of the alternation rules component to produce the valid surface forms from the
intermediate forms output by the lexicon transducer. Since alternation rule FSTs that are
conditioned by their environment are very difficult to construct by hand, we use the replacement
rules formalism in xfst to compile the necessary rules into FSTs. This is accomplished by the
regular expression composition operator (.o.).
Somali has several phonological alternations involving reduplication, lenition, vowel harmony and
tone. With this documentation we described the design of alternation rules clearer and we describe
or represent with an examples.
4.3 The Design of AF-Somali Part-Of-Speech Lexicon and Alternation Rules
As described in Chapter 2, there are a number of approaches implemented for morphological

analyzer development of many languages, but for this thesis work we have chosen rule-based
approach by using finite state transducer technology with Xerox finite state toolkit. So, as
mentioned in the previous section using rule based approach needs to have two components,
lexicon and alternation rules of the language. Therefore, for the development of Af-Somali
morphological analyzer we have created a lexicon for the morph-tactics of the Af-Somali most
important part of speech verbs, nouns and adjectives separately and the rules are captured with the
xfst tool.
4.3.1 AF-Somali Verb Lexicon Design

Verbs in Af-Somali are actions, and states. They agree in person and number, and also gender. We
classified the verbs into 5 groups which are interrelated based on their imperative markers. Their
representation and encoding process is described as follows using finite state transducer lexc
formalism by notepad as shown in Figure 4-2.
37 | P a g e
As mentioned above in the development of Af-Somali verb lexicon; we classified the verbs into
five groups known as V1, V2, V3, V4 and V5 which we illustrated their V1 verbs in the above
figure. The figure also shows that, there is a lexicon called verbs which contains five sub lexicons
of v1, v2, v3 v4 and v5 which also have a sub lexicon called v_suffixing and the detailed
description of the lexicon is found in Appendix-B.
V_suffixing sub lexicon contains all the suffixes attached to the root verbs which is described or
created in different lexicon as shown in the Figure 4-2. In this lexicon, we have presented the
morphemes that goes with the root verbs and in which order they co-occur with the verbs.
Figure 4-2: Af-Somali verb lexicon

In addition to this, the development of morphological analyzer requires to build finite state
networks which present how the morphemes and the root word can co-occur. So, we have
38 | P a g e
presented the Af-Somali verb finite state networks which shows the morphemes and the root verb
and their order as shown in Figure 4-3. And in this process the states are described with the rule
of Xerox finite state staring from the root verb till the word ends. As shown in the Figure 4-3, the
arcs represent states and the arrows indicate the tags and the double circle indicates that the state
is final state.
Figure 4-3: Af-Somali verbs finite state networks
39 | P a g e
Generally, we have described the word root/stem lexicon and their morphotactics with an examples
as shown in the Table 4.3. For example, the morphotactics of Af-Somali second subgroup verb
(V2) words are illustrated in Figure 4-4, and we also presented the finite state network with an
example in Figure 4-4, using the verbs of “toosi” and “caddee” which shows how the verbs of
second and 3rd group of Af-Somali verbs generated and the order in which they co-occur.
Table 4.3: An example of Af-Somali verb morphotactics
Lexical level Toos +V +imp +Sg +inf +pers +paste The word
Surface level toos 0 I 0 In 0 ay toosinay
40 | P a g e
Figure 4-4: Example representation of Af-Somali second and third group verb FSN
4.3.2 Alternation Rules of AF-Somali Verbs

Af-Somali has a number of morpho-phonemic alternations that a morphological analyzer has to
consider. These alternations are dependent on the phonological context, where the features of
individual morphemes in the context affect this process. Alternation rules of Af-Somali are defined
and the rules are composed with the lexicon file in xfst file. Af-Somali has several phonological
alternations involving reduplication, lenition, vowel harmony and tone.
41 | P a g e
In order to construct a finite state transducer for alternation rules, firstly we have defined Af-
Somali alphabets such as ‘, b, t, j, x, kh, d, r, z, sh, q, k, l, m, n, w, h, y, (‘, B, T, J, X, KH, D, R, S,
SH, DH, C, G, F, Q, K, L, M, N, W, H, Y and the five vowels a, e, I, o, u. but Af-Somali also has
other five long vowels which are aa, ee, ii, oo, uu. Some vowels in certain words are dropped if a
suffix starting with a vowel is attached and the detailed description of Af-Somali alternations are
presented in Appendix-A.
Figure 4-5: Af-Somali verbs alternation rules

For example caddee is an imperative verb and if we suffix with infinitive eyn, one of the two last
ee of imperative is replaced with y as we tried to show in figure 4-5.
Table 4.4: Realization with sh when it suffixed with t
The root English Person paste The verb Alternation

Maqal Listen T Ay Maqashay l->sh
Hadal Talk T Ay Hadashay l->sh
Dil Kill T Ay Dishay l->sh
Partial ablaut occurs in verbal infinitives with mostly any word of the pattern CaC. The infinitive
ending <i> is appended, <a> raises to <e>. It also occurs around person Suffixes and tense ending
42 | P a g e
in <e>. for example tag’go’ takes an infinitive marker ‘I’ and becomes tagi’to go’, but when we
add the 2nd.PL.paste tense of ‘een’ the verb becomes tageen ‘they went’ which means I replaced
with e. also in Af-Somali verbs we have to consider the property of l replacement with sh when
we add verb with 3rd.Sg.masc marker t and l is realized as sh as represented in Figure 4-6 and as
an example in Table 4.4.
Figure 4-6: Alternation rule representation with xfst

The Person morphemes, the realization of personal suffixes on verbs is a little complex and
depends mostly on declension type and whether or not the suffix is preceded by the progressive.
Realization of these suffixes is currently all handled by xfst as described the following Figure 4-
7.
43 | P a g e
Figure 4-7: person morpheme realization
4.3.3 Noun Lexicon Design

Nouns in Somali are things and we have developed a separate lexicon known as Nounlex.lexc
using lexc binary file. They have separate paradigms depending on morpho-phonological stuff,
but are split up into subgroups which correspond to pluralization pattern groups. Hence the Af-
Somali Noun lexicon in this study is classified in to seven declensions based on their pluralization
pattern. Nominal marked for gender undergo gender polarity changes in plural. We want to mark
+Masc and +Fem, such that disambiguation is easier, but knowing the gender of the lemma since
it is not predictable from a given plural form is a good thing. So, to solve this we already created
a lexicon database, which shows their gender. Nominal are also affixed with demonstrative
markers of aas, eer and an. So, we have defined a root lexicon known as noun which intern contains
seven sub lexicons each for one declension and they are suffixed with the morphemes of the Af-
Somali nouns as shown in Figure 4-8 the first declension.
44 | P a g e
Figure 4-8: Af-Somali noun lexicon
In addition to this, there is also a separate lexicon which includes the suffix tags and the order in
which these suffixes co-occur with the root nouns as illustrated in the following Figure 4-9. But
the general co-occurrence of the root noun with the morphemes are shown in figure by using
finite state networks and this shows the state in which the transducer passes. This figure simply
shows the first declension known as D1_f which are feminine nouns and we have put the detailed
description of the noun lexicon in Appendix-C.
Figure 4-9: Af-Somali noun suffixes
45 | P a g e
In general, the morphemes attached to the root nouns are number (Sg,Pl), definiteness (def,indef),
interrogatives (inter), possessives and demonstratives as we presented in Figure 4-10 which the
finite state network of the Af-Somali nouns.
Figure 4-10: Af-Somali verb finite state networks

For example, the morph-tactics of Af-Somali feminine noun of declension2 words as found in
above Finite state networks are described with Table 4.5.
46 | P a g e
Table 4.5: Example of noun declension 2 morphotactics
Lexical Mindi D2_F +Pl +def +inter The noun

level
Surface Mindi 0 Yo Ha ee mindiyahee
level
4.3.4 Alternation Rules of AF-Somali Nouns

Generally, to develop and use a lexicon and alternation rules using Xerox finite state toolkit we
have to define the characters used in that language. So, in the following sections we defined the
variables of Af-Somali and the rules used to implement the transducer. In declension 5 some
consonants becomes double when we make the noun plural and this process is captured with the
alternation rule components as shown the following Figure 4-11 and detailed description of the
alternation rules are presented in Appendix-A.
Figure 4-11: Af-Somali noun alternation rules

The other rule that have to be considered in the xfst is the deletion of <k> when it follows a back
consonant (which is not <k> itself). For example Af-Somali noun magac possess this property
when it is suffixed with the definite marker ‘ka.
AF-Somali has two kinds of reduplication: partial and full. Reduplication is typically a strategy
for marking plural in nouns and adjectives in some declensions, but also appears in verbs as a
derivational process. The inflectional processes are quite productive, but the derivational processes
are not as productive. The Partial reduplication occurs in the 4th declension of nouns, but a subtype
47 | P a g e
of these 4th declensional nouns also has full reduplication. Partial reduplication includes
epenthesis of <a> and in nouns it is suffixing. Also, the template is slightly different. For late is
see with an example found in the following Table 4.6.
Table 4.6: Partial reduplication of nouns
Root noun English Suffixing Number The noun

Af Mouth,language Af PL Afaf
Qoys Family As Pl Qoysas
So, this alternation can be presented with an example in table 4.7 as follows.
Table 4.7: The alternation of declension 5 representation
Verb Declension Plural marker The rule

Sacab Dec-5 CCo sacabbo
Dameer Dec-5 CCo Dameerro
4.3.5 Adjectives Lexicon Design

The Af-Somali adjective is formed by an adjectival root and the inflected forms of the reduced
paradigm of the verb yahay ‘to be’. A reduced paradigm is characterized by reduced distinctions
in subject marking.
Figure 4-12: Af-Somali adjective lexicon
48 | P a g e
Reduced present forms are identical to the root, whereas past forms display distinct inflectional
endings. As described in Figure 4-12, Af-Somali adjectives are few in number and we defined root
lexicon known as adjectives and sub lexicon known as Ad_suffix which indicates the suffixes
attached to the Adjectives using lexc formalism. The Af-Somali adjectives inflectionally use
person markers and tenses which needs with the agreement of numbers as shown in Table 4.8 with
an example.
Table 4.8: Example of adjective morphotactics
adjective 1st.Sg 1st.Pl 2nd.masc 2nd.fem pres paste The word

fiican 0 N Y T Ahay/ihiin 0/een fiicanahay
fiican 0 N Y T Ahay/ihiin 0/een Fiicantahay
Fiican 0 N Y T Ahay/ihiin 0/een Fiicannahay
Fiican 0 N Y T Ahay/ihiin 0/een fiicanyihiin
In addition to this, the morphotactic representation of adjectives are also presented in the following
Figure 4-13 and describes the order that the suffixes attached with the adjectives.
Figure 4-13: Af-Somali Adjective finite state networks
49 | P a g e
Chapter 5 : Experimentation and Evaluation
5.1 Introduction
This chapter discusses the test and evaluation conducted on Af-Somali Morphological analyzer.
In the discussion emphasis is given to assess the outputs produced and the test result found. So the
testing of any sizable natural-language processing system is notoriously difficult [8] and the
morphological analyzer is an essential and basic tool for building any language processing
application for a natural language e.g., Machine Translation system.
5.2 Experimentation
We have developed the morphological analyzer using XFST tool developed by Xerox. It supports
UTF-8 character coding which is important for the implementation of Af-Somali computational
morphologies. The tool is based on a lexicon and a set of rules for root and morphemes. This
lexicon contains the list of root words and its category separated by a tab. The analyzer fails on
giving a complex word as an input and the corresponding root word does not exist in the lexicon
file. We have developed the Af-Somali lexicon and the rules file required for analysis. The lexicon
is designed to reflect the word categories in the Af-Somali language.
The lexicon contains different states for each of the root words, starting with the declaration of the
tags. For example the verb lexicon is illustrated as shown in Figure 4-2. The root words and its
category are separated by a semicolon as shown in Figure 5-1 of Af-Somali verb. The left side of
the colon represents the upper side or the analysis form of the transducer, and the right side shows
the lower side or the surface form as presented on Appendix-B. The hash symbol at the end of a
row indicates the end of the transition, and therefore, that state is the final state. The analyzer takes
the surface form as input and produces the result as the grammatical structure of the word or the
lexicon form.
50 | P a g e
Figure 5-1: AF-Somali Verb to suffix attachment
5.3 Discussion and Evaluation
Evaluation of a morphological analyzer can be performed using a reliable broad-coverage

morphological analyzer, or by having a human experts annotate a text manually. The former option
was not possible as we have no such a tool developed for Af-Somali. The latter option is very hard
and difficult to perform manually and can be done on relatively small texts.
Generally to evaluate and test any morphological analyzer requires to measure the following things
the total number of word tokens correctly accepted by the analyzer versus the number of words
incorrectly processed by the analyzer and the total percentage that are correctly analyzed in context
versus the total percentage of tokens that are not analyzed at all in the context. Although, we have
to know the total percentage of wrongly analyzed linguistically regardless of context. Finally, how
many correct analysis have not output for a token is calculated.
Therefore, we have manually annotated 220 tokens, 90 nouns, 120 verbs and 8 Adjectives of words
from the book known as (qaamuus). 77 nominal, 105 verbal and 6 adjectives were correctly
analyzed. The results were evaluated by a human reader familiar with the language. An output was
considered correct only if it found all legal combinations of roots and grammatical structure for a
51 | P a g e
given word form and included no incorrect roots or structures. Thus, the overall accuracy of the
system is: 84.1% was correctly analyzed as shown in Table 5.1.
Table 5.1: Overall accuracy of the system
Wrong in
correct%
Nominal
Nominal
Adjectiv
Adjectiv
Correct
Correct
correct
correct
wrong
Verbs
Verbs
Total
Total
in %
%
es
90 120 8 85.55 87.5% 75% 218 188 30 86.2% 13.76

% %
So, from this we can understand that, the total number tokens analyzed was 218 and out of this
86.2% was correctly analyzed, 13.76% is wrongly analyzed and total 10 tokens failed to be
analyzed by the system.
Lastly, we have observed that, there was an errors because of the limited size of lexicon we
annotated and also we haven’t incorporated Guesser component which helps to guess the words
that was not found in the lexicon. In addition to this, the Af-Somali authors write words in different
formats and this gives to analyze one word in different way. For example, some authors or writers
write the word Dawlad while others write Dowlad (government).
52 | P a g e
Chapter 6 : Conclusion and Future Work
6.1 Conclusion
Language is one of the main tools for communication. Thus, its investigation will provide better
perspectives on all other aspects related with NLP. However, the formalization and computational
analysis of Af-Somali morphology are not worked out. In other words, there is lack of tools for
analysis of Af-Somali morphology from computational point of view. Moreover, grammar
resources contain variances depending on scholars. For example, in some resources there are that
write down the adjectives as verbs, whereas others describe adjectives as a separate word class. To
summarize, building correctly working system of morphological analysis by combining all
information is valuable for further researches on the language. In this thesis, a detailed analysis of
Af-Somali has been performed. Also, the formalization of rules over all morphotactics of Af-
Somali is worked out. By combining all gained information, a morphological analyzer is
constructed. This thesis reports on an attempt made to develop Af-Somali morphological analysis
system using finite state two level approach. The report started off with brief introduction to
concepts and principles used in the study. The introduction also includes description of
morphological analysis and the unique feature of Af-Somali words along with their peculiar
morphemic components.
The different subcategories of rule-based approaches were described briefly. In this study, finite
state two level approach was considered. Finite state transducer is the main tool for the
development of morphological analyzer and the implementation has been based on [8]. Two level
morphology is proving to be very well suited to Af-Somali morphology. A major advantage of
finite state two-level implementations of morphology is their inherent bi-directionality; the same
system is used for both analysis and generation of word forms in the language. An additional
advantage is the high efficiency of finite-state networks that allows to process even large words
within a few seconds. We presented the design and implementation of analyzed categories into a
finite state transducer using Xerox Finite State Toolkit in chapter 4. First, all forms of verbs, nouns
and Adjectives have been implemented in separate lexc formalism. The rules identified have been
implemented in xfst files respectively. The finite state transducers of each category and finite state
53 | P a g e
transducers of rules for respective categories are composed separately. All the finite state
transducers have been composed together resulting into a single lexicon finite state transducer
which can be used as morphological analyzer and generator.
However, the study is carried out under a number of constraints. The main challenge of these was
to figure out the linguistic, especially the exact morphotactical details needed for analysis and
(generation). The lack of any linguistic lexical resources, the list of words for Af-Somali in an
electronic form was so demanding. And also it was difficult to find out the morphological rules
that was used in the system.
6.2 Future Work
The morphological analyzer/generator can be useful for linguists who wish to understand
the morphological processes of Af-Somali, as well as for language learners to aid in their
language comprehension and the practice of word conjugation or declension, The main weakness
of the system results from the limited number of available roots and stems in the lexicon, to
incorporate Guesser and thus can be improved by increasing the number of stems and phonological
alternation rules and using Guesser component.
As this work deals only with inflectional morphology and the northern Somali dialect, there is a
need to extend the system to also include derivational and compounding morphology and the
Benaadir and Maay of Af-Somali morphology.
Finally, it is good to note that when the SoMorph is completely describe Af-Somali morphological
analysis it will be useful tool for large-scale NLP applications like machine language translation,
Pos checkers in the future.
54 | P a g e
References
[1] Annarita Puglielli iyo Cabdalla Cumar Mansuur, “QAAMUUSKA AF‒SOOMAALIGA”,

diyaarintii Roma TRE-PRESS, 2012
[2] Ali Mohamed “Development of morphological analyzer for afaraf”, M.sc Thesis, Debra
Birhan University, 2014.
[3] Andrzejewski, B. W. the Declensions of Somali Nouns, London: School of Oriental and
African Studies, 1964
[4] Banti, G. ‘Two Cushitic Systems: Somali and Oromo Nouns’, in H., 1988
[5] BATI, T. B., AUTOMATIC MORPHOLOGICAL ANALYZER FOR AMHARIC, 2002.
[6] Beesley K. R., Morphological Analysis and Generation:A First-Step in Natural Language
Processing, 2004, p. 1
[7] Elaine Uí Dhonnchadha, A Two-level Morphological Analyser and Generator for Irish
using Finite-State Transducers, Institiúid Teangeolaíochta Éireann 31 Plás Mhic Liam,
Baile Átha Cliath 2, Éire, and Dublin City University Glasnevin, Dublin 11 , Ireland
[8] Fissaha and Haller, “First larger-scale morphological analyzer for Amharic verbs used
XFST”, the Xerox Finite State Tools, 2003.
[9] Jackson Muhirwe, Computational Analysis of Kinyarwanda Morphology: The
Morphological Alternations. Advances in Systems Modelling and ICT Applications.
[10] Jurafsky Daniel and James Martin, Speech and Language Processing, Prentice-Hall, 2000
(referenced as J&M throughout this handout.4
[11] Karttunen, Lauri, Kaplan & Zaenen, Two-level morphology with composition, 1992.
[12] Kenneth Beesley and Lauri Karttunen, Finite State Morphology, CSU Studies in
Computational Linguistics, 2003.
[13] Kenneth Beesley , Finite-State Morphological Analysis and Generation of Arabic at Xerox
Research, Status and Plans in 2001, Xerox Research Centre Europe
[14] Kenneth Beesley, Finite state morphology / Kenneth Beesley and Lauri Karttunen, p. cm.
- (Studies in computational linguistics; 3), 1954.
[15] Koskeniemmi, K., Two-level morphology: a general computational model for word-form
recognition and production. Ph.D. thesis, University of Helsinki, 1983.
55
[16] Lauri Karttunen, Constructiong lexical transducers. In the proceeding of the fifteenth
international conference on computional linguistics, 1994.
[17] Çagrı Çöltekin, A Freely Available Morphological Analyzer for Turkish, Center for Language and
Cognition (CLCG) University of Groningen
[18] Elaine Uí Dhonnchadha, A Two-level Morphological Analyser and Generator for Irish using
Finite-State Transducers, institute of technology of Éireann 31 Plás Mhic Liam, Baile Átha
Cliath 2, Éire, and Dublin City University Glasnevin, Dublin 11, Ireland
[19] Gulshat Kessikbayeva and Ilyas Cicekli, A Rule Based Morphological Analyzer and A
Morphological Disambiguator for Kazakh Language, Linguistics and Literature Studies,
2016
[20] Kenneth R. Beesley, Finite-State Morphological Analysis and Generation of Arabic, Xerox
Research Centre Europe 6, chemin de Maupertuis 38240 MEYLAN, France, 2001
[21] Mesfin Abate, Yaregal Assabie (2014).”Development of Amharic morphological analyzer
using memory based approach”, 9th International Conference on NLP, PolTAL, Warsaw,
Poland, September 17-19, 2014. Proceedings.
[22] Michael Gasser (2009). “HornMorpho1.0: a system for morphological processing of
Amharic, Oromo, and Tigrinya”.
[23] KhumbarDebbarma, BrajaGopalPatra, Dipankar Das, Sivaji Bandyopadhyay2
Morphological Analyzer for Kokborok
[24] KorayAk, OlcayTanerYıldız, 2011. Unsupervised Morphological Analysis Using Tries,
Dept. of Computer Science and Engineering. Isık University
[25] Nicola Lampitelli, Evaluative morphology in Somali, Université Paris Diderot-Paris
[26] Nimaan Abdillahi, Building and Evaluating Af-SomaliCorpora, Proceedings of the 2014
Workshop on the Use of Computational Methods in the Study of Endangered Languages,
pages 73–76
[27] R.Akilan* and Prof. E.R.Naganathan , Morphological Analyzer for Classical Tamil Texts:
A Rulebased approach, Research Scholar, (Department of Computer Science, Bharathiar
University, Coimbatore) Programmer, Central Institute of Classical Tamil, Chennai.
[28] Shuly Wintner and Gelbukh: Finite-State Technology as a Programming Environment,
CICLing 2007, LNCS 4394, pp. 97–106, 2007.
56
[29] Saba Amsalu, Girma A. Demeke. (2006). Non-concatinative Finite State Morphotactics of
Amharic Simple Verbs.
[30] Xuri TANG , English Morphological Analysis with Machine-learned Rules, Dept. Foreign
Languages, Wuhan University of Science and Engineering, 430073， Wuhan, P. R. China
[31] Nicola Lampitelli, The morphophonology of Somali nouns, June, 15-18 2011
[32] Kazakov Dimater & Manandhar Suresh (2000) Unsupervised Learning for Word
Segmentation Rules with Genetic Algorithms and Inductive Logic Programming.
[33] John I. Saeed, “Somali Reference Grammar”, the University of Virginia, 26 Sep 2007
[34] Yitayal Abate, 2013.” Morphological analyzer for Ge’ez verbs using machine learning
approach”, in the thesis of Addis Ababa University.
[35] Shlomo Yona , A finite-state based morphological analyzer for Hebrew, thesis in
Department of Computer Science, November, 2004.
57
1.9 Appendix-A: Alternation Rules for Noun and Verb
1
2
3
1.10 Appendix-B: Af-Somali verb Lexicon
!!Somorph-lex.txt
LEXICON Root kaamil V1suffixing;
Verb; naafow V1suffixing;
LEXICON verb naanays V1suffixing;
aammus V1suffixing; naaqus V1suffixing;
abab V1suffixing; qaad V1suffixing;
aadaan V1suffixing; raac V1suffixing;
aammin V1suffixing; raadgoob V1suffixing;
daab V1suffixing; raadgur V1suffixing;
daabul V1suffixing; saacidV1suffixing;
edeg V1suffixing; saaf V1suffixing;
faalal V1suffixing;
faan V1suffixing; !!aaddi V2suffixing;
gaad V1suffixing; !!aammusi V2suffixing;
gaadaan V1suffixing; !!baafi V2suffixing;
gaaddabbuur V1suffixing; !!baahi V2suffixing;
gaadh V1suffixing; !!caafi ` V2suffixing;
haajir V1suffixing; !!faafi V2suffixing;
habaar V1suffixing; !!faahi V2suffixing;
kaah V1suffixing; !!gaabi V2suffixing;
4
!!gaadhsii V2suffixing; baadiyee V3suffixing;
!!haadi V2suffixing; baahee V3suffixing;
toosi V2suffixing; caalsaaree V3suffixing;
maadi V2suffixing; caanee V3suffixing;
maahi V2suffixing; hallee V3suffixing;
maalgeli V2suffixing; hambalyee V3suffixing;
qaawi V2suffixing; hambee V3suffixing;
rafaadi V2suffixing; qaaligaree V3suffixing;
rafaaji V2suffixing; qaaliyee V3suffixing;
ragaadi V2suffixing; qaamee V3suffixing;
saaci V2suffixing; raacdee V3suffixing;
taabsii V2suffixing; saamee V3suffixing;
uburi V2suffixing; saandambee V3suffixing;
ubxi V2suffixing; saawee V3suffixing;
xaadi V2suffixing; taakee V3suffixing;
xaadiri V2suffixing; taallee V3suffixing;
tabaabulee V3suffixing;
caddee V3suffixing; waayee V3suffixing;
dhabee V3suffixing; yaree V3suffixing;
aabee V3suffixing;
aafee V3suffixing; abyood V4suffixing;
aaladee V3suffixing; adaadumo V4suffixing;
baabee V3suffixing; baaho V4suffixing;
5
caashaqo V4suffixing; aamuso V5suffixing;
lifaaqo V4suffixing; badso V5suffixing;
liidaanyoo V4suffixing; bahayso V5suffixing;
liido V4suffixing; caddayso V5suffixing;
qaysho V4suffixing; cadgooso V5suffixing;
qiiroo V4suffixing; galdhacso V5suffixing;
qodo V4suffixing; hakaabso V5suffixing;
rigoo V4suffixing; halabayso V5suffixing;
riiqo V4suffixing; ilaaleyso V5suffixing;
riiqo V4suffixing; janjeerso V5suffixing;
saloolo V4suffixing; naso V5suffixing;
tacdaaro V4suffixing; qaawiso V5suffixing;
tafaxaydo V4suffixing; qalayso V5suffixing;
tafwareemo V4suffixing; raacdayso V5suffixing;
unko V4suffixing; samayso V5suffixing;
urugoo V4suffixing; tallaabso V5suffixing;
waabo V4suffixing; tallaabso V5suffixing;
xeroo V4suffixing; taraarayso V5suffixing;
xeydo V4suffixing; ubaxayso V5suffixing;
yeelo V4suffixing; udgoonso V5suffixing;
yeelo V4suffixing; waabariiso V5suffixing;
xabeebso V5suffixing;
aammiinso V5suffixing; xabkayso V5suffixing;
6
LEXICON V1suffixing +V2+Sg+inf:in #;
+V1+Sg+1P:0 #; +V2+1PSg:y #;
+V1+Pl:a #; +V2+3PSgmasc:y #;
+V1+Sg+inf:i #; +V2+3PPl:y #;
+V1+2P:s #; +V2+3PSgfem:s #;
+V1+Sg+3Pfem:t #; +V2+2PSg:s #;
+V1+1PPl:n #; +V2+2PPl:s #;
+V1+pres:aa #; +V2+1PPl:n #;
+V1+1P+pres:naa #; +V2+pres:aa #;
+V1+paste:ay #; +V2+2P+pres:saan #;
+V1+2P+paste:tay #; +V2+3PPl+paste:yay #;
+V1+1PPl+paste:nay #; +V2+Sg+inf+paste:nay #;
+V1+3Pfem+paste+1PPl:teen #; +V2+3PPl+paste:yaan #;
+V1+paste+1PPl:een #; +V2+paste:ay #;
+V1+1Ppres.conti:ayaa #; +V2+2PSg+paste:seen #;
+V1+2Ppres.conti:aysaa #; +V2+3PPl+paste:yeen #;
+V1+1PPlpres.conti:aynaa #;
+V1+2PPlpres.conti:aysaan #; LEXICON V3suffixing
+V1+3PPl+pres.conti:ayaan #; +V3:ee #;
LEXICON V2suffixing +V3+Sg:0 #;
+V2:i #; +V3+Pl:ya #;
+V2+Sg:0 #; +V3+Sg+inf:yn #;
+V2+Pl:ya #; +V3+Sg+3PSgmasc:y #;
7
+V3+3PSgfem:s #; +V4+Sg+1PSg+paste:aan #;
+V3+1PPl:n #; +V4+Sg+paste:ay #;
+V3+pres:aa #; +v4+3PSgfem+paste:teen #;
+V3+3PSgfem+pres:saan #; +V4+Sg+1PSg+paste:een #;
+V3+Sg+3PSgmasc+paste:yaan #;
+V3+paste:ay #;
+V3+3PSgfem+paste:seen #; LEXICON V5suffixing
+V3+Sg+3PSgmasc+paste:yeen #; +V5+Sg:0 #;
+V5+Sg+3Pmasc:0 #;
LEXICON V4suffixing +V5+Pl:da #;
+V4:o #; +V5+Sg+inf:an #;
+V4+Sg:0 #; +V5+3PSgfem:t #;
+V4+Pl:da #; +V5+Sg+1PPl:n #;
+V4+Sg+inf:an #; +V5+Sg+pres:aa #;
+V4+Sg+1PSg:0 #; +V5+3PSgfem+pres:taan #;
+v4+3PSgfem:t #; +V5+Sg+3Pmasc+pres:aan #;
+V4+1PPl:n #; +V5+Sg+paste:ay #;
+V4+Sg+pres:aa #; +V5+3PSgfem+paste:teen #;
+v4+3PSgfem+pres:taan #; +V5+Sg+3Pmasc+paste:een #;
8
1.11 Appendix-C: Af-Somali Noun lexicon
!!Somorph-lex.txt
LEXICON Root
Nouns;
LEXICON Nouns qori N2MYo;
aalad N1; qurub N2MYo;
abaar N1; ubax N2MYo;
bad N1; unug N2MYo;
beer N1; xijaab N2MYo;
hees N1;
kab N1; qor N2FYo;
kal N1; quraac N2FYo;
naag N1; sabti N2FYo;
qayb N1; subax N2FYo;
saacadN1; mindi N2FYo;
sannad N1;
shimbir N1; gabadh N3F2V;
suuradN1; gacan N3F2V;
toobad N1; galab N3F2V;
kibis N3F2V;
aroos N2MYo; xubin N3F2V;
asaas N2MYo;
dalool N2MYo; garab N3M2V;
dheri N2MYo; hilib N3M2V;
erey N2MYo; ilig N3M2V;
magac N2MYo; jilib N3M2V;
9
xadhig N3M2V; yaraan N5MCC;
baal N4FaC; daymo N6Moyin;
seef N4FaC; dhismo N6Moyin;
weel N4FaC; barkimo N6Moyin;
wiil N4FaC;
abeeso N6Foyin;
af N4MaC; daawo N6Foyin;
baaf N4MaC; darajo N6Foyin;
ceel N4MaC; hooyo N6Foyin;
dal N4MaC; magalo N6Foyin;
fal N4MaC; taallo N6Foyin;
miis N4MaC; ujeeddo N6Foyin;
qoys N4MaC; waddo N6Foyin;
riig N4MaC;
shil N4MaC; aabbe N7Myaal;
weel N4MaC; beenaale N7Myaal;
biyoole N7Myaal;
aabbur N5MCC; caanoole N7Myaal;
albaab N5MCC; fure N7Myaal;
alool N5MCC; gacaliye N7Myaal;
baabuur N5MCC; jaalle N7Myaal;
dagaal N5MCC; walaale N7Myaal;
dameer N5MCC; waraabe N7Myaal;
hoteel N5MCC; yeele N7Myaal;
ijaar N5MCC;
sacab N5MCC; LEXICON N1
shaqal N5MCC; +N1+Sg:0 #;
wadaad N5MCC; +N1+Pl:o #;
10
+N1+defM:ka #; +N2F+Pl:yo #;
+N1+defF:ta #; +N2F+Pl:O #;
+N1+defF+inter:tee #; +N2F+defF:ta #;
+N1+defM+inter:kee #; +N2F+defF:ha #;
+N1+defF+1PSg:tayda #; +N2F+defF+inter:yahee #;
+N1+defF+2PSg:taada #; +N2F+defF+inter:tee #;
+N1+defF+3Pmasc:tiisa #; +N2F+defF+1stSg:tayda #;
+N1+defF+3Pfem:teeda #; +N2F+defF+2ndSg:taada #;
+N1+defF+1PPl:taayada #; +N2F+defF+3rdmasc:tiisa #;
+N1+defF+close:tan #; +N2F+defF+3rdfem:teeda #;
+N1+defF+near:tas #; +N2F+defF+1stPl:taayada #;
+N1+defF+far:teer #; +N2F+defF+close:tan #;
+N2F+defF+near:tas #;
LEXICON N2MYo +N2F+defF+far:teer #;
+N2M+Sg:0 #;
+N2M+Pl:yo #; LEXICON N3F2V
+N2M+defM:ka #; +N3F+Sg:0 #;
+N2M+defM+inter:kee #; +N3F+Pl:0 #;
+N2M+defM+1PSg:kayga #; +N3F+defF:ta #;
+N2M+defM+2PSg:kaaga #; +N3F+defF+inter:tee #;
+N2M+defM+3Pmasc:kiisa #; +N3F+defF+1PSg:tayda #;
+N2M+defM+3Pfem:keeda #; +N3F+defF+2PSg:taada #;
+N2M+defM+1PPl:kaayaga #; +N3F+defF+3Pmasc:tiisa #;
+N2M+defM+close:kan #; +N3F+defF+3Pfem:teeda #;
+N2M+defM+near:kas #; +N3F+defF+1PPl:taayada #;
+N2M+defF+far:keer #; +defF+close:tan #;
+defF+near:tas #;
LEXICON N2FYo +N3F+defF+far:teer #;
+N2F+Sg:0 #;
11
LEXICON N3M2V LEXICON N4MaC
+N3M+Sg:0 #; +N4M+Sg:0 #;
+N3M+Pl:0 #; +N4M+Pl:aC #;
+N3M+defM:ka #; +N4M+defM:ka #;
+N3M+defM+inter:kee #; +N4M+defM+inter:kee #;
+N3M+defM+1PSg:kayga #; +N4M+defM+1PSg:kayga #;
+N3M+defM+2PSg:kaaga #; +N4M+defM+2PSg:kaaga #;
+N3M+defM+3Pmasc:kiisa #; +N4M+defM+3Pmasc:kiisa #;
+N3M+defM+3Pfem:keeda #; +N4M+defM+3Pfem:keeda #;
+N3M+defM+1PPl:kaayaga #; +N4M+defM+1PPl:kaayaga #;
+N3M+defM+close:kan #; +N4M+defM+close:kan #;
+N3M+defM+near:kas #; +N4M+defM+near:kas #;
+N3M+defM+far:keer #; +N4M+defM+far:keer #;
LEXICON N4FaC LEXICON N5MCC
+N4F+Sg:0 #; +N5M+Sg:0 #;
+N4F+Pl:aC #; +N5M+Pl:CC #;
+N4F+defF:ta #; +N5M+defM:ka #;
+N4F+defF+inter:tee #; +N5M+defM+inter:kee #;
+N4F+defF+1PSg:tayda #; +N5M+defM+1PSg:kayga #;
+N4F+defF+2PSg:taada #; +N5M+defM+2PSg:kaaga #;
+N4F+defF+3Pmasc:tiisa #; +N5M+defM+3Pmasc:kiisa #;
+N4F+defF+3Pfem:teeda #; +N5M+defM+3Pfem:keeda #;
+N4F+defF+1PPl:taayada #; +N5M+defM+1PPl:kaayaga #;
+N4F+defF+close:tan #; +N5M+defM+close:kan #;
+N4F+defF+near:tas #; +N5M+defM+near:kas #;
+N4F+defF+far:teer #; +N5M+defM+far:keer #;
12
LEXICON N6Moyin +N6F+defF+3Pmasc:tiisa #;
+N6M+Sg:0 #; +N6F+defF+3Pfem:teeda #;
+N6M+Pl:oyin #; +N6F+defF+1PPl:taayada #;
+N6M+defM:ka #; +N6F+defF+close:tan #;
+N6M+defM+inter:kee #; +N6F+defF+near:tas #;
+N6M+defM+1PSg:kayga #; +N6F+defF+far:teer #;
+N6M+defM+2PSg:kaaga #;
+N6M+defM+3Pmasc:kiisa #; LEXICON N7Myaal
+N6M+defM+3Pfem:keeda #; +N7M+Sg:0 #;
+N6M+defM+1PPl:kaayaga #; +N7M+Pl:yaal #;
+N6M+defM+close:kan #; +N7M+defM:ka #;
+N6M+defM+near:kas #; +N7M+defM+inter:kee #;
+N6M+defM+far:keer #; +N7M+defM+1PSg:kayga #;
+N7M+defM+2PSg:kaaga #;
LEXICON N6Foyin +N7M+defM+3Pmasc:kiisa #;
+N6F+Sg:0 #; +N7M+defM+3Pfem:keeda #;
+N6F+Pl:oyin #; +N7M+defM+1PPl:kaayaga #;
+N6F+defF:ta #; +N7M+defM+close:kan #;
+N6F+defF+inter:tee #; +N7M+defM+near:kas #;
+N6F+defF+1PSg:tayda #; +N7M+defM+far:keer #;
+N6F+defF+2PSg:taada #;
13
Submitted by:
Mahdi Yonis _____________________ May 30, 2017
Student ` Signature Date
Approved by:
1. Yaregal Assabie ______________________ May 30, 2017
Advisor Signature Date
2. ______________________ ______________________ ____________________
Chairman, Dept’s Signature Date
Graduate Committee
3. _______________________ ______________________ ___________________
Chairman, Faculty’s Signature Date
Graduate Commission
4. _______________________ ______________________ ___________________
Dean, Graduate School Signature Date

Development of Morphological Analyzer For Af-Somali

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Development of Morphological Analyzer For Af-Somali

Uploaded by

Copyright:

Available Formats

DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR AF-SOMALI

MAHDI YONIS KAYAD

A Thesis Submitted to the Department of Computer Science in

Addis Ababa, Ethiopia

Advisor: Dr. Yaregal Assabie

Signed by the Examining Committee:

Name __________________________Signature__________ Date_______

Figure 2-1: Example of two level representation of Af-Somali ................................................................ 27

Table 2.1: Pluralization system of Af-Somali........................................................................................... 11

1.1 Background of the Study

1.2 Morphological Analysis

Morphological analysis is a process of segmenting words into morphemes, the assignment

1.3 Statement of the Problem

 Studying and understanding the word and morphological categories in Af-Somali

1.5.1 Literature Review

1.6 Application of the Result

1.7 Scope and Limitation

1.8 Organization of the Thesis

2.2 Introduction to Morphological Analysis

Morphs are the phonological/orthographical realization of morphemes. A single morpheme

2.2.3 Types of Morphological Processes

Inflection is the combination of a word stem with a grammatical morpheme, usually

Derivation is the combination of a word stem with a grammatical morpheme, usually

2.3 AF-Somali Morphology

Somali language (Af-Somali) is an Afro-Asiatic language, belonging to the Cushitic family's

2.3.1 AF-Somali Phonetics

2.3.2 Basic Characteristics of AF-Somali

Table 2.1: Pluralization system of Af-Somali

Singular word Simple plural Plural of plural

Table 2.2: Derivational inflected plural form of Af-Somali

Root word Inflected Plural Derivation Word/English

2.4 Inflectional Process of AF-Somali

Table 2.3: Af-Somali Gender Markers

Table 2.4: Example of noun with gender markers

Words Masculine marker Words Feminine marker

Table 2.5: Example of Af-Somali pluralization and declension formation

Word Gender Number Word Gender Num English declension

2.4.2 AF-Somali Noun Determiners

Table 2.6: Example of Af-Somali Articles

Root word Gender Article Formed word

Table 2.7: Af-Somali Demonstratives

Word Gender Near Farness To left/right

Table 2.8: AF-Somali possessive

Person Masculine Feminine Root noun Gender Word

Table 2.9: Interrogative representation of A-Somali

Root noun Gender Interrogative suffix Word

Table 2.10: Pluralization of adjectives

Root adjective word Number Word Number

2.4.4 The Verb

Table 2.11: Example of person agreement with tenses

Person The root verb Tense Present verb Past verb

2.4.5 Classification AF-Somali Verbs

Table 2.12: First conjugation representation of Af-Somali verbs

Verb Person Tense Imperative The word

Table 2.13: Second Af-Somali verb conjugation (toosi)

Person/number imperative Habitual Present past Past continues

The root verb Imperative Infinitive The verb In English

Ciid Ee Eyn Ciideyn Put the soil

Table 2.15: Fourth Af-Somali verb conjugation representation

Person/number imperative Habitual Present Paste Paste

2.5 Derivational System of AF-Somali

2.6 Approaches to Morphological Analysis

2.6.1 Corpus-based Approaches

Name __________________Signature Date_