You are on page 1of 92

DSpace Institution

DSpace Repository http://dspace.org


Computer Science thesis

2020-05-21

AUTOMATIC SPELLING CHECKER


FOR AMHARIC LANGUAGE

TILAHUN, MELAKU

http://hdl.handle.net/123456789/10846
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY

BAHIR DAR INSTITUTE OF TECHNOLOGY

SCHOOL OF RESEARCH AND POSTGRADUATE


STUDIES

FACULTY OF COMPUTING

AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE

MELAKU TILAHUN ASRESS

BAHIR DAR, ETHIOPIA

OCTOBER 16, 2017


AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE

MELAKU TILAHUN ASRESS

A thesis submitted to the school of Research and Graduate Studies of Bahir Dar Institute
of Technology, BDU in partial fulfillment of the requirements of the degree

Of

Master in computer Science in faculty of computing

Advisor Name: Dr.Tesfa Tegegne

Bahir Dar, Ethiopia

October 2017

i
DECLARATION

I, the undersigned, declare that the thesis comprises my own work. In compliance with
internationally accepted practices, I have acknowledged and refereed all materials
used in this work. I understand that non-adherence to the principles of academic
honesty and integrity, misrepresentation/ fabrication of any idea/data/fact/source will
constitute sufficient ground for disciplinary action by the University and can also
evoke penal action from the sources which have not been properly cited or
acknowledged.

Name of the student______________________________ Signature _____________

Date of submission: ________________

Place: Bahir Dar

This thesis has been submitted for examination with my approval as a university
advisor.

Advisor Name: __________________________________

Advisor’s Signature: ______________________________

ii
© 2017

MELAKU TILAHUN ASRESS

ALL RIGHTS RESERVED

iii
Bahir Dar University

Bahir Dar Institute of Technology-

School of Research and Graduate Studies

Faculty of computing

THESIS APPROVAL SHEET

Student:
Melaku Tilahun Asress __________________________ ____________________
Name Signature Date

The following graduate faculty members certify that this student has successfully presented the
necessary written final thesis and oral presentation for partial fulfillment of the thesis
requirements for the Degree of Master of Science in computer science.
Approved By:
Advisor:
Dr.Tesfa Tegegne _____________________ ____________________
Name Signature Date
External Examiner:

Dr.Adane Letta_ _______________________ ____________________


Name Signature Date
Internal Examiner:
________________ _____________________ ____________________
Name Signature Date
Chair Holder:
___________________ _______________________ ____________________
Name Signature Date
Faculty Dean:
___________________ _______________________ ____________________
Name Signature Date

iv
To my mother, father and my wife

v
ACKNOWLEDGEMENTS

I would like to acknowledge my gratitude to my advisor Dr.Tesfa Tegegne for his Best
advising and supporting throughout the completion of the thesis, and I would like to
acknowledge my gratitude to Mr.Mekonnen Fentaw for his willingness and support the
open areas and advising how my research work will be going on, and also acknowledge
my gratitude to Mr.Belisty Yalew, Mr.Fentahun Mekuriaw, Mr.Bawoke Wondem and
Mr.Elias wondemagegn for their professional guidance and assistance. I would like to
acknowledge my gratitude to all my colleagues at the Department of Computer Science for
their cooperative. Finally I would like to acknowledge my gratitude to my wife Yirgalem
tadesse for her support and to give me time to accomplish the research work.

vi
ABSTRACT

In different government and non government organizations document preparation is one of


the tasks in day to day activities. A spelling error can occur when people use text processing
application to produce electronic documents. There are some works except on internal
inflection of words and repeated words which is unsatisfactory for a language having
complex morphology. Due to this reason, it is common to find various Amharic books and
newsletters that are published with misspelled words. In this study, an attempt has been
made to design and implement spell checker for Amharic language that works on inflection
of Amharic words (internal inflection, inflection by duplication of Amharic words also part
of this study). The design of our study has 5 components, namely, Input component,
normalization Component, error detection component, morphological analyzer component,
and spelling error correction and suggestion component.

The system has been evaluated with four sets of data. The first and the second sets of data
taken from Amhara national regional state, science technology, and information
communication commission 2009 annual report. The third set of data taken from afar
Region ICT 2009 annual report. The fourth set of data taken from Harari Region ICT 2009
annual report. The performance of the system is evaluated using precision and recall.

Finally the system evaluated using 5 experiments and we got 97.27% overall performance
of the system. As are commendation, Detection and correction of real word errors, the
performance of spelling error detection and correction algorithm, which is edit distance,
need to be compared with other identified spelling error correction techniques, integrating
this work with other Amharic NLP works like, automatic spelling error correction and
suggestion.

TABLE OF CONTENTS
DECLARATION ................................................................................................................. i

vii
ACKNOWLEDGEMENTS ............................................................................................... vi
ABSTRACT ...................................................................................................................... vii
LIST OF ABBREVATIONS .............................................................................................. x
LIST OF FIGURES ........................................................................................................... xi
LIST OF TABLES ............................................................................................................ xii
CHAPTER ONE ................................................................................................................. 1
1. INTRODUCTION ....................................................................................................... 1
1.1. Background ............................................................................................................. 1
1.2. Motivation ............................................................................................................... 2
1.3. Statement of the Problem ........................................................................................ 4
1.4. Objective of the Study ............................................................................................ 5
1.4.1. General Objective ............................................................................................. 5
1.4.2. Specific Objectives ........................................................................................... 6
1.5. Scope and limitation of the Study ........................................................................... 6
1.5.1. Scope of the Study ............................................................................................ 6
1.5.2. Limitation of the Study ..................................................................................... 6
1.6. Significance of the Study ........................................................................................ 7
1.7. Research Methodology ........................................................................................... 7
1.7.1. Literature Review.............................................................................................. 7
1.7.2. Data Collection and Preparation ....................................................................... 8
1.7.3. Implementation tools ........................................................................................ 8
1.7.4. Performance Evaluation techniques .................................................................. 8
1.7.5. Organization of the thesis ................................................................................. 9
CHAPTER TWO .............................................................................................................. 10
LITERATURE REVIEW AND RELATED WORKS ..................................................... 10
2.1. Literature Review.................................................................................................. 10
2.1.1. Amharic text spell checker.............................................................................. 10
2.1.2. Types of Spelling Errors ................................................................................. 11
2.1.3. Core functionalities of spell checkers ............................................................. 12
2.1.4. Spelling checker tools ..................................................................................... 17
2.2. Related works........................................................................................................ 18

viii
2.2.1. Nepali Spell Checker ...................................................................................... 19
2.2.2. Spelling Checker for Afaan Oromo Language ............................................... 20
2.2.3. Spell Checker for Bangla ................................................................................ 21
2.2.4. Spell Checker for Arabic language ................................................................. 22
2.2.5. Spelling Checker for Amharic Language ....................................................... 22
CHAPTER THREE .......................................................................................................... 25
DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE SPELLING CHECKER
........................................................................................................................................... 25
3.1. Amharic Language Spelling Checking ................................................................. 25
3.1.1. Amharic Language Inflection ......................................................................... 25
3.1.2. Amharic spelling error patterns ...................................................................... 29
3.1.3. Affix Rules Development ............................................................................... 31
3.1.4. Dictionary development .................................................................................. 32
3.1.5. Lexicon lookup ............................................................................................... 34
3.2. Design of Amharic Spelling Checker (AMSPCH) ............................................... 34
3.2.1. Design Requirements ...................................................................................... 35
3.2.2. Architecture of the Amharic Spell Checker .................................................... 35
CHAPTER FOUR ............................................................................................................. 41
EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL CHECKER ... 41
4.1. Introduction ............................................................................................................ 41
4.2. Prototype ................................................................................................................ 41
4.2.1. Input word processing ..................................................................................... 41
4.2.2. Implementation of spell checker in Open Office using Hunspell ................... 42
4.3. Experiment result and Discussion .......................................................................... 44
4.3.1. Evaluation Criteria ........................................................................................... 44
4.3.2. Experiment....................................................................................................... 45
4.3.3. Discussion ........................................................................................................ 51
CHAPTER FIVE .............................................................................................................. 54
CONCLUSIONS AND RECOMMENDATIONS ........................................................... 54
5.1. Conclusions ............................................................................................................ 54
5.2. Recommendations .................................................................................................. 55

ix
REFERENCE.................................................................................................................... 56
APPENDIX ....................................................................................................................... 59
Appendix A: Sample of Amharic words taken for experiment ........................................ 59
Appendix B: Amharic alphabets with their seven orders ................................................. 60
Appendix C: Prefix, Infix and suffix lists used in this thesis work. ................................ 62
Appendix D: sample screen shote ..................................................................................... 68
Appendix E: Prefix and suffix lists used in this thesis work. ........................................... 72
Appendix F: Min Edit Distance Algorithm ...................................................................... 75
Appendix G: Steps we followed for configuration, compilation, and execution of
Hunspell ............................................................................................................................ 76
Appendix H: Word counter Python code .......................................................................... 77

LIST OF ABBREVATIONS

x
BDU - Bahir Dar University

NLP – Natural Language Processing

OCR – Optical Character Recognition

OOo – OpenOffice.org

OS – Operating System

POS – Part of Speech

ANRS- Amhara National Regional State

STICC- Science Technology and information communication commission

AMSPCH- Amharic spelling checker

LIST OF FIGURES

FIGURE2.1 ARCHITECTURE FOR NEPALI SPELL CHECKER [6] ...................................... 19

xi
FIGURE 3.1 SAMPLE LIST OF DICTIONARY ......................................................................... 33

FIGURE 3.3 ARCHITECTURE OF AMHARIC SPELL CHECKER ADOPTED FROM [3]. ... 37

FIGURE 3.4 ALGORITHM FOR INPUT COMPONENT ADOPTED FROM [3] ..................... 38

FIGURE 3.5 ALGORITHM FOR MORPHOLOGICAL ANALYSIS ADOPTED FROM [8].... 40

FIGURE 4.1 SAMPLE AFFIX RULE ........................................................................................ 43

FIGURE 4.3 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 1 ........................................ 47

FIGURE 4.4 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 3 ...................................... 49

FIGURE 4.5 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 4 ....................................... 50

FIGURE 4. 6 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 6 ....................................... 54

LIST OF TABLES

TABLE 3.1 INFLECTION OF NOUNS BY ADDING SUFFIX “ኦች” .......................... 25

xii
TABLE 3. 2 INFLECTION OF NOUNS BY ADDING SUFFIXE “ዎች” ...................... 26
TABLE 3. 3 INFLECTION OF NOUNS BY REDUPLICATION.................................. 26
TABLE 3. 4 INFLECTION OF VERBS .......................................................................... 27
TABLE 3. 5 INFLECTION OF TRANSITIVE ............................................................... 27
TABLE 3. 6NEGATIVE INFLECTION OF VERBS ...................................................... 28
TABLE 3. 7 INTERNAL INFLECTION ......................................................................... 28
TABLE 3. 8 AMHARIC PUNCTUATION MARKS ...................................................... 37
TABLE 4. 1 EVALUATION RESULT FOR EXPERIMENT 1 ..................................... 47
TABLE 4. 2 EVALUATION RESULT FOR EXPERIMENT 2. .................................... 48
TABLE 4. 3 EVALUATION RESULT FOR EXPERIMENT 3 ..................................... 48
TABLE 4. 4 EVALUATION RESULT FOR EXPERIMENT 4 ..................................... 50
TABLE 4. 5 EVALUATION RESULT FOR EXPERIMENT 5 ..................................... 51
TABLE 4. 6 EXPERIMENT RESULT SUMMERY ....................................................... 52
TABLE 4. 7 AVERAGE PERFORMANCE CALCULATED FROM OVERALL
PERFORMANCE OF EACH EXPERIMENT ......................................................... 52
TABLE 4. 8 EVALUATION RESULT FOR EXPERIMENT 6 ..................................... 53

xiii
CHAPTER ONE

1. INTRODUCTION
1.1. Background

Amharic language is the official language of Federal Democratic Republic of Ethiopia and
which has a population of over 92.21 million, 21.6 million native Amharic language
speakers, 4 million secondary Amharic language speakers, 3 million emigrants outside of
Ethiopia speak Amharic language. Total of 28.6 million peoples speak Amharic language
[2].

Amharic language users use Amharic scripts for document preparation in daily bases. But
spelling error is one of the problems. To minimize the spelling error problem, spelling
checker tools are used in text processing applications. Spelling checkers have become
essential parts of any text processing application software. Different types of spelling
checker applications are implemented in text processing tools using different languages.

Most commercially available word processors has a spell checker, a grammar checker and
even a word list lookup facility as essential part for several languages such as English,
French, Portuguese, Spanish, Arabic, etc [3].

In most of African languages no spellcheckers exist and for those languages which have
spellcheckers, the adequacy of the actual use is questionable [4]. This is also true for
Amharic language which is the official language of the Federal democratic of Ethiopia [4].

When People use spelling checker during document preparation, they save money, time
and they can produce better quality and acceptable document. For example, the number of
spelling mistakes in English newspapers has dropped considerably by using text processing
tools with spelling correctors [5].

Amharic language users are not benefiting from the use of spelling checking and correcting
tools. Because, text processing tools do not integrate with spelling checker and spelling

1
correctors. As a result it requires excessive effort and man power to minimize misspelled
words in a written document(Newspapers, books, reports, plans, and different
publications). So spell checker is important in saving time, money wastage and produce
quality document. It also reduces dangerous consequences of mistyped electronic texts
such as courts, health, military and other related cases.

1.2. Motivation

Different spell checking and correcting techniques developed and implemented for
languages such as English, Arabic, Bangla, and so on. But a spell checker and corrector
tool developed and implemented for one language cannot be applied to others directly.
Because spell checkers are dependent on the characteristics of the language. Hence,
specific spell checkers are available for English and some other Latin script based
languages [6]. Existing word processing tools support language specific utilities like
grammar checking, vocabulary, lexicon, translators, etc for many of the languages.
However the absence of spell checker and corrector tool for Amharic language has made
document preparation activities difficult, and needs excessive effort to edit and correct
documents, reduce documents quality and time wastage. As a result, it is common to see
spelling errors in Amharic newspapers and published documents. For example, Figure 1.1
is taken from published document [7] which has up to 9 misspelled words in a single page.

2
Figure 1.1. Sample Amharic text taken from a published document

To reduce the mistyped errors some research work has been done using Hunspell for
uninflected and inflected Amharic words in Linux OS environment [1, 6]. However, in the
previous works, Internal inflected and repeated words are not considered. For example:
ገጣጠመ, ሰባበረ, ቆራረጠ, ጌጣጌጥ and so on.

3
1.3. Statement of the Problem

Most of the government and private sectors of federal democratic republic of Ethiopia use
Amharic scripts for document preparation and a lot of documents prepared in day to day
activities, among the problem verifying or edit documents written by the worker or
someone else has written.

In English language computers have considerably minimized this problem since they
automatically detect and correct spelling as well as grammatical mistakes. Because of this
writers not only save considerable amount of time and money but they have also started
relatively producing better documents. So the number of spelling mistakes in English
newspapers has decrease considerably because of the use of automatic spelling correctors
[5].

Nowadays there is no applicable Amharic text spell checker integrated with any test
processor tools. As described in [4] one of the reason is lack of standardization and complex
morphology for Amharic language and Absence of clearly defined spelling rules for
Amharic language, Amharic language that has the same alphabets for the same sound are
reasons for Amharic language not to be developed.

There is no Amharic text spell checker tool or software which has the following features:

1. Amharic spelling checker system with internal inflection,

2. Amharic spelling checker that can check spelling on the fly,

3. Usable application of Amharic spell checker,

4. Spell checker program with exhaustive rules incorporated with repeated words,

Therefore, it is very indispensible to develop a spelling checker and integrate to open Office
word processor that satisfies the above criteria. Additionally Language specific problems
such as lack of standardization can be solved temporarily by using available resources such
as dictionary and Amharic books in the development of the project.

4
Due to this problem Amharic language users preferring English language rather than
Amharic language. Shewangizaw [4] and Mekonnen [5] developed Amharic language
spelling checker. Gaddisa [8] has morphology based Afaan Oromo spelling checker, all
[1,5,8] using Hunspell tool but internal inflection was not considered and also the rule was
not exhaustive for example compound words such as ጋሻ-አጃግሬ, እራስ-አገዝ. Both Mekonnen
[5] and Shewangizaw [4] did not consider internal inflection, repeated words, exhaustive
rules (compound words) rather it was a future work. So, this research basis on the above
works mentioned try to exhaust compound words, incorporate internal inflection, repeated
words and finally designing and implementing Amharic language spelling checker
(AMSPCH) based on the previous related works.

Thus, this study tries to address the following questions:

 What is the suitable tool and algorithm for designing a system?

 What is the performance of the system?

1.4. Objective of the Study

The following are general and specific Objectives of the study.

1.5. General Objective

The general objective of the study is to design and develop automatic Amharic language
spelling checker for open office word processor.

5
1.5.1. Specific Objectives

To achive the general objective, the following specific objectives are accomplished.

 Review different related documents to understand the concept, identify the gap and
study on Amharic languages structure of words and their derivations,

 Develop Amharic root word dictionary,

 Explore Amharic word formation rules from a root word,


 Design and develop the prototype,
 Evaluate the performance of the system.

1.6. Scope and limitation of the Study

1.6.1. Scope of the Study

The study attempts to collect and analyze different related documents, design and
implement the Amharic language spelling checker by considering internal Inflection of
words, repeated words, and compound words. Finally integrate to open office word
processor and study the performance of the system. However real-word error checking and
correction is out of the study (i.e. error checking and correction using contextual
information is out of the study).

1.6.2. Limitation of the Study

In this study, spelling error correction techniques in other languages were investigated.
However, due to time constraint we will not consider automatic suggestion and correction
of words, we use Levenshtein edit distance spelling error correction technique, which is
implemented in Hunspell, is adopted. Affix rules work only the first 65535 Unicode
characters is the limitation of Hunspell.

6
1.7. Significance of the Study

From this study, Government organizations, students, journalists, teachers and basically
anyone who uses Amharic language to prepare document will be beneficiary. This work
will have a lot of significance in different areas.

Some of them are listed below:

 For press company in preparation of Amharic Books, Journals, newspapers and etc

 For teaching learning processes in preparation of lecture notes, handout, reports,


assignments

 For business organizations in preparations of their routine and regular reports, planes
etc.

 For governmental organization in preparation of rules and regulations, and etc

 For anyone who want to write a sensitive report in avoiding or reducing dangerous
consequences. (courts, governments, agreement)

1.8. Research Methodology

This study is experiment based and considered the following methodology and tools.

1.8.1. Literature Review

For proper understanding of the problem and successful completion of this study, different
global and local relevant literatures such as Journal articles, conference papers, reports,
books, manuals and relevant resources from internet reviewed to achieve the study
objectives. The study was done based on previous research works and literatures related
with Amharic language spelling checker. In this study, we reviewed different types of
spelling checker tools to identify pros and cons.

7
1.8.2. Data Collection and Preparation

Since there is no readymade Amharic root word dictionary for Amharic language spelling
checker (AMSPCH), different Amharic electronic documents were collected and studied
to analyze errors encountered in Amharic documents which are helpful to characterize
different types of spelling errors. The Amharic spell checker’s word list was built by
combining Amharic dictionary, lists of some common names in Amharic, list of Ethiopian
person names, list of common places in Ethiopia, list of abbreviations and lists of some
countries in the world were collected from different books and research works.

1.8.3. Implementation tools

For this study we use the following off the shelf components.

 Amharic Unicode fonts used to type Amharic text


 Open Office word processor: though Amharic spell checker can be integrated to
closed proprietary word processor such as Microsoft word, we choose the open
office word processor because of accessibility of tools and codes.
 Hunspell tools: it is a spell checker and morphological analyzer library and program
designed for languages with rich morphology and complex word compounding or
character encoding.
 Cygwin terminal tool is used for interfacing.
 Word counter python code is used to count number of words in the dataset; it is
shown in appendix H.

1.8.4. Performance Evaluation techniques

To measure the performance of the new system recall and precision were taken as major
criteria. The test data was collected from different region annual report document.
Moreover, valid uninflected and inflected Amharic words were used in addition to
misspelled Amharic words.

8
1.8.5. Organization of the thesis

The reminder of the thesis is organized as follows:

Introductory part gave an overview of background and statements of the problem for the
study, objectives, scope and limitations of the study, significance of the study and
description about the methodology to conduct the study.

Chapter two of this study talks about literatures and related works reviewed and provides
background information about how spell checker works and types of spelling errors and
Related works are that describe spell checker works done for languages like Nepali,
Bangla, Arabic, and Amharic.

Chapter 3 provides information about Amharic language with its writing system, and
design requirements for Amharic spell checker and the architecture of the designed spell
checker. Chapter 4 deals with the experiments conducted to evaluate the performance of
the spell checker and discuss the obtained results. The last chapter, chapter 5, presents the
overall conclusions that have been drawn from the studies reported in this thesis work.
Finally, recommendations are given and areas open to future research are also identified
and presented in this chapter.

9
CHAPTER TWO

LITERATURE REVIEW AND RELATED WORKS

In this chapter theoretical concept and types of spelling checker, core functionalities of
spelling checking system, spelling checking related works and finally approaches used in
developing spelling checker system discussed.

2.1. Literature Review

2.1.1. Amharic text spell checker

A spell checker is a tool that enables us to check the spellings of the words in a text file,
validates and checks whether they are rightly or wrongly spelled and in case the spell
checker has doubts about the spelling of the word, finally suggests possible alternatives
[8].

Spell checker operates on a single word at a time. It is either dictionary based or rule based,
dictionary based spell checker can be designed in two ways. In the first case, a dictionary
contains all root words and their inflection forms. Thus, it is not suitable for languages that
have rich morphology such as Amharic and Arabic languages. Amharic language has rich
morphology, and spell checkers should be able to handle high inflection of words. But it is
easy to develop Amharic spell checker using uninflected word collections. However, as
stated in [4], it has less performance and high memory consumption. In the second case, a
dictionary contains only root words. This one has better performance and memory
consumption. In spelling checker Stemming is important to develop root word dictionary
from an existing electronic dictionary. It is the process of reducing morphological variants
of a word into a common form particularly by removing prefixes and suffixes. Affix and
dictionary of Amharic words can be in Ethiopic script or Unicode data.

10
2.1.2. Types of Spelling Errors

There are two types of spelling errors; it can be real word spelling error or non-word
spelling error [9].

 Real word spelling errors

In Real word spelling error a word is correctly spelled but not contextually correct [10].
That means in real word spelling error, it is impossible to decide whether a word is wrong
or not without some contextual information. Spelling errors that result in a token, which is
a correctly spelled word, though not the ones that the user intended [11], are real word
spelling errors.

 Non word spelling Error

Non-word spelling errors occur when the user writes misspelled word or typed incorrectly
[12]. In our research work, we focus on non-word spelling errors. As stated in [13]non-
word errors mainly classified into typographic and cognitive errors.

Typographic Errors

Typographic errors occur when writer knows the correct spelling of the word but mistypes
the word by mistake (for example, ሞራረደ vs. ሞረራደ). These errors are mostly related to the
keyboard shift key.

As stated in [1, 5] and, Typographic errors are classified in to four major types, such as
substitution error, deletion error, insertion error, and transposition error.

These error types can be of multi error misspellings and single error misspelling. Multi-
error misspellings are errors that contain more than one instance of error, whereas, single
error misspellings are a single instance of an error in the given word. As stated in [14] the
majority of (80%) wrong spellings happen because of one of the following four categories:

11
 Single letter insertion, e.g. typing ነኢትዮጵያ for ኢትዮጵያ
 Single letter deletion, e.g. typing ኢዮጵያ for ኢትዮጵያ
 Single letter substitution, e.g. typing ኢተዮጵያ for ኢትዮጵያ
 Transposition of two adjacent letters, e.g. typing ኢዮትጵያ for ኢትዮጵያ

In most cases typographic errors are related to the keyboard adjacencies and the most
common typographic errors are substitution error types. This error type is mainly caused
by replacement of a letter by some other letter whose key on the keyboard is adjacent to
the correct letter’s key. As shown in Kukich [15] study, 58% of the errors involved adjacent
typewriter keys for English language.

Cognitive errors

Cognitive errors are also called orthographic errors [13], and it occurs when writer does
not know or has forgotten the correct spelling of a word. It is assumed that in the case of
cognitive errors, the misspelled word happens by missing the pronunciation of the correct
word especially in foreign languages used in Amharic languages (e.g., ኢንፎርሸሚን -
>ኢንፎርሜሽን, ኮርኮሬሽን ->ኮርፖሬሽን).

2.1.3. Core functionalities of spell checkers

Spelling error detection and spelling error correction are the two core functionalities of a
spell checkers. Error Detection is to verify the validity of a word in the language while
Error Correction is to suggest corrections for the misspelled or wrongly spelled word [16].

According to [14]study, interactive and automatic are types of Spelling error correction.
Interactive spellchecker can suggest more than one alternative correction for each error and
the user select one from the suggestion for replacement and in automatic correction, the
spellchecker decide and select one best correction and the error is automatically replaced
with misspelled word.

12
In automatic error correction is the requirement for those speech processing and Natural
Language Processing (NLP) related systems where human intervention is not possible [14].
The spell checking process can generally be divided into three steps, detecting errors,
finding correction and ranking correction. Detection and correction are discussed above;
Ranking is the listing of suggested corrections in decreasing order of their intended word.

2.1.3.1. Spelling Error Detection

Spelling Error Detection is verifying the validity of a word in the language and includes
identification of misspelled words and flagging of misspelled words using different
detection algorithms.
The two main approaches for non-word error detection are dictionary lookup and n-gram
analysis method [17].

Dictionary Lookup Technique

Dictionary lookup technique is used to check the presence of every input text word in
dictionary. If the word is present in the dictionary, then it is a correct word, otherwise it is
an error word or misspelled word. The most common technique for gaining fast access to
a dictionary is the use of a Hash Table. To look up an input string, one simply computes
its hash addresses and retrieves the word stored at that address in the pre-constructed hash
table. If the word stored in the hash address is different from the input string or is null, a
misspelling is indicated [18].

The challenges of this approach are:

A lexicon containing all correct words could be extremely large, resulting in a need of more
space and inefficiency searching time and for morphologically complex languages, it is
practically impossible to list all correct words. So, instead of storing the word as it is in the
lexicon, some sort of rules can be applied to reduce a given word into its root word. This
can be done by storing only root words in the lexicon including prefix, infix and suffix
information. Then, rules can be applied on the root words of the lexicon by using prefix,
infix and suffix information to generate derived words.

13
N-gram Analysis Technique

The N-gram analysis or independent spelling error detection method does not use a wordlist
or lexicon; instead it uses statistical means to detect misspelled words [4].

This method works by using a large corpus of text from the desired language and by
generating a character n-gram from the list. An n-gram is calculated from this corpus. A
character n-gram is a sequence of characters where n is the number of letters in the
sequence.

One, two and three letter n-grams are often referred to as unigrams, bigrams and trigrams,
respectively. An example of a trigram analysis of the word ኢትዮጵያ would give the 3-gram
set {ኢትዮ, ትዮጵ, ዮጵያ}. By using this technique, strings that contain unusual n-grams can
be identified as possible spelling errors. N-gram techniques usually require a large corpus
or lexicon training data so that an n-gram table of possible combinations of letters can be
compiled.

According to [15], N-gram analysis technique is very useful for detecting errors occurred
in machine-generated texts such as texts generated by OCR. Its main advantage is that it
works without a lexicon. However, for human generated errors, most spell checkers rely
on dictionary lookup for error detection; and some applications use a hybrid of these two
methods. To use dictionary lookup technique, we need to be careful on the lexicon size and
usage of efficient lookup algorithm.

2.1.3.2. Spelling error correction

Non word spelling error correction is a process of detecting and providing suggestions for
incorrectly spelled words in a text. Spellchecker can suggest one or more corrections for
each error and the user selects the best word from the list and replaces the misspelled word.
Non word spelling error correction can be done without considering contextual information
which is called isolated word error correction [14].

14
Isolated word error correction approach is very helpful for handling non-word spelling
errors. In isolated word error correction approach, knowledge about error patterns is very
useful. Most misspellings are within one or two characters in length of the correct word.
While searching for the correct spelling, we do not usually need to look at words with
greater character length difference, especially more than two. Kukich [15] also mentioned
that the number of errors occurred at the beginning of a word is minimal. As the probability
of getting error in the first letter of a word is less, the process of error correction can be
speeded up by concentrating on the remaining letters of the word.

Generally Isolated Word Error Correction techniques can be divided into following
subcategories [14]:
1. Edit distance techniques
2. Similarity Key techniques
3. Probabilistic Techniques
4. N-Grams Based Techniques
5. Phonetics based techniques

Minimum Edit Distance Technique

Edit distance is a most effective technique to generate the alternates of wrongly spelled
words. In this approach word containing the spelling mistake is compared to every word in
the dictionary and various operations like insertions, deletions and substitution and
transposition are performed on the word corresponding to every word in the dictionary.
The total number of such operations is referred to as the distance. The minimum edit
distance is the minimum number of operations (insertions, deletions and substitutions)
required to transform one text string into another [19]. In its original form, minimum edit
distance algorithms require m number of comparisons between misspelled string and the
dictionary of n number of words [15]. After comparison, the words with minimum edit
distance are chosen as correct alternatives. Minimum edit distance has different algorithms
from this Levenshtein algorithm, Hamming, Longest Common Subsequence are included.

Similarity key technique

15
Similarity key technique is to map every string into a key such that similarly spelled strings
will have similar keys. Thus when key is computed for a misspelled string it will provide
a pointer to all similarly spelled words in the lexicon [4].

Rule Based Technique

Rule Based Techniques are algorithms that attempt to represent knowledge of common
spelling errors patterns in the form of rules for transforming misspellings into valid words.
The candidate generation process consists of applying all applicable rules to a misspelled
string and retaining every valid dictionary word those results [20].

Probabilistic Techniques

In this, two types of probabilistic technique have been exploited in to transition


probabilities they represent a given letter will be followed by another given letter and
confusion probabilities they estimates of how often a given letter is mistaken or substituted
for another given letter. Confusion probabilities are source dependent because different
optical character recognition (OCR) devices use different techniques and features to
recognize characters, each device will have a unique confusion probability distribution.

N-gram Based Techniques

Letter n-grams, including tri-grams, bi-grams and unigrams have been used in a variety of
ways in text recognition and spelling correction techniques. They have been used by OCR
correctors to capture the lexical syntax of a dictionary and to suggest legal corrections.

Phonetics based techniques

These techniques work on the phonetics of the misspelled string. The target is to find such
a word in dictionary that is phonetically closest to the misspelling [14].

16
2.1.4. Spelling checker tools

Ispell, Aspell, MySpell and Hunspell are some of open source spell checker tools integrated
with different open source word processors such as liberoffice, Open office [5].

Ispell is a spelling checker for UNIX that supports most Western languages
(English (United Kingdom), English (United States), French, German, and Spanish). It
offers several interfaces, including a programmatic interface for use by editors such as
emacs (it is a popular text editor used mainly on Unix-based systems by programmers,
scientists, engineers, students, and system administrators). Ispell only suggest corrections
that are based on a Levenshtein distance. It will not attempt to guess more distant
corrections based on English pronunciation rules. The generalized affix description system
introduced by ispell has been imitated by other spelling checkers such as MySpell [2].

Like most computerized spelling checkers, ispell works by reading an input file word by
word, stopping when a word is not found in its dictionary. Ispell attempts to generate a list
of possible corrections and presents the incorrect word and any suggestions to the user,
then choose a correction, replace the word with a new one, leave it unchanged, or add it to
the dictionary [5].

Another open source spelling checker tool is aspell. It is spell checker program designed
to replace Ispell. Its primary advantage over Ispell and other existing spell-checkers is the
suggesting of possible replacements for a misspelled word. Aspell has also the capability
to spell check UTF8 encoded documents without the use of an additional dictionary. Aspell
includes support for multiple dictionaries at once, which Ispell does not do. MySpell is a
spellchecker based on Ispell. MySpell is used by OpenOffice.org and Firefox/Mozilla and
works on both Windows and Linux [5].

Hunspell Spellchecker is the next generation of Myspell, has been improved in order to
support additional features for different languages, especially for Hungarian language, as
well as other languages such as German and Turkish [21].

In general Hunspell is a spell checker and morphological analyzer library and program
designed for languages with rich morphology and complex word compounding or character

17
encoding. Hunspell becomes attractive spell checker for many languages such as Amharic
language and Arabic language because of the following features [5]

 Unicode support,
 Morphological analysis and stemming,
 Support complex compounding,
 Support language specific features,
 Handle conditional affixes, circumfixes, forbidden words, pseudo roots and
homonyms,
 Free and open source software,

All Ispell, Aspell, MySpell and Hunspell uses a dictionary file (.dic) and affix file (.aff).
The dictionary (.dic) is a list of words with their corresponding affix rules.

The affix file describes each of the prefix, Infix and suffix based rules. Affix is a linguistic
element added to a word to produce an inflected or derived form. An affix can be placed at
the beginning (prefix), middle (infix), or end (suffix) of the root or stem of a word [21].
However, affix rules used in the spell checkers mentioned above are either prefix, infix or
a suffix rules. Amharic electronic text spell checker should include feature such as
analyzing the rich morphological structure of Amharic language and support of the
Unicode encoding. Hungarian spell checker, it is based on Hunspell is capable of analyzing
complex morphological nature of language and supporting Unicode encoding. So in our
study Hunspell spelling checker tool is used.

2.2. Related works

Different language specific spell checkers have been developed to improve spelling error
problems that are created due to specific nature of the language in document preparation
using word processors. In this portion we try to show some of those language specific spell
checkers. In addition, it tries to see sources of spelling error variations for Amharic
language.

18
2.2.1. Nepali Spell Checker

Keyboard adjacencies, shift key characters, phonetic similarity, and visual similarity are
indicated as the main causes of spelling mistakes in Nepali writing system [22].
Architecture of the Nepali spell checker as shown in Figure 2.1, Nepali spell checker has
three components namely: Morphological Preprocessing Module, Lexicon Lookup/Error
Detection Module and Suggestion Module.

Each module can be easily incorporated to develop a new spell checker for other languages
and also can be used to device new techniques and procedures for Nepali language.

Figure2.1 Architecture for Nepali spell checker [6]

Lexicon

Because the size of the lexicon is an important factor for the efficiency of a spell checker,
only Nepali root words are stored.

Error Detection Module

Error detection module deals with lookup of the input word in the lexicon. Token (Nepali
word) is input to the Error detection module. This module searches the input word in the
lexicon, if it is not found, it will be sent to morphological preprocessing module. In
addition, it accepts the word which is broken down by the morphological preprocessing

19
module and then searches it in the lexicon. If the root word is not found in the lexicon,
spelling error is detected. It then sends the word to the suggestion module for correction.

Morphological Preprocessing

Morphologically complex words are broken down into root words in this module, which
are then searched into the lexicon. To do this, the researchers used morphological rules to
reduce the size of the lexicon. Morphological preprocessing module uses a Nepali porter
stemmer to breakdown the morphologically complex words into roots and affixes.

Spelling Error Correction and Suggestion Module

The suggestion module receives token when spelling error is detected. For the purpose of
spelling error correction and suggestion, it uses the edit distance algorithm (more
specifically Levenshtein edit distance algorithm).

Evaluation

The researchers used lexical recall (which indicates the percentage of valid words correctly
accepted), error recall (that gives the percentage of invalid words correctly flagged),
precision (which indicates the percentage of correctly flagged words), and suggestion
adequacy (that indicates how adequate the correct suggestion is) as evaluation metric for
their spell checker.

2.2.2. Spelling Checker for Afaan Oromo Language

As stated in Gaddisa [8], Afaan Oromo is a Cushitic language family. It is an official


language of Oromiya regional state and it has a very rich morphology. The system is
designed in a dictionary look-up with morphological rules.

Morphological rules in Afaan Oromo Language address word categories and their possible
inflections, derivation and compounding.

20
The architecture has eight components: Tokenize, Knowledge base, Error detection,
Morphological analyzer, Error correction, Morphological generator, Suggestion ranker and
Word assembler. The system uses English characters, and the inflection of words different
from Amharic inflection of words. In the system Levenshtein Edit Distance algorithm used
to rank the suggestion. Finally the suggested word with the shortest distance to the
misspelled word is considered as the best suggestion.

The research used accuracy, performance, precision, and recall for evaluation of the spell
checker. The accuracy measures how high the prototype suggests for the generated errors.
The performance measures the efficiency for the prototype in terms of time it takes to
generate the correct suggested word. On the other hand, precision and recall measures the
number of correct suggestions in the total number of spelling suggestions [8].

2.2.3. Spell Checker for Bangla

According to [23], Phonetic similarity of Bangla characters, difference between the


grapheme representations and the phonetic utterance are the most common reasons for
spelling errors in writing Bangla language. To produce good suggestions for these spelling
errors, methods based on edit-distance and fuzzy string matching algorithms have been
done for the language.

The research work [24], done by Naushad UzZaman and Mumit Khan from BRAC
University, presents a double metaphone encoding algorithm for Bangla that can be used
by spell checkers to improve the quality of suggestions for misspelled words in the
language. The researchers presented how this encoding system effectively encapsulates the
complex rules for Bangla and dialectic pronunciation differences that are not possible to
handle by using the traditional edit-distance methods. They compared the proposed Double
Metaphone algorithm with the edit distance based methods in producing suggestions for
misspelled words.

21
2.2.4. Spell Checker for Arabic language

In 2003, Khaled Shaalan and Amin Allam [25], University of Cairo, developed an Arabic
morphological analyzer. In addition, they devised techniques for spelling error detection
and corrections for Arabic language by investigating common spelling errors in Arabic text
writing.
The researchers analyzed and classified common spelling errors in writing Arabic word as:
 Reading Errors: Such kind of spelling errors could occur when the writer inputs a
word from a written documents or visual similarity of some characters in the
language.
 Hearing Errors: such errors can occur when the human writer is being dictated
and he/she might recognize a character as another one. This might occur from
pronunciation differences.
 Morphological errors: errors in this category might be the result of nonnative
speakers of Arabic language or a non well-educated writer.
 Editing errors: these are the most common errors in other languages [4] like
insertions, deletions, substitutions, and transpositions.

Their study has mainly focused on spelling errors correction to isolated words. They
proposed the spelling correction method categorized as ‘add missing character’, ‘replace
incorrect character’, ‘remove excessive character’, and ‘adding a space’ to split a
misspelled word into two or more words.
In adding missed character, the spell checker adds a missing character in every possible
position. If the modified word matches a word in lexicon, the new word will be added to
the list of candidates. Similarly, in replacing incorrect character, their tool replaces every
character with one of its neighbors according to some rule. And, if a new word is found in
the lexicon it will be added to the list. For adding a space to split words, the tool adds a
space in every possible positions and the newly formed word will be added to the
suggestion list (when it is found in the lexicon).

2.2.5. Spelling Checker for Amharic Language

22
Different researchers were done their studies on AMSPCH. From this Hunspell based
Typographic Amharic Spell Checker was done by Getaneh Woldeyesus at Graduate School
of telecommunications and Information Technology, Ethiopia. This work mainly focuses
on how Hunspell, an open source spell checker with morphological analyzer library
originally developed for Hungarian language spell checker, can be used to provide for
uninflected Amharic typographic spell checking process. In this case, the research used the
model of Hunspell as a solution for Amharic spelling errors detection and correction [4].

First of all the study generates typographic errors for Amharic language and the technique
for generation of errors works first by selecting words that have three or more characters
from the lexicon, and then selects a position to start the error generation randomly.

In the implementation part of this work, the Hunspell was modified by removing Hungarian
languages specific function calls, capitalization checking removal, and truncation of
affixation rules. In addition, a new word list was generated from the word list to avoid
inflected words from the list.

The research used accuracy, performance, precision, and recall for evaluation of the
proposed spell checker for Amharic language. The accuracy measures how high the
prototype suggests for the generated errors. The performance measures the efficiency for
the prototype in terms of time it takes to generate the correct suggested word. On the other
hand, precision and error recall measures the number of correct suggestions in the total
number of spelling suggestions [4].

The work is done on uninflected Amharic words and errors are generated randomly. First,
generation errors do not reflect how the trends of spelling errors look like in Amharic text
writing. Second, Amharic is a Semitic language with complex morphology. This implies
that we need to consider morphological analysis in developing Amharic spell checkers.

Due to the above reasons, much more effort is needed to study the existence of spelling
errors in Amharic text writing and works on spell checking for inflected Amharic words
especially internal inflected Amharic words.

23
On the other hand, as stated in [5], Amharic language words and their inflections such as:
inflection of nouns, inflection of verbs and inflection of adjectives and as Shewangizaw [4]
identified and studied Amharic error patterns and affixes of Amharic words listed for each
Amharic part of speech. Mekonnen [5], Gaddisa [8] and Shewangizaw [4] works were
implemented based on the frame work of Hunspell, that is default Open Office spell
checker. In Shewangizaw [4] spell checker checks Amharic text written with other word
processor by manually copying and pasting on Open Office and study the dictionary file
and affix file using Latin script and he used transliteration components to translate Amharic
texts to Latin and Latin back to Amharic and the internal inflection not considered.

In Mekonnen [5] study, root words in the dictionary, prefixes and suffixes in the affix rule
respectively without translation. In both [4] and [5] the following activities were not
addressed:

 Management of internal inflection: example ቆራረጠ is derived from root word ቆረጠ.
ራ is added inside the word ቆረጠ before the character ረ.
 Consider real word error checking and correction based on contextual information.
 Consider auto correction, extra spaces removal and repeated word removal.
 Consider Amharic fonts independent spell checker. It is Unicode dependent not
other Amharic fonts are some of the limitation of the previous works.
 Prefix and suffix rules were not exhausted.

So in this study, internal inflection of words, compound words, repeated words, prefixes
and suffixes are exhausted.

24
CHAPTER THREE

DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE


SPELLING CHECKER

3.1. Amharic Language Spelling Checking

3.1.1. Amharic Language Inflection

As stated in Getahun [26], Amharic part of speech tagging (POS) is categorized in different
classes namely Noun, Verb, Adjective, Preposition, and Adverb. These classes can be
inflected by number, gender, case, definiteness, pronoun, tenses, and person [5].

In this study we investigate how root words are inflected and develop a rule based on the
investigation.

Inflection of nouns
Amharic nouns can be inflected by number, gender, case and definiteness. Nouns are
inflected by adding affixes and by reduplication of nouns [2, 18]. The inflection of noun is
done by adding two suffixes “ዎች” and “ኦች”. By adding the suffix “ዎች” the word “በሬ” is
inflected to “በሬዎች” and using the suffix “ኦች” the word “ዶክተር” is inflected to “ዶክተሮች”.
Examples are shown in table 3.1 and 3.2.

Table 3.1 Inflection of nouns by adding suffix “ኦች”

ነጠላ ቁጥር ብዙ ቁጥር


ዶክተር ዶክተር-ኦች ዶክተሮች
አያት አያት-ኦች አያቶች
ቤት ቤት-ኦች ቤቶች

25
ወንበር ወንበር-ኦች ወንበሮች
ፍየል ፍየል-ኦች ፍየሎች
በግ ብግ-ኦች በጎች
እግር እግር-ኦች እግሮች
እስስት አስስት-ኦች እስስቶች
Table 3.2 Inflection of nouns by adding suffix “ዎች”

ነጠላ ቁጥር ብዙ ቁጥር


ገበሬ ገበሬ-ዎች ገበሬዎች
እርሻ እርሻ-ዎች እርሻወች
ሸማኔ ሸማኔ-ዎች ሸማኔዎች
ነጋዴ ነጋዴ-ዎች ነጋዴዎች
በሬ በሬ-ዎች በሬዎች
ተማሪ ተማሪ-ዎች ተማሪዎች

Another technique of derivation of nouns or inflection of nouns is reduplication; it is done


by repeating the word itself with some modification. The six alphabet is converted into the
forth alphabet and repeat the first word. The example was shown in table 3.3.

Table 3.3 Inflection of nouns by reduplication

ነጠላ ብዙ

ግርድ ግርዳግርድ
ጌጥ ጌጣጌጥ
ትል ትላትል
ብረት ብረታብረት
ጥሬ ጥራጥሬ
ሸቀጥ ሸቀጣሸቀጥ
ጨርቅ ጨርቃጨርቅ

Amharic Words can also inflected based on gender by adding “ኢት”, the word “በግ” can be
inflected to “በግ-ኢት”, “ልጅ” can be inflected to “ልጅ-ኢት” by adding “ኢት”. The other form of
noun inflection is based on cases which concerns usage of the word in a sentence such as

26
subject and object. It is done by adding suffixes such as “-ን”, “-ኤ”, -ህ”. for example by
adding “-ን” in the word “ልጅ” we get “ልጁን”, by adding “-ኤ” we get “ልጅ-ኤ” we get “ልጄ” The
last noun inflection form is based on definiteness which is done by adding suffixes “ኢቱ”,
“ዉ”, “ኡ”, “ዋ” and “ይቱ” [26].

Inflection of Verbs

Verbs have affixes that show subject and object of a sentence [5]. These affixes are “ሽ”,
”ች”, “ህ” etc. Verbs inflected by person, gender, number and tenses. An affix that shows
third person, singular, female and past tense is “ሽ” in “ሄድሽ”, similarly by adding “ህ” we
get “ሄድህ” and the affix shows third person, singular, male and past tense is “ህ” in “ሄድህ”.
As stated in [5], Amharic verbs are the most inflected part of speech in Amharic language.
So as described in Table 3.4, 3.5 and 3.6 in the first column, Verbs in the form of perfect
tense or verbs that indicate third person singular male gender are considered root words in
this study. All root verbs are inflected to compound imperfect, gerund, contingent and
infinitive. Compound imperfect verb is derived from root word by adding affixes such as
ይ--አል, ት--አለች, ይ--አሉ, ን--አለን, ት--አላችሁ, ት--አለህ, ት--አለሽ and etc. Gerund form of a verb is
obtained by adding ኦ suffixes. Contingent and infinitive form of a verb are obtained by
adding ይ-- and መ respectively as shown in table 3.4. Table 3.5 and 3.6 depicts affixes used
in transitive verbs and negation of verbs. Table 3.7 shows the internal inflection of words.

Table 3.4 Inflection of verbs

Root ይ------አል ------ኦ ይ-------እ መ-----እ

ሄደ ይሄዳል ሄዶ ይሂድ መሄድ

ቆረጠ ይቆርጣል ቆርጦ ይቁረጥ መቁረጥ

ሮጠ ይሮጣል ሮጦ ይሩጥ መሮጥ

Table 3.5 Inflection of transitive

Root አስ--- ተ---

27
ገደለ አስገደለ ተገደለ

በላ አስበላ ተበላ

በላች አስበላች ተበላች

Table 3.6 Negative inflections of verbs

Root አት------ም አል------ም አይ-------ም አን-----ም

ተማረች አትማርም አልማርም አርምይመ አንማርም

በላች አትበላም አልበላም አይበላም አንበላም

ገደለ አትገድልም አልገድልም አይገድልም አንገድልም

በላ አትበላም አልበላም አይበላም አንበላም

Table 3.7 Internal inflections

Root Inflection

ገጠመ ገጣጠመ

ቆረጠ ቆራረጠ

ሰበረ ሰባበረ

ቀመሰ ቀማመሰ

ቆመጠ ቆማመጠ

ከተፈ ከታተፈ

ገመጠ ገማመጠ

ደረበ ደራረበ

ከመረ ከማመረ

ሰነጠቀ ሰነጣጠቀ

ገነጠለ ገነጣጠለ

ገለጠ ገላለጠ

28
መረጠ መራረጠ

Inflection of Adjectives

Adjectives are inflected by numbers, cases and definiteness. Inflection of adjective is


similar to inflection of noun when it is inflected by number, cases and definiteness. For
example by adding ኦች the word ጅል, ብልህ, ጠባብ, ጎበዝ inflected to ጅሎች, ብልሆች, ጠባቦች,
ጎበዞች respectively.

3.1.2. Amharic spelling error patterns

In addition to Amharic language spelling error patterns presented by Shewangizaw [4] and
Daniel [27], compound words, abbreviations and mistyping are identified as a sources of
spelling error variations for Amharic language. These sources or error variations and study
for trends of spelling errors for Amharic language are presented below.

Compound words

Amharic writing system uses different compound word writing techniques and there is no
standard to write compound words as a two separate words or a single word [5]. As a result
of this, we get Amharic words having the same meaning but in different ways of context.
For example, it is not clear to select which of these words are right to use: ሰዉ ሰራሽ, ሰዉ-
ሰራሽ ወፍ ዘራሽ, ወፍ-ዘራሽ እጅ አዙር, እጅ-አዙር, ወጥቤት, ወጥ-ቤት, አየርወለድ, አየር-ወለድ, አጥር ግቢ አጥር-
ግቢ. So in our study we use Getahun [26] compound words writing system by concatenating
two words using hyphen. Example አጥር-ግቢ ልብስ-ሰፊ, ሰዉ-ሰራሽ, እንጀራ-ጋጋሪ, ሰርቶ-አዳሪ, አዉቆ-
አበድ, ሰርጎ-ገብ, መንፈቀ-ሌሊት, ቤተ-ክርስተያን.

Abbreviations

29
In English language, it is commonly used words in abbreviation forms. Example Dr, Mr.,
etc. Similarly Amharic language allows writing a word idifferent abbreviation forms for a
single word. So this can be a source of Amharic spelling error variation. For example, when
abbreviating the phrases (ጠቅላይ ሚኒስትር, one can find ጠ/ሚ or ጠ/ሚኒስትር) ዶክተር can be
identified as ዶ/ር, ክፍለከተማ can be written as ክ/ከተማ. Therefore, these kinds of words should
be handled in a spellchecker application when a user enters any of these words.

Syllographic redundancy

Most of Amharic vocabularies are originated from the Geez language [4]. However, it lacks
to preserve Geez’s phonology while it takes its symbol for some of the characters. In
addition to having same pronunciation, each character has its own order. Such types of
issues are inherent from Amharic symbol redundancy and need to be addressed in spell
checking process. Example: “አለምፀሀይ” and “አለምጸሀይ” for “ዓለምፀሐይ”.

Glypheme misidentification

This source of errors occurs due to visual similarity of some Amharic characters. Most of
the time, the characters: ‘ው’ and ‘ዉ’, ‘ፖ’ and ‘ፓ’, ‘ዪ’ and ‘ዩ’, ‘ጕ’ and ‘ጒ’, ‘ቁ’ and ‘ቍ’
are used interchangeably in Amharic writing system. Due to visual similarity of characters,
users may simply choose the form that is easiest to write by hand or type into a computer.
“ነዉ” instead of “ነው” can be taken as glypheme misidentification error types.

False Geezims

This type of source of variations in Amharic words occurs due to inserting the wrong letter,
mostly characters that have to be silent. Example: “ምልክት” vs. “ምልእክት”.

Assimilation and Alternations

It is common in Amharic that, ‘ም’ may be exchanged for ‘ን’ before ‘በ’, as in “ሽንብራ” vs
“ሽምብራ”. This is one source of spelling variation in Amharic writing system.

30
Foreign language transcription

Amharic has some words that are taken from other languages, very often technology terms.
Until some convention emerges there will be conflicting way of writing in transcription of
those words. Example: “ኮምፒዩተር” vs. “ኮምፒውተር”.
Dialect variation

Regional dialects can also impact word formation in the basic level where the words are
more likely to be written following their spoken form; “ሆመጠጠ”, “ኮመጠጠ”, “ሂጂ” vs “ሂጅ”,
“አይዶለም” vs “አይደለም” , “ዓጤ” vs “ዓፄ” are some of them.

Mistyping

It is common to type the misspelled words other than the correct ones during document
processing. The two common reasons for mistyping Amharic words are the number of keys
on a keyboard is fewer than number symbols and different Amharic word processing tools
have different keyboard layout for inputting Amharic word. In phonetic based input
methods, mistyping comes from “shift-slip”, Example “ቴና” for “ጤና” [4].

3.1.3. Affix Rules Development

As described in Section 2.1.3.1, the drawback of storing all forms of words in the lexicon
is, a lexicon containing all correct words could be extremely large. As a result it needs more
space, inefficiency searching time and it is practically impossible to list all correct words.
To minimize this problem we stored only root words in the lexicon and the input word from
the input component checked against the lexicon words by considering root and inflected
words by developing affix rules.

31
During affix rule development, Prefix, Infix and suffix lists collected from different
documents manually.

The dictionary is built from 24649 words, and for affix rule built from 60 prefixes, 1752
suffixes, and for internal inflection words 86 rules and the sample Prefixes, infixes and
suffixes used in this work are listed in Appendix C.

Then the identified prefix, infix and suffix lists need to be categorized so that they can be
integrated to each lexicon entry. According to [26], Amharic POS can be categorized in
five major classes namely Noun, Verb, Adjective, Pronouns, and Adverb. Nouns can be
inflected for number, gender, case and definiteness. Verbs can be inflected for person,
gender, number, mood, and tense. Adjectives are inflected for number, case, and
definiteness. Based on this, we categorized the identified suffix lists and each category was
given a unique identifier.

After suffix and prefix lists are categorized, there should be a rule that indicates how a
given word takes a suffix. For example, a suffix “-ዎች” is allowed for a word በሬ. Hence,
rules were developed which handles such cases.

3.1.4. Dictionary development

Amharic root word dictionary is compiled from different sources. Amsalu Aklilu [28],
Concise Amharic Dictionary Amharic to English and English to Amharic dictionary [29],
ጌታሁን አማረ [26] and ባየ ይማም [20] is taken as base dictionary as they contain part of speech
for many words, compound words and phrases.

While developing the dictionary, this study uses the following steps:

 Remove inflected words from dictionary

 Remove phrases made of two or more words

32
 Add some verbs that are not available in Amsalu Aklilu dictionary and from
Concise Amharic Dictionary Amharic to English and English to Amharic
dictionary.

 Add country names and common person names

 Normalize dictionary entries

 Append rules to each words in the dictionary partially based on Amharic part of
speech.
 Produce a text file consisting of list of Amharic words one per line.
As shown in figure 3.3 Words are listed one word in a row followed by affix rule identifiers
that should be applied to that rule. In first line we should write approximate number of
words. Any word in the dictionary is followed by forward slash and 0 or more flag
identifier. The output is a file (am_ET.dic) with .dic extension. It is an input for Amharic
spell checker program.

ሱቅ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN

ሰበረ/MKIMINPOHN

ዜማ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN

ሰለለ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN

ሰገረ/MKNIMINASAGAWANMWMOCWCPOAUAEETNNO2WMWN

ቢሮ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN

Figure 3. 1 Sample list of dictionary

As can be seen in figure 3.3, the first entry number 6 indicates estimated number of
root words in the dictionary. Amharic words starting the second line are lists of root
words. The slash after Amharic words is used to indicate that it is end of root word
and beginning of rule identifier. All characters after slash symbol are rule identifier

33
defined in affix files. Rule identifier, MK points affix rules that append prefixes of
verbs such as ለ, በ, ከ, የ, ስለ, እንደ, ያል andእስከ.

The dictionary is built from 24649 words, and for affix rule built from 60 prefixes,
1752 suffixes, and for internal inflection words 86 rules.

3.1.5. Lexicon lookup

A lexicon lookup algorithm implemented as the long linked list requires going all the
way to the end of the list. Checking every element for equality with a given input
word is very inefficient and slow especially for large lexicon words. Hence, a method
known as hash table dictionary lookup method was implemented for lexicon lookup
process. In this work, we adopted a dictionary lookup algorithm developed by the
author of Hunspell [30].

A Hash table dictionary lookup method was implemented first by calculating a hash
function for a given string. This value is obtained by manipulating the bytes of the
given string [6].

3.2. Design of Amharic Spelling Checker (AMSPCH)

In the design of spell checker has to incorporate general features of a spell checker and
language specific components for the targeted language. In this chapter, we will try to
discuss general and language specific requirements for the designed Amharic language
spell checker. In the design of AMSPCH task issues and requirements addressed.

To do spell checking task, first we need to input words and present as tokens; hence, we
have introduced an input component. Other components of our spell checker are
normalization component, error detection component, morphological analyzer, and error
correction and suggestion component. All of these components are briefly discussed below
in3.2.2.

34
3.2.1. Design Requirements

Lexicon lookup speed, selecting an appropriate technique for detecting and correcting
spelling error, and storage requirements are the general factors that are needed to be
considered in designing a spell checker. Besides, the general requirements of a spell
checker design, one has to consider language specific features of a spell checker. In this
case, our spell checker takes the following typical features of Amharic language that affects
the spell checking process.

Morphological variants of words

Amharic is one of the languages with rich morphology. As it is discussed earlier, one of
the tasks in spell checker is developing a lexicon. There are two options to develop the
lexicon, the first one is to store all forms of words in the lexicon and retrieve from this list,
and the second have only root words in the lexicon then create some affix rule or algorithm
for validating all acceptable words (inflected words) for the language. However, the first
option can have two problems one is the performance, and the other is getting all forms of
words for the given language. To avoid above mentioned problem, developing affix rules
(prefix, infix, and suffix rules) has a solution in the spell checking process [6]. So in this
study we use the second option.

Encoding issue

Previously, Amharic electronic documents were developed using mostly incompatible


software based on different encoding systems. However, software vendors for Amharic
word processing have started to use Unicode in recent times. Moreover, Unicode seems to
be choice of preference to represent Amharic documents. This study focus on Amharic
documents written using Unicode encoding.

3.2.2. Architecture of the Amharic Spell Checker

35
Figure 3.1 depicts the architecture of Amharic spelling checker. The architecture presents
the components of Amharic spelling checker.

Our spell checker is designed to check whether a given Amharic word is correctly typed or
not and gives suggestion for incorrectly typed words. To achieve this goal, five components
are introduced.

The components are:


• Input component,
• Normalization Component,
• Error Detection Component,
• Morphological Analyzer Component,
• Error Correction and Suggestion Component,

From the above five components, the Normalization and Morphological analyzer
components have language specific features and should address the Amharic language
specific characters that are related to spell checking process.

Amharic Spelling Checker Architecture

36
Figure 3.2 Architecture of Amharic spell checker adopted from [3].
Input Component

An input component is responsible to read characters from open Office tokenize them.
When the user input a word, the input module read characters one by one from open Office
word processor. If the user presses space bar or punctuation marks shown in table 3.8 or
pastes electronic documents, then the input component tokenize texts. On the other hand if
visual or Syllographic redundant characters, then the input module represents them by their
predefined representative. After tokenized the text, then it passes the input word to
Normalization component for further processing. The algorithm for this component is
presented in Figure 3.2.

Table 3.8 Amharic punctuation marks

Word period comma colon Semi- Preface Question Exclamation


separator colon colon mark Mark

: :: ፣ ፤ ÷ :- ? !

Begin

1: Make token empty

2: Read a character

2.1 If character is one of Amharic punctuation marks call

Normalization module

Else if a character is end of file Exit

Else append the read character to the token

3: Move pointer to the next character then go to step 2

37
Figure 3.3 Algorithm for Input component adopted from [3]

Normalization Component

As it is discussed in section 2.2.5, one source of error variation in Amharic is Syllographic


redundancy or the presence of some repetitive alphabets that can be used interchangeably
in Amharic words. A spell checker for such a language should be able to address this issue.
As a result, those types of words should come into a common form. Hence, one use of this
component is to apply a rule on input words which have Syllographic redundancy problem.

Error Detection Component

This component accepts the decomposed word from the Morphological analyzer
component. Then, it checks whether the returned word exists in the lexicon or not.
Consequently, the error detection component passes the non-word to the Error correction
component so that the user gets the suggested list.

Amharic spell checker is dictionary based spell checker. Misspelled words are identified
by using dictionary lookup algorithm. First check the existence of token output from

38
Normalization module in the dictionary. If exists, then it is root word and treated as correct
spell word. Otherwise, it is passed to morphological analyzer to check if it is one of
inflected words or not. If morphological analyzer strips affixes added to a root word, then
root word is passed to error detection modules to check it existence in the dictionary [5].

 Morphological Analyzer Component

As described in 2.2.6, Amharic is a morphologically complex language, whose basic units


are mostly consonantal roots. As a result of its complexity, all classes of words are highly
inflected and contain lots of information in a single word.

This situation has to be addressed while designing the spell checker for such language. In
other words, we need to accept all valid inflected words in addition to the root words.

Other than a spell checker, morphological analyzers were used for information retrieval,
POS tagging, Machine translation etc [4]. The task of our Morphological analyzer
components accepting a word from the error detection component, decomposing the input
word into stem and affixes based on predefined Amharic language word formation rules,
and then passing the resulted stem and affix to the error detection module.

This morphological analyzer is limited to inflectional morphology. As a result, it considers


internal derivational morphology (for example ፈለገ into ፈላለገ, ገጠመ into ገጣጠመ). This kind
of words will be considered as an internal inflection and rules are applied to it.

We adopted the morphological analysis methods used by Hunspell [30] for the claimed
misspelled word, first by developing word formation rules for Amharic language. The
details of these rules are presented next.

Input: word I_Word from Error detection component


Output: list of affix and root words
Start
1. Scan input word from right to left and left to right to
look for valid suffix and prefix

For each valid suffix in I_Word strip them and store


result in a buffer
For each valid prefix in I_Word strip them and store
result in a buffer 39
//pass list of affix and stems to the error detection module
Return root and affix
Figure 3.4 Algorithm for Morphological Analysis adopted from [8]
 Error Correction and Suggestion Component

After the input word is flagged as a non-word, the spelling error has to be corrected and we
get a list of suggested words so that we will select from the list. The error correction and
suggestion component was designed to accomplish this task. Hence, this component inputs
a word from Error detection component, searches all possible list of corrections from the
lexicon as a suggestion, and it ranks list of words.

In this work, Levenshtein edit distance has been used for error correcting and suggestion.
Detail, see Appendix F.

Levenshtein edit distance algorithm works by defining some threshold value which
indicates a maximum distance for possible list of words as a suggestion. Shewangizaw [4]
finds a single word error contributes 88.8% of the total error.

40
CHAPTER FOUR
EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL
CHECKER

4.1. Introduction

This section describes the detail of the experiment based on source of error variation, error
detection and correction techniques, and spelling error trends in Amharic documents.

4.2. Prototype

4.2.1. Input word processing

As stated in section 3.2.2, the designed Amharic spell checker has an input component
which takes a text file as an input and applies Tokenization and Normalization to generate
words for error detection component.

Tokenization is the process of breaking up the given text into units called tokens. The
tokens may be words or number or punctuation mark. It can occur at a number of different
levels: paragraphs, sentences, words, syllables, or phonemes [31]. This process needs word

41
boundaries of a given text or punctuation marks and encoding of a given language. In this
work, only words with Unicode encoding are demarcated or tokenized.

As discussed in section 3.2.2, Amharic has its own punctuation marks that demarcate
words, sentences, etc. But instead of using punctuation marks white spaces are used to
demarcate Amharic words in electronic documents.
Therefore, tokenization for Amharic text is done by considering all Amharic punctuation
Marks (i.e. word separator, period, comma, colon, semicolon, preface colon, question
mark, and exclamation mark) and white space.

4.2.2. Implementation of spell checker in Open Office using Hunspell

As discussed in section 2.2.6, Hunspell is the default spell checker for openoffice.org. It
requires two files to define the language spell checking. The first file is a lexicon containing
words for the language (Amharic words in our case), and the second is an affix file that
defines the meaning of special flags in the lexicon. This affix file contains the prefix and
suffix rules to be associated with the words in the lexicon.

As shown in figure 4.1 a lexicon file (am_ET.dic) contains a list of Amharic root words.
The first line of the lexicon contains approximate number of entries in the lexicon file.
Each word may optionally be followed by a slash (“/”) and one or more flags, which
represents prefix infix and suffix rules.

An affix file (*.aff) may contain a lot of optional attributes. For example, SET is used for
setting the character encoding of affixes and lexicon files. PFX and SFX defines prefix and
suffix classes respectively named with affix flags. The following example describes the
structure of the affix file of Hunspell.

Affix file:

SET UTF-8

1. SFX OC Y 27
2. SFX OC 0 ዎ‹ [^IMU`eipwê‹”˜¡¨<ÃÉÏ´»åêõý]
3. SFX OC I J‹ I
4. SFX OC M KA‹ M
4. SFX OC p q‹ p
- 42
-
5. SFX AA ý þ‹
Figure 4.1 sample affix rule

As shown in figure 4.1, ህልቅ and ፕ in third column are characters that are stripped before
affixation. ህልቅ and ፕ in fifth column are characters that are checked if one of them is
last character of a word before affixation. 0 in line two and third column indicates that
nothing is stripped when ች is affixed. ሆች, ሎች, ቆች, ፖች are affixes to be affixed if
specified condition is fulfilled. [^ህልምሽቅብትችንኝእክውዝዥጵጽፍፕ] is a condition to check that
the last character is not one of fifth order characters in Amharic scripts. So figure 4.1 shows
how to define rules that is used in derivation of nouns to their plural forms.

Amharic verbs are highly inflected than other Amharic part of speeches. It is inflected by
person, tense, Gender, and cases. Some prefixes and suffixes of verbs are dependent to
each other. This dependency is controlled by CIRCUMFIX commands.

CIRCUMFIX XX
PFX EE Y 1
PFX EE 0 እን /xx
SFX TA Y 2
SFX TA O ለን/EEXX [ሃላማሳራሻቃባታቻናኛዛዣያዳጃጳጻፋፓ]
SFX TA ደዳለን/EEXX ደ
If ሄደ/TA and በላ/TA are in the dictionary, እንሄዳለን and እንበላለን are valid inflected words
where as ሄዳለን, በላለን, እንሰብረ and እንበላ are invalid words and marked as misspelled words.

Amharic verbs have also subjected and/ or object indicator suffixes. For root word ገደለ, we
get its inflected form ገደሉዋቸዉ by adding ዉ which is subject indicator and ኣቸዉ which is
object indicator.

43
In this work the rule identifiers and their corresponding affixes are listed in Appendix B.
Because of its complexity, Amharic verbs need exhaustive rules. The output is a file named
am_ET.aff with .aff extension. It is an input to error detection and correction modules.

4.3. Experiment result and Discussion

In this section evaluation criteria for the prototype followed by how the training and testing
data was prepared and described. In addition the results obtained are presented and
discussed in this section.

4.3.1. Evaluation Criteria

The system is evaluated to test its effectiveness. Different research works have proposed
various criterion for evaluation of a given spell checker. Shewangizaw [4], Gaddisa [8] and
Mekonnen [5] recommend that error recall, precision recall and suggestion adequacy for
the evaluation of spell checker algorithm.

The performance of the system is evaluated using precision and recall. Precision can be
seen as a measure of exactness, whereas Recall is a measure of completeness. Precision
and recall are defined in [5, 33], for information retrieval as follows. Precision is defined
as the number of relevant documents retrieved by a search divided by the total number of
documents retrieved by that search, and

Recall is defined as the number of relevant documents retrieved by a search divided by the
total number of existing relevant documents (which should have been retrieved).
Precision and Recall are define in [5, 33]. In the following way for statistical classification
tasks.
Precision for a class is the number of true positives divided by the total number of elements
labeled as belonging to the positive class [32].

44
∑ True Positive
The formula for calculating recall is = ∑True Positive + ∑ False Negative

Recall in this context is defined as the number of true positives divided by the total number
of elements that actually belong to the positive class [32].

∑ True Positive
The formula for calculating precision is = ∑True Positive + ∑ False Positive

 True Positive – which means that the spell checker identifies correctly spelled
words

 False Positive – which means that the spell checker treats misspelled words as
correct spelled words

 False Negative - This means that the correct spelled words are flagged by spell
checker as incorrect word.

 True Negative - which means that the spell checker identifies misspelled words

The dataset is taken from different region reports which were used to study error trends in
Amharic texts. Five sets of test data have been collected from different summarized reports
collected from afar, Amhara National Regional State Science Technology and information
communication commission(ANRS STICC), Afar Region, and Harari regions. For
misspelled words, the intended valid Amharic word was given manually. The evaluation
result is presented in table 4.8.

4.3.2. Experiment

The experiment has been conducted to measure the effectiveness of the Amharic language
spell checker. As mentioned in section 4.3.1 we used precision and recall to measure the

45
accuracy of the Amharic language spell checker. Five experiments (1, 2, 3, 4 and 5) done
to evaluate the accuracy of the the syetm.

I. Experiment 1

For this experiment, the text was taken directly from Amhara Science Technology and
information communication commission 2009 annual report. Then the text was checked
against the lexicon to evaluate the accuracy of the system.

The data for this experiment have 199 Amharic words out of which 5(2.5%) words are
misspelled, from these all the misspelled words are detected by the system. But one correct
word marked as misspelled word by the system. That means one correct word detected by
the system as misspelled word.

All words in the sample data recognized as misspelled by the spell checker system are
automatically flagged. The following figure 4.5 shows screen shoot taken from the output
of the spelling checker tested using sample data and table 4.1 shows the result of the
experiment 1.

46
Figure 4.2 Sample text screen shot of Experiment 1

Results of Experiment 1

As described in the above formula the result of precision and recall shown below

Table 4.1 Evaluation Result for Experiment 1

Results Precision Recall


True positive (TP) 193 TP/(TP+FP)x100 TP/(TP+FN)x100
False negative(FN 1 =193/(193+0)x100 =193/(193+1)x100
False positive(FP) 0 =100% =99.4%
True negative(TN) 5

Experiment 2

47
To see the performance of the system, we have increased the data taken from Amhara
national regional state science technology and information communication commission
annual report. The sample data for this experiment have 840 Amharic words out of which
16(1.9%) words are misspelled, from these all the misspelled words are detected by the
system. But 20(2.4%) correct word marked as misspelled word by the spelling checker.
The evaluation result of experiment 2 was shown in table 4.2.

All words in the sample data recognized as misspelled word by the spell checker are
automatically flagged. The screen shoot taken from the output of the spelling checker tested
using sample data are shown in appendix D. As we observed in experiment 2, the number
of test data increases the false negative also increases.

Table 4.2 Evaluation Result for Experiment 2.

Results Precision Recall


True positive (TP) 804 TP/(TP+FP)x100 TP/(TP+FN)x100
False negative(FN 20 =804/(804+0)x100 =804/(804+20)x100
False positive(FP) 0 =100% =97.6%
True negative(TN) 16

II. Experiment 3

For this experiment 3, the text was taken directly from afar Region ICT 2009 annual report,
it consists of 181 Amharic words out of which 7(3.9%) words are detected as misspelled.
But three correct words are marked as misspelled word.

The screen shoot is presented in figure 4.6. The evaluation result is presented in table 4.3.

Table 4.3 Evaluation Result for Experiment 3

Results Precision Recall


True positive (TP) 171

48
False negative(FN 3 TP/(TP+FP)x100 TP/(TP+FN)x100
False positive(FP) 0 =171/(171+0)x100 =171/(171+7)x100
True negative(TN) 7 =100% =96.1%

Figure4.3 Sample text screen shot of Experiment 3

III. Experiment 4

Similar to Experiment 1, Experiment 2 and Experiment 3, the text was taken directly from
Harari Region ICT 2009 annual report; it consists of 94 Amharic words out of which
9(9.5%) words are detected as misspelled and correct words are marked as misspelled word
by the system.

49
As shown in figure 4.7, the screen shoot taken from the output of the spelling checker tested
using sample data. The precision and recall evaluation result for experiment is shown in
table 4.4 below.

Table 4.4 Evaluation Result for experiment 4

Results Precision Recall


True positive (TP) 79 TP/(TP+FP)x100 TP/(TP+FN)x100
False negative(FN 6 =79/(79+0)x100 =79/(79+9)x100
False positive(FP) 0 =100% =89.8%
True negative(TN) 9

Figure4.4 Sample text screen shot of experiment 4

III. Experiment 5

Similar to experiment 1 and 2, 3we used precision and recall to measure accuracy of the
Amharic spell checker. To compare the result calculated from the output of the system to
manual checked by experts, the same data used for Experiment 4 was evaluated by
Language Expert. He identified, total number of words 94 similar to experiment 4, invalid

50
or misspelled words 7(7.4%) it decreases by two compared to Experiment 4 evaluated by
the system, correct words but he marked as misspelled word is zero and finally total
unidentified word by the expert is two.

Expert Evaluation

To compare the accuracy of the system the same dataset is given to evaluate by the
language expert. The expert identified 87 correct, 7 misspelled and 0 incorrectly marked
as misspelled.

The precision and recall evaluation result for experiment is shown in table 4.5 below.

Table 4.5 Evaluation Result for experiment 5

Expert System
Results Precision Recall Results Precision Recall
True positive (TP) 87 TP/(TP+FP)x100 TP/(TP+FN)x100 79 TP/(TP+FP)x100 TP/(TP+FN)x100
=87/(87+2)x100 =87/(87+7)x100 =79/(79+0)x100 =79/(79+9)x100
False 0 6
=97.75% =92.55% =100% =89.8%
negative(FN)
False 2 0
positive(FP)
True 7 9
negative(TN)

As it can be seen in the table, there is a difference between the system and the expert. The
reason of the difference is evaluated in experiment 6.

4.3.3. Discussion

Generally, the above experiments (1, 2, 3, 4, and 5) are summarized in table 4.6 below.

51
Table 4.6 Experiment result summery

Exp Total Total Total Total Correct Precision Recall


erim Number Misspelled Detected Undetected words but
ent of words words Misspelled misspelled marked as
words words misspelled
words
1 199 5 5 0 1 100 99.4
2 840 16 16 0 20 100 97.6
3 181 7 7 0 3 100 96.1
4 94 9 9 0 6 100 89.8
5 94 9 7 2 0 97.75 92.55

As shown in table 4.7 the overall performance measure determines how accurate a spelling
checker is and calculated using the following formula [33].

P = (Tp+Tn)/(Tp+Tn+Fp+Fn)

Table 4.7 Average performance calculated from overall performance of each Experiment

Experiment TP TN FP FN Precision Recall Overall


Performance

1 193 5 0 1 100 99.4 99.5


2 804 16 0 20 100 97.6 97.62
3 171 7 0 3 100 96.1 98.34
4 79 9 0 6 100 89.8 93.62
5 87 7 2 0 97.75 92.55 97.92
Average Performance 97.4

Based on the evaluation done in all experiments, 1408 words were taken from different
sources; out of 1408 words, 46 words are misspelled, 44 misspelled words are detected and
2 misspelled words undetected by a language expert, 30 correct words detected or marked
as misspelled word.
As shown in Table 4.7, the result of precision of the system is more accurate than the result
checked manually. And the average recall and precision of the system tested in Experiment

52
1, 2, 3, and 4is 95.75, 100 respectively. Compared to Experiment 5 tested manually by
language expert is 92.55, and 97.75. So based on the result the system performance is
better. And the overall performance of the system is 97.27%.

As we can see from the experiment, we observe that Amharic spell checker lacks
completeness which is indicated by Recall in all experiments.

So we try to check the reason of lack of completeness by randomly taking experiment


4from all experiments and try to exhaust the affix rules in experiment below.

Experiment 6

We select Experiment 4 randomly, as presented in experiment 4, the text was taken directly
from Harari Region ICT 2009 annual report and it consists of 94 words.

As shown in figure 4.8 below, the screen shoot taken from the output of the spelling checker
tested by add affix rules and words in the lexicon. The precision and recall evaluation result
for experiment is presented in table 4.8 below.

Table 4.8 Evaluation Result for experiment 6

Results Precision Recall


True positive (TP) 84 TP/(TP+FP)x100 TP/(TP+FN)x100
False negative(FN 1 =84/(84+0)x100 =84/(84+1)x100
False positive(FP) 0 =100% =98.89%
True negative(TN) 9

The overall performance of the system based on experiment 6


P = (Tp+Tn)/(Tp+Tn+Fp+Fn)
=(84+9)/(84+9+0+1+)%
=98.93% compared to experiment 4 experiments 6 is better performance.
So as we can see in table 4.8 the false negative reduced from 6 to 1. Based on this
experiment the reason of lack of completeness is:

53
1. Affix rules defined in the development are not exhaustive.
2. Complexity of Amharic language, all words are not included the in dictionaries.
So to enhance the performance of the system it is better to exhaust the above mentioned
problems.

Figure 4.5 Sample text screen shot of experiment 6

CHAPTER FIVE

CONCLUSIONS AND RECOMMENDATIONS

5.1. Conclusions

54
Document preparation is one of the main tasks in government and non government
organizations. A spelling error may occur when people use text processing application.
Hence, text processing application software has integrated spell checkers, and grammar
checkers for some languages. But, for Amharic text processing tools are not integrated.
Thus, it is common to find various Amharic books and newsletters that are published with
misspelled words. This research has been done to design and develop a spell checker tool
for Amharic texts. It involved study spelling errors that can occur in Amharic text writing
and development of Amharic spell checker. In addition, we adopted word formation rules
for Amharic language which can be integrated to the lexicon used by Amharic spell
checker. This lexicon was compiled from ጌታሁን አማረ [26], ባየ ይማም [20], and
concise Amharic dictionary, the lexicon list of names, and list of countries.

We demonstrate by integrating to open office in the development of Amharic spell checker.


The Amharic electronic text spell checker integrated to open office word processor in as-
you-type mode, word formation and lexicon dependent design type. It is also a word level
spell checker particularly non-word error detector spell checker. That is, it does not
consider real word errors, grammatical error and white space. It is a customized version of
Hunspell spell checker. The algorithms and the architecture are inherently dependent on
Hunspell spell checker.

In this work we added some new features that are not addressed in previous works. These
are internal inflected words, repeated words stated in previous researchers[3,4] are
included, dictionary that does not require transliteration when token is accepted to process
spell checking and when suggestion lists are generated. The usage of Unicode data is
supposed to increase performance of spell checking by avoiding transliteration. Finally we
try to measure the performance of the system by taking 5 experiments and calculating the
recall and precision. Then we got the overall performance of the system is 97.27%.And
finally recommendations are shown in section 5.2.

5.2. Recommendations

The following recommendations are made for further research and improvement.

55
 Amharic documents display real word errors in addition to non-word spelling
errors. Hence, there is a need for detection and correction of real word errors
that can occur in Amharic documents;
 Dialectic variations, false geezims; assimilation and Alternations are sources
of error variations which are not done in this thesis work. If there is a method
that handles these issues in our input component, the performance might be
better;
 The performance of spelling error detection and correction algorithm, which
is edit distance, need to be compared with other identified spelling error
correction techniques;
 Integrating this work with other Amharic NLP works like;

• Amharic search engine applications

• Amharic speech synthesis applications

 Automatic spelling error correction and suggestion.

REFERENCE

[1] Shewangizaw Gulilat, DESIGN AND IMPLEMENTAION OF SPELL CHECKER FOR AMHARIC.
ADDIS ABABA, February, 2009.

[2] ANRS Plan Comission, "Development Indicator of Amhara National State," p. 83, 2017.

56
[3] (2017, September) 2007 census. [Online].
https://en.wikipedia.org/wiki/Amharic#cite_note-Lewis-2

[4] Mekonnen Fentaw, INTEGRATION OF AMHARIC SPELL CHECKER IN A WORD PROCESSOR.


Addis Ababa, 2009.

[5] (2009, ነሐሴ) ሜዲካል.

[6] Gaddisa Olani Ganfure and Dr. Dida Midekso, "Design And Implementation Of Morphology
Based Spell Checker," vol. 3, p. 8, December 2014.

[7] Bidyut B. Chaudhuri, "A simple real-word error detection and correction using local word
bigram and trigram," p. 10, 2013.

[8] Donald C. Comeau and W. John Wilbur, "Non-Word Identification or Spell Checking
Without a Dictionary," p. 9, Jan. 2004.

[9] Aminul Islam and Diana Inkpen, "Real-Word Spelling Correction using GoogleWeb 1T 3-
grams," p. 9, Aug. 2009.

[10] Debashree Goswami2 and Geetoshree Goswami3 Biswajit Sarma1, "Assamese Spell
Checker Design and Implementation," p. 4, Feb. 2016.

[11] Tahira Naseem, "A Hybrid Approach for Urdu Spell Checking," p. 87, 2004.

[12] KAREN KUKICH, "Techniques for Automatically Correcting Words in Text," Techniques for
Automatically Correcting Words in Text, p. 63, 1910.

[13] William H. Wilson* and Yoo-Jin Moon† Kyongho Min*, "TYPOGRAPHICAL AND
ORTHOGRAPHICAL SPELLING ERROR," p. 5.

[14] Baljeet kaur1 and Harsharndeep Singh2, "Design and Implementation of HINSPELL -Hindi
Spell Checker," vol. 3, p. 5, 2015.

[15] Harpreet Kaur* and Navroop Kaur, "SPELL CHECKING AND ERROR CORRECTING SYSTEM
FOR TEXT PARAGRAPHS WRITTEN IN PUNJABI LANGUAGE USING HYBRID APPROACH," p. 4,
Feb. 2016.

[16] Ritika Mishra and Navjot Kaur, "A Survey of Spelling Error Detection and Correction
Techn," vol. 3, 2013.

[17] Hsuan Lorraine, "SPELL CHECKERS AND CORRECTORS," p. 119, Oct. 2008.

57
[18] ባየ ይማም, "DESIGN AND IMPLEMENTAION OF SPELL CHECKER FOR AMHARIC," የአማርኛ ሰዋሰው
የተሸሻለ ሁለተኛ እትም, p. 492, sep 2000.

[19] (2017, Aug.) wikipedia. [Online]. https://en.wikipedia.org/wiki/Ispell

[20] Taha Zerrouki, "Implementation of infixes and circumfixes in," p. 6, 2015.

[21] Chandan Raj Rupakheti and Chiranjivi Upreti, "Nepali Spell Checker Project".

[22] Nepal Dhulikhel, Nepali Spell Checker Project. Kathmandu University.

[23] Khan Md. Anwarus Salam, "Phonetic Bengali Input Method for Computer and Mobile".

[24] Naushad UzZaman, "A COMPREHENSIVE ROMAN (ENGLISH)-TO-BANGLA".

[25] Amin Allam and AbdAllah Gomah Khaled Shaalan, "Towards automatic spell checking for
Arabic," Towards automatic spell checking for Arabic, p. 9, 2003.

[26] Daniel Yacob, "Application of the Double Metaphone Algorithm to Amharic Orthography ".

[27] ጌታሁን አማረ, የአማርኛ ሰዋሰዉ በቀላል አቀራረብ.: EMPDA, 1989.

[28] Amsalu Aklilu, english to amharic Dictionary., 1974.

[29] Wole leslau,., p. 536.

[30] (2017, Aug.) Hunspell – spell checker, stemmer and morphological analyzer. [Online].
http://hunspell.github.io/

[31] (2008, Sep.) ccl.pku.edu.cn. [Online]. http://ccl.pku.edu.cn/doubtfire/NLP/lexical_analysis/

[32] Chi-Fu and Patrick, "Final Year Project: A Spell Checker For the ABC System," p. 87, Jan.
2012.

[33] Gerhard B. van Huyssteen* and Roald Eiselen & Martin Puttkammer, "Evaluating
Evaluation Metrics for Spelling Checker Evaluations," p. 9, 2002.

[34] Neha Gupta, "Spell Checking Techniques in NLP," International Journal of Advanced
Research in, vol. 2, p. 5, 2012.

[35] Jennifer Pedler, "Computer Correction of Real-word Spelling Errors," Computer Correction
of Real-word Spelling Errors, p. 239, 2007.

58
[36] (2017, 23) Levenshtein Distance in three flavours. [Online]. http://www.merriampark.com/

[37] Patrick Chi-Fu Chan. (2012, May) Final Year Project: A Spell Checker For the ABC System.

APPENDIX

Appendix A: Sample of Amharic words taken for experiment

59
Appendix B: Amharic alphabets with their seven orders

60
61
Appendix C: Prefix, Infix and suffix lists used in this thesis work.

SET UTF-8

LANG am_ET

FLAG long

REP 36

REP ሃ ሀ

REP ሐ ሀ

REP ሀ ሐ

REP ሀ ሃ

REP ሀ ሓ

REP ሀ ኀ

REP ሓ ሀ

REP ኄ ሀ

REP ዉ ው

REP ው ዉ

REP ዪ ይ

REP ይ ዪ

REP የ ዬ

REP ዬ የ

REP ጨ ጬ

REP ጬ ጨ

REP ጪ ጭ

REP ጭ ጪ

62
REP ቸ ቼ

REP ቼ ቸ

REP ች ቺ

REP ቺ ች

REP ሼ ሸ

REP ሸ ሼ

REP ሺ ሽ

REP ሽ ሺ

REP ኘ ኜ

REP ኜ ኘ

REP ኚ ኝ

REP ኝ ኚ

REP ጀ ጄ

REP ጄ ጀ

REP ጂ ጅ

REP ጅ ጂ

REP ወ ዎ

REP ዎ ወ

PFX MK Y 11

PFX MK 0 ለ

PFX MK 0 በ

PFX MK 0 ከ

PFX MK 0 የ

PFX MK 0 እነ

PFX MK 0 ስለ

63
PFX MK 0 እንደ

PFX MK 0 ከነ

PFX MK 0 ያለ

PFX MK 0 እስከነ

PFX MK 0 ንም

#የግስ ድህረ ቅጥያ

PFX AH Y 3

PFX AH 0 አስ

PFX AH 0 ተ

PFX AH 0 አ

#ዎች እና ኦች ቁጥር አብዢ ቅጥያዎች

SFX OC Y 27

SFX OC 0 ዎች [^ህልምስርሽቅብትችንኝክዝይድጅግጥጭጵፍፕ]

SFX OC ህ ሆች ህ

SFX OC ል ሎች ል

SFX OC ል ላዊ ል

SFX OC ም ሞች ም

SFX OC ስ ሶች ስ

SFX OC ር ሮች ር

SFX OC ሽ ሾች ሽ

SFX OC ቅ ቆች ቅ

SFX OC ብ ቦች ብ

SFX OC ት ቶች ት

SFX OC ች ቾች ች

64
SFX OC ን ኖች ን

SFX OC ኝ ኞች ኝ

SFX OC ክ ኮች ክ

SFX OC ው ዎች ው

SFX OC ዝ ዞች ዝ

SFX OC ይ ዮች ይ

SFX OC ድ ዶች ድ

SFX OC ጅ ጆች ጅ

SFX OC ግ ጎች ግ

SFX OC ጥ ጦች ጥ

SFX OC ጭ ጮች ጭ

SFX OC ጵ ጶች ጵ

SFX OC ጽ ጾች ጽ

SFX OC ፍ ፎች ፍ

SFX OC ፕ ፖች ፕ

#Infix or internal inflated words

PFX IM Y 32

PFX IM ገጠ ገጣጠ ገ

PFX IM ል ልማ ል

PFX IM ተ ተለማ ተ

PFX IM ተ አላ ተ

PFX IM ሊ ላ ሊ

PFX IM ለ አላ ለ

PFX IM ለ ተላ ለ

PFX IM ለ ለዋ ለ

65
PFX IM ለ ል ለ

PFX IM ለ ልዉ ለ

PFX IM መ መላ መ

PFX IM ቆ ቆራ ቆ

PFX IM ሰ ሰባ ሰ

PFX IM ከሰ ከሳሰ ከ

PFX IM ቆ ቆማ ቆ

PFX IM ከተ ከታተ ከተ

PFX IM ከተ ከታተ ከተ

PFX IM ገመ ገማመ ገመ

PFX IM ገመ ገማመ ገመ

PFX IM ገጠ ገጣጠ ገጠ

PFX IM ከረ ከራረ ከረ

PFX IM ከረ ከራረ ከረ

PFX IM ደረ ደራረ ደረ

PFX IM ከመ ከማመ ከመ

PFX IM ሰነጠ ሰነጣጠ ሰነጠ

PFX IM ገነጠ ገነጣጠ ገነጠ

PFX IM ገለ ገላለ ገለ

PFX IM መረ መራረ መረ

PFX IM አሰ አሳሰ አሰ

PFX IM ገደ ገዳደ ገደ

PFX IM ነከ ነካከ ነከ

PFX IM ለየ ለያየ ለየ

SFX IN Y 47

66
SFX IN ጠ ጥን ጠ

SFX IN ጠ ጠች ጠ

SFX IN ጠ ጣችሁ ጠ

SFX IN ጠ ጡ ጠ

SFX IN ረ ረች ረ

SFX IN ረ ሩ ረ

SFX IN ረ ራችሁ ረ

SFX IN ረ ርን ረ

SFX IN ረ ሩ ረ

SFX IN ፈ ፉ ፈ

SFX IN ፈ ፍን ፈ

SFX IN ፈ ፋችሁ ፈ

SFX IN ፈ ች ፈ

SFX IN ፈ ፍሁ ፈ

SFX IN ፈ ፍሁ ፈ

SFX IN መ ሙ መ

SFX IN መ ምን መ

SFX IN መ ምሁ መ

SFX IN መ መች መ

SFX IN መ ማችሁ መ

SFX IN በ ባችሁ በ

SFX IN በ ብን በ

SFX IN በ ቡ በ

SFX IN በ በች በ

SFX IN ቀ ቁ ቀ

67
SFX IN ቀ ቀች ቀ

SFX IN ቀ ቃችሁ ቀ

SFX IN ቀ ቅን ቀ

SFX IN ለ ሉ ለ

SFX IN ለ ላችሁ ለ

SFX IN ለ ልን ለ

SFX IN ለ ለች ለ

SFX IN ለ ልሁ ለ

SFX IN ሰ ሱ ሰ

SFX IN ሰ ሰች ሰ

SFX IN ሰ ሳችሁ ሰ

SFX IN ሰ ስን ሰ

SFX IN ነ ነች ነ

SFX IN ነ ኑ ነ

SFX IN ነ ናችሁ ነ

SFX IN ነ ን ነ

SFX IN ነ ንሁ ነ

SFX IN የ ዩ የ

SFX IN የ የች የ

SFX IN የ የን የ

SFX IN የ የሁ የ

SFX IN የ ያችሁ የ

Appendix D: sample screen shote

68
69
70
71
Appendix E: Prefix and suffix lists used in this thesis work.

I. Prefix lists used in this work






እነ
ስለ
እንደ
ከነ
ያለ
እስከነ
እየ
በየ

አስ


ገጣጠ
አላ

አላ
ተላ
ለዋ

ልዉ
መላ
ቆራ
ሰባ
ከሳሰ
ቆማ
ከታተ
ከታተ

72
ገማመ
ገማመ
ገጣጠ
ከራረ
ከራረ
ደራረ
ከማመ
ሰነጣጠ
ገነጣጠ
ገላለ
መራረ
አሳሰ
ገዳደ
ነካከ
ለያየ
II. Suffix lists used in this work








በት
ነት
ነትን


ዎችን
ዎችና
ዎች
ንም
ነቱን
ነትን
ዉን
ቸዉን
ቸውን

73

ዉን
ሆች
ሎች
ላዊ
ሞች
ሶች
ሮች
ራት
ሾች
ቆች
ቦች
ቶች
ቷል
ቾች
ኖች
ኞች
ኞች
ኮች
ዎች
ዞች

ዮች
ዶች
ዶችን
ጆች
ጎች
ጦች
ጡን

ጶች
ጾች
ፎች
ፖች

74
Appendix F: Min Edit Distance Algorithm

75
Appendix G: Steps we followed for configuration, compilation, and execution
of Hunspell

 download and Installing Cygwin terminal


 Downloading Hunspell 1.3.2 code
 From synaptic manager we installed libraries needed
 We made modifications in the Hunspell code, as it is stated in section
 We executed the command ./configure –with-warnings –with-experimental –
within
 We executed make command
 We put the lexicon and the affix file in the am folder
 To run the program from terminal window $ Hunspell -d am_ET <amhara.txt
 To work on OOo.org we configured language setting of Open Office writer
 We put lexicon and affix file in the OOo dictionary folder
 We put Amharic text in the Open Office word processor

76
Appendix H: Word counter Python code

77

You might also like