You are on page 1of 8

International Journal of Advanced Science and Technology

Vol. 29, No. 7, (2020), pp. 2641-2648

Multilingual Spelling Checker for Selected Ethiopian Languages


Wubetu Barud Demilie, Department of Information Technology, Wachemo University, Hossana,
Ethiopia, P.O. Box 667

Wubetubarud@gmail.com or wubetubarud@yahoo.com

Abstract
Multilingual spelling checker is a tool that needs to be developed for all users. Spelling checker is a
prerequisite to be digitized. It is one of the applications of natural language processing that detects and
corrects errors in natural languages accordingly. Spelling checker applications that have been developed
will be integrated with other natural language processing applications. The research paper that I have
proposed as a model of multilingual spelling checker that is based on dictionary based technique and it is
applied in error detection and correction for five selected Ethiopian languages including Amharic, Afan
Oromo, Tigrinya, Hadiyyisa and Awngi. This model provides correction and suggestion by selecting the
most suitable from a list of corrective suggestions based on lexical resources and dictionary based statistics
and it depends on the lexicon of the selected five Ethiopian languages.
The evaluation of the model uses Amharic, Afan Oromo, Tigrinya, Hadiyyisa and Awngi words in dictionary
form for each of the languages. All language spelling errors have been detected (by using red zigzag line)
and it automatically detects the error from list of words that have been prepared in dictionary. This
approach detects the error with efficiently and effectively with minimum time interval. After effective
evaluation of the model that I have developed for the selected languages, Precision, recall and F-mesures
have been calculated.
Keywords: Error Correction, Error Detection, Multilingual, Suggestion, Spelling Checker, Types of
Errors

I. INTRODUCTION
Multilingual spelling checker which directly identifies what natural language is being dealt with and shifts
to the proper
Spelling checker for the languages that the users are interesting to do so.
Language is a medium of communication and which helps human beings to exchange ideas and
information.
Spelling checker system for languages would be used to check spellings for any kinds of spelling errors.
The spelling error detection and correction tools work on word level and use a dictionary based technique.
Every word from the text is looked up in the speller lexicon. When a word is not in the dictionary, it is
detected as an error. In order to correct the error, spelling checker searches the dictionary for words that
resemble the erroneous words. These words are then suggested to the user who chooses the word that was
intended. Spelling checker systems are used in various Natural Language Processing Applications (NLPA)
including parts of speech tagger [1] [2] and as grammar checker for natural languages [3].
There are two main issues related to spelling checker. These are error detection and correction. In
developing upon the types of errors are non-word and real word errors. There are many techniques available
for detection and correction. In this paper, I have been designed, implemented and evaluated an end to end
system that performs spelling checker and auto correction for multiple Ethiopian languages.

ISSN: 2005-4238 IJAST 2641

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

II. SIGNIFICANT OF THE STUDY


Learning to spell helps to adhesive the relation and/or linkage between the letters and their resonances, and
learning high occurrence to mastery level progresses both in reading and writing. The connection between
spelling and reading comprehension is high because both depends on a common denominator (i.e. skill with
language). The more intensely and carefully a operator identifies a word, the more probable he or she is to
identify it, spell it, define it, and use it properly in speech and script [4]. Many researchers of the area have
developed different spelling checkers for foreign and Ethiopian languages. From those researchers
specially, Ethiopian researchers no one have been developed spelling checker for multiple languages within
one system [5][6][7]. So the application that I have developed namely “Multilingual spelling checker for
selected Ethiopian languages” have an option that inform users to select the language that they are
interesting to do so. Here, I want to acknowledge the researches who have done different studies for
Ethiopian languages including grammatical rules, word formation, sentence structure and other related
concepts about the languages [8][9].
III. RELATED WORKS
Spelling checker methods have been significant such as error detection and correction. The two commonly
used methodologies for error detection are dictionary lookup and n-gram analysis. Most of the developed
spell checker systems described in the work, use dictionaries as a list of precise spellings that help systems
to find directed words. A dictionary is a lexical source that encompasses list of precise words for a specific
language.
The isolated word methods that have been described here are the most studied spelling correction
algorithms, they are: edit distance, similarity keys, rule-based techniques, n-gram-based techniques,
probabilistic techniques and neural networks [10][11][12][13][14][15][16][17][18][19].
IV. USED METHODOLOGY
There are many methodologies for identifying and modifying spelling errors in written text. For the study,
I have used dictionary based methodology which is engaged to relate and detect input strings in a dictionary,
a lexicon, a corpus or amalgamation of lexicons and corpora. The datasets or lexicon files for the selected
Ethiopian languages have been collected from different genres which have balanced corpora and/or lexicon.
In order to serve the purpose of spelling error detection and correction, exact string matching systems are
used.
If any string or word is not present in the chosen lexicon or corpus, it is considered to be a misspelled or
worthless word. At this stage, the researcher considers that all words in the lexicon or corpus are
morphologically complete, i.e. all inflected forms are included in to the dictionary.
The attentions on dropping dictionary search time via effective dictionary based and/or pattern
corresponding tactics, via dictionary partitioning structures and via morphological processing procedures.
The most substantial dictionary based tactics are hashing, binary search trees, and finite state automata.
From those listed approaches, I have selected and used the hashing since it is a well-known and efficient
dictionary lookup strategy.
According to [14], the basic idea of hashing relies on some effective calculations accepted out an input
string to identify where an identical entry can be found. In the spelling checking context, if the word kept
at the hash address is the identical as the input string, there is a match. However, if the input word and the
regained word are not the same or the word kept at the hash address is null, the input word is specified as a
misspelling.

ISSN: 2005-4238 IJAST 2642

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

V. BASIC WORKFLOW OF THE PROPOSED SYSTEM


The following diagram shows the working principles of the spelling checker for the selected Ethiopian
languages.
Start

Does the Ask dictionary file for


dictionary.txt No each language
file exists for
all languages?
Yes
Create hashtable
Load into the Save the hashtable
memory

Take user input

No No Search the Hashtable for the word that


Do the words
has the minimum Edit Distance from the
exist in
given words
HashTable?
Yes Is the distance less
than or equal to the
minimum allowed
The word is correct distance?
Yes No
Find the transformation
between these two words

The word is misspelled print the


correct spelling and the distance
between the words

The word does not exist

Figure 1: Work flow of the Proposed System


End

VI. TYPES OF ERRORS


In this study, I have identified spelling errors as typographic and cognitive errors.
VI.I. Typographic Errors
These errors are arising when the accurate spelling of the word is identified but the word is mistyped by
fault. These errors are frequently associated to the keyboard and therefore do not follow any linguistic
standards.
Generally, those kind of errors can be categorized in to the following four main groupings.

ISSN: 2005-4238 IJAST 2643

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

1. One-character addition kind of typographic errors.


Example, typing “acress” for the word “cress”.
2. One-character removal kind of typographic errors.
Example, typing “acress” for the word “ctress”.
3. One-character substitution kind of typographic errors.
Example, typing “acress” for the word “across”.
4. Swapping of two neighboring characters’ kind of typographic errors.
Example, typing “acress” for the word “caress”.
The errors that can be produced by any one of the above editing actions are also called single error.
VI.II. Cognitive Errors
It occurs when the accurate spellings of the word are not clearly known. In this kind of error, the intonation
of misspelled word is the identical pronunciation of the proposed correct word.
Example, for the word “achieve” -> acheive.
VII. ERROR DETECTION
There are many methods for error detection such as n-gram analysis and dictionary lookup. The error
detection method commonly comprises of testing to see if an input string is a valid index or dictionary
word. Well-organized methods have been developed for noticing such types of errors. The two most known
methods are n-gram analysis and dictionary lookup. Spelling checkers rely commonly on dictionary lookup
and text recognition systems rely on n-gram techniques [13].

VII.I. N-gram Analysis


According to [13][20][21] n-gram analysis is defined as a technique to find wrongly spelled words in a
mass of texts. Instead of associating the complete word in a text to a dictionary, just n-grams are controlled.
Spelling checking is done by using an n-dimensional matrix where real n-gram occurrences are stored. If a
non-existent or rare n-gram is found the word is highlighted as a misspelling, otherwise not.
VII.II. Dictionary Lookup
According to the study of [13], it is a list of words that are predictable to be accurate. It can be represented
in different ways, like rapidity and stowage necessities. Dictionary lookup and structure methods must be
custom-made rendering to the determination of the dictionary. A smaller a dictionary file can give the
operator also many incorrect denials of valid words, the larger dictionary files can receive a high amount
of valid low occurrence words. Hash tables are the most common implemented methods to gain fast access
to a dictionary. In order to lookup a string, one has to calculate its hash address and recover the word
deposited at that address in the pre-built hash table. If the word kept at the hash address is dissimilar from
the input string, a misspelling is flagged. The main pro of hash table is its arbitrary access nature that
eliminated the large number of assessments required to exploration of the dictionary files.
To store a word in the dictionary, it computes the respectively hash function for the word and fixed the
vector entries equivalent to the intended values to true. To find out if a word fits to the dictionary, it
computes the hash values for that word and looks it in the vector. If all entries equivalent to the values are
true, then the word fits to the dictionary, else it does not.
VIII. ERROR CORRECTION
In the study of [16][14] correction of spelling errors for different natural languages are an old problem
including the languages that I have mentioned in the study. Many researches have done in this area over
decades. The current spelling correction methods emphasis on isolated words, without taking into account
the textual context in which the string seems.
IX. EVALUATION
To evaluate the performance, the approach that I have selected and to demonstrate its easy portability to all
the selected Ethiopian languages. Firstly, I made an evaluation based on Amharic language test data which
are in the dictionary file list. Secondly, I made an evaluation based on Afan Oromo language test data which
are in the dictionary file list. Thirdly, I made an evaluation based on Tigrinya language test data which are
in the dictionary file list. Fourthly, I made an evaluation based on Hadiyyisa language test data which are

ISSN: 2005-4238 IJAST 2644

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

in the dictionary file list. Fifthly, I made an evaluation based on Awngi language test data which are in the
dictionary file list. In order to evaluate spelling error detection capability of the selected approach for all of
the selected five Ethiopian languages, precision, recall, and F1 measure were used as metrics. The
comparative locations of the correct spellings in the reasonable suggestions list were used to evaluate
spelling error correction.
IX.I. Test Data
I have used manually prepared spelling error test corpora for evaluation of the performance. For all selected
Ethiopian languages, I have used a test corpus which have been collected from different sources that are
balanced.
I have prepared word dictionaries for all languages as follows.
Table 1:word dictionaries for all languages
No Language Amount of words (dictionary files)
1. Amharic 993,072
2. Afan Oromo 866,328
3. Tigrinya 966,328
4. Hadiyyisa 987,176
5. Awngi 678,534

Here, to evaluate and compute the actual scores, we used the manually compiled test data as the gold
standard/ balanced data set for the evaluation.
TP
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = TP+FP----------------(1)

TP
𝑅𝑒𝑐𝑎𝑙𝑙 = TP+FN--------------------(2)

2∗(Precision+Recall)
𝐹1 = Precision+Recall ----------(3)
The excellence of suggestions obtainable by a spelling corrector is dignified by the virtual locations of the
accurate spellings in the suggestions list that have been prepared in dictionary suggestions list. In the best
situation, the right correction always appears on the topmost of the list accordingly.

Table 2:Five Ethiopian languages spelling error detection result


Metric Languages
Amharic Afan Oromo Tigrinya Hadiyyisa Awngi
Precision 86.6% 85.3% 83.9% 82.8% 84.7%
Recall 84.7% 81.9% 82.4% 81.6% 81.9%
F Measure 85.65% 83.6% 83.15% 82.2% 83.3%

X. RESULTS AND DISCUSSIONS


Different experiments have been done for the selected languages independently to evaluate the spelling
checker as of evaluating the quality of the proposed system. To achieve this, the evaluation was done on
the selected Ethiopian languages as shown above. Here, I have prepared different words in dictionary form
for the selected languages. Here, from the selected Ethiopian languages Amharic, Tigrinya and Awngi uses
Ge’ez script and I have included font identifier in to the system.
But the remaining languages use Latin script and it is possible to use as it is without any modification of
the font.
We can consider the following sample source code for the model:

ISSN: 2005-4238 IJAST 2645

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

import javax.swing.JEditorPane;
import javax.swing.JFrame;
import javax.swing.JTextPane;
import com.inet.jortho.FileUserDictionary;
import com.inet.jortho.SpellChecker;
import java.awt.Font;
public class SampleApplication extends JFrame{
public static void main(String[] args){
new SampleApplication().setVisible( true );
}
private SampleApplication(){
super(" Multilingual Spelling Checker/የአምስት ቋንቋዎች ቃላት አፃፃፍ ስርዓት");
JEditorPane text = new JTextPane();
Font font = new Font("", Font.BOLD, 22);
text.setText( "Multilingual Spelling Checker "
+ "የአምስት ቋንቋዎች ቃላት አፃፃፍ ስርዓት" );
add( text );
text.setFont(font);
setSize(200, 160);
setDefaultCloseOperation( EXIT_ON_CLOSE );
setLocationRelativeTo( null );
SpellChecker.setUserDictionaryProvider(new FileUserDictionary() );
SpellChecker.registerDictionaries( null, null );
SpellChecker.register( text );
}
}
Example, if we run the above java source code firstly, we will get the GUI which looks like:

Figure 2: Proposed Model

As we have seen from the above GUI, the words are underlined in red zigzag line. It indicates that all are
not in dictionary file lists. So the user should have to click on the underlined word (can use "F7" from the
computer keyboard) to display the list of alternatives. After this, the system user will get the following
GUI.

ISSN: 2005-4238 IJAST 2646

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

Figure 3: Model for words which are not in dictionary

I have tested the proposed system by creating commonly known errors of the selected languages for users
of it accordingly.
Since the collected and used dictionary files of each language are from different genres, the system checks
the error easily and suggests the best alternative from the given list of words that have been provided in the
dictionary. Here, firstly we should have to select the language. The proposed model also suggests the
appropriate word from dictionary based on the user’s query.
Similarly, if the systems dictionary file has more related words, it will suggest all possible list of words
accordingly. For example, consider the sentence " የኢንፎርሜሽን ቴክኖሎጂ ትምህርትን ክፍል" and list of
suggestions
.

Figure 4: Sample model for word suggestion


Here, we can follow the same procedure to compute all spelling related issues for all of the languages that
have been selected.

XI. CONCLUSION
Spelling checkers are fairly reliant on the words in the lexicon dictionary. Some words have very few words
spelled similarly, so even numerous faults will recover the accurate word. Other words will have many
likewise spelled words, so one error may make alteration problematic or unbearable. This paper proposes
multilingual spelling checker for selected Ethiopian languages that is based on dictionary based method. It
is used in noticing and modifying diverse classes of spelling errors. The main features of the planned model

ISSN: 2005-4238 IJAST 2647

Copyright ⓒ 2020 SERSC


International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 2641-2648

can be précised in giving of the proposals for noticed errors and providing the correction automatically
using the first suggestion. Furthermore, the planned model is calculated using dictionary based data sets for
all languages that the researcher has been selected for the study.
REFERENCES

1. W. B. Demilie, “Parts of Speech Tagger for Awngi Language,” vol. 9, no. 9, 2019.
2. K. Desta, “Part of Speech Tagger for Hadiyyisa Language.”
3. D. Tesfaye, “A rule-based Afan Oromo Grammar Checker,” vol. 2, no. 8, pp. 126–130, 2011.
4. I. of Spelling, “Importance of Spelling.” .
5. A. M. Gezmu, A. Nürnberger, and B. E. Seyoum, “Portable Spelling Corrector for a Less-Resourced
Language : Amharic,” pp. 4127–4132, 2014.
6. G. O. Ganfure and D. Midekso, “Design And Implementation Of Morphology Based Spell Checker,”
vol. 3, no. 12, pp. 118–125, 2014.
7. M. D. Jeldu and R. Mehta, “Rule based afan oromo analyzer for spell checker 1 1,2,” no. 7, pp. 36–
39, 2018.
8. P. dr. B. Y. (Addis A. University), A Typology of Verbal Derivation in Ethiopian Afro-Asiatic
Languages. .
9. W. T. A. D. Tamirat, “Afan Oromo Sentence Structure.” .
10. V. J. Hodge and J. Austin, “A Comparison of Standard Spell Checking Algorithms and a Novel
Binary Neural Approach,” vol. 15, no. 5, pp. 1073–1081, 2003.
11. B. O. Connor, “Edit Distance, Spelling Correction, and the Noisy Channel,” 2015.
12. F. Ahmed, E. W. De Luca, and A. Nürnberger, “Revised N-Gram based Automatic Spelling
Correction Tool to Improve Retrieval Effectiveness,” 2009.
13. S. M. El Atawy, “Automatic Spelling Correction based on n-Gram Model,” vol. 182, no. 11, pp. 5–
9, 2018.
14. H. L. Liang, “SPELL CHECKERS AND CORRECTORS : By,” no. November, 2008.
15. A. Samuelsson, “Weighting Edit Distance to Improve Spelling Correction in Music Entity Search
Weighting Edit Distance to Improve Spelling Correction in Music Entity Search,” 2017.
16. D. Sundby, “Spelling correction using N-grams.”
17. R. Kumar, M. Bala, and K. Sourabh, “A study of spell checking techniques for Indian Languages,”
no. March, pp. 105–113, 2018.
18. A. A. Patil and P. R. Sharma, “Study and Review of Selective Spell Checking,” pp. 1049–1056, 2019,
doi: 10.15680/IJIRSET.2019.0802064.
19. O. Wilde, “Spelling Correction and the Noisy Channel,” 2019.
20. T. A. Pirinen and M. Silfverberg, “Improving Finite-State Spell-Checker Suggestions with Part of
Speech N-Grams,” vol. 3, no. 2, pp. 153–166, 2012.
21. T. A. Pirinen, Weighted Finite-State Methods for Spell-Checking and Correction. 2014.

ISSN: 2005-4238 IJAST 2648

Copyright ⓒ 2020 SERSC

You might also like