Professional Documents
Culture Documents
Abstract— Spell correction is used to detect and correct solve data insufficiency, different methods are used such as
orthographic mistakes in texts. Most of the time, traditional changing correct sentences into incorrect sentences by
dictionary lookup with string similarity methods is suitable for inserting, deleting, transposing letters to sentences and other
the languages that have less complex structure such as English data augmentation techniques are applied.
language. However, Azerbaijani language has more complex
structure and due to morphological structure of it, derivation of There are a lot of researchers that have been dealing with
words is plenty that several words are derived from adding spelling checking and introduced tradition techniques such as
suffices, affixes to the words. Therefore, in this paper sequence dictionary lookup and distance metrics for spelling
to sequence model with attention mechanism is used to develop correction.
spelling correction for Azerbaijani. Total 12000 wrong and
correct sentence pairs used for training and model is test on 1000 Levenshtein distance algorithm is used which is one of the
real world misspelled words and F1-score results are 75% for string similarity checking algorithm that uses edit distance to
distance 0, 90% for distance 1, and 96% for distance 2. measure the similarity between two strings [1] [2]. Easiest
way for spell checking is to construct a word dictionary or
Keywords— Spell Correction, Autoencoder, Attention, Natural corpus that contains all the correctly written words and use
Language Processing, Deep neural networks string similarity checking algorithm to find most similar
I. INTRODUCTION words from dictionary to the inputted word [3] [4]. Usual
dictionary lookup methods take long time, because it matches
We all know, how data is important today. However, the given words with every word from the dictionary. If you
correctness and clarity of the transmitted information is more have a sentence and it has five words and your dictionary
important. Most of the orthographical errors are found in word corpus have 500 thousand words, the complexity of
transmitted information on social media, chat applications, similarly checking algorithm will be number of inputted
and emails. Spelling correction is also important in the search words multiple number of words in the dictionary word
engines and translation systems. Expressions and meaning of corpus [5].
sentences that are made up the context is very important. It is
not pleasant to read a content that has many grammatical and More advanced method for dictionary method is
spelling mistakes. discussed. Beside from usual dictionary lookup, there is a
method which is Lexicon lookup. Lexicon is a kind of data
The main reason of the research is to develop a spelling structure that stores words based on their alphabetic similarity
checker algorithm is that users who try to use Azerbaijani in in the form of tree structure. This algorithm is optimized for
their blogs, social media have no way to be sure about fast searching in the Tree [6] [7]. Another solution was
spelling for the texts they have written. That is why, there is provided which is improved version of Levenshtein distance
a necessity to create new algorithm to check orthographical algorithm is Damerau- Levenshtein distance. Beside from
mistakes. It is observed that majority of Azerbaijani writers insertion, deletion, and substitution operations, Damerau
do not use some of our letters “ö ğ ı ə ç ş” because they are added one additional operation which is transposition [8] [9].
writing the text in English keyboard. So main part of this Transposition costs for two successive letter pairs are
project will not only be catch misspelling words it also required to get better results while using Damerau -distance
corrects the letters the users add with English keyboard. algorithm [10] [11].
In order to develop effective spelling correction, typing There are different probabilistic models that used for
behavior of the people is studied to find what kind of mistakes word corpuses or sequences. Models the assign the
and errors they make when they write. For this purpose, probabilities to the words are called as Language model [12]
Azerbaijani comments on Facebook is examined. [13]. Instead string similarity, this model uses occurrence of
Azerbaijani is an agglutinative language so different the inputted word in the given context. One of the language
languages that have complex morphological structure are models is N-gram. N-gram consists of words that can be
studied. One of the problems of this research is scarcity of mono-gram, bi-gram, (two words), tri-gram, (three words)
the labeled Azerbaijani sentences. Incorrect sentences and and etc. N-gram model finds the probability of last word that
their correct version (labels) are needed in order to apply is given in n-gram pairs.
advance techniques such as deep neural networks. In order to
In the following parts, related works are presented which A. Data Collection and Pre-Processing
introduce different solutions with neural networks and For spelling correction, wrong and correct sentence pairs
sequence to sequence modelling for similar problems as are needed to feed into deep neural network. Due to scarcity
spelling correction. After related work, research methods are of labeled data, data augmentation technique is used to
discussed which are about data collection, data pre- generate synthetic data from original data. Wrong sentences
processing, structure of sequence to sequence model, loss are generated from correct sentences by applying different
function and attention mechanism. After that proposed model operations such as insertion, deletion, transposition, and
and results are discussed. substitution.
II. RELATED WORK These operations are randomly performed on each word
Many researchers have presented several solutions for in the sentence. Insertion, deletion, and transposition
sequence to sequence problems where input and output of the operations are performed less than 40 percent of time.
neural network model is sequence and difference hybrid However, for the substitution operation, more intelligent way
methods are presented for spelling correction. is used. As disused in introduction, people use some English
letters (e, w, c, g) instead of Azerbaijani letters (ə, ş, ç, ğ). So,
For data scarcity, experiments are performed on synthetic instead of substituting a letter with random letter, it can be
datasets created for Indic languages, Hindi and Telugu, by replaced with a letter that is confused and mistyped most of
incorporating the spelling mistakes committed at character them time. In order to analyze mistyping behavior of people,
level and sequence-to-sequence deep learning model is used comments of people on different pages in Facebook are
[14]. Spelling corrects consists of two steps. The error scraped and analyzed. In table 1, letters and their most
detection and error correction. The detection employs a misspelled and confused letters are given.
neural network with LSTM layers which is used to find
incorrect words and their positions. Error correction steps Table 1. Correct letters and their most mistyped versions.
generate candidate words for incorrect words based on
probability [15]. Correct letter Misspelled letters
The use of a fixed-length vector is a pivot point in 'ə' ['e', 'i', 'a', 'ı', 'ə']
improving the performance of this basic encoder–decoder 'ş' ['w', 's', 'sh', 'i']
architecture and they propose a method extend this 'ç' ['c', 'ch']
architecture by allowing a model to automatically search for 'q' ['g', 'x', 'k', 'ğ']
parts of a source sentence that are relevant to predicting a 'ı' ['i', 'l', 'u', 'ə', 'o', 's', 'a']
target word, without having to form these parts as a hard 'r' ['t', 'l', 'z', 'x']
segment explicitly. They achieved high performance for the 'ğ' ['g', 'q']
English-to-French translation [16]. 'l' ['d', 'n', 'o']
Bi-Directional Attention Flow network is applied, and 'd' ['n', 't', 'x']
attention flow mechanism is used to get context aware 'u' ['h', 'i', 'ı', 'o'],
representation of query without early summarization. Their 'ö' ['o', 'i', 'ü', 'e']
experiment achieved good results on Stanford Question 'ü' ['u', 'i', 'e']
Answering Dataset (SQuAD) and CNN/DailyMail cloze test 't' ['d', 'r']
[17]. 'y' ['g', 't', 'j']
A better method is proposed which its approach 's' ['z', 'c', 'd', 'ss']
significantly improves on state-of-the-art statistical language 'z' ['x', 's']
models which are probabilistic models and that the proposed 'g' ['q', 'y', 'k', 'ğ']
approach allows to take advantage of longer contexts [18]. 'n' ['m', 'v', 'l']
Local and global attention mechanism is used for machine 'i' ['u', 'y', 'ı']
translation. Their ensemble method that used different 'a' ['i', 'u', 's', 'e', 'ə', 'm', 'ı', 'v']
attention architectures got high improvement for English- 'k' ['y', 'g', 'q']
German translation task [19].
Autoencoder neural network architecture with As substitution operation is done with logical ways than
bidirectional LSTM with attention mechanism is used for random letters, it is more preferred and used 60 percent of
Azerbaijani spelling correction. Instead of simulated data, time. By generating synthetic words with this way, it
manually labeled words are used for training. Although becomes very close to real word errors, misspelling that
proposed model speed up traditional dictionary lookup people do. In table 2, two sample wrong, correct sentences
process and corrects a word at a time, it can be improved so pair provided. Here, each synthetic word is generated from
that it can correct sentences at a time considering contextual the correct words in the sentence. Total of 12000 sentences
meanings [20]. pairs are creates in such way.
Encoder network can consist of several layers such as Feed forward neural networks (𝐴00 , 𝐴01 , 𝐴02 , 𝐴03 ) get
Embedding and recurrent layers as Gated Recurrent Unit input from previous decoder state and output of encoder state
(GRU) and Long-Short Term Memory (LSTM). The role of to generate outputs( 𝑒00 , 𝑒01 𝑒02 , 𝑒03 ). Each output is
Encoder part is to converting input sentence (sequence of
calculated as in (1) where g is an activation function
(sigmoid, ReLu etc.)
𝑒0𝑖 = g(𝑆0 , ℎ𝑖 ) (1)
80%
Conference on Computational Advances in Bio and
Medical Sciences, 2019
75% 75% [9] Aouragh Si Lhoussain, Gueddah Hicham, Yousfi
70% Abdellah, Adaptating the levenshtein distance to
65% contextual spelling correction, International Journal of
Computer Science and Applications, 2015
60%
[10] Shengnan Zhang, Yan Hu, Guangrong Bian, Research
0 1 2 3 4
on string similarity algorithm based on Levenshtein
distance Distance, IEEE 2nd Advanced Information Technology,
Electronic and Automation Control Conference
(IAEAC), 2017
Figure 6. Model accuracy by edit distance. [11] Chunchun Zhao, Sartaj Sahni, String correction using
the Damerau-Levenshtein distance, 7th IEEE
IV. CONCLUSION International Conference on Computational Advances in
In this research, encoder-decoder model with attention Bio and Medical Sciences, 2017
mechanism is applied to spelling correction for Azerbaijani [12] Daniel Jurafsky , James H. Martin, “N-gram Language
language. Train and test data consist of sequences of incorrect Models”, CHAPTER 3, October 2019
and correct sentence pairs where incorrect sentences in train [13] Vesa Siivola, Bryan L. Pellom, Growing an n-gram
data is generated from correct sentences. Model is test on real language model, Conference Paper, 2005
word data and the overall results are 75% for edit 0, 90% for [14] Pravallika Etoori, Manoj Chinnakotla, Radhika Mamidi.
edit 1, and 96% for edit 2. Automatic Spelling Correction for Resource-Scarce
Languages using Deep Learning. Proceedings of ACL
REFERENCES 2018, Student Research Workshop, pages 146–152
[1] Haldar, Rishin & Mukhopadhyay, Debajyot, [15] S. Sooraj∗, K. Manjusha, M. Anand Kumar and K.P.
“Levenshtein Distance Technique in Dictionary Lookup Soman. Deep learning-based spell checker for
Methods: An Improved” 2011. Malayalam language. Journal of Intelligent & Fuzzy
[2] Hicham, G., Abdallah, Y., & Mostapha, B. “Introduction Systems 34 (2018) 1427–1434
of the weight edition errors in the Levenshtein distance” [16] Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio,
International Journal of Advanced Research in Artificial Neural machine translation by jointly learning to align
Intelligence, 1(5) 2012 and translate. Conference paper at ICLR, 2015
[3] Neha G., Pratistha M., “ Spell Checking Techniques in [17] Minjoon Seol, Aniruddha Kembhavi, Ali Farhadi,
NLP: A Survey ” International Journal of Advanced Hananneh Hajishirzi, bi-directional attention
Research in Computer Science and Software flow for machine comprehension, conference paper at
Engineering Volume 2, Issue 12 December 2012. ICLR, 2017
[4] Damerau F. , “ A technique for computer detection and [18] Yoshua Bengio,Réjean Ducharme,Pascal Vincent,
correction of spelling errors ” Magazine Christian Jauvin, A Neural Probabilistic Language
Communications of the ACM CACM Homepage archive Model, Journal of Machine Learning Research 3, 2003
, Volume 7 Issue 3, March 1964 Pages 171-176 [19] Minh-Thang Luong Hieu Pham Christopher D.
[5] Stoyan Mihov, Klaus U. Schulz, Fast approximate search Manning, Effective Approaches to Attention-based
in large dictionaries, Association for Computational Neural Machine Translation, Conference on Empirical
Linguistics, 2004 Methods in Natural Language Processing, 2015
[6] Peter Gjol Jensen, Kim Guldstrand Larsen, Data [20] Samir Mammadov, Neural Spelling Correction for
Structure for Compressing and Storing Sets via Prefix Azerbaijani Language, IEEE Xplore, 2019
Sharing, International Colloquium on Theoretical
Aspects of Computing, September 2017.