Spelling Correction in Agglutinative Languages: Surface

Spelling Correction in Agglutinative Languages
Kemal Oflazer and Cemaleddin G/izey

D e p a r t m e n t of C o m p u t e r E n g i n e e r i n g a n d I n f o r m a t i o n Science
Bilkent University
A n k a r a , 06533, T u r k e y
ko@cs, bilkent, edu. tr
1 Introduction We denote the set of the surface forms of the roots

in the language by R. We denote by X and Y
Spelling correction is an important component of strings formed using the alphabet of the language.
any system for processing text. Agglutinative lan- X will denote the surface form of the incorrect or
guages such as Turkish or Finnish, differ from lan- misspelled string, and Y will typically denote the
guages like English in the way lexical forms are gen- surface string that is a (possibly partial) candidate
erated. Typical nominal or a verbal root may gener- word. Yle~ will denote the lexical form of this candi-
ate thousands (or even millions) of valid forms which date string? We assume the existence of a function,
never appear in the dictionary. For instance, we surface() to generate surface strings from lexical st-
can give the following (rather exaggerated) exam- rings, i.e., surface(Yze~) = Y . The edit distance
ple from Turkish: metric e d ( Z , Y ) (Du and Chang, 1992) measures
uygarla~tzramayabileceklerimizdenmi~sinizcesine 1 how many unit operations (insertion, deletion, re-
whose morpheme breakdown is: placement of single character and transposition of
uygar -lag -tzr -area two adjacent characters) are necessary to convert
civilized +BECOME C A U S +NEG
-yabil -ecek -let -irniz one string X into another Y.
+POT +FUT +3PL +POSS-1SG We would like to abstract the behavior of a mor-
-den -mi~ -siniz -cesine phological generator and analyzer for the given lan-
+ABL +NARR +2PL +AS-IF
Methods developed for spelling correction for language by two finite state automata. A finite state
guages like English (see the review by Kukich (Ku- generator Mg = (P, 6, V, S, F) where P is a set of
kich, 1992)) are not readily applicable to agglutina- states, $ is the state transition function, V is the
tive languages. This poster presents an approach to output alphabet, S is the starting state, and F is
spelling correction in agglutinative languages that a set of final states, generates, all correctly formed
is based on two-level morphology and a dynamic- words of the language. The transitions of Mg are
programming based search algorithm. After an of the form /5(pi) = pj (Pi and pj 6 P), with an
overview of our approach, we present results from output v~ 6 V which denotes the lexical form of
experiments with spelling correction in Turkish. a morpheme in the language and also labels the
transition. It should be noted that it is possible to
2 Spelling correction algorithm go from one state Pi to another pj by more than one
transition, outputting a different morpheme. We say
Our approach comprises two-steps: (1) determining a string Ytez is generated by Mg, if Y)e, is formed
all the roots from the dictionary that can be the root by concatenating, in order, the outputs of the ma-
of the misspelled word, and (2) generating (system- chine as we traverse starting from S to one of the
atically) all the possible words that "resemble" the states in F. We denote by L(Mg) as the set of all
given character string, from these roots. The first lexical strings generated by Mg. We also assume a
step of the problem is relatively easy because of the finite state recognizer Mr which recognizes whether
static structure of the root dictionary. Techniques a given surface string is in the language or not.
developed for spelling correction, say, in English can The spelling correction problem can now be for-
usually be applied here. The second step involves mulated as follows: Given an incorrect word X
producing all the possible words from the selected (rejected by Mr), a!ad an edit distance threshold
roots and requires a generate and test search proce- t, find the solution set of possible correct words
dure. S(X,t) = {Yled(X,Y) < t and Y = surface(Yt~,)
and Y~e, 6 L(Mg)} in viable time and space.
1This is an adverb meaning roughly "(behaving) as
if you were one of those whom we might not be able 2Lexical and surface are used in the two-level mor-
civilize." phology sense.
194
The set of all the possible roots for the incorrect
word X is defined as P R ( X , t ) = {r [ ed(X[i],r) < Table 1: Average number of operations per mis-
spelled word.
t and 1 < i < m and r E R}. 3 We assume that
P R ( X , t ) can be computed by a standard q - g r a m
technique. Using a small number (3 - 5) of 2-grams,
gives satisfactory solutions in our case.
i Ed.
2
is.
IRcl°en L
30.9
108.4
311.2
4462.0
Ed6peDrm" I S°ln" ] %Accuracy
2498.4
20680.4
3.6
52.0
95.1
95.1
Assuming that we have a set of root words found
as described above, we now have to generate words in
the language having these roots, that do not deviate
our algorithm on a set of 141 randomly selected in-
from the given misspelled string by more than the
correct words from Turkish text with edit distances
threshold. The solution requires a generate and test
1 (86%) and 2 (14%) to their correct form. The
probing of the finite-state a u t o m a t o n Mg, starting
morphological analyzer and generator that we used
with the start state S. We now have to find all the
was our two-level specification for Turkish (Oflaser,
paths from this state to one of the final states using 1993), developed using the PC-KIMMO system. It
the roots in PR(X, t), so that when the morphemes is, however, rather slow and can analyze only about
along this p a t h are concatenated and surface string
2 forms per second and can generate about 50 forms
is generated, it is within an edit distance t of X.
a second on Sun Sparcstations. So, instead of using
When the search starts morphemes are concate-
timings, we counted the number of times certain ex-
nated to root and the length of the candidate lexical pensive operations we called. The statistics shown in
string Yle~ increases. After one step of the search,
Table 1 show the average number of morphological
the partial surface string Y is compared with a suit-
recognitions and generations, and the edit distance
able prefix of X. In most of the cases the candidate
operations required, and the number of correct solu-
Y will deviate from these prefixes of X by more than tions offered per misspelled input word. The last col-
the threshold without reaching a final state, so that
umn indicates the percentage of cases the intended
we can no longer get to a viable solution. In such
correct form was found. We also have developed a
cases we do not consider any further transitions from
algorithm for ranking the solutions which offers the
that state. Special attention has to paid when a suf-
intended correct form as the first in 74% of the cases
fixation changes the surface realization of morpheme
when t = 1.
immediately to the left.
4 Conclusions
3 Results from experiments with
spelling correction in Turkish This poster has presented a spelling correction algo-
rithm for agglutinative languages that is based on a
We first present a spelling correction example from
two-level morphological generator and analyzer, and
our implementation where we used bigrams (q = 2),
a intelligent generate and test search procedure.
and we chose the number of bigrams to use for deter-
mining the root, k, as 3 (see Figure 1). We tested 5 Acknowledgement
EXAMPLE The research was supported in part by a NATO Sci-
Misspelled t:word: qal§rnalamyla ence for Stability Grant, T U - L A N G U A G E .
Threshold : 2
Solutions: yam§malamyla yatl§malanyla
on left edge yapl§malarlyla yakl§malarlyla
i¢~,§malamyla qlk1§rnalarlyla References
Candidate lq.oots:46a~qakl 6al qah qam qan 6ap gar qat qatl
qag qak qakl§ qal qah§ ~ap qat qatl§ qav
~av qay M .W. Du and S. C. Chang. 1992 A model and a fast
algorithm for multiple errors spelling correction.
Solutions: 5 Lexical Surface
Acta Informatica, 29:281-302.
Edit distance 1 <jat+H§+mA+lArH+ylA qatl§malanyla
~ap+ H§-b rnA+lArH-bylA qapl§malamyla K. Kukich. 1992 Techniques for automatically cor-
qah§-l-mA+lArH+ylA ~ah§rnalarlyla (corr.) recting words in text. ACM Computing Surveys,
Edit Distance 2 qav+mA+lArH+ylA qavmalamyla
24:377-439.
q'at + m A + I A r H + y l A qatmalamyla
K. Oflazer. 1993 Two-level description of Turkish
morphology. In Proceedings of the Sixth Confer-
Figure 1: Spelling correction example ence of the European Chapter of the Association
for Computational Linguistics, April. A full ver-
sion appear in Literary and Linguistic Computing,
3X[i] denotes the string of the first i characters of X. Vol.9 No.2, 1994.
4The duplicate entries in the list of candidate roots
in fact have different part-of-speech categories andhence
different morphota~=tics.
5A small subset of the whole solution set is given, due
to space limitations.
195

Spelling Correction in Agglutinative Languages: Surface

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spelling Correction in Agglutinative Languages: Surface

Uploaded by

Copyright:

Available Formats

Spelling Correction in Agglutinative Languages

Kemal Oflazer and Cemaleddin G/izey

1 Introduction We denote the set of the surface forms of the roots

You might also like