Professional Documents
Culture Documents
Stemming is used in information retrieval systems to re- dures that merely remove plurals, past and present parti-
duce variant word forms to common roots in order to im- ciples, to more sophisticated techniques that encompass
prove retrieval effectiveness. As in other languages, there all of the important types of morphological variation.
is a need for an effective stemming algorithm for the index-
ing and retrieval of Malay documents. The Malay stemming Such procedures involve either the removal of the single
algorithm developed by Othman is studied and new ver- longest matching suffix or the iterative removal of several
sions proposed to enhance its performance. The improve- suffixes, and may also require the specification of de-
ments relate to the order in which the dictionary is looked- tailed context-sensitive rules if there is not to be a sig-
up, the order in which the morphological rules are applied,
and the number of rules.
nificant error rate (Lennon, Peirce, Tarry, & Willett,
1981; van Rijsbergen, 1979) . Other reports have detailed
stemmers both for languageswith a similar morphologi-
cal structure to that of English, such as Slovene (Popovic
& Willett, 1992) and French (Savoy, 1993), and for lan-
guagesthat are lesssimilar, such as Finnish (Jappinen &
Introduction Ylilammi, 1986) and Turkish (Oflazer, 1994). How-
Lovins ( 1968) defines a stemming algorithm as “a ever, these languageshave in common the fact that vari-
computational procedure which reduces all words with ant word forms are created by adding suffixes to a basic
the same root (or, if prefixes are left untouched, the same stem, and the removal of just the suffixes by a stemmer
stem) to a common form, usually by stripping each word has thus been found to be sufficient for the purposes of
of its derivational and inflectional suffixes.” For exam- information retrieval (Porter, 1980).
ple, the words group, groups, grouped, grouping, or sub- The usageof affixes in English (and similar languages)
groups are reduced to the root group. A stemming algo- is far less complex than in languages such as Malay
rithm has applications in the fields of information re- (Sembok, Yusoff, & Ahmad, 1994) and Arabic (Al-
trieval and computational linguistics. In information Kharashi & Evens, 1994)) where the stripping of suffixes
retrieval, grouping words having the same root under the alone would not be sufficient for retrieval purposes.
same stem will increase the successwith which docu- Thus, focusing on Malay, the root of the words makanan
ments can be matched against a query (Frakes, 1992; (food), pemakan (eater), dimakan (being eaten), pem-
Harman, 1991). In computational linguistics, there is a akanan (diet), and termakan (accidentally eaten) is ma-
need to identify linguistically correct roots, since the at- kan (eat), and a simple stemmer that removed only
tached affixes provide information about the grammati- suffixes would yield five different stems: makan, pema-
cal function of a word and sometimes to its meaning kan, dimakan, pemakan, and termakan. It is clear that
(e.g., when considering prefixes) and thus help in the it is not possible to stem Malay text effectively without
syntactic analysis of a sentence. considering the removal of prefixes as well as suffixes.
Many different kinds of stemming algorithms for En- To our knowledge, there is only one existing Malay
glish have been proposed, ranging from simple proce- stemming algorithm that has been developed and tested
for applications in information retrieval. This algorithm,
which was developed by Othman ( 1993) as an M.Sc. dis-
* To whom all correspondence should be addressed. sertation project, uses 121 morphological rules, which
Received April 18, 1995; revised September 13, 1995; accepted Sep-
tember 13, 1995.
are arranged and applied to an input word in alphabeti-
cal order, and a dictionary of Malay words that is derived
8 1996 John Wiley & Sons, Inc. from a very large Malaysian dictionary, the Kamus De-
JOURNAL OF THE AMERICAN SOCIETY FOR 1NFORMATlON SCIENCE. 47(12):909-918, 1996 CCC 0002-8231/96/120909-10
wan (DBP, 199 1) . The dictionary plays an important wrong to used the affixes me and an to the word minum
role in the morphological analysis of Malay: Stemming (drink) to form meminuman. In the following sections,
experiments with and without a dictionary have been re- we shall briefly discuss the classesof affixes in the Malay
ported by Sembok et al. ( 1994)) who show that high er- language.
ror rates and some incomprehensible roots are obtained
if a dictionary is not used. In this article, we describe sev-
eral improvements that we have been able to make to Prejixes
Othman’s algorithm to make it better able to handle the There are many prefixes available in the Malay mor-
morphological complexities of Malay text. The next sec- phology: Some of the most important or commonly used
tion provides a brief introduction to these complexities, ones are di, ke, se, beR, meN, teR, peN, and peR. Prefixes
and this is followed by a discussion of Othman’s algo- like di, ke, and se do not change their form when they
rithm and its main characteristics. We then describe the are combined with a root word, and there are also not
modifications we have made to the original algorithm. any changes in the spellings of the root words that are
The various procedures are tested with two very different attached to them. However, prefixes like beR, meN, teR,
data sets, and the article concludes with a summary of peN, and peR will change their forms, specifically the let-
our major findings. ters written in capitals below, depending on the first let-
ters of the roots that are attached to them. The following
examples illustrate the changes that can occur when a
prefix is attached to a root:
660+ 19 19 32 36 22 35 6 6 I9 24 11 24
420- 15 15 18 21 17 20 6 6 9 13 10 13
371* 13 13 15 16 13 15 6 6 8 10 8 10
earlier the rules are grouped into prefix (pr), suffix dictionary will avoid doing stemming on words which
( su ) , prefix-suffix pair ( ps ) , and infix (in) rules. The are already root words. The source data used here was
order of rules within the four classes of rules is arbi- three chapters from the Malay version of the Quran.
trary and order-dependent. We have tested all possi- The first set of six runs used the basic Othman algo-
ble orderings of the four classes, with the exception rithm, in which the dictionary is not first checked before
that the infix rules are always tested last, owing to
their rarity in Malay. The order ofapplication ofthe
the application of the stemmer, with each of the six rule-
rules is thus: orderings defined in the previous section. The only
difference in the secondset of six runs was that an initial
Test- 1: pr-ps-su-in; check was carried out. The dictionary that was used was
Test-2: pr-su-ps-in; a version of the Kamus Dewan (DBP, 1991) that had
Test-3: ps-pr-su-in;
been manually modified to contain a total of 22,393
Test-4: ps-su-pr-in;
word roots.
Test-5 su-pr-ps-in;
Test-6: su-ps-pr-in.
The performance of the various procedureswas mea-
sured by the numbers of words that were stemmedincor-
(3) To evaluatethe relativemerits of the setsof rules rectly as compared to the stemsproduced manually. The
that aredefinedin AppendicesA-C. numbers of words incorrectly stemmed are obtained by
running the various procedureson three different collec-
tions of words from the data sets as follows: All word
occurrences in each chapter/abstract are stemmed; all
Experimental Results and Discussion unique words within each chapter/abstract are
stemmed; and only unique words among all the
chapters/abstracts are stemmed. The total numbers of
Initial Checking of the Dictionary
words stemmed for the three collections are marked with
These experiments were carried out to determine the symbols +, -, and *, respectively,in Tables 1-3 and
the effect of checking each input word against a dictio- 6. Table 1 shows the results obtained from the two
nary of root words before it was processedby the stem- groups of experiments, and it will be seenthat there is a
mer, which in this case was based on the 432 rules significantly lower error rate if an initial dictionary check
listed in Appendix B. The initial checking against the is carried out prior to the invocation of the stemmer. It
TABLE 2. Effect of different rule orderings on numbers of errors. Ten chapters from Quran. Six different rule orderings and other approaches.
757+ 14 17 11 17 23 20 23 132 61
504- 12 13 9 14 18 15 16 90 38
434* I1 12 8 13 17 14 16 76 33
will also be seen that the best results are obtained with yielding the shortest root possible from the removal of a
Test-l and Test-2, i.e., when the prefix rules are used single affix (and conversely for the shortest-match
first. approach ) .
The results obtained are listed in Tables 2 and 3. It
will be seen from these tables that there are only m inor
differences between the six class orderings that can be
used with the set-B rules. However, there are differences
Order OfApplication of the Rule Sets
between the Quran and abstracts data sets and it is not
We have studied the order of application in more de- possibleto say that any one of the approachesis unequiv-
tail using the set-B rules with our two principal data sets, ocally the best. There is, however, no doubt that the
viz 10 chapters of the Quran and 10 research abstracts. other three approaches,i.e., the original algorithm and
For comparison, we have also run Othman’s original al- the shortest (or longest) match algorithms, yield many
gorithm and two versions of the set-B rules, which we more incorrect word roots.
shall refer to as longest match and shortest match. In the The types of error produced by Tests l-6 are shown
longest-match approach, we simply remove the single in Tables 4 and 5, which are for the Quran and research-
longest affix that can be mapped to the input word, thus abstracts data sets, respectively. The errors have been
Test no.
Word Actual root Resulting root Error type where error occurs
Test no.
Word Actual root Resulting root Error type where error occurs
classified into five classes: Overstemming, understem- words that can either be a root word themselvesor be a
m ing, unchanged, spelling exception, and others. The derivative of some other word, although such an error
first three have been identified by Savoy ( 1993) and the can also occur if there are some non-root words in the
other two are introduced to cater specifically to errors dictionary of (supposedly) root words. The wordsfirmu-
encountered in stemming of Malay words. Overstem- lasi, realisasi, and klinikal remained unchanged in the
m ing occurs when more characters have been removed research-abstractsdata set since set-B does not include
from the input word than necessary.On the other hand, any rules for the derivation of modern Malay words from
understemming occurs when too few characters have foreign words that use the suffixes si, asi, and al. The
been removed. Unchangedis in fact a special caseof un- effect of adding such rules was explored in the final set of
derstemming. It occurs when no characters have been experiments.
removed when some should have been removed in order
to get the correct root. Spelling exception occurs when
the first letter of the stem obtained is not correctly re-
coded after the prefix has been removed. Other types of
Use of an Extended Set of Rules
error are classifiedas others.
The largest class of errors is that called others. These The final experiments used the extended set of rules,
are mostly due to the order in which the rules are applied set-C, and applied them to the Quran and research-ab-
when a word is stemmed. The second most common er- stracts data sets, as shown in Table 6. For comparison,
ror type is unchanged. These errors mainly occur with we have also included again the results obtained with
Rules Rules
1548+ 13 19 757+ 11 6
1053- I3 I7 504- 9 5
736+ I2 15 434* 8 5
* Rules of set-A are also included in this set but not listed here.