A Study of Stemming Effects On Information Retrieval in Bahasa Indonesia

A Study of Stemming Eects on Information Retrieval in Bahasa Indonesia
Fadillah Z Tala 0086975
Master of Logic Project Institute for Logic, Language and Computation Universiteit van Amsterdam The Netherlands
Contents
1 Introduction 2 A Purely Rule-based Stemmer for Bahasa Indonesia 2.1 2.2 2.3 Morphological Structure of Bahasa Indonesia Words . . . . . . . . . . . . . . . . . The Porter Stemming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Porter Stemmer for Bahasa Indonesia . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 6 6 6 11 11 12 13 16 16 16 18 18 18 19 19 20 21
3 Evaluation of the Stemming Algorithm 3.1 Stemmer Quality Evaluation 3.1.1 3.1.2 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Paice Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . The Paice Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 Inectional Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivational Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Stemmer Performance Evaluation for Information Retrieval 4.1 The Test Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.2 4.3 The Document Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Information Requests (Queries) . . . . . . . . . . . . . . . . . . . . . . Relevant Documents for Every Information Request . . . . . . . . . . . . .
The FlexIR System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
4.3.1 4.3.2 4.3.3 4.4 4.5
Precision/Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R-Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 21 21 21 22 23 26 31 32 34 36 37 39
Stoplists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 4.5.2 4.5.3 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detailed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of the Detailed Analysis . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusions A Derivational Rules of Prex Attachment B The Meaning of Axations C Word Frequency Analysis D A Stoplist for Bahasa Indonesia
iv
List of Figures
2.1 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 The basic design of a Porter stemmer for Bahasa Indonesia. . . . . . . . . . . . . . Illustration of Paice evaluation methods. . . . . . . . . . . . . . . . . . . . . . . . . UI x OI plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Document example: kompas document KOMPAS-HL2001-310101-PRES01. . . . . Query example: query KOMPAS-HL2001-Q-2. . . . . . . . . . . . . . . . . . . . . 7 14 15 19 20
Comparison of Recall-Precision between non stopwords vs. stopwords ltering system. 22 PR-Curves for kompas Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . PR-Curves for tempo Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantile Plots from Non-interpolated average precision values of Nazief for the kompas collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 25
List of Tables
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 Illegal conx pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Double prexes order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The rst cluster of rules which covers the inectional particles. . . . . . . . . . . . The second cluster of rules which covers the inectional possessive pronouns. . . . The third cluster of rules which covers the rst order of derivational prexes . . . . The fourth cluster of rules which covers the second order of derivational prexes . The fth cluster of rules which covers the derivational suxes . . . . . . . . . . . . Examples of syllables in Bahasa Indonesia words. . . . . . . . . . . . . . . . . . . . Comparison of two Bahasa Indonesia stemmers. . . . . . . . . . . . . . . . . . . . . Results of stripping inectional suxes. . . . . . . . . . . . . . . . . . . . . . . . . Errors in the inectional sux stripping. . . . . . . . . . . . . . . . . . . . . . . . . Results of derivational prex and sux stripping. . . . . . . . . . . . . . . . . . . . Spelling adjustment errors in stripping suxes. . . . . . . . . . . . . . . . . . . . . Test-Collection Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test-Query Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Precision and R-Precision results of system without and with stoplist (NoSo and So) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Precision and R-Precision results over all queries for the three systems . . ANOVA Table for Average Precision Measurement . . . . . . . . . . . . . . . . . . ANOVA Table for R-Precision Measurement . . . . . . . . . . . . . . . . . . . . . . 5 5 7 8 8 8 9 9 15 16 17 17 17 18 20 22 23 26 26 34
A.1 Rules and Variation Forms of Prexes . . . . . . . . . . . . . . . . . . . . . . . . . vi
B.1 The meaning of axations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Most frequently occur words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Suggested stoplist for Bahasa Indonesia . . . . . . . . . . . . . . . . . . . . . . . . D.2 Most common words in Bahasa Indonesia newspapers . . . . . . . . . . . . . . . .
36 38 39 43
vii
Chapter 1
Introduction
Stemming is a process which provides a mapping of dierent morphological variants of words into their base/common word (stem). This process is also known as conation [10]. Based on the assumption that terms which have a common stem will usually have similar meaning, the stemming process is widely used in Information Retrieval as a way to improve retrieval performance. In addition to its ability to improve the retrieval performance, the stemming process, which is done at indexing time, will also reduce the size of the index le. Various stemming algorithms for European languages have been proposed [10, 16, 17, 24, 28, 29, 31, 32]. The designs of these stemmers range from the simplest technique, such as removing suxes by using a list of frequent suxes, to a more complicated design which uses the morphological structure of the words in the inference process to derive a stem. These algorithms have also been evaluated in order to examine their eect on the retrieval performance. A good summary of these evaluation results can be found in [10, 19]. Results of stemming usage in information retrieval are inconsistent. Harman [12], in her experiments with three sux stripping algorithms for English, reported inconsistent results. Whilst Krovetz [20] and Hull [15] both reported more favorable results of the stemming usage in English, especially for short queries. Popovic and Willett [28] reported a signicant improvement in retrieval precision for Slovene language which is more complex than English [18]. He also reported that his control experiments conrmed the results in [12]. Experiments of stemmer usage for other European languages which are more complex than English, showed an improvement of retrieval precision and recall [13, 19, 27]. These studies support the hyphothesis in [18] and [27] namely, that the eectiveness of stemming in an IR systems also depends on the morphological complexity of the language. In the case of Bahasa Indonesia, so far there is only one stemming algorithm which is developed by Nazief and Adriani [23]. This stemming algorithm was developed using a conx stripping approach with a dictionary look-up. The dictionary is very simple, it consists of a list of lemmas. The stemming process is done by stripping the shortest possible match of axes. The dictionary look-up is performed before each stripping step and the stripping process itself is implemented recursively. However, it is unfortunate that there is no experimental report about the eect of this stemmer on the retrieval performance. The morphological complexity of Bahasa Indonesia can be considered simpler than English because it does not recognize tenses, gender and plural forms. It is interesting to investigate whether the study of stemming eect in Information Retrieval in Bahasa Indonesia
will also support that hyphotesis. These are main reasons that motivated us to evaluate the stemming eect in Bahasa Indonesia. Results of the experiments reported by Ahmad et al. [2] pointed out that dictionary plays an important role in the stemming process for Malay language. Since Bahasa Indonesia and the Malay language are very similar, we assume that dictionary also plays an important role in the stemming process of Bahasa Indonesia. However, based on the fact that resources such as a large digital dictionary for this language are expensive due to the lack of computational linguistics research, clearly, there is a practical need for a stemming algorithm without dictionary involvement. From a scientic point of view, it is also interesting to see whether stemming algorithm without involving a dictionary would also be eective for Bahasa Indonesia, such as it proved to be for Slovene [28] and Dutch [19]. This thesis is about a study of stemming algorithms in Bahasa Indonesia, especially their eect on the information retrieval. We try to evaluate the existing stemmer for Bahasa Indonesia [23] and compare it with a purely rule-based stemmer, which we created for this purpose. This rule-based stemmer is developed based on a study of morphological structure of Bahasa Indonesia words. A summary of the morphological structure of words in Bahasa Indonesia is introduced in Chapter 2. Chapter 2 also includes the design and the implementation of our rule-based stemmer. Since the quality of the stemming algorithm in [23] has never been assessed, we conducted an experiment to evaluate its quality. In this experiment, we chose the Paice evaluation method [25] and results are given in Chapter 3. In this chapter, we also evaluated the eect of dictionary size to the quality of the stemmer. This hopefully will answer to what extent the dictionary-based stemmer can be approximated by a purely rule-based stemmer. It is especially relevant in the case of developing languages such as Bahasa Indonesia where new words are continuously being adopted. The main task of this thesis is discussed in Chapter 4. In this chapter, the evaluation of stemming on the retrieval performance is explained in detail. In this evaluation, we used the traditional Precision/Recall measure. We also performed some detailed evaluations resulting in more concrete results. Finally, Chapter 5 describes the conclusion of our experiments.
Chapter 2
A Purely Rule-based Stemmer for Bahasa Indonesia

The purely rule-based stemmer we developed here is a Porter-like stemmer which is modied for Bahasa Indonesia. The Porter stemmer was chosen based on the consideration that its basic idea seems appropriate for the morphological structure of words in Bahasa Indonesia. First, a brief introduction to the morphological structure of words in Bahasa Indonesia is given, and second we will explain the mechanism of a Porter stemmer. Last, we will explain the modied Porter stemmer for Bahasa Indonesia which we used as the comparison in the retrieval evaluation.
2.1
Morphological Structure of Bahasa Indonesia Words
In this section, we discuss the morphological structure of Bahasa Indonesia words which is based on information in [6, 7, 23, 35]. We also looked at the morphological structure of Malay language words [1], since Bahasa Indonesia is very similar to Malay language. This discussion includes prexes, suxes, and combinations of them (conxes) in derived words. Although inxes do exist in Bahasa Indonesia, the number of derived words from these inxes is very small. Because of this and for the sake of simplicity, inxes will be skipped and ignored. Words that contain an inx will be considered as they are. The morphology of Bahasa Indonesia words can comprise both inectional and derivational structures. Inectional is the simplest structure which is expressed by suxes which do not aect the basic meaning of the underlying root word. These inectional suxes can be divided into two groups: 1. Suxes -lah, -kah, -pun, -tah. These suxes are actually the particles or functional words which have no meaning. Their occurrence in words is for emphasizing, examples: dia (she/he) saya (I) + + kah lah diakah (he/she - with emphasizing for questioning) sayalah (I - with emphasize)
2. Suxes -ku, -mu, -nya. These suxes, which are attached to the words, form the possessive pronouns, examples: 3
tas (bag) sepeda (bicycle)
+ +
mu (you) ku (me)
tasmu (your bag) sepedaku (my bicycle)
Each sux of groups 1 and 2 may occur in the same word. When they are both present, they follow a strict order: suxes of the second group always precede the rst group. This ordering motivates the following denition. Denition 2.1 The morphological structure of an inectional word is: inflectional := (root + possessive_pronouns) | (root + particle) | (root + possessive_pronouns + particle) The attachment of inectional suxes to a word/root will not change the spelling of the word/root in the derived word. In other words, no character in the root/original word is diluted in the derived word. The root/original word can still be located easily in the derived word. Just like the Malay language, derivational structures of Bahasa Indonesia consist of prexes, suxes and a pair of combinations of the two [1, 6, 7, 34, 35]. The most frequent prexes are: ber-, di-, ke-, meng-, peng-, per-, ter- [34, 35]. The following list shows an example of each prex: ber + lari (to run) berlari (to run, running) di + makan (to eat) dimakan (to be eaten - passive form) ke + kasih (to love) kekasih (lover) meng + ambil (to take) mengambil (taking) peng + atur (to arrange) pengatur (arranger) per + lebar (wide) perlebar (to make wider) ter + baca (to read) terbaca (can be read, readable) Some prexes such as ber-, meng-, peng-, per-, ter- may appear in several dierent forms. The form of each of these prexes depends on the rst character of the attached word. Unlike the inectional structure, the spelling of the word may be changed when these prexes are attached. As an example, take the words menyapu (to sweep, sweeping) which is constructed from the prex meng- and the root word sapu (broom, sweeping-brush). The prex meng- is changed to meny- and the rst character of the root word is diluted. Rules for various forms of these prex attachments can be found in Appendix A, Table A.1. Derivational suxes are: -i, -kan, -an. Examples of words with these suxes are: gula (sugar) + i gulai (to put sugar to) makan (to eat) + an makanan (food, something to be eaten) beri (to give) + kan berikan (to give to) In contrast to prexes, the attachment of suxes never changes the spelling of the root in the derived word. As mentioned earlier, the derivational structure also recognizes conxes, where a combination of prex and sux attaches together in a word to derive a new word. For example: 4
per + main (to play) + an permainan (toy, game, thing to be played) ke + menang (to win) + an kemenangan (victory) ber + jatuh (to fall) + an berjatuhan (falling) meng + ambil (to take) + i mengambili (taking repeatedly) Not all combinations of prexes and suxes can be joined together to form a conx. There are some combinations of prex and sux which are not permitted. Table 2.1 shows all of the illegal conxes. Table 2.1: Illegal conx pairs.
Prex ber di ke meng peng ter Sux i an i|kan an i|kan an
A prex/conx can be added to an already conxed/prexed word, which results in a double prex structure. Just like the construction of conxes, not all prexes/conxes can be added to a certain conxed/prexed word to form a double prex. There exist rules which govern the ordering of these double prexes, but there are some exceptions to this rule. Table 2.2 shows these ordering rules. Table 2.2: Double prexes order.
Prex 1 meng di ter ke Prex 2 per ber
Denition 2.2 The morphological structure of a derivational word: derivational where prefixed suffixed confixed double_prefix := := := := prefix + root root + suffix prefix + root + suffix (prefix + prefixed) | (prefix + confixed) | (prefix + prefixed + suffix) := prefixed | suffixed | confixed | double_prefix
The last possibility to derive a new word is by adding the inectional suxes to an already prexed, suxed, conxed and even double prexed word. These forms are the most complex structure in Bahasa Indonesia. Nazief and Adriani in [23] called these structures multiple suxes. From Denition 2.1 and Denition 2.2, the general morphological structure of words in Bahasa Indonesia can be simplied by Denition 2.3. 5
Denition 2.3 The morphological structure of words in Bahasa Indonesia: [prefix1] + [prefix2] + root + [suffix] + [possesive_pronoun] + [particle] where [...] means an optional occurrence.
2.2
The Porter Stemming Algorithm
The Porter stemming algorithm is a conation stemmer which was proposed by Porter [29]. The algorithm is based on the idea that suxes in English are mostly made-up of a combination of smaller and simpler suxes [26]. The stripping process is performed on a series of steps, specically ve steps, which simulates the inectional and derivational process of a word. At each step, a certain sux is removed by means of substitution rules. A substitution rule is applied when a set of conditions/constraints attached to this rule hold. One example of such a condition is the minimal length (the number of vowel-consonant sequences) of the resulting stem. This minimum length is called measure [29]. Other simple conditions on the stem can be whether the stem ends with a consonant, or whether a stem contains a vowel. When all conditions of a certain rule are satised, the rule is applied, which causes the removal of the sux and the control moves to the next step. If the conditions of a certain rule in a current step cannot be met, the conditions of the next rule in that step are tested, until a rule is red or the rules in that step are exhausted. This process continues for all ve steps.
2.3
Porter Stemmer for Bahasa Indonesia
As mentioned at the beginning of this chapter, the Porter stemmer was chosen based on the consideration that its main idea ts the morphological structure of words in Bahasa Indonesia. The morphological structure of words in Bahasa Indonesia consists of a combination of smaller and simpler inectional and derivational structure, where each is made-up of simpler and smaller suxes and/or prexes. This seems to t the basic idea of the Porter algorithm. The series of linear steps in the Porter stemmer, which simulate the inectional and derivational process of words in English also ts the derivational and inectional structure of Bahasa Indonesia (Denition 2.3). These series of linear steps hopefully will reduce a word with complex structure in Bahasa Indonesia to a correct stem. The basic design of the Porter stemmer in Bahasa Indonesia is illustrated in Figure 2.1.
2.3.1
Implementation
Our implementation of the Porter algorithm is based on the English Porter Stemmer developed by Frakes [10]. This version is more readable because Frakes made a clear separation between substitution rules and procedures for testing the attachment conditions. Because English and Bahasa Indonesia come from two dierent class of languages, some modications had to be performed in order to make Porters algorithm suitable for Bahasa Indonesia. The modications consist of modications in the cluster of rules and the measure condition. Since Porters algorithm can only do sux stripping, some additions have to be done also for handling
word
Remove Particle
Remove Possesive Pronoun
Remove 1st Order Prefix fail a rule is fired
Remove 2nd Order Prefix
Remove Suffix a rule is fired
Remove Suffix
Remove 2nd Order Prefix
fail
stem
Figure 2.1: The basic design of a Porter stemmer for Bahasa Indonesia. prex stripping, conx stripping, and also spelling adjustment in the case where dilution of the rst character of the root word had occurred.
Ax-rules Based on the morphological analysis in Section 2.1, ve ax-rule clusters were created for our Porter stemmer for Bahasa Indonesia. These ve clusters are dened by reversing the order in which the axes occur in the word formation process (see Denition 2.3). This means that the inectional suxes, i.e. particles and possessive-pronouns, are removed rst before the derivational axes. The ve ax-rule clusters are shown in Table 2.3, Table 2.4, Table 2.5, Table 2.6 and Table 2.7. Table 2.3: The rst cluster of rules which covers the inectional particles.
Sux kah lah pun Replacement NULL NULL NULL Measure Condition 2 2 2 Additional Condition NULL NULL NULL Examples bukukah buku adalah ada bukupun buku
Measure Condition In order to cope with the spelling of Bahasa Indonesia, the measure condition, which is used in Porters algorithm, is modied. In Bahasa Indonesia, the smallest unit of a word is suku kata 7
Table 2.4: The second cluster of rules which covers the inectional possessive pronouns.
Sux ku mu nya Replacement NULL NULL NULL Measure Condition 2 2 2 Additional Condition NULL NULL NULL Examples bukuku buku bukumu buku bukunya buku
Table 2.5: The third cluster of rules which covers the rst order of derivational prexes
Prex meng meny men mem mem me peng peny pen pem pem di ter ke
Replacement NULL s NULL p NULL NULL NULL s NULL p NULL NULL NULL NULL
Measure Condition 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional Condition NULL V. . . NULL V. . . NULL NULL NULL V. . . NULL V. . . NULL NULL NULL NULL
Examples mengukur ukur menyapu sapu menduga duga menuduh uduh memilah pilah membaca baca merusak rusak pengukur ukur penyapu sapu penduga duga penuduh uduh pemilah pilah pembaca baca diukur ukur tersapu sapu kekasih kasih
This notation means that the stem starts with a vowel.
Table 2.6: The fourth cluster of rules which covers the second order of derivational prexes
Prex ber bel be per pel pe
Replacement NULL NULL NULL NULL NULL NULL
Measure Condition 2 2 2 2 2 2
Additional Condition NULL ajar K er. . . NULL ajar NULL
Examples berlari lari belajar ajar bekerja kerja perjelas jelas pelajar ajar pekerja kerja
This notation means that the stem starts with a consonant.
(syllable). A syllable comprises at least one vowel. Some examples of the adapted measure for Bahasa Indonesia can be seen in Table 2.8 The word measure which is designed here cannot capture all the actual measure of words in Bahasa Indonesia. This is because Bahasa Indonesia also recognizes diphthongs, that is a sequence of two vowels which is considered as a non-separable vowel. There are ai, au, oi diphthongs, e.g: pantai (beach) consists of two syllables pan and tai. These diphthong forms are problematic, especially for the diphthongs ai and oi when they occur at the end of a word. It is dicult to separate it automatically from derivational words with sux -i, such as tandai (to give a sign), which consists of three syllables, i.e. tan, da and i. Since the
Table 2.7: The fth cluster of rules which covers the derivational suxes
Sux kan an i Replacement NULL NULL NULL Measure Condition 2 2 2 Additional Condition prex {ke, peng} / prex {di, meng, ter} / V|K. . . c1 c1 , c1 = s, c2 = i and prex {ber, ke, peng} / Examples tarikkan tarik (meng)ambilkan ambil makanan makan (per)janjian janji tandai tanda (men)dapati dapat pantai panta
Table 2.8: Examples of syllables in Bahasa Indonesia words.

Measure 0 1 2 3 Examples kh, ng, ny ma, af, nya, nga maaf, kami, rumpun kompleks mengapa, menggunung, tandai Syllables kh, ng, ny ma, af, nya, nga ma-af, ka-mi, rum-pun, kom-pleks meng-a-pa, meng-gu-nung, tan-da-i
number of words with diphthong is smaller than the number of words with sux -i, diphthongs are ignored. This causes words with diphthongs ai/oi to be treated as derivational words. The last character (-i) will be removed as the result of stemming process. Based on the raw data collected by Nazief [22] and data from our own experiment of stopwords, an analysis on syllables in Bahasa Indonesia had also been conducted. This analysis is performed automatically with manual correction and checking. The result of the analysis showed that most of root words in Bahasa Indonesia consist of minimum of two syllables. This is the reason that the minimum length of the stemmed word is two.
Prex-stripping and Spelling Adjustment Prex stripping is handled by treating it just like sux stripping, with reverse replacement, that is at the beginning of the word. Since the prex attachment might in some cases change the spelling of the attached word, spelling correction/adjustment must be performed. There is a diculty in the implementation of the spelling correction since some rules in the derivational structure of Bahasa Indonesia themselves lead to ambiguity (see Appendix A). For example, take the prex meng-, a derived word mengubah (changing) may originate from ubah (to change) or kubah (dome). Meanwhile the word mengalah (to give up) may originate from kalah (to loose), or alah (to dry out). Therefore the spelling adjustment for these ambiguity rules are neglected. We realize that this may lead to overstemming/understemming errors. The spelling adjustment for non-ambiguous rules are done directly by substituting the prex with the proper character for that prex and its stem. The rules in Table 2.5 and Table 2.6 are ordered in such a way that the spelling adjustment for each prex removal can be accommodated properly.
Conx and Double Prex Stripping The conx stripping case is handled in the main algorithm by arranging a consecutive sequence of prex and sux replacements. The prex stripping is always prior to the sux stripping. An additional condition is added to check the possibility of a sux to form a legal conx combination 9
with the previously removed prex. A sux rule cannot be applied if its additional requirement is not fullled. By neglecting the inectional suxes, there are ve possible forms of a derived word, i.e. prex only, sux only, conx word, prex of an already conxed word, or conx of an already prexed word. The rst three possibilities actually can be handled by a sequence of prex and sux replacement and the additional condition of legal conxes. The last two possibilities are actually double prexes. They can be handled by adding another prex stripping or conx stripping, which is dependent on the previous prex and sux replacement.
10
Chapter 3
Evaluation of the Stemming Algorithm

Before stemming is used for retrieval purposes, we want to evaluate the quality of the two stemming algorithms. The purely rule-based stemmer often yields a stem which cannot be considered to be comprehensible words, especially in Malay [2], while the linguistically-motivated (dictionarybased) stemmer can eliminate most of these errors [2, 17]. Therefore we need to perform an experiment to compare the quality of those two stemmers. This evaluation will hopefully give some information of how good or bad a purely rule-based stemming algorithm is, compared to a linguistically-motivated stemmer.
3.1
Stemmer Quality Evaluation
Out of various methods to evaluate the quality of a stemmer [10, 24, 25], we chose the Paice evaluation method [25]. In this evaluation method, the quality of the stemmer is assessed by counting the number of identiable errors during the stemming process. The input words from various samples of texts have to be semantically grouped. Ideally, a good stemmer will stem all words from the same semantic group to the same stem. But due to the irregularities which are prominent in all natural languages, all stemmers unavoidably make mistakes, including the ones which use vocabulary lists. In other words, we might say that no stemmer can be expected to work perfectly correct. There will always exist error cases where words which ought to be merged will not be merged to the same stem (understemming) or cases where words are merged to the same stem while they are actually distinct (overstemming). A good stemmer should obviously produce as few overstemming and understemming errors as possible. By counting these errors for a sample of texts, we can gain some insight in the functioning of a stemmer. A comparison between two dierent stemmers is then possible.
11
3.1.1
The Paice Evaluation Method
Paice [25] dened three classes of relationship between pairs of words. These classes are dened as follows: Type 0 Two words are identical in forms, they are already conated. By ignoring the possibility of homographs, this kind of word is of no interest. Type 1 Two words are dierent in form, but are semantically equivalent. Type 2 Two words are dierent in form and are semantically distinct. Using this relationship denition, a good stemming algorithm is dened as one which can conate Type 1 pairs as many as possible, whilst conating as few Type 2 pairs as possible. Paice then quantied the understemming and overstemming error using two new parameters called Understemming Index (UI) and Overstemming Index (OI). The Understemming Index (UI) is dened as the proportion of Type 1 pairs which are unsuccessfully merged by the stemming algorithm. The Overstemming Index (OI) is dened as the proportion of Type 2 pairs which are merged by the stemming procedure. If all words from the sample texts are grouped semantically (as demanded by the denition of word relationship) then for a certain semantic group g, the desire to merge all of the words within that group is dened as DM Tg = 0.5 Ng (Ng 1) (3.1) where Ng is the number of words in the group g. For a group which contains only one form, the DM T value for that group is 0 since no pairs can be formed. The desired merge value for all of the groups in the sample texts is called the Global Desired Merge Total and is dened as: GDM T =
ing
DM Tgi
(3.2)
where ng is the total number of semantic groups in the sample texts. After the stemming process, all of the words will have been reduced to their stems. In a non-fully conated group, there will be more than one form of stem within the group. This means that not all of the words in that group are conated to the same stem, the stemming algorithm is unable to merge those words. The inability of a stemmer to merge all of the words in a certain semantic group g to the same stem is quantied by a parameter which is called the Unachieved Merge Total, U M Tg = 0.5
i[1..fg ]
ngi (Ng ngi ),
(3.3)
where fg is the number of distinct stems in the semantic group g, and ngi is the number of words in that group which are reduced to stem i. The unachieved merge total value for all groups in the sample text is called the Global Unachieved Merge Total, GU M T = U M Tgi (3.4)
ing
Using Eq. 3.2 and Eq. 3.4, the Understemming Index (UI) can be redened as follow: UI = GU M T GDM T 12 (3.5)
A stemmer might transform many pairs of words which originated from dierent semantic groups into identical stems. Every stem now denes a stem group whose members might be derived from a number of dierent semantic groups. If all items of a certain stem group were derived from the same original semantic group, then the stem group contains no error; conversely if a certain stem group contains members which are derived from dierent semantic groups, this means that wrongly-merged has occurred. The number of these wrongly-mergeds within a certain stem group s, which contains stems that are derived from fs dierent original semantic groups, is called the Wrongly-Merged Total, W M Ts = 0.5
i[1..fs ]
nsi (Ns nsi ),
(3.6)
where Ns is the total number of items in the stem group s, nsi is the number of stems which are derived from the ith original semantic group. The number of wrongly-merged for all words in the sample texts after the stemming process is called Global Wrongly-Merged Total, GW M T =
ins
W M T si
(3.7)
where ns is the number of stem groups as the results of the stemming. Every word within a certain semantic group has a possibility to be conated (by a stemming algorithm) with words from a dierent semantic group, which should be avoided. For a certain group g, this number is called the Desired Non-merged Total, DN Tg = 0.5 Ng (W Ng ) (3.8)
where W is the total number of words in the sample texts. The possible number for the whole words in the sample texts is called Global Desired Non-Merge Total and dened as: GDN T =
i[1..Ng ]
DN Tg
(3.9)
Just like the Understemming Index (UI), the Overstemming Index (OI) can be redened using Eq. 3.7 and Eq. 3.9 as: GW M T OI = (3.10) GDN T The ratio between OI and UI is called the Stemming Weight (SW), which is used as the parameter to measure the strength of a stemmer. This parameter ranges from weak (indicated by a low value) to strong (indicated by a high value). Figure 3.1 illustrates how this evaluation method works.
3.1.2
The Paice Experimental Results
The evaluation used sample texts taken from Kamus Elektronik Bahasa Indonesia (KEBI), an online digital dictionary built by Badan Penelitian dan Pengembangan Teknologi (BPPT), an Indonesian government organization which is responsible for research and technology development (http://nlp.aia.bppt.go.id/kebi/). This dictionary is chosen because it fullls the prerequisite of the Paice evaluation method. In this dictionary, words are linguistically grouped according to their roots, and the assessment of grouping is done manually. This dictionary consists of 8550 root words and 14200 derivational words. Repetitions, e.g. berlari-lari, were removed because 13
Original Semantic Groups Groups g1 Full Words sekolah bersekolah disekolahkan menyekolahkan persekolahan g2 seko
After stemming Stemmed Words seko seko sekolah sekolah sekolah seko
(a) semantically grouped words
(b) after stemming process, UI=0.6
Reordering Stemmed Words Stemmed Words seko seko seko sekolah sekolah sekolah Original Sem. Group g1 g1 g2 g1 g1 g1
(c) reordered stemmed words into stem group, OI=0.4
Figure 3.1: Illustration of Paice evaluation methods. they contain a non-word character (-). Homographs were also removed to full the prerequisite of Paices evaluation method, i.e., the input words do not contain duplicates [25]. For a linguistically-motivated stemmer, the dictionary size plays an important role in the stemming process. Therefore we would also like to know what the eect of the size of the dictionary is compared to the purely rule-based stemmer, especially in the case of a developing language such as Bahasa Indonesia, where new words are continuously being adopted from regional or foreign languages. To see to which extent the size of the dictionary will eect the quality of a dictionary-based stemmer, we have done several experiments by reducing the dictionary size of the Nazief stemmer [23]. Instead of nding some new words, that are not listed in the list of the Nazief stemmer lemmas, we preferred to reduce the size of that list of lemmas. The deleted lemmas are considered as new words, which are not recognized by the dictionary. These deleted lemmas are chosen randomly. The minimum reduction is 10% and the maximum is 90% of the original dictionary size. The results of the experiments can be seen in Table 3.1. As can be expected, the Nazief stemmer [23], which uses a full dictionary list, performed better than the Porter stemmer. And as can
14
be seen from Figure 3.2, the Nazief stemmer resides closer to the origin1 than our Porter stemmer. This is of course an acceptable result, since the Nazief stemmer comes with dictionary with 30528 words, which is larger than the size of the KEBI dictionary. But this size is only 39% of the size of the complete printed dictionary [8]. Table 3.1: Comparison of two Bahasa Indonesia stemmers.
Stemming Algorithm Porter Nazief Nazief (10% reduced) Nazief (20% reduced) Nazief (30% reduced) Nazief (40% reduced) Nazief (50% reduced) Nazief (60% reduced) Nazief (70% reduced) Nazief (80% reduced) Nazief (90% reduced)
1 0.9 0.8 0.7 0.6 50% UI 0.5 40% 0.4 0.3 0.2 10% 0.1 0 0 2e-06 4e-06 OI 6e-06 8e-06 1e-05 0% 30% 20% 90% 80% 70% 60% reduced porter
UI 0.262 0.09 0.165 0.31 0.384 0.47 0.55 0.62 0.72 0.82 0.91
OI106 8.44 3.60 3.92 4.46 4.52 4.73 4.74 4.71 3.52 3.24 2.13
SW106 32.27 40.85 23.75 14.41 11.78 10.17 8.70 7.61 4.89 3.97 2.35
Time (ms) 486 741 715 696 668 642 650 635 583 589 564
Figure 3.2: UI x OI plot Obviously, there are hardly overstemming errors in the linguistically-motivated stemmer (Nazief). The experimental results show that the OI values are still low even when the dictionary is already being reduced. In contrast, our Porter stemmer for Bahasa Indonesia tends to make more overstemming errors. This tendency can be explained by the characteristic of Porter stemmer, i.e., it removes the rst longest matched string at each step. Meanwhile, most of the prexes and suxes forms are substring of each others. For example, the prex me- is a substring of the prex memin the word memakan (to eat). Our Porter for Bahasa Indonesia will remove the prex mem- from that word and leaves the words akan as a stem, although the correct stem is makan (to eat). This will be explained in more detail in the Section 3.2. As the size of the dictionary became smaller, many words were not stemmed by the Nazief stemmer.
1A
good stemmer will lie closely to the origin.
15
Many understemming errors have been made, as shown by the increasing values of UI in Table 3.1. This means that the Nazief stemmer really depends on the completeness of the dictionary. Since Bahasa Indonesia is in its development such that it keeps adapting new words, a complete digital dictionary can be considered as something expensive. Considering this fact and the result of this experiment, the usage of linguistically-motivated stemmer, such as Nazief stemmer, in Bahasa Indonesia for practical purposes is questionable. We still need to nd out whether the linguisticallymotivated stemmer can be useful such that it can improve the retrieval performance.
3.2
Error Analysis
Our error analysis is conducted by analyzing the results of both stemmers for each type of word structures, i.e., the inectional structure and the derivational structure.
3.2.1
Inectional Structure
Both stemmers perform well for stripping the inectional suxes from a word. In most cases, they stripped inectional words correctly. Table 3.2 shows some results of stripping inectional suxes. Table 3.2: Results of stripping inectional suxes.
Porter Words bukunya bukukah bukunyakah dibukukannya bukunya bukukah bukunyakah dibukukannya Stems buku (book) buku buku dibukukan buku (book) buku buku dibukukan Inectional Sux 1 nya nya nya nya nya nya Inectional Sux 2 kah kah -
Nazief
There were some cases when errors emerged in our Porter stemmer. These cases arose because there exist a word w, which comprises of two substrings w1 and w2 . The substring w1 consists of more than two syllables and w2 {inectional suxes}. The stemmer mistakenly stemmed the substring w2 , which is actually part of the root word of w. The most frequent cases especially happened if there was a prex attached to the root word of w. In other word, the substring w1 contains a prex. Examples of these cases are shown in Table 3.3. The Nazief stemmer may also produce the same kind of error stems, although correct stems were listed in its dictionary. Similar with the Porter, these errors occurred when a word comprise of substrings w1 and w2 , where w2 {inectional suxes} and w1 contains prex and a word which is included in the dictionary (see Table 3.3).
3.2.2
Derivational Structure
For this structure, our Porter stemmer produces more errors compared to the Nazief stemmer for the same input words. Some examples are shown in Table 3.4. The causes of these errors can be divided into three categories. 16
Table 3.3: Errors in the inectional sux stripping.

Words Porter Nazief bersekolah (school) majalah (magazine) bersekolah (school) Prex ber ber Stem seko (spy) maja (kind of tree) seko Inectional Sux lah lah lah Actual Root sekolah (school) majalah (magazine) sekolah (school)
The rst error category is occurred if there is a substring w in a root, such that w {prexes} {derivational suxes} and the root consists of more than two syllables. Examples of this type of error are listed in the rst two rows of Porter part in Table 3.4. The second error category is caused by the stripping mechanism, i.e., the removal of the longest possible match. This mechanism causes errors since most of the prexes and suxes are substrings of each other. For example, the prex meng- with its various forms viz. me-, men-, mem-, meny-, and meng-, are substrings of each others. Suxes -kan and -an are substrings one of each other even though one of them is not the various form of the other. The last three rows of Porter part in Table 3.4 show this kind of error. The Nazief stemmer also suers from this kind of error, but it is because of its shortest possible match. In the case of the Nazief stemmer, this case happened especially with the inxes -an and -kan (the last rows of Table 3.4). The last type of errors occurred because of the diculty in the implementation of derivational rules for Bahasa Indonesia, that contain ambiguities. Both stemmers suer from this kind of errors, but of course the Porter stemmer suers more than the Nazief stemmer. Some examples of these errors are shown in Table 3.5. Table 3.4: Results of derivational prex and sux stripping.
Stemmer Porter Words naluri (instinct) perahu (boat) bentrokan (clash) perbaikan (improvement) berkedudukan (located) naluri (instinct) perahu (boat) bentrokan (clash) perbaikan (improvement) berkedudukan (located) Prex per per ber per ber Stem nalur ahu bentro bai kedudu naluri perahu bentrok baik keduduk (a kind of plant) Sux i kan kan kan an an an Actual Root naluri perahu bentrok (to clash) baik (good) duduk (to sit) naluri perahu bentrok baik duduk
Nazief
Table 3.5: Spelling adjustment errors in stripping suxes.

Words mengalahkan (defeating) mengobarkan (to re someone up) mengupas (peeling) Prex meng meng meng Stem alah obar upas (security guard) Derivational Sux kan kan Actual Root kalah (to defeat) kobar (to inspire) kupas (to peel)
17
Chapter 4
Stemmer Performance Evaluation for Information Retrieval

In this chapter we evaluate the performance of the two stemmers introduced before in the setting of Information Retrieval. We used a non-stemming system as the baseline of this evaluation. In the next four sections, we explain the environments of the experiments. Results and an analysis of the evaluation are given in Section 4.5.
4.1
4.1.1
The Test Collections

The Document Collections
Since there is no document collection in Bahasa Indonesia available in standard collections such as the TREC collection and the CLEF collection, we setup our own collections. We took our documents from two sources, Kompas, an Indonesian daily online newspaper (http://www.kompas.com), and Tempo, an Indonesian daily online news (http://www.tempo.com). From these two sources we created two document collections, viz. kompas and tempo. The kompas collection is a two-years headline edition (that is from January 2001 until December 2002). And the tempo collection is also a two-years edition (that is from June 2000 until July 2002). Table 4.1 shows the statistics of each collection. Table 4.1: Test-Collection Statistics
size (MB) of documents avg. docs length (byte) avg. unique words (terms) kompas 27.52 5449 4031.00 326.09 tempo 45.57 22944 1549.59 155.00
Both document collections have been parsed in order to remove all of the HTML tags. These collections have also been transformed into an SGML-like structure. The format follows the overall TREC document structure with two main considerations, viz. easy parsing, so that these documents can easily be used for the purpose of this experiment and for the future expectation, 18
such that these document collections can help further IR research in Bahasa Indonesia. An example of a document from the kompas collection can be seen in Figure 4.1. Manual correction has also been performed to all these document collections.
<DOC> <DOCID> KOMPASHL2001310101PRES01 </DOCID> <TITLE> Presiden Bantah Terlibat </TITLE> <TEXT> Presiden Abdurrahman Wahid membantah terlibat dalam penyelewengan dana Yayasan Bina Sejahtera (Yanatera) Badan Urusan Logistik (Bulog) . . . </TEXT> </DOC>
Figure 4.1: Document example: kompas document KOMPAS-HL2001-310101-PRES01.
4.1.2
The Information Requests (Queries)
Both document collections are accompanied with a set of information requests (queries) that are used for the evaluation purposes. Each query is a description of an information need, which is constructed in natural language. The queries construction was done manually by two University of Amsterdam students whose native language is Bahasa Indonesia. The queries covers widely known events, which had happened in Indonesia during the year each collection covers. Some queries in the kompas and tempo collections are about the same topics. Queries in the kompas are created such that they are longer than queries in the tempo. We also xed the number of queries for each document collection to 35, since this number exceeds the minimum number of topics which are needed for an experiment within the TREC general consensus [5]. The statistics of the queries can be found in Table 4.2. The queries were also written in an SGML-like format to allow easy access for the purpose of the experiments and for the purpose of dening the relevant sets. The format, which includes a clear description of each query, should help the assessors determine the relevant documents. Figure 4.2 shows an example of a query from the kompas collection.
4.1.3
Relevant Documents for Every Information Request
The set of relevant documents for each query is constructed manually by the student that created the query. This manual way is chosen because the collections size are not huge, which makes it possible to do so. The set of these relevant documents are assessed again by the second student which resulted in a double checking relevant sets. In the case that these two assessors have dierent opinions about the relevance of a certain document for a certain query, the document is then considered as
19
<QRY> <QRYID> KOMPAS2001Q2 </QRYID> <TITLE> TKI ilegal di Malaysia </TITLE> <RQST> Masalah Tenaga Kerja Indonesia (TKI/TKW) ilegal di Malaysia </RQST> <DESC> Dokumen berisi berita seputar masalah tenaga kerja Indonesia (TKI/TKW) ilegal di Malaysia yang mencuat karena adanya pemberlakuan hukum baru bagi para tenaga kerja ilegal tersebut. Berita pemulangan tenaga kerja ilegal dan usaha Pemerintah RI dalam hal pemulangan tenaga kerja ilegal asal Indonesia. Termasuk juga catatan pengamat tentang masalah tenaga kerja Indonesia (TKI/TKW), terutama masalah TKI di Malaysia akibat pemberlakuan hukum baru tersebut. </DESC> </QRY>
Figure 4.2: Query example: query KOMPAS-HL2001-Q-2. non-relevant. The statistics of the relevant sets for kompas and tempo collections can be seen in Table 4.2. Table 4.2: Test-Query Statistics
of queries avg. queries length (word) avg. unique words avg. of relevant docs per query kompas 35 8.777 8.63 22.657 tempo 35 5.2 5.17 66.971
4.2
The FlexIR System
All our experiments used the FlexIR information retrieval system. FlexIR is an automatic information retrieval system built at the Universiteit van Amsterdam. This system is based on the vector space model [3] and implemented in Perl [21]. It supports many types of scoring, such as Precision/Recall, Average Precision and R-Precision which were used in this evaluation. The original design of this system is dedicated for Western-European languages such as English. Therefore some modications have been performed in order to use the system for Bahasa Indonesia. The weighting scheme is the Lnu.ltc scheme [4, 33], xing the slope to 0.2 and setting the pivot to the average number of unique words per document as in [21].
20
4.3
4.3.1
Performance Measurements
Precision/Recall
The traditional Average Precision-Recall measure is used because it is the standard measurement and it is used extensively in the literature [3, 10, 30]. Recall is the proportion of relevant items retrieved, while precision is the proportion of retrieved items that are relevant. Equation 4.1 gives the specic denition of these two measurements. recall precision = = Nrr Nrel Nrr Nret
(4.1)
where Nrr is the number of relevant items retrieved, Nrel is the number of relevant items and Nret is the number of retrieved items. The P-R measurement is based on the average precision at certain recall levels. By assuming that a certain recall level must be attained for every query, the best retrieval system is the one that attains this recall level with the fewest number of non-relevant documents (the highest precision). Although there are some critics of using this measurement [3, 14], from our point of view, this measurement is a nice tool for macro-evaluation of the retrieval systems.
4.3.2
Average Precision
The Average Precision is a single value summary. For a certain query, it is computed by averaging all precision values for that query at the relevant document position in the ranking. From its denition, it can be seen that this measurement represents the entire area underneath the recallprecision curve. It is also recommended to be used as a measurement in the general purposes retrieval evaluation [5].
4.3.3
R-Precision
The R-Precision is also a single value summary. It is calculated by computing the precision after R documents have been retrieved, where R is the total number of relevant documents for the current evaluated query. This measurement is used, because our document collections consist of a large variety in the number of relevant documents [19].
4.4
Stoplists
To complete the IR environment, we also propose in this thesis, a new stoplist for Bahasa Indonesia (see Appendix D), because there is no available stoplist for Bahasa Indonesia which can be used. The proposed stoplist is derived from the results of the analysis of word frequencies in Bahasa Indonesia (see Appendix C). It is compared to the result of computational linguistics research in Bahasa Indonesia [22] and with the stoplist in [9].
21
Before using the the proposed stoplist in the evaluation of stemming eect to retrieval performance, we conducted some experiments to evaluate our stoplist. In these experiments, two systems (for each document collection) were evaluated, viz. the IR system without using either stemmer and stoplist (NoSo) and the IR system with stoplist only (So). The results of these experiments are depicted in Figure 4.3 and Table 4.3. For both document collections, the results show that the removal of these stopwords can enhance the precision, especially at low recall levels although not signicant. Therefore we can say that the proposed stoplist can be used in the further retrieval evaluation. Table 4.3: Average Precision and R-Precision results of system without and with stoplist (NoSo and So)
NoSo kompas tempo non-interpolated avg. precision R-Precision 0.7015 0.6542 0.5251 0.5168 So kompas tempo 0.7101 0.6649 0.5329 0.5252
1 0.9 0.8 0.7 Precision
So NoSo
1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1
So NoSo
0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Recal 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5 0.6 Recal
0.7
0.8
0.9
(a) NoSo vs. So for kompas
(b) NoSo vs. So for tempo
Figure 4.3: Comparison of Recall-Precision between non stopwords vs. stopwords ltering system.
4.5
Evaluation Results
This section describes the experiments which we conducted for evaluating the stemmers eect on the retrieval performance in Bahasa Indonesia. We used all the retrieval environments which have been described in the previous sections. In this evaluation, we contrast the two stemmers with a baseline of no stemming at all. Therefore there are three systems which were evaluated, viz. no stemming at all (NoSm), the Nazief stemmer (Nazief), and our Porter stemmer for Bahasa Indonesia (Porter). The result of the experiments (for each document collection) can be seen in Figure 4.4, Figure 4.5, and Table 4.4. As can be examined from those two gures and table, the dierences of the performance values between the three systems are very small and the results for both document collections show an inconsitency between the stemming systems and the non-stemming system. Therefore at this point we cannot make any conclusion based on the dierence of these values. 22
Table 4.4: Average Precision and R-Precision results over all queries for the three systems
NoSm kompas tempo non-interpolated avg. precision R-Precision 0.7101 0.6649 0.5329 0.5252 Porter kompas tempo 0.7086 0.6574 0.5456 0.5403 Nazief kompas tempo 0.7026 0.6563 0.5464 0.5383
In this situation, Hull [15] suggested to perform a statistical testing, which can provide valuable evidence about whether the experimental results have a more general signicant dierences. The statistical testing and its results are described in Sub Section 4.5.1.
1 0.9 0.8 0.7 Precision
Porter NoSm
1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1
Nazief NoSm
0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Recal 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5 0.6 Recal
0.7
0.8
0.9
(a) NoSm vs. Porter
(b) NoSm vs. Nazief
1 0.9 0.8 0.7 Precision
Nazief Porter
1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1
Nazief Porter NoSm
0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Recal 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5 Recal
0.6
0.7
0.8
0.9
(c) Porter vs. Nazief
(d) NoSm vs. Porter vs. Nazief
Figure 4.4: PR-Curves for kompas Collection.
4.5.1
Statistical Testing
The statistical testing which we conducted here follows the procedure in [14]. The standard statistical model for a certain query and a certain retrieval method is dened as yij = + i + j +
ij
(4.2)
where yij is the observed data corresponds to the retrieval performance for query i and method j, is the true mean performance, i is the query eect, j is the retrieval method eect, and
23
1 0.9 0.8 0.7 Precision
Porter NoSm
1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1
Nazief NoSm
0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Recal 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5 0.6 Recal
0.7
0.8
0.9
(a) NoSm vs. Porter
(b) NoSm vs. Nazief
1 0.9 0.8 0.7 Precision
Nazief Porter
1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1
Nazief Porter NoSm
0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Recal 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5 0.6 Recal
0.7
0.8
0.9
(c) Porter vs. Nazief
(d) NoSm vs. Porter vs. Nazief
Figure 4.5: PR-Curves for tempo Collection

ij is the error. The query eect i and the method eect j are assumed to be independent and additive.
The Null Hypothesis (H0 ), which is tested, is that the observed stemmer methods are in equal performances. If the p-value is very small, then evidence suggests that the observed statistics reect an underlying dierence between the stemmers. Because there were three stemmers to be evaluated, the two-way ANOVA is used [14, 15, 36]. In the ANOVA, the F-test for j = 0 for all j is dened as n F = M Sbet.stem = M Sresidual
i,j (yi,j
(j y) y m1
yi yj + y )2
(4.3)
(n 1)(m 1) where yi,j is the observed data, which corresponds to the retrieval performance of method j = [1 . . . m] for query i = [1 . . . n] (m and n are the number of stemmers and queries respectively). The value yi is the average performance of a query i over all methods, while yj is the average performance of a method j over all queries. M Sbet.stem and M Sresidual are the mean square between stemmers and the residual errors. If the F-test is signicant (identied by the very small p-value), then the ANOVA multiple comparison is used. The ANOVA multiple comparison is Tukeys studentized range test distribution
24
under H0 which is dened as: |k yl | y
qm,v s n
(4.4)
where qm,v is the studentized range statistic for v = (n 1)(m 1) at signicant level and y s = M SE (Mean Squared Error). All mean dierences between method k (k ) with method l (l ) that are greater than the value at the right-hand side of Eq. 4.4 are assumed to be signicant. y
In order to convince ourselves of using the two-way ANOVA, we tested the data by making a quantile plot. One of the example of the quantile plot for data from the Non-interpolated average precision values of the Nazief stemmer can be seen in Figure 4.6(a). The quantile plot shows that the data are skewed. Therefore we transformed the data using the function f (x) = arcsin( x) as suggested in [14]. As can be seen from Figure 4.6(b), the transformed data already follows the normal distribution, therefore these data could be used for the ANOVA analysis.
8 1.2
KOMPAS Non-interpolated avg. prec. with Nazief stemmer
1.0
.8 4
Expected Normal Value
.6
.4
0 .06 .13 .19 .25 .31 .38 .44 .50 .56 .63 .69 .75 .81 .88 .94 1.00
Std. Dev = .23 Mean = .71 N = 35.00
.2 0.0
.2
.4
.6
.8
1.0
1.2
KOMPAS Non-interpolated agv. prec with Nazief Stemmer
Observed Value
(a) The original Non-interpolated average precision values
10 1.6
1.4
1.2 6 1.0
Expected Normal Value
.8
.6
0 .25 .38 .50 .63 .75 .88 1.00 1.13 1.25 1.38 1.50 1.63
Std. Dev = .30 Mean = 1.04 N = 35.00
.4 .2 .2 .4 .6 .8 1.0 1.2 1.4 1.6
Observed Value
(b) The transformed Non-interpolated average precision values
Figure 4.6: Quantile Plots from Non-interpolated average precision values of Nazief for the kompas collection
25
Table 4.5: ANOVA Table for Average Precision Measurement

Doc. Coll kompas Source Stemmer Query Residual Stemmer Query Residual Sum of Sq. 0.0016 8.6742 0.1546 0.0053 6.4746 0.0773 df 2 34 68 2 34 68 Mean of Sq. 0.0008 0.2551 0.0022 0.0027 0.1904 0.0011 F 0.3481 112.2208 2.3298 167.4948 p 0.7072 6.99485E 48 0.1050 1.13734E 53
tempo
Table 4.6: ANOVA Table for R-Precision Measurement

Doc. Coll kompas Source Stemmer Query Residual Stemmer Query Residual Sum of Sq. 0.0021 7.5064 0.1586 0.0054 5.1979 0.0687 df 2 34 68 2 34 68 Mean of Sq. 0.0010 0.2208 0.0023 0.0027 0.1529 0.0010 F 0.4455 94.6643 2.6641 151.2662 p 0.6424 1.93728E 45 0.0769 3.41583E 52
tempo
The ANOVA tables of our experiments for average precision and R-precision performance can be seen in Table 4.5 and Table 4.6. By taking p-value less than 0.05 for rejecting the H0 , from the ANOVA results of the stemmers eect for both document collections, we can say that we accept the H0 . This means we accept that the three systems are equal in the average precision and R-Precision performance. Since the statistical test was unable to detect the signicant dierence, we conducted a detailed analysis by examining a number of individual queries and their stems to gain more information about why the signicance tests failed. This detailed analysis is explained in Sub Section 4.5.2
4.5.2
Detailed Analysis
At this point, it was not clear to us why the statistical test was unable to detect a signicant dierence between the three systems. Therefore we conducted a detailed analysis of some queries by examining their stem results and some of the retrieved documents within those queries. We used the queries from both document collections and discuss several cases where the performance of one system is equals or outperforms the other two systems.
kompas Q.1: Kasus penyalahgunaan dana Yanatera Bulog dan Memorandum DPR The fraud of Yanatera Bulog Budget and Parliaments Memorandum The Nazief stemmer did not recognize that the words penyalahgunaan (fraud), disalahgunakan (misuse - passive form), and menyalahgunakan (misuse) are related. It also did not recognize that these words are also related to the phrase/compound salah gunakan and salah guna. Penyalahgunaan is a compound word. This compound is originated from the phrase/compound salah guna. This kind of compound is a phenomenon that might exist in Bahasa Indonesia, because there is a rule which governs the phrase with both prex and sux to be written unseparately [6]. Since the stem salahguna is not in its lexicon, it did not stem any words in this query, such that its
26
performance is equal to the non-stemming system. Our Porter stemmer converted the derivational compound penyalahgunaan, menyalahgunakan, and disalahgunakan to the same stem salahguna. The usage of this stem made more relevant documents being retrieved so that its performance is better than that of the two other systems. But without further splitting, it also failed to recognize that those compounds are also related to the phrase salah gunakan and salah guna.
kompas Q.2: Kasus kerusuhan antar agama di Poso dan penanganannya Religious conict in Poso and its solution Both Nazief and Porter stemmer converted the word penanganannya (the process of handling) to tangan (hand). This conversion is a bad decision since the word tangan is a common word in Bahasa Indonesia. This word can be combined with various words to form compound words, such as campur tangan (involvement), tanda tangan (signature), which are contained in many documents in this collection. Both stemmers also converted the word kerusuhan (turmoil) to rusuh (restless). This conversion is unhelpful since the resulting word rusuh is an adjective which is considered as non-important term for information retrieval. Therefore the performance of both stemming systems are below the non-stemming system.
kompas Q.3: Pelaksanaan sidang istimewa MPR meminta pertanggungjawaban Presiden Abdurrahman Wahid. Dekrit Presiden The extraordinary session of Peoples Consultative Assembly (MPR) to ask Presidents responsibility. The Decree of President Just in the case of kompas Q.1, the Nazief stemmer failed to recognize that the compound words pertanggungjawaban (responsibility), mempertanggungjawabkan (to account for) and dipertanggungjawabkan (to account for - passive form) are related to the phrase bertanggung jawab (to be responsible) and tanggung jawab (responsibility). Since the stem tanggungjawab is not in its lexicon, it left the word pertanggungjawaban as it is. However, it successfully recognized that the words pelaksanaan (implementation), melaksanakan (to perform), dilaksanakan (being performed), and pelaksana (executor) are related and stemmed them to one stem that is laksana. The usage of this stem could pull out more relevant documents such that its performance is better than the non-stemming systems. The Porter stemmer successfully recognized that the words pertanggungjawaban, mempertanggungjawabkan, dipertanggungjawabkan are related. It stemmed them to the same stem tanggungjawab, but without further splitting process, it failed to recognized that those compounds are also related to the phrase bertanggung jawab and tanggung jawab. Therefore it gained only small benet. Just as the Nazief stemmer, our Porter stemmer also recognized that words pelaksanaan, pelaksana, melaksanakan and dilaksanakan are related and stemmed them to laksana. The usage of these two stems made its performance outperform the other two systems.
kompas Q.4: Konik bersenjata Aceh, Gerakan Aceh Merdeka (GAM) dan penanganannya Armed conict in Aceh, Free Aceh Movement and its solution The Nazief stemmer converted the word gerakan (movement) to gerak (to move), where actually the word gerakan is part of an organization name (proper name) and should not be converted.
27
Similar to the case of kompas Q.2, it converted penanganan to tangan. The Porter stemmer also converted the part of the organization name Gerakan to gera, and merdeka to erdeka which should not be converted. It also converted the word penanganan to tangan. The usage of these stems decreased the performance of both stemmers.
kompas Q.5: Konik antar etnis Madura-Dayak di Kalimantan Madura-Dayak ethnics clashes in Kalimantan All three systems have the same retrieval performance. The Nazief stemmer did not stem any of the words in this query. The Porter stemmer incorrectly stemmed the word Kalimantan (the name of Borneo island) to Kalimant, but since the word Kalimant does not exist in Bahasa Indonesia, this did not hurt its performance.
kompas Q.7: Kasus penyalahgunaan dana nonbudgeter Bulog yang melibatkan Akbar Tandjung The fraud of nonbudgeter budget of Bulog which involves Akbar Tanjung As in the case of kompas Q.1, the Nazief stemmer did not stem the derivational compound word penyalahgunaan, whilst our Porter stemmer stemmed it to salahguna. The Nazief stemmer wisely recognized that the word melibatkan (involving), terlibat (involved) and keterlibatan (involvement) are related and stemmed them to the same stem libat (to involve). Just like Nazief stemmer, our Porter stemmer also recognized that those words except keterlibatan are related. In this case, it suered from understemming error by converting the word keterlibatan to terlibat. But the Nazief stemmer did not gain great benet since the resulting stem libat is a verb. As verbs cannot be considered as important terms, therefore its performance is slightly lower than our Porter stemmer, but it is still better than the non-stemming system.
kompas Q.9: Kasus penculikan dan pembunuhan ketua Presidium Dewan Papua (PDP) Theys Eluay The kidnapping and the murder case of the head of Papua Council Presidium, Theys Eluay Both the Nazief and Porter stemmer relate the word penculikan (kidnapping) with the words menculik (to kidnap), diculik (being kidnapped) and penculik (kidnapper) and stem them to culik (to kidnap) which is a verb. Both stemmers also relate the word pembunuhan (murder) with the words membunuh (to kill), dibunuh (killed), and pembunuh (killer/murderer) and stem them to bunuh (to kill). This conversion turned out to be an unwise decision, due to the fact that both resulting stems are verbs which cannot be considered as important term for information retrieval. Also the document collection contains many stories about murder and kidnapping which were done by the Free Papua Movement (OPM). Therefore the performance of both stemming systems are below the nonstemming system.
kompas Q.18: Kasus pengambilalihan Semen Padang The take over of Semen Padang This is the same case as kompas Q.3, where the word pengambilalihan (process of taking over) is supposed to be converted to ambil alih (to take over) such that it can be related to other words, i.e. 28
mengambil alih (taking over), diambil alih (took over - passive form). The Nazief stemmer could not stem this query, which caused the stemmed query to be equal to the non-stemmed version. The Porter stemmer converted it to ambilalih which is not recognized in Bahasa Indonesia except if it is splitted. Therefore the performance of the three systems are equal.
kompas Q.23: Perubahan Undang Undang Dasar 1945 menyangkut pemilihan presiden langsung The amendment of 1945 State Constitution in the view of direct presidential election act Both Nazief and Porter stemmers converted the word perubahan (alteration) to ubah (to change), which is a very common term. The word perubahan in the phrase Perubahan Undang Undang has a specic meaning in the domain of law, which usually associated to the word amendment. Therefore their performance were lower than the non-stemming system.
kompas Q.25: Kasus penangkapan tiga warga negara Indonesia Tamsil Linrung, Agus Dwikarna dan Abdul Jamal Balfas di Filipina The arrest of three Indonesians, Tamsil Linrung, Agus Dwikarna and Abdul Jamal Balfas, in Philippine All the three systems have the same performance for this query. As we can see, that this query contains many specic names. The usage of specic names might be the cause that the three systems have equal performance.
kompas Q.31: Kasus peledakan bom yang terjadi di Bali The Bali bombing blast Both Nazief and Porter stemmer converted the word peledakan (blast, explotion) to ledak (to blast, to explode). Both stemmers do not get benet from this conversion, since the resulting stem is a verb which cannot be considered as important term for information retrieval. Even, it seems to decrease their performance. The Nazief stemmer also converted the word Bali to bal (ball). Since bal is in its lexicon, it also converts Balkan to the same stem. Therefore some of the retrieved documents also contain the story of Balkan. This worsens its performance.
kompas Q.32: Kasus kontak senjata antara anggota TNI AD dengan Kepolisian di Binjai The armed conict between Indonesian army and Indonesian police force in Binjai The Nazief and our Porter stemmer both converted the word Kepolisian (police force) to polisi (police). The word kepolisian is a specic word which is related to the police organization. This conversion made the performance of Nazief and Porter stemmer both lower than the non-stemming system.
29
kompas Q.34: Kasus sengketa Sipadan dan Ligitan antara Indonesia-Malaysia The dispute between Indonesia and Malaysia over Sipadan and Ligitan islands This is the same case as queries kompas Q.5 and kompas Q.6, except that our Porter stemmer incorrectly converted the word Sipadan to sipad (ear) and the word Ligitan to ligit. But these two terms did not decrease the performance of our Porter stemmer since the word sipad, which comes from Javanese, is rarely used, and sipad is not recognized in Bahasa Indonesia.
tempo Q.1: Kenaikan harga dan subsidi BBM The increase of oil prices and oils subsidy Both Nazief and our Porter stemmer stemmed the word kenaikan (increase) to naik (to increase). This resulting stem is a verb and is a very common term in Bahasa Indonesia. Therefore the usage of this stem made the performance of both Nazief and our Porter stemmer lower than the non-stemming system.
tempo Q.2: Konik bersenjata di Aceh Armed conict in Aceh Both Nazief and our Porter stemmer converted the word bersenjata (armed) to senjata (weapon) which is a noun. The usage of this stem can pulled out many relevant documents such that the performance of both stemming systems are better than the non-stemming system. Compare to the Query kompas Q.4 from the kompas collection, this query is shorter and contains less proper name. The results of both queries show that both stemming systems have better performance for this query than for the Query kompas Q.4.
tempo Q.3: Penyelewengan dana nonbudgeter Bulog The fraud of nonbudgeter budget of Bulog Both Nazief and our Porter stemmer converted the word penyelewengan (fraud) to seleweng (to deceive). This conversion turned out to be an unwise decision, since the documents collection contains many stories about corruption in Indonesia.
tempo Q.4: Kasus Buloggate (dana Yanatera) dan Bruneigate Buloggate (Yanatera budget) and Bruneigate All the three systems have the same retrival performance. Both Nazief and our Porter stemmer did not stem any words in this query.
tempo Q.18: Kecelakaan pesawat Cassa NC-212 di Irian Jaya Airplane Cassa NC-212 crashed in Irian Jaya Both Nazief and our Porter stemmer converted the word kecelakaan (crash, accident) to celaka (unfortunate). This conversion can be considered as unwise decision, since the stem celaka is an adjective which cannot be considered to be helpful in pulling-out more relevant documents. Even it makes the performance of the two stemming systems are below the non-stemming system. 30
4.5.3
Summary of the Detailed Analysis
We found that the linguistically correct stems, which are produced either by the linguisticallymotivated stemmer or by the rule-based stemmer, may not be optimal for retrieval purposes. In this case, the stemming process is harmful. Similar to what happened in English and Dutch [15, 17, 20], we found many examples where the rule-based stemmer, such as the Porter stemmer, produced non-comprehensible words. Because the morphological rules in Bahasa Indonesia contain many ambiguities, the rule-based stemmer without using any additional knowledge might produce many more non-comprehensible words than rule-based stemmers for other languages. Here, our Porter stemmer produced 11.8% noncomprehensible words in stemming all of words in the queries of kompas collection. From the query analysis, we found examples where the linguistically-motivated stemmer, such as the Nazief stemmer, undesireably stems some words to a word with a very dierent meaning, even though it is already accompanied by a lexicon. From our detailed analysis of queries, we found that words which were stemmed have very small number of variations. The average number of derivational variations of a certain word (excluding proper name) from all queries is only about 4.135. This is very small compared to Slovene language [27]. We also found that the number of inectional and derivational axes which are handled with these two stemmers are very small compared to the number of axes which are handled in the Slovone stemmer [28], the Dutch stemmer [17] and the Porter stemmer [29]. Recall to the purpose of stemming as morphological normalization, a stemmer which handles a small number of axes should also gain a small number of benet in retrieval performance. This may be the cause of the non-signicant dierence in performance of both stemming systems compared with the non-stemming system in our experiments. From analysis of the results of stemming all words in the queries of both documents collections, we see that most of the resulting stemmed words are verbs and adjectives. As verbs and adjectives are less important term for index or search keys than nouns in information retrieval, this might also be the cause that our experiment results show non-signicant dierences between stemming systems and the non-stemming system. We can also see that some of the queries consist of derivational compound words.1 These derivational compound words are not recognized by the Nazief stemmer, hence it did not stem them and left them as they were. In this case the performance of the Nazief stemmer equals to that of the non-stemming system. Whilst for the Porter stemmer, although it could stem these words correctly, it could not get any benets from it, unless they were further splitted. Similar to what happened in English [15], stemming process seems to give more benet for short queries. This can be seen from the results of both documents collections. Although the results of the experiments of stemming and non-stemming systems are not statistically signicant, the performance of stemming experiments with the tempo collection which has shorter queries is better than the experiments with the kompas collection. We found that some of the queries do not need to be stemmed at all. For the kompas collection, the number of this kind of queries are about 29% of all queries. We also found that many of the queries consist of many proper names which are left untouched by both stemmers. These also made the performance of all three systems are equal.
1 These derivational compound words are exist in Bahasa Indonesia. As stated before, there is a rule which governs the compound words (phrase) to be written unseparated if there is a prex attached to the rst word and a sux attached to the second word. If only prex or sux attached to the rst or second word, then they should be written separately.
31
Chapter 5
Conclusions
After several evaluations of the eet of stemming on retrieval performance in Bahasa Indonesia, we reach a number of conclusions: 1. Our Porter stemmer for Bahasa Indonesia produces many non-comprehensible words which are caused by the ambiguity in the morphological rules of Bahasa Indonesia. In some cases the errors do not hurt performance, but in other cases they decrease the performance. Extending it with a digital dictionary is somewhat a dilemma since a digital dictionary is expensive. In further research, extending the rule-based stemmer with words co-occurence may give better results. 2. Such as in English, the linguistically-motivated stemmer which is developed by Nazief for Bahasa Indonesia, posses two main problems. First, the ability of the stemmer depends on the size of the dictionary. It cannot stem a word which is not in its lexicon. Second, a linguistically correct stems which is produced by this kind of stemmer does not always optimal for the purpose of information retrieval application. Therefore, if it is to be used for IR, this linguistically motivated stemmer should be enhanced with other tool such as domain linguistic analysis or adding a domain specic lexicon. 3. Derivational compounds in Bahasa Indonesia seems to need special treatment in order to get benet from stemming. Further morphological research needs to be conducted to see whether compound splitting is needed for information retrieval. The derivational compound in Bahasa Indonesia is not as common as in Dutch, and it can only be useful if it is combined with a stemmer. And also a more complex IR system, which recognizes phrases, is required. 4. From our detailed analysis of queries, we found that words which were stemmed have very small number of variations. The average number of derivational variations of a certain word (excluding proper name) from all queries is only about 4.135. This is very small compared to Slovene language [27]. We also found that the number of inectional and derivational axes which are handled with these two stemmers are very small compared to the number of axes which are handled in the Slovone stemmer [28], the Dutch stemmer [17] and the Porter stemmer [29]. Recall to the purpose of stemming as morphological normalization, a stemmer which handles a small number of axes should also gain a small number of benet in retrieval performance. This may be the cause of the non-signicant dierence in performance of both stemming systems compared with the non-stemming system in our experiments. 5. Failure of the statistical signicance test in our experiment to detect signicant dierence does not necessarily mean that there is no dierence between systems [14]. We realized that 32
our corpora are far from perfect due to the fact that these corpora are created and judged only by two persons. We also know that our queries were formulated such that they contain many proper names. Therefore tests on a number of dierent other corpora (collections) are needed to be performed further.
33
Appendix A
Derivational Rules of Prex Attachment

Table A.1: Rules and Variation Forms of Prexes
Prex meng Variation Form meng Rules + Vowel|k|g|h. . . , e.g: ambil (to take) mengambil (taking) embun (vapor) mengembun (to condense) ikat (to tie) mengikat (to tie/to bind) olah (process) mengolah (processing) ukur (to measure) mengukur (measuring) kurus (slim) mengurus (become slimmer) urus (to take care) mengurus (to take care) ganggu (to disturb) mengganggu (disturbing) hilang (to lose) menghilang (to dissapear) + s. . . , e.g: sisir (comb) menyisir (to comb something) + b|f|p. . . , e.g: beku (frozen) membeku (to become frozen) tnah (to accuse) memtnah (accusing) pukul(to hit) memukul (hitting) + c|d|j|t. . . cuci (to wash) mencuci (washing) darat (land) mendarat (landing/docking) jual (to sell) menjual (selling) tukar (to change) menukar (changing) + l|m|n|r|y|w. . . , e.g: lintas (to cross) melintas (crossing) makan (to eat) memakan (eating) nikah (marriage) menikah (to get married) rusak (to break) merusak (breaking) wabah (epidemic) mewabah (outbreak) yakin (sure) meyakin(kan) (to convince someone) + Vowel|k|g|h. . . , e.g: ikat (to tie) pengikat (something that is used to tie) continue to next page
meny mem
men
me
peng
peng
34
continued from previous page Prex Variation Form Rules olah (to process) pengolah (processor) ukur (to measure) pengukur (measurement) urus (to take care) pengurus (person who take cares) ganggu (to disturb) pengganggu (person who disturbs) halus (soft) penghalus (softener) peny + s. . . , e.g: saring (to lter) penyaring (lter) pem + b|f|p. . . , e.g: baca (to read) pembaca (reader) tnah (to accuse) pemtnah (people who accuse) pukul(to hit) pemukul (things that is used to hit) pen + c|d|j|t. . . cuci (to wash) pencuci (laundress/laundryman) datang (to come) pendatang (the comer) jual (to sell) penjual (seller) tukar (to change) penukar (changer) pe + l|m|n|r|y|w. . . , e.g: lintas (to cross) pelintas (passerby) makan (to eat) pemakan (eater) rusak (to break) perusak (destroyer) warna (color) pewarna (dye) ber bel + ajar, eg: ajar (to teach) belajar (to study/to learn) be + r|KVr. . . , e.g: rencana (plan) berencana (to have a plan) kerja (to work) bekerja (working) ber + any word which violates conditions of the alomorphs bel and be tukar (to change) bertukar (to change, changing) per pel + ajar, e.g: ajar (to teach) pelajar (student) pe + r|KVr. . . , e.g: ramal (to predict) peramal (fortune-teller) per + any word which violates conditions of the alomorphs pel and pe kaya (rich) perkaya (to make richer) ter te + r. . . , e.g: rasa (to feel) terasa (to be felt) ter + K|V. . . , where K = r, e.g: atur (to arrange) teratur (to be properly arranged)
35
Appendix B
The Meaning of Axations

Table B.1: The meaning of axations
Ax mengditerpengFunctions verb to verb form noun to verb form verb to passive verb form noun to passive verb form verb to passive accidental verb form noun to passive accidental verb form noun to noun form verb to noun form adjective to noun form verb to active verb form noun to active verb form adjective to active verb form verb to noun form noun to causative verb form adjective to causative verb form adjective to noun form verb to command verb form noun to command verb form adjective to command verb form verb to intensive/repetitive verb form noun to intensive/repetitive verb form adjective to commmand verb form verb to noun form Examples makan (to eat) memakan (to eat, eating) sisir (comb) menyisir (to comb) makan dimakan (to be eaten) sisir disisir (to be combed) makan termakan (to be eaten accidently) paku (nail) terpaku (to get nailed accidently) tani (farm) petani (farmer) baca (to read) pembaca (reader) rusak (damaged, destroyed) perusak (destroyer) main (to play) bermain (to play, playing) sepeda (bicycle) bersepeda (to bike/cycling) gembira (happy) bergembira (to be excited) kerja (to work) pekerja (worker) istri (wife) peristri (to take someone as a wife) halus (soft) perhalus (to make softer) tua (old) ketua (leader) ambil (to take) ambilkan (asking someone to take) sisir sisirkan (asking someone to comb something) jauh (far) jauhkan (asking someone to move something further) ambil ambili (taking something several times) sisir sisiri (combing something several times) jauh jauhi (asking someone to move further from something) makan makanan (food, something to be eaten)
ber-
per-
ke-kan
-i
-an
36
Appendix C
Word Frequency Analysis

Word frequency analysis was conducted by performing experiment on Bahasa Indonesia corpus. This experiment used online Indonesia newspapers as text source. One year editions are collected from Online Kompas, http://www.kompas.com, one of the most widely read newspaper in Indonesia. These editions are taken in a consecutive of every day in a year (started from January 2001 until December 2001) with the total of 3160 documents. These documents are only the daily headlines of the newspaper. The corpus, which is built from this analysis, consists of 50.000 unique words, after removing the names of peoples, cities, organisations, countries, etc. The results of the most frequently occur words can be seen in Table C.1. This list consists of root words and derived words. The number of root words and derived words are still under investigation. The further investigation will be done by using Indonesian Dictionary.
37
Table C.1: Most frequently occur words

No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Word yang dan itu tidak dengan dari untuk dalam ini akan juga pada ada presiden karena bisa sudah tersebut pemerintah tahun oleh saya atau mereka kepada menjadi harus hari kata sebagai adalah lebih para mengatakan hanya orang telah masih bahwa tetapi namun saat seperti negara sekitar secara lain kami satu baru Freq. 55971 41286 24768 18723 18281 17632 16886 15681 14707 12433 9343 9212 8592 8310 7935 6703 6690 6121 5963 5766 5675 5643 5429 5392 5336 5219 5184 5163 5148 5068 4922 4818 4686 4650 4457 4430 4363 4283 4266 4180 4134 4105 4080 4027 4009 3959 3797 3785 3750 3591 No. 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 Word lalu kita kalau belum terjadi besar terhadap kepala masyarakat sampai sementara politik setelah tak antara lagi ketua melakukan dilakukan saja katanya persen dapat daerah jalan anggota sangat pun hal rumah warga beberapa seorang banyak atas ekonomi agar serta bagi kota kembali ketika hukum selama tiga merupakan sebuah kedua negeri luar Freq. 3495 3467 3438 3422 3417 3346 3284 3243 3211 3211 3197 3197 3183 3177 3149 3145 3093 2989 2897 2894 2866 2858 2839 2820 2798 2796 2785 2756 2749 2736 2676 2639 2618 2613 2593 2546 2525 2517 2467 2439 2410 2394 2369 2367 2347 2340 2306 2277 2256 2225 No. 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 Word dana pukul bukan tetap jika semua sama waktu sejumlah bank polisi memang hingga sejak partai baik sekarang sendiri tim apa menyatakan tentang korban pihak sehingga dunia demikian lainnya masalah rakyat salah kasus tempat kemudian berbagai keamanan harga tengah pertemuan bulan langsung wakil selain membuat pasar malam pertama nasional sebesar bahkan Freq 2191 2184 2174 2169 2128 2122 2110 2109 2086 2083 2073 2062 2042 2019 2017 2012 1991 1965 1959 1955 1951 1930 1890 1889 1860 1856 1855 1847 1815 1807 1803 1796 1793 1792 1790 1788 1785 1767 1763 1755 1748 1696 1664 1652 1640 1623 1623 1619 1612 1607 No 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Word kali umum ujar terus jelas sedang diri memberikan juta sebelumnya masuk hasil adanya maupun berada per pernah meminta bangsa kini jadi menurut soal segera aksi perlu mulai sebelum bersama termasuk seluruh pusat agung milyar sidang kenaikan akibat melalui rapat setiap empat tanpa pemerintahan begitu pesawat kerja kemarin apakah ujarnya datang Freq 1587 1584 1564 1539 1528 1502 1495 1488 1482 1473 1469 1454 1450 1447 1445 1445 1442 1433 1423 1419 1416 1409 1404 1397 1397 1391 1388 1379 1372 1371 1370 1364 1358 1357 1349 1334 1333 1315 1303 1297 1296 1287 1257 1256 1254 1243 1241 1234 1216 1211
38
Appendix D
A Stoplist for Bahasa Indonesia

Table D.1: Suggested stoplist for Bahasa Indonesia
Word ada adanya adalah adapun agak agaknya agar akan akankah akhirnya aku akulah amat amatlah anda andalah antar diantaranya antara antaranya diantara apa apaan mengapa apabila apakah apalagi apatah atau ataukah ataupun bagai bagaikan Root ada ada adalah adapun agak agak agar akan akan akhir aku aku amat amat anda anda antar antar antara antara antara apa apa apa apabila apakah apalagi apatah atau atau atau bagai bagai Part of Speech verb noun verb particle adverb adverb particle particle particle noun pronomia pronomia adverb adverb noun noun particle verb noun particle verb pronomia pronomia pronomia particle pronomia pronomia pronomia particle particle particle noun particle Word lah lain lainnya melainkan selaku lalu melalui terlalu lama lamanya selama selama-lamanya selamanya lebih terlebih bermacam bermacam-macam macam semacam maka makanya makin malah malahan mampu mampukah mana manakala manalagi masih masihkah semasih masing Root Part of Speech
lah particle lain adjective lain adjective lain verb laku particle lalu verb lalu verb lalu adverb lama adjective lama noun lama noun lama adjective lama adjective lebih adjective lebih adverb macam adjective macam adjective macam noun macam adverb maka particle maka particle makin adverb malah adverb malah adverb mampu adjective mampu adjective mana pronoun manakala particle manalagi particle masih adverb masih adverb masih adverb masing pronomia continue to next page
39
continued from previous page Word Root sebagai bagai sebagainya bagai bagaimana bagaimana bagaimanapun bagaimana sebagaimana bagaimana bagaimanakah bagainamakah bagi bagi bahkan bahkan bahwa bahwa bahwasanya bahwasannya sebaliknya balik banyak banyak sebanyak banyak beberapa beberapa seberapa beberapa begini begini beginian begini beginikah begini beginilah begini sebegini begini begitu begitu begitukah begitu begitulah begitu begitupun begitu sebegitu begitu belum belum belumlah belum sebelum belum sebelumnya belum sebenarnya benar berapa berapa berapakah berapa berapalah berapa berapapun berapa betulkah betul sebetulnya betul biasa biasa biasanya biasa bila bila bilakah bila bisa bisa bisakah bisa sebisanya bisa boleh boleh bolehkah boleh bolehlah boleh buat buat bukan bukan bukankah bukan bukanlah bukan bukannya bukan
Part of Speech particle particle pronomia pronomia particle pronomia particle adverb particle particle adverb adjective numeralia numeralia numeralia pronomia adjective pronomia pronomia numeralia adverb adverb adverb adverb numeralia adverb adverb adverb adverb adverb pronomia pronomia pronomia pronomia adjective adverb adjective adjective particle particle verb verb adverb particle particle particle particle adverb pronomia adverb adverb
Word masing-masing mau maupun semaunya memang mereka merekalah meski meskipun semula mungkin mungkinkah nah namun nanti nantinya nyaris oleh olehnya seorang seseorang pada padanya padahal paling sepanjang pantas sepantasnya sepantasnyalah para pasti pastilah per pernah pula pun merupakan rupanya serupa saat saatnya sesaat saja sajalah saling bersama bersama-sama sama sama-sama sesama sambil
Root Part of Speech masing-masing pronomia mau adverb mau particle mau adverb memang adverb mereka pronomia mereka pronomia meski particle meski particle mula adverb mungkin adverb mungkin adverb nah particle namun particle nanti adverb nanti adverb nyaris adverb oleh particle oleh particle orang noun orang noun pada particle pada particle padahal particle paling adverb panjang noun pantas adjective pantas adjective pantas adjective para particle pasti adjective pasti adjective per paticle pernah adverb pula particle pun particle rupa verb rupa noun rupa verb saat noun saat noun saat noun saja adverb saja adverb saling adverb sama verb sama verb sama adjective sama adjective sama noun sambil particle continue to next page
40
continued from previous page Word Root Part of Speech cuma cuma adverb percuma cuma adverb dahulu dahulu adverb dalam dalam particle dan dan particle dapat dapat adverb dari dari particle daripada daripada particle dekat dekat adjective demi demi particle demikian demikian pronomia demikianlah demikian pronomia sedemikian demikian pronomia dengan dengan particle depan depan noun di di particle dia dia pronomia dialah dia pronomia dini dini adjective diri diri noun dirinya diri noun terdiri diri verb dong dong particle dulu dulu adverb enggak enggak adverb enggaknya enggak adverb entah entah adverb entahlah entah adverb terhadap hadap particle terhadapnya hadap particle hal hal noun hampir hampir adverb hanya hanya adverb hanyalah hanya adverb harus harus adverb haruslah harus adverb harusnya harus adverb seharusnya harus adverb hendak hendak particle hendaklah hendak adverb hendaknya hendak particle hingga hingga particle sehingga hingga particle ia ia pronomia ialah ialah particle ibarat ibarat particle ingin ingin particle inginkah ingin verb inginkan ingin verb ini ini pronomia inikah ini pronomia
Word sampai sana sangat sangatlah saya sayalah se sebab sebabnya sebuah tersebut tersebutlah sedang sedangkan sedikit sedikitnya segala segalanya segera sesegera sejak sejenak sekali sekalian sekalipun sesekali sekaligus sekarang sekarang sekitar sekitarnya sela selain selalu seluruh seluruhnya semakin sementara sempat semua semuanya sendiri sendirinya seolah seolah-olah seperti sepertinya sering seringnya serta siapa
Root Part of Speech sampai verb sana noun sangat adverb sangat adverb saya pronomia saya pronomia se particle sebab particle sebab particle sebuah numeralia sebut verb sebut verb sedang particle sedang particle sedikit adjective sedikit adverb segala adjective segala adjective segera adverb segera adverb sejak particle sejenak noun sekali adverb sekali numeralia sekali particle sekali adverb sekaligus adverb sekarang adverb sekaranglah adverb sekitar noun sekitar noun sela adverb selain particle selalu adverb seluruh numeral seluruh numeral semakin adverb sementara particle sempat adverb semua numeralia semua adverb sendiri adverb sendiri adverb seolah verb seolah adverb seperti particle seperti particle sering adverb sering adverb serta particle siapa pronomia continue to next page
41
continued from previous page Word Root Part of Speech inilah ini pronomia itu itu pronomia itukah itu pronomia itulah itu pronomia jangan jangan particle jangankan jangan particle janganlah jangan particle jika jika particle jikalau jikalau particle juga juga adverb justru justru adverb kala kala noun kalau kalau particle kalaulah kalau particle kalaupun kalau particle berkali-kali kali adverb sekali-kali kali adverb kalian kalian pronomia kami kami pronomia kamilah kami pronomia kamu kamu pronomia kamulah kamu pronomia kan kan particle kapan kapan particle kapankah kapan particle kapanpun kapan particle dikarenakan karena verb karena karena particle karenanya karena particle ke ke particle kecil kecil adjective kemudian kemudian particle kenapa kenapa pronomia kepada kepada particle kepadanya kepadanya particle ketika ketika noun seketika ketika adverb khususnya khusus adverb kini kini adverb kinilah kini adverb kiranya kira adverb sekiranya kira verb kita kita pronomia kitalah kita pronomia kok kok particle lagi lagi adverb lagian lagi adverb
Word siapakah siapapun disini disinilah sini sinilah sesuatu sesuatunya suatu sesudah sesudahnya sudah sudahkah sudahlah supaya tadi tadinya tak tanpa setelah telah tentang tentu tentulah tentunya tertentu seterusnya tapi tetapi setiap tiap setidak-tidaknya setidaknya tidak tidakkah tidaklah toh waduh wah wahai sewaktu walau walaupun wong yaitu yakni yang
Root siapa siapa sini sini sini sini suatu suatu suatu sudah sudah sudah sudah sudah supaya tadi tadi tak tanpa telah telah tentang tentu tentu tentu tentu terus tetapi tetapi tiap tiap tidak tidak tidak tidak tidak toh waduh wah wahai waktu walau walau wong yaitu yakni yang
Part of Speech pronomia pronomia adverb adverb adverb adverb pronomia pronomia pronomia particle particle adverb adverb adverb particle adverb adverb adverb adverb adverb adverb particle adjective adjective adverb adjective adverb particle particle numeralia adjective adverb adverb adverb adverb adverb particle particle particle particle noun particle particle pronomia particle particle particle
42
Table D.2: Most common words in Bahasa Indonesia newspapers

Word berada keadaan akhir akhiri berakhir berakhirlah berakhirnya diakhiri diakhirinya mengakhiri terakhir artinya berarti asal asalkan atas awal awalnya berawal berbagai bagian sebagian baik sebaik sebaik-baiknya sebaiknya bakal bakalan balik terbanyak bapak baru bawah belakang belakangan benar benarkah benarlah beri berikan diberi diberikan diberikannya memberi memberikan besar sebesar betul kebetulan Root ada ada akhir akhir akhir akhir akhir akhir akhir akhir akhir arti arti asal asal atas awal awal awal bagai bagi bagi baik baik baik baik bakal bakal balik banyak bapak baru bawah belakang belakang benar benar benar beri beri beri beri beri beri beri besar besar betul betul Part of Speech verb noun noun verb verb verb noun verb verb verb adjective noun verb particle particle noun noun noun verb verb noun noun adjective adjective adverb adverb adverb verb noun adjective noun adjective noun noun noun adjective adjective adjective verb verb verb verb verb verb verb adjective adjective adjective adverb Word masa semasa masalah masalahnya termasuk semata semata-mata diminta dimintai meminta memintakan minta mirip dimisalkan memisalkan misal misalkan misalnya semisal semisalnya bermula mula mulanya dimulai dimulailah dimulainya memulai mulai mulailah dimungkinkan kemungkinan kemungkinannya memungkinkan menaiki naik menanti menanti-nanti menantikan menyatakan nyatanya ternyata pak panjang dipastikan memastikan penting pentingnya diperlukan diperlukannya Root Part of Speech
masa noun masa adverb masalah noun masalah noun masuk verb mata adverb mata adverb minta verb minta verb minta verb minta verb minta verb mirip adverb misal verb misal verb misal noun misal verb misal noun misal noun misal noun mula verb mula noun mula verb mulai verb mulai verb mulai noun mulai verb mulai verb mulai verb mungkin verb mungkin noun mungkin noun mungkin verb naik verb naik verb nanti verb nanti verb nanti verb nyata verb nyata adjective nyata verb pak pronomia panjang adjective pasti verb pasti verb penting adjective penting adjective perlu verb perlu noun continue to next page
43
continued from previous page Word Root dibuat buat dibuatnya buat diperbuat buat diperbuatnya buat membuat buat memperbuat buat bulan bulan bung bung cara cara caranya cara secara cara cukup cukup cukupkah cukup cukuplah cukup secukupnya cukup terdahulu dahulu didapat dapat mendapat dapat mendapatkan dapat terdapat dapat berdatangan datang datang datang didatangkan datang mendatang datang mendatangi datang mendatangkan datang dua dua kedua dua keduanya dua empat empat seenaknya enak digunakan guna dipergunakan guna guna guna gunakan guna mempergunakan guna menggunakan guna hari hari berkehendak hendak menghendaki hendak diibaratkan ibarat diibaratkannya ibarat ibaratkan ibarat ibaratnya ibarat mengibaratkan ibarat mengibaratkannya ibarat ibu ibu berikut ikut berikutnya ikut ikut ikut diingat ingat
Part of Speech verb verb verb verb verb verb noun noun noun noun particle adjective adjective adjective adjective adverb verb verb verb verb verb verb verb adjective verb verb numeralia numeralia numeralia numeralia adjective verb verb noun verb verb verb noun verb verb verb noun verb particle verb verb noun adjective adjective verb verb
Word memerlukan perlu perlukah perlunya seperlunya pertama pertama-tama memihak pihak pihaknya sepihak pukul dipunyai mempunyai punya merasa rasa rasanya terasa rata berupa disampaikan kesampaian menyampaikan sampai-sampai sampaikan sesampai tersampaikan menyangkut satu disebut disebutkan disebutkannya menyebutkan sebut sebutlah sebutnya keseluruhan keseluruhannya menyeluruh sendirian bersiap bersiap-siap mempersiapkan menyiapkan siap dipersoalkan mempersoalkan persoalan soal soalnya
Root Part of Speech perlu verb perlu adverb perlu adverb perlu noun perlu adverb pertama numeralia pertama adverb pihak verb pihak noun pihak noun pihak noun pukul noun punya verb punya verb punya verb rasa verb rasa noun rasa noun rasa verb rata adverb rupa verb sampai verb sampai verb sampai verb sampai verb sampai verb sampai particle sampai verb sangkut verb satu numeralia sebut verb sebut verb sebut verb sebut verb sebut verb sebut verb sebut verb seluruh noun seluruh noun seluruh verb sendiri pronomia siap verb siap verb siap verb siap verb siap verb soal verb soal verb soal noun soal noun soal noun continue to next page
44
continued from previous page Word Root Part of Speech diingatkan ingat verb ingat ingat verb ingat-ingat ingat verb mengingat ingat verb mengingatkan ingat verb seingat ingat adverb teringat ingat verb teringat-ingat ingat verb berkeinginan ingin verb diinginkan ingin verb keinginan ingin noun menginginkan ingin verb jadi jadi verb jadilah jadi verb jadinya jadi noun menjadi jadi verb terjadi jadi verb terjadilah jadi verb terjadinya jadi noun jauh jauh adjective sejauh jauh noun dijawab jawab verb jawab jawab verb jawaban jawab verb jawabnya jawab verb menjawab jawab verb dijelaskan jelas verb dijelaskannya jelas verb jelas jelas adjective jelaskan jelas verb jelaslah jelas adjective jelasnya jelas verb menjelaskan jelas verb berjumlah jumlah verb jumlah jumlah noun jumlahnya jumlah noun sejumlah jumlah noun sekadar kadar adverb sekadarnya kadar adverb kasus kasus noun berkata kata verb dikatakan kata verb dikatakannya kata noun kata kata verb katakan kata verb katakanlah kata verb katanya kata noun mengatakan kata verb mengatakannya kata verb sekecil kecil adjective keluar keluar verb
Word diketahui diketahuinya mengetahui tahu tahun ditambahkan menambahkan tambah tambahnya tampak tampaknya ditandaskan menandaskan tandas tandasnya bertanya bertanya-tanya dipertanyakan ditanya ditanyai ditanyakan mempertanyakan menanya menanyai menanyakan pertanyaan pertanyakan tanya tanyakan tanyanya ditegaskan menegaskan tegas tegasnya setempat tempat setengah tengah tepat terus tetap setiba setibanya tiba tiba-tiba tiga setinggi tinggi ditujukan menuju tertuju
Root Part of Speech tahu verb tahu noun tahu verb tahu verb tahun noun tambah verb tambah verb tambah verb tambah verb tampak verb tampak verb tandas verb tandas verb tandas adjectice tandas verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya verb tanya noun tanya verb tanya verb tanya verb tanya verb tegas verb tegas verb tegas verb tegas verb tempat noun tempat noun tengah numeralia tengah adverb tepat adjective terus adverb tetap adjective tiba particle tiba noun tiba verb tiba-tiba adverb tiga numeralia tinggi adjective tinggi adjective tuju verb tuju verb tuju verb continue to next page
45
continued from previous page Word Root kembali kembali berkenaan kena mengenai kena bekerja kerja dikerjakan kerja mengerjakan kerja dikira kira diperkirakan kira kira kira kira-kira kira memperkirakan kira mengira kira terkira kira kurang kurang sekurang-kurangnya kurang sekurangnya kurang berlainan lain dilakukan laku melakukan laku berlalu lalu dilalui lalu keterlaluan lalu kelamaan lama berlangsung langsung lanjut lanjut lanjutnya lanjut selanjutnya lanjut berlebihan lebih lewat lewat dilihat lihat diperlihatkan lihat kelihatan lihat kelihatannya lihat melihat lihat melihatnya lihat memperlihatkan lihat terlihat lihat kelima lima lima lima luar luar bermaksud maksud dimaksud maksud dimaksudkan maksud dimaksudkannya maksud dimaksudnya maksud semampu mampu semampunya mampu
Part of Speech verb verb particle verb verb verb verb verb noun adverb verb verb verb adverb adverb adverb verb verb verb verb verb adjective adjective verb adjective verb adverb adjective particle verb verb noun noun verb verb verb verb numeralia numeralia noun verb verb verb verb verb adjective adjective
Word ditunjuk ditunjuki ditunjukkan ditunjukkannya ditunjuknya menunjuk menunjuki menunjukkan menunjuknya tunjuk berturut berturut-turut menurut turut bertutur dituturkan dituturkannya menuturkan tutur tuturnya diucapkan diucapkannya mengucapkan mengucapkannya ucap ucapnya berujar ujar ujarnya umum umumnya diungkapkan mengungkapkan ungkap ungkapnya untuk usah seusai usai terutama waktu waktunya meyakini meyakinkan yakin
Root tunjuk tunjuk tunjuk tunjuk tunjuk tunjuk tunjuk tunjuk tunjuk tunjuk turut turut turut turut tutur tutur tutur tutur tutur tutur ucap ucap ucap ucap ucap ucap ujar ujar ujar umum umum ungkap ungkap ungkap ungkap untuk usah usai usai utama waktu waktu yakin yakin yakin
Part of Speech verb verb verb verb verb verb verb verb verb verb adverb adverb particle verb verb verb noun verb verb verb verb verb verb verb verb verb verb noun noun adjective adverb verb verb verb verb particle verb particle verb adverb noun noun verb verb adjective
46
Bibliography
[1] Tata Bahasa Melayu, 1998. URL http://tatabahasabm.tripod.com. [2] F. Ahmad, M. Yuso, and T. M. T. Sembok. Experiments with a Stemming Algorithm for Malay Words. Journal of The American Society for Information Science, 47:909918, 1996. [3] R. Baeza and B. Ribeiro. Modern Information Retrieval. Addison Wesley, 1999. [4] C. Buckley, A. Singhal, M. Mitra, and G. Salton. New Retrieval Approaches using SMART: TREC 4. In D. Harman, editor, Proceedings of the Fourth Text REtrieval Conference (TREC4), pages 2548, 1995. [5] C. Buckley and E. M. Voorhees. Evaluating Evaluation Measure Stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3340, Athena, Greece, 2000. ACM Press. [6] Dept. of Cultural and Education, Republic of Indonesia, editor. Pedoman Umum Ejaan Bahasa Indonesia yang Disempurnakan. Pustaka Setia, 1987. [7] Dept. of Cultural and Education, Republic of Indonesia, editor. Tata Bahasa Baku Bahasa Indonesia. Balai Pustaka, 1988. [8] Dept. of Education, Republic of Indonesia, editor. Kamus Besar Bahasa Indonesia. Balai Pustaka, 2001. [9] C. Fox. Lexical Analysis and Stoplists. In Frakes and Baeza [11], pages 102130. [10] W. B. Frakes. Stemming Algorithms. In Frakes and Baeza [11], pages 131160. [11] W. B. Frakes and R. Baeza, editors. Information Retrieval, Data Structures and Algorithms. Prentice Hall, 1992. [12] D. Harman. How eective is suxing? Journal of The American Society for Information Science, 42:715, 1991. [13] V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monoligual Document Retrieval for European Languages. 2003. [14] D. A. Hull. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 329338, Pittsburgh, Pennsylvania, 1993. ACM Press. [15] D. A. Hull. Stemming Algorithms - A Case Study for Detailed Evaluation. Journal of The American Society for Information Science, 47, 1996. [16] M. Kantrowitz, B. Mohit, and V. Mittal. Stemming and Its Eects on TFIDF Ranking. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 357359, Athens, Greece, 2000. ACM Press. 47
[17] W. Kraaij and R. Pohlman. Porters stemming algorithm for Dutch. In L. G. M. Noordman and W. A. M. de Vroomen, editors, Informatiewetenschaap 1994: Wetenschappelijke Bijdragen aan de deede STINFO Conferentie, pages 167180, Tilburg, 1994. [18] W. Kraaij and R. Pohlman. Evaluation of a Dutch Stemming Algorithm. In J. Rowley, editor, The New Review of Document and Text Management, volume 1, pages 2543. Taylor Graham, 1995. [19] W. Kraaij and R. Pohlman. Viewing Stemming as Recall Enhancement. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4048, Zurich, Switzerland, 1996. [20] R. Krovetz. Viewing Morphology as an Inference Process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191203, Pittsburgh, Pennsylvania, 1993. ACM Press. [21] C. Monz and M. de Rijke. Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, Germany and Italian. In C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors, Evaluation of Cross-Language Information Retrieval System. Second Workshop of the Cross-Language Evaluation Form, CLEF 2001, LCNS 2406, pages 357359, Darmstadt, Germany, Sept. 2001. Springer. [22] B. Nazief. Spelling Checker Facility and The Analysis of the Word Frequency. Proceedings of Computer and Arts Conference, 1995. [23] B. Nazief and M. Adriani. Conx Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Technical report, Faculty of Computer Science, University of Indonesia, Depok, 1996. [24] C. D. Paice. Another Stemmer. ACM SIGIR Forum, 24(3):5661, 1990. [25] C. D. Paice. Method for Evaluation of Stemming Algorithms Based on Error Counting. Journal of The American Society for Information Science, 47(8):632649, Aug. 1996. [26] C. D. Paice. What is Stemming?, 1996. URL http://www.comp.lancs.ac.uk/computing/ research/stemming/general/index.htm. [27] A. Pirkola. Morphological Typology of Languages fo IR. Journal of Documentation, 57: 330348, May 2001. [28] M. Popovic and P. Willett. The Eectiveness of Stemming for Natural-Language Access to Slovene Textual Data. Journal of the American Society for Information Science, 43(5): 384390, June 1992. [29] M. F. Porter. An algorithm for sux stripping. Program, 14(3):130137, July 1980. [30] G. Salton and M. J. McGill, editors. Introduction to Modern Information Retrieval. McGrawHill, 1983. [31] J. Savoy. Stemming of French Words Based on Grammatical Catagories. Journal of the American Society for Information Science, 44:19, Jan. 1993. [32] J. Savoy. Report on CLEF-2001 Experiments. In C. Peters, editor, Results of the CLEF 2001, Cross-Language System Evaluation Campaign, pages 1119, Sophia-Antipolis, 2001. [33] A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996. ACM Press.
48
[34] S. Y. Tai, C. S. Ong, and N. A. Abdullah. On Designing an Automated Malaysian Stemmer for the Malay Language. In Procedings of the Fifth International Workshop on Information Retrieval with Asian Languages, pages 207208, Hongkong, China, 2000. ACM Press. [35] H. G. Tarigan. Pengajaran Morfologi. Angkasa, Bandung, 1995. [36] R. J. Wonnacott and T. H. Wonacott. Introductory Statistics. John Willey & Son, fourth edition, 1985.
49

A Study of Stemming Effects On Information Retrieval in Bahasa Indonesia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study of Stemming Effects On Information Retrieval in Bahasa Indonesia

Uploaded by

Copyright:

Available Formats

A Study of Stemming Eects on Information Retrieval in Bahasa Indonesia

Fadillah Z Tala 0086975

The Paice Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . The Paice Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .

Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 Inectional Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivational Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The FlexIR System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

4.3.1 4.3.2 4.3.3 4.4 4.5

Precision/Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R-Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A.1 Rules and Variation Forms of Prexes . . . . . . . . . . . . . . . . . . . . . . . . . vi

A Purely Rule-based Stemmer for Bahasa Indonesia

Morphological Structure of Bahasa Indonesia Words

tas (bag) sepeda (bicycle)

tasmu (your bag) sepedaku (my bicycle)

The Porter Stemming Algorithm

Porter Stemmer for Bahasa Indonesia

Remove Possesive Pronoun

Remove 1st Order Prefix fail a rule is fired

Remove 2nd Order Prefix

Remove Suffix a rule is fired

Remove 2nd Order Prefix

This notation means that the stem starts with a vowel.

Replacement NULL NULL NULL NULL NULL NULL

Additional Condition NULL ajar K er. . . NULL ajar NULL

This notation means that the stem starts with a consonant.

Table 2.8: Examples of syllables in Bahasa Indonesia words.

Evaluation of the Stemming Algorithm

Stemmer Quality Evaluation

The Paice Evaluation Method

ngi (Ng ngi ),

nsi (Ns nsi ),

The Paice Experimental Results

(a) semantically grouped words

(b) after stemming process, UI=0.6

(c) reordered stemmed words into stem group, OI=0.4

good stemmer will lie closely to the origin.

Table 3.3: Errors in the inectional sux stripping.

Table 3.5: Spelling adjustment errors in stripping suxes.

Stemmer Performance Evaluation for Information Retrieval

The Test Collections

Figure 4.1: Document example: kompas document KOMPAS-HL2001-310101-PRES01.

The Information Requests (Queries)

Relevant Documents for Every Information Request

The FlexIR System

1 0.9 0.8 0.7 Precision

0.5 0.6 Recal

(a) NoSo vs. So for kompas

(b) NoSo vs. So for tempo

1 0.9 0.8 0.7 Precision

0.5 0.6 Recal

(a) NoSm vs. Porter

(b) NoSm vs. Nazief

1 0.9 0.8 0.7 Precision

Nazief Porter NoSm

(c) Porter vs. Nazief

(d) NoSm vs. Porter vs. Nazief

Figure 4.4: PR-Curves for kompas Collection.

1 0.9 0.8 0.7 Precision

0.5 0.6 Recal

(a) NoSm vs. Porter

(b) NoSm vs. Nazief

1 0.9 0.8 0.7 Precision

Nazief Porter NoSm

0.5 0.6 Recal

(c) Porter vs. Nazief