You are on page 1of 21

i

6
The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay
Documents in Jawi Characters
SULIANA SULAIMAN, Sultan Idris Education University, Malaysia
KHAIRUDDIN OMAR, NAZLIA OMAR, MOHD ZAMRI MURAH, and
HAMDAN ABDUL RAHMAN, Universiti Kebangsaan Malaysia

The Malay language has two types of writing script, known as Rumi and Jawi. Most previous stemmer
results have reported on Malay Rumi characters and only a few have tested Jawi characters. In this article,
a new Jawi stemmer has been proposed and tested for document retrieval. A total of 36 queries and datasets
from the transliterated Jawi Quran were used. The experiment shows that the mean average precision for
a stemmed Jawi document is 8.43%. At the same time, the mean average precision for a nonstemmed
Jawi document is 5.14%. The result from a paired sample t-test showed that the use of a stemmed Jawi
document increased the precision in document retrieval. Further experiments were performed to examine
the precision of the relevant documents that were retrieved at various cutoff points for all 36 queries. The
results for the stemmed Jawi document showed a significantly different start, at a cutoff of 40, compared
with the nonstemmed Jawi documents. This result shows the usefulness of a Jawi stemmer for retrieving
relevant documents in the Jawi script.
Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing
Language models; Language parsing and understanding; Text analysis; H.3.4 [Information Storage and
Retrieval]: Systems and SoftwarePerformance evaluation (efficiency and effectiveness); H.3.1 [Information Storage and Retrieval]: Content Analysis and IndexingLinguistic
General Terms: Languages, Performance
Additional Key Words and Phrases: Jawi stemmer, Malay stemmer, Jawi document retrieval, stemming
ACM Reference Format:
Sulaiman, S., Omar, K., Omar, N., Murah, M. Z., and Rahman, H. A. 2014. The effectiveness of a Jawi
stemmer for retrieving relevant Malay documents in Jawi characters. ACM Trans. Asian Lang. Inform.
Process. 13, 2, Article 6 (June 2014), 21 pages.
DOI:http://dx.doi.org/10.1145/2540988

1. INTRODUCTION

Stemming in Malay is more complex than in English. The Malay language has two
different types of script: the Jawi script and the Rumi script. Jawi is an Arabic-scriptbased orthography. Jawi is based on Arabic, and Rumi is a Roman-based script. Jawi
is read from right to left and has different forms of characters. For example, the word
king in Malay can be written as in the Jawi or Raja in the Rumi. The Jawi
script was used as early as 674 [Nasruddin et al. 2008]. It is also used as a writing
system in the Malay archipelagos.
Jawi has also been used as an art form to perform Islamic calligraphy. This type
of calligraphy can be seen in architecture, where walls are decorated using the Jawi
Authors addresses: S. Suliana (corresponding author), Faculty of Art Computing and Creative Industry,
Sultan Idris Education University, Tanjung Malim, Perak Darul, Ridzuan 35900, Malaysia; email:
ssuliana@yahoo.com; K. Omar, N. Omar, M. Z. Murah, and H. A. Rahman, Universiti Kebangsaan Malaysia,
43600 Bangi, Selangor, Malaysia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
c 2014 ACM 1530-0226/2014/06-ART6 $15.00

DOI:http://dx.doi.org/10.1145/2540988
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:2

S. Sulaiman

script. Historically, the Jawi script was used for writing on inscribed stones and wood
[Moain 1992]. The Jawi script was used as an official written script for communication
between the Malay king and the British king. The Jawi script was very important at
that time; for example, Sultan Muzaffar Syah (King of Perak) embossed his name on
the Perak currency using the Jawi script during his rule [Yatim 1990]. The Jawi script
has been used in approximately 15000 manuscripts that are kept in the British library
[Nasruddin et al. 2008].
After many years, the Jawi script was altered to the Rumi script, which has been
used until today. The official use of the Rumi script started in the early 20th century.
The Rumi script is a romanized transliteration of the Jawi script. During that period
the Malaysian government announced that the official language of Malaysia would use
the Rumi script and that all documents must be written in this script.
The motivation for developing a Jawi stemmer is because the Jawi is not just a
writing script. Jawi has been used since the 14th century when most Malays wrote
their manuscripts in Jawi. Ding Choo Ming emphasized the importance of preserving these manuscripts and making them accessible to scholars, as they are critical to
the study of Malay literature and culture [Ming 1986]. In the 19th century many of
these manuscripts were copied by the Europeans [Ming 1986]. Until today, National
Library of Malaysia and other agencies have been involved in efforts to preserve and
prevent these manuscripts from being destroyed. One such effort has been to digitize
the manuscripts, thus the Jawi stemmer is beneficial, especially in helping to search
for the appropriate word or term from the digitized manuscripts. Other than that, it
can also be used as one of the components to transliterate the Jawi script into the Rumi
script [Ghani et al. 2009; Yon Hendri 2009]. The Malaysian Ministry of Education is
taking serious measures to preserve the Jawi script and one of their aims is to ensure
that primary-school children can read Jawi.
The difference between Jawi and Rumi can be seen from the characters, the vowels,
the spelling method, and the loan words. Even though the two scripts are different
from each other, the language is purely Malay. Vowels in the Rumi script are represented using six different sounds: [a], [e], [i], [o], [u], and [e]. However, in the Jawi
script, these six vowel sounds are represented by three characters: ,
, and
.
This difference is one of the reasons why the Rumi and Jawi scripts are distinct even
though they represent the same language. Figure 1 shows vowel representation in the
Jawi and Rumi scripts for the Malay language.
Malay words consist of a combination of single syllables or more than one syllable.
A syllable is a sound of a vowel that is created when we pronounce the word. For
example, a single-syllable word is
<ru>,
<cap>, and
<bah>; two-syllable
words include
<bulan>, which is a combination of two syllables, + ; threesyllable words are a combination of three single syllables, such as
<utara>=
+ + ; four-syllable words are a combination of four single syllables, such as
<sementara>= + + + ; five-syllable words are a combination of five syllables,
+
+
+
+ ; and six-syllable words are a
such as
<universiti>=
combination of six syllables, such as
<keanak-anakan>=
+ + + +
+ . These syllables can be divided into open and closed syllables, based on specific
patterns. Open syllables have three patterns, such as the Vowel pattern (Vp), which
is composed of only one vowel sound, the Consonant Vowel pattern (CVp), composed
of one consonant and one vowel sound, and the Consonant Diphthong pattern (CDp),
composed of a consonant sound followed by a diphthong. At the same time, closed
syllables are syllables that end with a consonant character. Closed syllables contain
two patterns: the Vowel Consonant pattern (VCp), which is composed of one vowel and
one consonant, and the Consonant Vowel Consonant pattern (CVCp), which contains

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:3

Fig. 1. Vowel representation in the Jawi and Rumi scripts for the Malay language.

a consonant followed by a vowel followed by a consonant. Most pure Malay words are
based on disyllables.
The aim of this article is to investigate whether the use of a Jawi stemmer can
increase the precision and recall in Jawi document retrieval. An experiment was performed in which the effect of a stemmer was computed using the Mean Average Precision (MAP) between stemmed Jawi documents and nonstemmed Jawi documents
(no stemmer was used), to find out whether the stemmer had a positive effect on the
precision and recall. Next, statistical testing was performed to be certain that there
was a significant difference between the stemmed Jawi documents MAP and the nonstemmed Jawi documents MAP.
This article is divided into six sections. Section 2 presents an overview of related
studied. Section 3 describes the Jawi stemmer. Section 4 clarifies the test collection.
Section 5 presents the experiments and results. Finally, Section 6 presents our conclusions and possible directions for future research.
2. RELATED STUDIES

A stemmer is also beneficial for transliteration. Roslan [2009] suggested a new method
for transliterating the Jawi to Rumi script using a rule-based system. The affix and the
root word must be separated by a stemming to produce an easy and fast transliteration
process [Roslan 2009; YonHendri 2009].
Stemming was developed to reduce morphological variants of root words [Hull 1996].
A stemmer is used to increase recall and precision in some languages, such as English
[Harman 1991], Swedish [Carlberger et al. 2001], Bengali [Islam et al. 2007], Dutch
[Kraaij and Pohlmann 1996], and French [Savoy 1999]. Abdullah [2006] and Ahmad
[1995] studied the effect of stemmers on the Malay document. Their studies showed
that search engines retrieve more relevant results from stemmed documents than from
nonstemmed documents. However, the experiment was tested only on Rumi Malay
documents, and no results were reported on the effect of stemmers for Jawi documents.
In order to produce the most accurately stemmed words, techniques such as rule
based, n-gram, supervised learning, and dictionaries have been used. Suffix stripping
was an English stemmer introduced by Porter [1980]. Its algorithm is short and fast.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:4

S. Sulaiman

Complex suffixes can be removed using simple steps from the stemmer. The suffix is
removed depending on the remaining word. For this reason, the length of the word is
important [Porter 1980]. To be certain that the stemmer could be used to improve
the retrieval performance, the stemmer was tested using the Cranfield 200 collection [Cleverdon et al. 1966]. The results showed that the stemmer could improve the
retrieval compared with the program used in Cambridge since 1971 [Porter 1980].
Frakes and Baeza-Yates [1992] reported from a previous study that it was still unclear whether stemming is useful for certain languages. Harman [1991] demonstrated
that there was no significant improvement between the S-stemmer, the Porter stemmer, and Lovins stemmer [Harman 1991]. Flores et al. [2010] evaluated whether the
best stemmers for Portuguese could achieve retrieval effectiveness. Results reported
for their experiment showed that a relationship exists but that it is not as strong as
previously estimated [Flores et al. 2010].
A Malay stemmer has been employed on the Malay-English Terminology Retrieval
System by Sembok et al. [2003]. Using the Malay stemmer, Malay science terminology could be retrieved. There are many types of stemmers, such as rule based, n-gram,
and suffix stripping. In 1999, Abu Bakar [1999] proposed the conflation method using a
combination of n-gram string and RAO stemming algorithm and showed the improvement in retrieval effectiveness on Malay documents. Adriani et al. [2007] developed
a confix-stripping approach to stem Indonesian-derived words. The inflectional and
derivational suffixes were removed, followed by derivational prefixes; then, recoding
was started, if possible. Prefix disambiguates were treated using a rule-based method.
If the requirement of the prefix met the rule, then it was returned as an appropriate
result, otherwise, it would skip the rule. For the rules precedence, the prefix was removed followed by the suffix when suffix pairs were encountered (be-..-lah, be-..-an,
me-..-i, di-..-i, pe-..i, and te-..i) when addressing common ambiguities. However, in rare
instances, the suffix was removed before the prefix. Hyphenated words were treated
using explicit lookup lists. The stemmer was tested using two experiments, namely to
find how good is the stemmer and how stemming affects information retrieval from
Indonesian text. The results showed that, even though the confix-stripping approach
gave the best results, it still could not solve all of the stemming problems because ambiguity is inborn in human languages. The results also showed that stemming does not
significantly help the retrieval performance on the Indonesian collection.
Based on Malacon [2004], a rule-based approach is a necessary tool for processing
Malay documents. The author proposed a rule-based approach to analyse affix words in
Malay [Malacon 2004]. The morpho-graphemic problem was solved by affirming that
modifications only affected the form of the base. It could also be solved by hashing
the word using segmentation rules and applying morpho-graphemic rules to build the
citation form of the base. Searching affixes from two directions (left and right) enables
us to identify a circumfix and thus to produce a correct segmentation. The results
show that the accuracy of the Malay morphology analysis was between 92% and 94%
for Malay text and was 89% for Indonesian text.
Othman [1993] developed the first Malay stemmer algorithm and used a dictionary
to stem Malay-derived words into their root words. The dictionary was divided into 26
different files. Another 26 files were used to help search for the roots in the dictionary.
The first Malay stemmer was developed using a rule-based approach. A total of 121
rules were used to ensure that the stemmer could stem as well as possible. These
rules were arranged and applied in alphabetical order to be certain that the computer
program was flexible in accepting changes to its morphological rules.
Derived words were checked using the morphological rules. The rules removed the
affix through binary searching, and the stemmed word was then checked against a
dictionary. According to Othman [1993], the affix precedent must follow the circumfix,
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:5

prefix, suffix, and infix precedents. Root words were deemed valid when the stemmed
word was found in the dictionary, otherwise, a spelling exception was performed. The
spelling exception changed the first character of the stemmed word into its corresponding character, based on a rule. The result was then checked using the dictionary. If the
word was found in the dictionary, it was then output as a root word; otherwise, it would
continue onto the next rule, until the end of the rule was reached. If the derived word
failed to match any of the rules, then the word would be returned as a root word. No
results were reported on retrieval performance.
Ahmad [1995] developed a Rule-Application-Order (RAO) stemmer based on
Othman [1993]. Ahmad et al. [1996] performed several experiments to ensure the best
affix precedence in the Malay stemmer. In the RAO stemmer, each input was checked
with a dictionary to confirm whether the word was a root word or not. The three most
frequent affixes (i.e., prefix, suffix, and circumfix) were compared, and an infix was
inserted at the end of each list. It was found from the experiment that the best order
to stem affixation is the following: prefix, circumfix, suffix, and infix. A recoding rule
was performed on words that did not match the root word dictionary. Most of these
cases were implemented on a prefix and circumfix. The first letter of the stemmed
word was replaced with a stemmed character based on the recoding rule and again,
it was checked with the root word dictionary. This same process was used by Othman
[1993], except for the rule precedence and the type of dictionary that was used. The
RAO stemmer was tested with 736 distinct words and produced an accuracy of 98.4%.
Ahmad [1995] also tested the RAO stemmer to gauge the retrieval effectiveness of the
stemming algorithm. The experiment showed that there was a significant performance
difference between an RAO-stemmed document and a nonstemmed document at the
10% level but not at the 5% level.
Abdullah et al. [2009] proposed a Rule-Frequency-Order (RFO) stemmer based on
the RAO stemmer, for Malay. Errors from the RAO stemmer were examined to increase the stemmer performance. The dictionary and rules from Ahmads set A were
used [Ahmad 1995]. These rules were sorted in decreasing order according to their
frequency. For the test collection, the first two chapters of the Quran were used. To
enhance Ahmads stemmer [1995], another eight affixes were added to set A, and
several modifications were made to improve the spelling variation rule. Several new
words were updated to the root word dictionary, which made a total of 22433 entries.
We can conclude from these results that the list of rules, the spelling variation, and
the root word dictionary affected the performance of the stemming algorithm. The
RFO stemmer produced minimum errors compared with Ahmads stemmer [1995].
Abdullah [2006] also tested the RFO to investigate whether the use of the RFO improved the retrieval effectiveness. The results show that there was a significant performance difference between an RFO-stemmed document and a nonstemmed document
at the 5% level.
There are many tests that can be used in statistical testing. Therefore, the best
test should be chosen to reflect the tests objective. Ahmad [1995] and Abdullah [2006]
used significance tests to test the performance of the stemmer using a stemmed document and a nonstemmed document. Smucker et al. [2007] emphasized that a statistical significance test was better because it allows the researcher to detect significant
improvements, even when the difference is small. The authors performed a study to
identify the best statistical significance test to use for information retrieval evaluation
[Smucker et al. 2007]. Their results showed that the bootstrap test and the students
t- test produced comparable significance values, meaning that this type of test would
produce the same p-values for the same experiment. No practical difference was detected between them. However, the authors reported that, using the same dataset, the
Wilcoxon signed rank test and the sign test obtained different p-values, which means
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:6

S. Sulaiman

that these two tests could reduce the capability of detecting the significance and provide a false result [Smucker et al. 2007]. For this article a paired sample t-test was
chosen for measuring the significance of the difference between the means.
3. JAWI STEMMER
3.1. Spelling in Jawi

There are special rules that must be followed for spelling in Jawi. These rules are
implemented either for root words or for derived words. The first rule is the DERLUNG
rule. This rule is applied to disyllable words that use [a + a] as vowels in the first and
second syllables. The vowel alif can be presented in both syllables with one condition:
the first character in the second syllable starts with { , , , , }. Otherwise, alif is
presented in the first syllable [Rahman 1999].
The second rule is the KAFGA rule. This rule applies when the first character in the
second syllable belongs to
or . If this requirement is met, then a vowel is present
in the first syllable, otherwise, a vowel is present in both syllables. The third rule is
the slide HAMZA rule. This rule is implemented when the first character in the second
syllable represented is the vowel [i] or [u]. If this construct occurs, then the vowel is
present in both syllables, and is inserted between those vowels. Otherwise, the vowel
is present in both syllables.
The fourth rule is the distinctive ALIF rule. This rule is used to differentiate the pronunciation of a word. For example, the word
can be read as <buaya> or <buai>.
To distinguish these two words, a distinctive ALIF is used, hence, the new word of
<buaya> is spelled
and not
. The last rule is the SEKEDI rule. Here, the
character is used as a diacritic on the to differentiate between the word
<daun>
(leave) and
<diawan> (at the cloud).
3.2. Afxes in Jawi

Affixation can be described as a morphological process in which the base possibly expands by one or more affixes. A base can be free or a compound root morpheme, a
complex, a duplication, or a compound form. There are four forms of affixes, known as
the prefix, suffix, circumfix, and infix. Prefixes are found at the beginning of a word,
suffixes are found at the end of a word, infixes are inserted within a word, and circumfixes are found at both the beginning and the end of a word.
There can be three layers of affixation. One- and two-layer constructions of affixation are common in Malay. Examples of one-layer and two-layer constructions are
<seorang> (to be alone) and
<keseorangan> (loneliness). Affixation can
be layered, but not more than three times. However, a three-layer construction is extraordinary in Malay: for example,
<berkeseorangan> (to suffer loneliness)
[Hassan 1974].
A prefix can appear as {- , - , - , - , - , - , - and - }}. Some of these prefixes can
have variants. Most common suffixes used in Malay can be listed as { -, -, -, -,
-, -, -,
-, -, -, - , and -}}. Circumfixes are a combination of a prefix and
suffix on one base word, to construct a derived word. Table I and Table II show examples of prefix and circumfix variants.
Basically, when an affix is added to the root word to form a derivative word, the
spelling of the root word endures. However, under several conditions, the spelling of
the root word is changed by the affix. This arrangement can be explained in the case of
prefixes for - <se>, - <ke> and - <di>. If the first character of the root word is
<alif>, then the prefixes - <se>, - <ke> and - <di> are spelled as <se>, <ke> and - <di>. This arrangement is different from Rumi spelling. In Rumi, when
the prefixes se-, ke-, and di- are added to another root word that starts with <a>, no
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:7

Table I. Example of Prefix


Variants
Prefix
-

Variant
No
No
No
- ,- ,- ,- ,- ,- ,-

Table II. Example of Circumfix Variants


Circumfixes
-..-

Variant
-..-

-..-

-..-

-..-

-..-

-..-

-..- ,

-..- ,

-..-

-..- ,

-..- ,

-..- ,
-..- ,

-..-..-

extra character is added to the prefix. For example, when the prefix di- is added to the
root word ambil, it produces the word diambil. However, in Jawi, this word is spelled

<diambil>. The character


is added after the prefix as an extra character.
Another difficulty related to Jawi prefixes is that first character of the root word can
be eliminated to obtain the correct derivative word. For example, when the prefix <mem> is added to the root word
<fokus>, the character
is eliminated to
form the derivative word
<memfokus>. This procedure occurs in several prefix
cases. There are many Arabic loan words in Malay. Most of the loan words start with
the character <alif>. When a prefix is added to this loan word, the character <alif>
is replaced with the character <ya> or <wau>.
A suffix can be inserted based on the end sound of the root word. A root word that
ends with the sound [a] is represented using the character alif and is spelled - <an>.
Nevertheless, when the [a] sound is not represented by the character <alif>, this suf<tajaan> =
<an> +
fix must be spelled together with , such as
Several derivative words are based on repeated root words. For example,
<kebudak-budakan> is derived from the word
<budak>.

<taj>.
-

3.3. Rumi Deafxation Rule

We proposed a Jawi stemmer for the Malay language to stem Jawi-derived words
into their root words. Ahmad [1995] and Abdullahs [2006] framework was used as a
baseline for this Jawi stemmer. A book authored by Rahman [1999] was used to understand Jawi spelling methods and how the affixes were added to the root word to create
a derived word.
Before the rules for the Jawi stemmer were created, we tested the rules that were
used in the Rule Application Order (RAO) proposed by Ahmad and the Rule Frequency
Order (RFO) proposed by Abdullah to investigate whether these rules were compatible with the Jawi script. The experiment was performed using 104 unique derivative
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:8

S. Sulaiman
Table III. The Result of Tested Rule
Types of Error & Accuracy
Overstemming
Understemming
Spelling Exception
Unchanged
Others
Accuracy

RAO
2
1
1
30
67.30

RFO
2
1
1
29
68.27

words taken from an online newspaper (Berita Harian1 and Utusan Melayu2 ). The
entire set of rules was transliterated directly into Jawi script using TERUJA [2010].
The data was used because no other online Jawi documents were available at that
time. Table III shows the result of this experiment, and Table IV shows details of
the errors.
From the experiment, Ahmad [1995] and Abdullah [2006] produced most of the errors in the unchanged category. RAO tended to produce 34 errors compared to RFOs
33 errors. This happened because an additional rule was developed in the RFO to
overcome some of the errors in the RAO. In this case, the word
(result) was
stemmed correctly using the RFO and reduced the error of the RFO. As can be seen,
most of the errors that occurred for both the RAO and RFO were unchanged errors.
The unchanged error occurred because the rule is not appropriate to stem a Jawi word.
In Rumi, the way that the suffix -an is spelled has only one rule; however, spelling the
same suffix in Jawi involves three different rules. This experiment shows that the rule
produced for Rumi script is not sufficiently appropriate to stem Jawi script. To obtain
a correct root word, the specific rules for the Jawi stemmer were developed based on
the conditions given next.
3.4. Jawi Deafxation Rule

There are two main components in a Jawi stemmer. The first component is a deaffixation rule, and the second component is a Spelling Error Detector Rule (SEDR). The
deaffixation rule includes affixation and spelling variation rules, while the SEDR rule
is used to check the spelling of the stemmed word. A stemmed word with the correct
spelling will be output as a root word.
Malay contains four different types of affixation. For the prefix rule, several rules
must be emphasized to avoid errors such as understemming, overstemming, and
spelling exceptions. These rules include the prefixes meN- and peN-. These two prefixes contain the variant {- , - , - , - , - , - , - , - , - and - }}. If the first two
characters match the words for the prefix variant, then the prefix must be removed (as
shown in Table V). Table V shows the conditions of the prefix rule (meN- and peN-).
Another important rule for the prefix is {- , - and - }. These prefixes have one
variant, which is {- , and - }. The pattern of the word was checked to avoid
overstemming. Consonant characters are replaced with a C, while vowels are replaced with a V. Because many pure Malay words are disyllabic, we implemented
the pattern of disyllables into this prefix rule. Table VI shows the conditions of prefixes for {- ,- and - }.
Suffixes were eliminated cautiously because the use of suffixes in Jawi is more complicated than in Rumi. The exclusion of the suffix - must depend on the original root
1 http://www.bharian.com.my/
2 http://www.utusan.com.my/

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:9

Table IV. Details of the Error


Derive word
(market)
(downfall)
(turbulent)

Correct
Root
Word

RAO
Stemmed
Word

Types of Error

RFO
Stemmed
Word

Types of Error

Unchanged
Unchanged
Overstemming

Unchanged
Unchanged
Overstemming

(analysts)
(close)
(increase)
(following)
(enhancement)
(food)

Understemming
Unchanged
Unchanged
Unchanged
Unchanged
Unchanged

Understemming
Unchanged
Unchanged
Unchanged
Unchanged
Unchanged

(tension)
(continued)

Unchanged
Unchanged

Unchanged
Unchanged

(forecast)

Unchanged

Unchanged

(prolonged)
(supply)
(incident)

Unchanged
Unchanged
Unchanged

Unchanged
Unchanged
Unchanged

(expectation)

Unchanged

Unchanged

(about)

Unchanged

Unchanged

(benefit)
(picture)

Overstemming
Unchanged

Overstemming
Unchanged

(result)

Spelling Exception

Spelling Exception

(possibility)

Unchanged

Unchanged

Unchanged

Unchanged

Unchanged

Unchanged

Unchanged
Unchanged

Unchanged
Unchanged

(business)

Unchanged

Unchanged

(result)

Unchanged

(financial)

Unchanged

Unchanged

(funded)

Unchanged

Unchanged

(through)

Unchanged

Unchanged

Unchanged

Unchanged

Unchanged
Unchanged
Unchanged

Unchanged
Unchanged
Unchanged

(planning)
(address)
(pressure)
(ingredient)

(expenses)
(income)
(old)
(recommendation)

words spelling. The characters before the - were examined to prevent understemming errors. Table VII shows the conditions for the suffix - .
The deaffixation rule for the circumfixes is a combination of the prefix and suffix
rules. Affixes are stemmed from the beginning and the end of the words. Therefore,
the rule for prefix recoding and suffixes must be applied to the circumfix rule. Another
condition is when there is more than one affix present at the same time. For example,
the word
<mempelbagaikan> contains a combination of the prefixes - , ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:10

S. Sulaiman
Table V. Conditions of the Prefix Rule (meN- and peN-)

Prefix
- /-

Rule
If Length > 3 & 2nd char = & 3rd char = || ||
Then Remove the 1st & 2nd char
Else
If 3rd char != || || ,
at beginning of
Then remove 1st & 2nd char & add
the word.
End If
End If
If Length > 3& 2nd char= & 3rd char = || || || ||
||
||
Then Remove the 1st & 2nd char.
Else,
||
,
If the 3rd char != || || || || ||
at beginning
Then remove the 1st & 2nd char & add
of the word.
End if
End if
If Length > 3 & 2nd char = & 3rd char = & 4th char =
Then Remove the 1st & 2nd char & add at the beginning
of the word.
Else
Remove the 1st & 2nd char & add at beginning of the
word.
If Length
> 3 & 2nd char = 3rd char = || || || ||
g
|| ||
Then Remove the 1st & 2nd char.
Else
at beginning of the
Remove the 1st & 2nd char & add
word.
If Length > 3 & 2nd char =
Then Remove the 1st & 2nd char & add
the word.
Else
Remove the 1st char

Example
(chooser) (killer) -

(choose)
(kill)

(writer) (tailor) -

(write)
(sew)

(production) (out)
(remembering) (remember)

(connecter) (connect)

(copying) (singing) -

at beginning of

(copy)
(sing)

Table VI. The Conditions of Prefixes for {- ,- ,- }


Prefix
- ,- ,-

Rule
If Length > 3 & 2nd char =
If word pattern = CCCVCV
Then remove the 1st & 2nd char.
End if
End if
If Length > 3 & 2nd char = & 3rd char =
Then Remove the 1st char
Else
Remove the 1st & 2nd char
End if

Example
(has leg) -

||

(to feel) (together) -

(leg)

(feel)
(same)

and a suffix - . To stem this type of word, we must stem the first prefix and examine
the first character after the second prefix (as shown in Table II).
The second process involved in the Jawi stemmer is the Spelling Error Detector Rule
(SEDR). The SEDR was developed as a substitute for the root word dictionary. The root
word dictionary was used by Ahmad [1995] and Abdullah et al. [2009] in their stemmer.
After affixations are eliminated from the derived words, the stemmed words must be
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:11

Table VII. The Conditions of the Suffix -.


Prefix Rule

Example

If Length > 3 & 2nd last char =


& 2nd char = | |
= ||
Then Remove the last three characters
Else
Remove the last two characters
End if
If Length > 3 & 2nd last char =
& 2nd char = | |
char = || || || ||
Then Remove the last two characters
Else
Remove the last three characters
End if
If Length > 3 & 2nd last char = & 3rd last char =
Then Remove the last two characters
Else
Remove the last character.\
End if

& 4th last char

& 4th last

(opening) (renting) (rental)

(open)

(subsistence) (subsistence)

(helping) -

(help)

Table VIII. Details of Malay Disyllable Combinations


Syllable combination
Open syllable + Open
Syllable

Closed syllable + Open


syllable

Open syllable + closed


syllable

Closed syllable + closed


syllable

Pattern
v+v
v + cv
cv + v
cv + cv
v + cd
cv + cd
vc + cv
cvc + cv
vc + cd
cvc + cd
v + vc
v + cvc
cv + vc
cv + cvc
cv + cvc
cvc + cvc
cd + cvc

Exam in Ru
i (v) + a (v) = ia
i (v) + tu (cv) = itu
du (cv) + a (v) = dua
bo (cv) + la (cv) = bola
i (v) + bai (cd) = abai
ba (cv) + loi (cd) = baloi
an (vc) + da (cv) = anda
ban (cvc) + tu (cv) = bantu
an (vc) + dai (cd) = andai
ran (cvc) + tau (cd) = rantau
a (v) + ur (vc) = aur
i (v) + kan (cvc) = ikan
ma (cv) + in (vc) = main
se (cv) + pit (cvc) = sepit
in (cv) + tan (cvc) = intan
sun (cvc) + tik (cvc) = suntik
tau (cd) + lan (cvc) = taulan

Example in Jawi
(it)
(that)
(two)
(ball)
(ignore)
(you)
(help)
(if)
(corner)
(bamboo)
(fish)
(play)
(clip)
(diamond)
(inject)
(friend)

c = consonant; v = vowel; d= diphthong

checked with the SEDR to be certain that the stemmed word was spelled correctly. The
SEDR rule was constructed based on the word patterns and spelling methods for Jawi.
Words in Jawi are created from a syllable or a combination of syllables. The difference between Jawi and Rumi can be seen in their use of vowels in each syllable. In
Jawi, there are cases in which syllables are spelled using consonants only. Each syllable in Rumi words has a vowel, which is not the case in Jawi. There are situations in
which spelling in Jawi does not involve vowels at all. There are four types of disyllable
combinations in Malay local words. Table VIII shows the details of Malay disyllable
combinations.
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:12

S. Sulaiman
Table IX. Summary of the Spelling Methods for the Vowel [a] at the End of a Word

Type of
syllable
2nd
syllable =
Open
syllable

Pattern
for 2nd
syllable
v

cv

Condition

Symbolise

None
1st syllable is open syllable and vowel at
the 1st syllable is [e]
1st syllable is open syllable and vowel at
the 1st syllable is [a]. Plus the 1st
character in the 2nd syllable is / / / /

Example

Dua /
Kera /

(two)
(monkey)

Bawa /

(take)

1st syllable is open syllable and vowel at


the 1st syllable is [a]. Plus the 1st
character in the 2nd syllable not equal to
/ / / /

None

Raja /

(king)

1st syllable is open syllable and vowel at


the 1st syllable not equal to [a] or [e] and
last character at the 2nd syllable is /
1st syllable is open syllable and vowel
at the 1st syllable not equal to [a] or [e]
and last character at the 2nd syllable not
equal to /

None

Muka /

(face)

Kerja /

(work)

c = consonant; v = vowel;

Vowels in Jawi are based on six different sounds, which are represented using three
different characters. The six different sounds are [e], [a], [i], [e], [u], and [o]. These
vowels can be symbolized as , and . Vowels and represent [e]. The SEDR rule
was developed using this rule. The SEDR rule must follow the three conditions for the
[e] pattern. Words that contain [e] at the beginning must be represented with and
the [e] vowel must be used at an open syllable with the word pattern (v). Examples are
emak /
(mother) and enam / (six). The other condition is that the [e] vowel must
be used at the beginning of a word for a closed syllable with the pattern vc; examples
are entah /
(know), erti /
(mean), and engkau /
(you). There is no vowel used
for [e] in the middle of a word for an open syllable. The pattern for this condition is
(cv) and (cvc), which is used for a closed syllable. Examples are kena /
(have) and
sumber /
(source). If [e] is present at the end of a word, then it represents the
vowel with the character (alif maqsura). This rule was true for the vowel [e] at the
end of an open syllable, with the pattern (cv). For example, the word egoisme (egoism)
is spelled
.
Next, if the vowel [a] is used at the beginning of the word, for an open syllable with
pattern (v) and a closed syllable with pattern (vc), the vowel [a] must be symbolized
with the character , such as anak /
(child). The vowel [a], for closed syllables with
pattern (cv) and closed syllables with pattern (cvc), is represented as . Examples are
kami /
(we) and sumpah /
(curse). The use of the vowel [a] at the end of a word
is more complicated than in Rumi. Table IX shows a summary of the spelling methods
for the vowel [a] at the end of a word.
The vowels [i] and [e] are represented for open syllables (pattern: v) and closed
(follow). When these vowels are present in
syllables (pattern: vc), such as ikut /
the middle of a word with an open syllable (pattern: cv/v) or a closed syllable (pattern:
cvc), they will be represented as . An example is ribu /
(thousand). However,
when used with a closed syllable (pattern: vc) and when the first syllable is an open
syllable, then the vowel should be represented as the characters , such as in buih /
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:13

Fig. 2. The processes that are involved in the SEDR.

(bubble). If this vowel is present at the end of a word and the second syllable is an
open syllable with the pattern (v) or (cv), then the vowel should be represented as .
(bread) and kari /
(curry).
Examples are roti /
The vowels [u] or [o] are represented as for open syllables (pattern: v) and closed
syllables (pattern: vc). If they appear in the middle of a word, then for open syllables
(cv) and closed syllables (cvc), the vowels will be represented as a . If they are present
at the end of a word, for open syllables, the vowel should be . However, if they are
present for closed syllables (pattern: cv), the character will be .
Basically, the SEDR rule was developed using these conditions. After circumfixes
were eliminated (the deaffixation rule), the text was checked using the SEDR rule to
ensure that the spelling was correct, otherwise, the stemmer would proceed to the next
step in the deaffixation process. The stemmer must perform the prefix rule and check
the stemmed word against the SEDR. This process continues until all of the prefix
precedencies have been checked. If no rule is detected, then it outputs the word as a
root word. Figure 2 shows the processes that are involved in the SEDR.
The consonant vowel (cv) pattern was used to transform the Jawi word into a consonant and vowel pattern. Using the identifying syllable pattern process, the potential pattern was examined to generate a possible syllable. Then, the syllable rule was
implemented during the syllable rule process and, finally, the result was output. For
example, to check the word
(open), first we must make it an input for this process.
Next, the stemmed word is transformed into a consonant vowel pattern. Here, characters other than , , and are transformed into c (as a character) and , , as v
(as a vowel). For , this character is slightly special and is represented as two other
characters, namely (c) and (v). Next, the syllable pattern for the stemmed word is
(open) will generate a cvc pattern. This pattern
identified. For example, the word
is only generated for two conditions. The first condition occurs when the vowel [a] is
at the end of the word, and the second condition occurs when the spelled word has
only one syllable (a closed syllable). The identifying process is narrowed down by the
syllable rule, and the result is output as suggested by the rule. In this example, both
of the conditions fulfil the rule, but the first condition is output as the result because
most pure Malay words are disyllable. Therefore, the rule for one syllable is avoided.
Affix precedence is important when developing a Malay stemmer. Othman [1993]
suggested that the best way to stem an affix in Rumi is to begin with the circumfix
followed by the prefix, the suffix, and the infix. An experiment by Ahmad [1995] showed
that, to reduce RAO stemmer errors, the first affix that must be eliminated is the
prefix, followed by the suffix, circumfix, and infix. An experiment was performed to
identify the best affix precedence for Jawi. The test dataset comprised 1200 uniquely
derived words in Jawi that which were taken from online newspapers. Testing for affix
precedence included six tests. Table X shows the six tests that were covered in this
experiment.
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:14

S. Sulaiman
Table X. List of the Six Tests Covered in this Experiment
Labelled
T1
T2
T3
T4
T5
T6

Affix Precedence
Prefix, circumfix, suffix and infix
Prefix, suffix, circumfix and infix
Circumfix, prefix, suffix and infix
Circumfix, suffix, prefix and infix
Suffix, prefix, circumfix, infix
Suffix, circumfix, prefix and infix

Fig. 3. The results of the affix precedence tests for Malay.

The T1 sequence was suggested by Ahmad [1995], and the T3 sequence was suggested by Othman [1993]. The stemmer was run for all six tests. For the first test (T1),
if the rule matched the derived word, then the result appeared as a root word. If no
rule was found, then the circumfix rules were loaded and the stemmer attempted to
find a rule based on the input. If the rule requirement was met, then the affix was
stemmed and the stemmed word was output as a root word, otherwise, the suffix rule
was loaded. If the rule requirement was again met, then the affix was stemmed and
output as a root word. If no rule was found, then the infix rule was loaded and the word
was stemmed based on that rule. The result was then output as a root word. This process was repeated using the prefix sequence in T2. This process was continued using
the sequences of T3, T4, T5, and T6. The results for the first experiment are shown in
Figure 3.
The results in Figure 3 show that the highest accuracy for affix precedence was in T3,
which refers to the circumfix, prefix, suffix, and infix [Sulaiman et al. 2011]. However,
this result differs from the results reported in Ahmad [1995].
4. TEST COLLECTIONS

This experiment was divided into two parts: the first part tested the accuracy of the
Jawi stemmer and the second part tested whether the Jawi stemmer is useful for retrieving relevant Jawi documents. For the first experiment, 1200 unique words were
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:15

Fig. 4. Accuracy of the stemmers.

derived from online newspapers and used for the experiment. Using the Transliteration Engine for Rumi to Jawi [TERUJA 2010], each word was transliterated into
the Jawi script. Some minor errors made by TERUJA were corrected manually. The
stopwords were eliminated using Abdullahs [2006] stop-word list.
The second experiment was a test on the document retrieval. The Quran collection,
as used by Ahmad [1995] and Abdullah [2006], was used as the corpus. Again, the
TERUJA [2010] transliteration engine, with the help of a Jawi expert, was used to
transliterate the Rumi corpus, and queries were made into the Jawi script. Verses
of each chapter were divided into separate text files, formatted as .txt. For example,
Chapter 1 had seven verses. Documents were created from each verse, which made
seven documents for the first chapter. Unique names were assigned to each file. For
example, document 13 in Chapter 1 was named S1A13, meaning S1 = chapter 1 and
A13 = verse 13; (S1A13 = Chapter 1, verse 13). Table I shows the Qurans chapters
and the number of verses that are involved in the test collection.
The corpus included 6236 documents. These documents covered the 114 chapters
of the Quran. The queries and relevant set used in this experiment were also based
on Ahmad [1995] and Abdullahs [2006] work. In this experiment, 36 queries were
transliterated using TERUJA [2010].
5. EXPERIMENTS AND RESULTS

The first experiment was conducted to investigate the accuracy of the Jawi stemmer
to produce the correct root word. A total of 1200 data words were used and the experiment was performed on three different stemmers. These three stemmers used the
same deaffixation rule and SEDR rule. Stemmer A is a stemmer algorithm based
on Ahmad [1995]. At the same time, stemmer B is a stemmer algorithm based on
Abdullah [2006], and stemmer C is a Jawi stemmer algorithm. Figure 4 shows the
results of this experiment and Table XI shows the errors of each stemmer.
From the previous result, the highest accuracy is produced by the Jawi stemmer,
which belongs to the Jawi stemmer. Each stemmer has a different precedence for the
deaffixation rule. The first group is based on the Ahmad [1995] precedence rule and the
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:16

S. Sulaiman
Table XI. Error of Each Stemmer
Types of Error
Understemming
Overstemming
Spelling Exception
Others

Stemmer A
(Ahmad)
147
30
21
58

Stemmer B
(Abdullah)
138
29
31
26

Stemmer C
(Jawi Stemmer)
34
61
40
44

second group follows the RFO (Rule Frequency Order), proposed by Abdullah [2006].
The Jawi stemmer produced more errors on overstemming because the stemmer tends
to stem as many characters as it can; for example, between - and - , the - rule
was instantiated first followed by - . This scenario is reversed for the first and second
group. These two groups produced more errors on understemming because the SEDR
rule cannot distinguish between the uses of [e] and [e] [Sulaiman et al. 2011]. Using
the Jawi deaffixation rule and the SEDR rule, Stemmer A and Stemmer B tend to
produce 79% and 81% accuracy. This experiment shows that the Jawi deaffixation rule
is more suitable for stemming Jawi script compared to the deaffixation rule [Ahmad
1995; Abdullah 2006] that was developed for Rumi script.
The stemmer was also tested for document retrieval purposes in terms of precision
and recall. The same queries and datasets from Ahmad [1995] and Abdullah [2006]
were transliterated to Jawi using TERUJA [2010]. The datasets contained 36 queries
and 6232 documents from The Quran. The experiment was divided into two sets. The
first set was Stemmed Jawi, and the second set was Nonstemmed Jawi. The relevance set from Ahmad was used. The determination of the relevance of a document
was done manually by Ahmad based on 36 queries. This was done by searching through
the subject index, concordance that, and glossary of the Quran as well as a several Islamic books [Ahmad 1995]. According to author, there are 3440 relevant documents for
the 36 queries [Ahmad 1995]. The purpose of this experiment was to test whether the
search engine could retrieve more relevant documents using a stemmed query and a
stemmed Jawi document. The experiment was performed and the results are shown in
Figure 5.
The interpolated average recall-precision graph was formed using 11 cutoff average
precision values. This graph has 11 cutoff average precision values, and the remaining
values are interpolated. The interpolation was performed based on the following rule
[Manning et al. 2008].
P(r) = max P(r )
r r

From the aforesaid rule, the interpolated precision is the maximum known precision
at any higher recall level. Based on this rule, the precision at recall 0 can be computed.
From Figure 5, the recall line declined from 0.0 to 1, which implies that the search engine could find all of the relevant documents as it reached a recall of 1. These relevant
documents also contained a substantial number of nonrelevant documents. The highest precision value was 50% for stemmed Jawi and was 28% for nonstemmed Jawi.
The precision of the stemmed Jawi remained constant from a recall of 0.8 onwards,
whereas the precision of the nonstemmed Jawi leveled off at a recall of 0.4 because,
without the use of the stemmer, the search engine retrieved documents that contained
the exact words of the query, thus cutting short the unrelated documents for the query.
The graph shows that the stemmed Jawi performed much better than the nonstemmed
Jawi at all the recall levels.
Next, the Mean Average Precision (MAP) value was calculated for each test. No interpolation process was involved in this calculation. The results show that the MAP
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:17

Fig. 5. Interpolated average recall-precision graph.


Table XII. Paired Sample Statistics

Pair 1

Mean
8.432
5.137

Stemmed Jawi
Non-Stemmed Jawi

N
36
36

Std. Deviation
10.150
7.450

Std. Error Mean


1.690
1.240

Table XIII. Paired Sample Test


Paired Differences

Pair 1

Stemmed Jawi Non-Stemmed


Jawi

Mean

Std.
Deviation

Std.
Error
Mean

3.295

7.171

1.195

95% Confidence
Interval of the
Difference
Lower Upper
.869
5.721

Df

Sig.
(2tailed)

2.757

35

.009

values for the stemmed Jawi and nonstemmed Jawi were 8.432 and 5.137, respectively. There was a difference between these values because the search engine tended
to retrieve more relevant documents for stemmed Jawi than for nonstemmed Jawi.
Tables XII and XIII show the results of this statistical testing.
A paired sample t-test was conducted to compare MAP values for stemmed Jawi and
nonstemmed Jawi. There was a significant difference in the scores for the stemmed
Jawi (M=8.432, SD=10.15) and nonstemmed Jawi (M=5.137, SD=7.45) conditions; t
(35) = 2.757, p = 0.009. These results suggest that the use of stemmed Jawi documents increased the precision in the document retrieval. Another experiment was also
performed using the Indri [Strohman et al. 2005] search engine to examine the precision of the relevant documents that were retrieved at various cutoff points for all 36
queries. The cutoffs were defined at the positions 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, and 3000. There were no changes
after the 3000 cutoff point. The ranked list produced by the search engine was ranked
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:18

S. Sulaiman
Table XIV. Paired Sample T-Test

Rank Cut-off Points


Cut-off 10
Cut-off 20
Cut-off 30
Cut-off 40
Cut-off 50
Cut-off 60
Cut-off 70
Cut-off 80
Cut-off 90
Cut-off 100
Cut-off 200
Cut-off 300
Cut-off 400
Cut-off 500
Cut-off 600
Cut-off 700
Cut-off 800
Cut-off 900
Cut-off 1000
Cut-off 2000
Cut-off 3000

Stemmed Jawi
Mean
Std. Deviation
3.716
7.636
4.684
8.732
5.098
8.717
5.585
8.780
5.990
8.790
6.336
8.941
6.635
9.158
6.930
9.290
7.269
9.759
7.475
9.976
7.815
9.868
8.002
9.930
8.173
10.044
8.146
10.143
8.285
10.071
8.322
10.071
8.317
10.077
8.320
10.077
8.334
10.087
8.412
10.150
8.432
10.150

Non-Stemmed Jawi
Mean
Std. Deviation
2.014
5.074
2.455
5.588
2.689
5.824
2.949
5.801
3.220
5.827
3.462
5.942
3.651
6.044
3.777
6.134
3.881
6.244
3.968
6.326
4.432
6.532
4.629
6.623
4.780
6.823
4.826
6.872
4.863
6.883
4.881
6.890
4.884
6.889
4.892
6.893
4.897
6.897
4.926
6.945
4.926
6.945

Sig. (2-tailed)
.134
.071
.055
.029
.022
.017
.014
.010
.007
.006
.007
.008
.007
.009
.007
.007
.007
.007
.007
.006
.006

Fig. 6. Performance comparison for the interpolated average recall-precision graph between the Jawi stemmer and the Malay stemmer (RAO and RFO).

based on the probability and commonly called a likelihood model. Table XIV shows the
paired sample t-test results for each cutoff point.
The hypotheses of this test are described as follows.
H0 = There was no significant difference in the means of the stemmed Jawi and
nonstemmed Jawi accuracies.
H1 = There was a significant difference in the means of the stemmed jawi and
nonstemmed Jawi accuracies.
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:19

Table XIV clearly shows that there was a difference between the means for the
Stemmed Jawi and nonstemmed Jawi documents. From a cutoff of 40 and above, the
null hypothesis was rejected and the alternative hypothesis was accepted. This result
explains that there was a significant difference in the means of the stemmed Jawi
(M=5.585. SD=8.78) accuracy and the nonstemmed Jawi (M=2.949, SD=5.801) accuracies; t (35) = 2.280, p = 0.029.
Sembok [2005] and Abdullah [2006] have conducted experiments to compare the retrieval effectiveness of using the conflation method. RAO is a Rumi stemmer proposed
by Ahmad et al. [1996], and the RFO stemmer was proposed by Abdullah [2006]. The
RAO and RFO stemmer represent the Rumi stemmer. Figure 6 shows the performance
comparison for the average recall-precision graph between the Jawi stemmer and the
Malay stemmer (RAO and RFO).
Figure 6 shows the average recall-precision values for all of the 36 queries for the
Jawi stemmer and the Rumi stemmer. From that figure, it can be observed that the
Jawi stemming performs better than the others.
6. CONCLUSIONS

In this article, we proposed a Jawi stemmer to stem derived Jawi words into their root
words. The deaffixation rule used by Ahmad [1995] and Abdullah [2006] was not efficient enough to stem Jawi-derived words. This result arises because spelling in Jawi
involves many rules. The accuracy of the Ahmad stemmer [1995] and the Abdullah
stemmer [2006], when tested on the Jawi script, is 67.3% and 68.3%, respectively. To
create a good stemmer, we produce a new Jawi deaffixation rule to stem Jawi-derived
words. Here, the deaffixation rule used in Ahmad [1995] and Abdullah [2006] was replaced by the Jawi deaffixation rule. The accuracy of these two stemmers increased
to 79% for that used in Ahmad [1996] and 81% for that used in Abdullah [2006]. The
Jawi stemmer shows a high accuracy compared to the others. This stemmer was also
evaluated using document retrieval. This evaluation method was chosen because we
wanted to investigate the effect of the Jawi stemmer on increasing the precision and
recall. Most of the stemmers have been proven to help search engines retrieve relevant documents, but some have no effect on document retrieval. Even though Jawi
represents the Malay language in the same way as Rumi, their spelling methods are
totally different. There is a significant difference in retrieval effectiveness (measured
in MAP) between the stemmed Jawi documents and the nonstemmed Jawi documents.
Detailed experiments were conducted: nonstemmed Jawi stabilized at a recall of
0.4 because, without using the stemmer, the search engine retrieved documents that
contained the exact words of the query, thus cutting short the unrelated documents
for the query. This trend made the nonstemmed Jawi documents level off at earlier
cutoff points compared to the stemmed Jawi documents. However, the stemmed Jawi
retrieved all of the documents that contained the stemmed word in the query. As a
result, it retrieved more nonrelevant documents than the nonstemmed Jawi, but was
still able to attain a higher precision. The Jawi stemmer can be used to increase document retrieval with a MAP value of 8.432%, and the paired sample t-test showed
that there was a significant difference between the stemmed Jawi documents and the
nonstemmed Jawi documents. For the precision at each document ranking, the
stemmed Jawi showed a significant difference at a cutoff of 40 and above. These results can be used to answer the question of whether the Jawi stemmer can be used to
increase document retrieval. The performance of peak F-measure for three stemmers
are RAO = 0.20, RFO = 0.21, and Jawi stemmer = 0.26 at recall 20%. The evaluation
can also be achieved using the Paice method to analyze the errors that are produced
by the stemmer [Paice 1994].
ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

6:20

S. Sulaiman

ACKNOWLEDGMENTS
The authors would like to thank Tn Haji Hamdan Abdul Rahman for spelling checking all data and queries.

REFERENCES
Abdullah, M. T. 2006. Monolingual and cross-language information retrieval approaches for Malay and
English language documents. Ph.D Thesis. Universiti Putra Malaysia.
Abdullah, M. T., Ahmad, F., Sembok T. M. T. 2009. Rules frequency order stemmer for Malay language. Int.
J. Comput. Sci. Netw. Secur. 9, 2, 433438.
Abu Bakar, Z. 1999. Evaluation of retrieval effectiveness of n-gram string similarity matching on Malay
documents. Tech. rep., Universiti Kebangsaan Malaysia. Bangi.
Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S. M., and Williams, H. 2007. Stemming Indonesia: A confixstripping approach. ACM Trans. Asian Lang. Inf. Process. 6, 4, 1330.
Ahmad, F. 1995. A Malay language document retrieval system: An experimental approach and analysis.
Ph.D. thesis, Universiti Kebangsaan Malaysia. 1248.
Ahmad, F., Yusoff, M., and Sembok, T. M. T. 1996 Experiments with a stemming algorithm for Malay words.
J. Amer. Soc. Inf. Sci. 47, 12, 909918.
Carlberger, J., Dalianis, H., Hassel, M., and Knutsson, O. 2001. Improving precision in information retrieval
for Swedish using stemming. In Proceedings of the 13th Nordic Computational Linguistics Conference
(NODALIDA01) 15.
Cleverdon, C. W., Mills, J., and Keen, M. 1966. Factors determining the performance of indexing systems.
Tech. rep., College of Aeronautics, University of Michigan, MI.
Flores, F. N., Moreira, V. P., and Heuser, C. A. 2010. Assessing the impact of stemming accurancy on information retrieval. In Proceedings of the 9th International Conference on Computational Processing of the
Portuguese Language (PROPAR10). 1020.
Frakes, W. and Baeza-Yates, R. 1992. Information retrieval: Data Structures and Algorithms. Prentice-Hall.
Ghani, R. A., Zakaria, M. S., and Omar, K. 2009. Jawi-Malay transliteration. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI09). 154157.
Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Inf. Sci. 42, 1, 715.
Hassan, A. 1974. The Morphology of Malay. Dewan Bahasa Dan Pustaka, Kementerian Pelajaran Malaysia.
Hull, D. 1996. Stemming algorithms: A case study for detailed evaluation. J. Amer. Soc. Inf. Sci. 47, 1, 7084.
Islam, M. Z., Uddin, M. N., and Khan, M. 2007. A light weight stemmer for Bengali and its use in spelling
checker. In Proceedings of 1st International Conference on Digital Communications and Computer
Applications (DCCA07).
Kraaij, W. and Pohlmann, R. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR96). 4048.
Malacon, B. R, 2004. Computational analysis of affixed words in Malay language. In Proceedings of the 8th
International Symposium on Malay/Indonesia Linguistic (ISMIL04).
Manning, C. D., Raghavan P., and Schutze, H. 2008. Introduction to Iinformation Retrieval. Cambridge
University Press. UK.
Ming, D. C. 1986. Access to Malay manuscripts. In Proceeding of the 32nd International Congress for Asian
and North African Studies.
Moain, A. J. 1992. Sejarah Tulisan Jawi. Jurnal Bahasa 35, 11. 1011012.
Nasruddin, M. F., Omar, K., Zakaria, M. S., and Liong, C.-Y. 2008. Handwritten cursive Jawi character
recognition: A survey. In Proceedings of the 5th IEEE International Conference on Computer Graphics,
Imaging and Visualisation (CGIV08). 247249.
Othman, A. 1993. Pengakar perkataan Melayu dan sistem capaian dokumen. Tech. rep., Universiti
Kebangsaan.
Paice, C. D. 1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR94).
4250.
Porter, M. F. 1980. An algorithm for suffix stripping. Program. Electron. Libr. Inf. Syst. 14, 3. 130137.
Rahman, H. A. 1999. Panduan Menulis dan Mengeja Jawi. Dewan Bahasa dan Pustaka. Kuala Lumpur.
Roslan, G. 2009. Jawi-Malay transliteration. In Proceedings of the International Conference on Electrical
Engineering and Informatics. 154157.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents

6:21

Savoy, J. 1999. A stemming procedure and stopword list for general France corpora. J. Amer. Soc. Inf. Sci.
50, 10, 944952.
Sembok, T. M. T, Palasundram, K., Ali, N. M, Yahya, A., and Wook, T. S. M. T. 2003. Istilah sains: A
Malay-English terminology retrieval system experiment using stemming and n-grams approach in
malay words. In Proceeding of the 6th International Conference on Asian Digital Libraries: Technology
and Management of Indigenous Knowledge for Global Access (ICADL03). Lecture Notes in Computer
Science, Vol. 2911, Springer, 173177.
Sembok, T. M. T. 2005. Word stemming algorithm and retrieval effectiveness in Malay and Arabic documents
retrieval systems. World Acad. Sci. Engin. Techno. 2911, 9597, 173177.
Smucker, M. D., Allen, J., and Carterette, B. 2007. A comparison of statistical significant test for information retrieval evaluation. In Proceeding of the 16th ACM Conference on Conference on Information and
Knowledge Management (CIKM07). 623632.
Smucker, M. D., Allen, J., and Carterette, B. 2009. Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In Proceeding of the 32nd International ACM
SIGIR Conference on Research and Development in Information Retrieval (SIGIR09). 630631.
Strohman, T., Metzler, D., Turtle, H., and Croft, W. B. 2005. Indri. A language-model based search engine
for complex queries. In Proceeding of the International Conference on Intelligence Analysis.
Sulaiman, S., Omar, K., Nazlia, O., Murah, M. Z., and Abdul Rahman, H. 2011. A Malay stemmer for Jawi
character. In Proceeding of the 24th Australasian Joint Conference on Artificial Intelligence (AI11).
668676.
Teruja, 2010. Transliteration engine for Rumi to Jawi, http://www.jawi.ukm.my/.
Yatim, O. M. 1990. Epigrafi islam terawal di Nusantara. Dewan Bahasa dan Pustaka.
Yonhendri. 2009. Transliterasi rumi ke jawi berasaskan petua. Master Thesis, Universiti Kebangsaan
Malaysia.
Received March 2013; revised August 2013; accepted October 2013

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

i
i

You might also like