You are on page 1of 7

Modern Information Retrieval

Course ID: INSC612

Title = IR Stemming Article review


By: Lemi Rattu

ID = GSE/3342/13

Submitted to: Dr. Dereje Teferi (Ph.D.)

August 2023

Addis Ababa, Ethiopia

1
List of Table
Table 1: Summary of implemented algorithms for six Ethiopian language.........................................5

Table of Contents
1. Introduction...............................................................................................................................3

2. Purposes of the Article..............................................................................................................3

3. Approach or Methodology........................................................................................................3

i. Rule-Based Approach:........................................................................................................4

ii. Successor Variety Approach:.............................................................................................4

iii. Hybrid Approach.............................................................................................................4

iv. Longest Match Approach................................................................................................4

4. Discuses.....................................................................................................................................6

5. Recommendations.....................................................................................................................7

6. Reference:..................................................................................................................................7

2
1. Introduction
In the article, the significance of using stems in information retrieval processes is emphasized,
particularly in systems that use natural language processing. Stems are essential for the topical
categorization of texts and for improving the accuracy of search results. By enhancing memory
and precision, these processes are essential for increasing the effectiveness of information
retrieval applications.

In text mining and other types of natural language processing, stemming is a common
preprocessing step. It is very important for information retrieval processes. The paper underlines
that each languages' unique morphological patterns call for the need for specific stemming
techniques.

The study explores how specifically designed stemming algorithms handle linguistic complexities
and variances in six Ethiopian languages, which have complicated morphological rules involving
prefixes, suffixes, and infixes.

2. Purposes of the Article


The research focuses at the ways to stem languages of Ethiopia such as Amharic, Afan Oromo,
Tigrinya, Wolaita, Kambaata, and Awngi. It reviews and compares several stemming techniques
with an emphasis on effectiveness, adaptability, and stem production for data retrieval systems.
To determine which strategy works best for Ethiopian languages is the objective.

3. Approach or Methodology
The authors describe two main text-stemming techniques. Affixes are found and eliminated using
the first technique, which uses context-free analysis. Lemmatization, the second strategy, calls for
in-depth knowledge of a language's grammar and lexicon. It is more complex than stemming and
requires dictionary searches. Lemmatization produces more precise results despite its complexity.
For instance, the word "better" lemmatizes to "good," a change that traditional stemming methods
cannot make without a dictionary.

3
For various languages, there are numerous stemming techniques with varied performance and
accuracy. The four distinct stemming strategies rule-based, successor variety, hybrid technique,
and longest match that are described in the study will be thoroughly explored.

In the areas of information retrieval and natural language processing, the book explains four
different stemming techniques. These strategies include: These approaches are as follows:

i. Rule-Based Approach: This approach combines a pattern-driven infix remover with a


rule-based lightweight stemmer. Prefixes and suffixes are successfully eliminated by the
stemmer by following predetermined rules, while infixes are eliminated by the infix
remover by following predetermined patterns. The usage of a stemmer and infix remover
demonstrates a disciplined and methodical approach to dealing with linguistic variances,
and the extensive referencing emphasizes its acknowledged importance in the academic
community.
ii. Successor Variety Approach: The successor variety methodology is focused on
evaluating the diversity of distinctive characters that follow a given string of words
included inside a corpus, in accordance with the Author. This method is intended to be
adaptable to various text datasets, unlike approaches that rely on predetermined lists of
suffixes or rigid rules for removal. This approach relies on the frequency of letter
sequences in the text to aid with stemming, as explained in [10]. As more characters are
added, the successor variety gradually decreases until it reaches its lowest point just
before a segment's border is reached. It goes through a noticeable rise at this point, which
great.ly helps in locating the main stem.
iii. Hybrid Approach: The author's sources outline hybrid approaches, which combine two
or more procedures, and provide references [11] and [12]. For instance, a hybrid approach
using a suffix tree can start its operation by using a brute-force strategy to access a lookup
table. Notably, this table only stores a small number of "common exceptions,"
significantly reducing the amount of storage needed.
iv. Longest Match Approach According to [13], the longest match method strips a word of
its longest suffix. For instance, the "fullness" suffix would be dropped from the term
"fruitfulness". However, this approach necessitates the generation of all affix
combinations, which uses a lot of computational power and storage space.

4
The text also discusses an analysis of these stemming approaches for selected Ethiopian
languages. Different researchers have developed and applied these approaches to these languages.
While each approach is designed for specific languages, some observations have been made
regarding their effectiveness. The analysis of these approaches for the selected Ethiopian
languages is summarized in a table.

The authors have compiled a comprehensive summary in the form of a table, detailing the
application of diverse stemming techniques to various languages. This overview includes
corresponding accuracy and error rates for each case:

Table 1: Summary of implemented algorithms for six Ethiopian language

No Languag Primary Conflation Sensitive Error Accuracy


e Researcher Technique in Rate
Context
?
1 Amharic Nega Alemayehu Rule Yes 4.01% 95.90%
Based(Iterative)
2 Amharic Atelach Alemu Affix removal & No 25% 75%
and Lars Asker Dictionary-based-
Based
3 Amharic Genet Mezemir Successor Variety Yes 28.2% 71.8%
4 Afan Mekonnen Rule-Based Yes 7.48% 92.52%
Oromo Wakshum (Longest Match)
5 Afan Debela Tesfaye Rule-Based Yes 4.27% 94.84%
Oromo (Iterative)
6 Kambaata Jonathan Samuel Rule-Based Yes 3.13% 96.87%
(Longest Match)
7 Ge’ez Abebe Belay and Rule Based Yes 17.58% 2.42 %
Yibeltal Chanie
8 Wolaita Lemma Lessa Longest Match Yes 4.01% 95.9%
9 Afan Debela Tesfaye A Hybrid Approach Yes 4.27% 95.73%
Oromo
10 Kambaata Jonathan Samuel Rule Based Yes 4.01% 95.9%
& Solomon
Teferra
11 Silt’e Muzeyn Kedir Longest Match Yes 14.28% 85.71 %
12 Tigrinya Omer Osman and Hybrid Approach Yes 10.7% 89.3%
Yoshiki Mikami
13 Awngi Tsegaye misikir Longest Match Yes 8.59% 91.41%
14 Wolaita Girma Yohannis Longest Match Yes 8.16% 91.84%
Bade and Hussien

5
Seid
15 Tigrigna Yonas fisseha Longest Match Yes 13.89% 86.1%

4. Discuses
The analysis of different languages using various natural language processing (NLP) methods
appears to be represented in the table. The analysis contains information on the main researchers,
conflation methods, context sensitivity, and error rates for each language.
Conflation Techniques:
Conflation techniques are approaches used in NLP to deal with word variants, such as various
word forms (inflections, plurals, etc.), or related words. The table lists various approaches, such
as "Rule-Based (Iterative)", "Affix Removal & Dictionary-Based", "Successor Variety", "Rule-
Based (Longest Match)", and "A Hybrid Approach".
Sensitive in Context:
If the language under analysis is context-sensitive, it is shown in the "Sensitive in Context" field.
This suggests that the language's contextual aspect may call for additional processing. The
effectiveness of NLP approaches may be impacted by this sensitivity.
Error Rates:
The error rates listed in the table show how well the applied NLP approaches for each language
were accurate or successful. Lower error rates are preferred since they signify improved
technique performance and precision.
Interpretation:
According to the source in the table, several conflation approaches have apparently been used to
evaluate different languages. Due to complicated grammatical rules or context-dependent
meanings, some languages are more context-sensitive than others.
Comparison:
comparing error rates between various languages and methodologies to determine which
techniques work best for certain languages. The fact that error rates range greatly, from as low as
3.13% to as high as 28.2%, is significant.
Researcher and Technique Variability:

6
Multiple researchers and a variety of techniques indicate that various researchers have used
various methods to analyze the languages. Based on different methodology, this can result in
differences in the outcomes.
Limitations:
The table gives no details regarding the amount of the dataset utilized for analysis, the precise
NLP tasks being carried out, or other outside variables that can affect mistake rates. These
elements might have a significant effect on the outcomes.

5. Recommendations
This kind of analysis is crucial for enhancing NLP tools for languages with limited resources and
understanding the difficulties presented by various linguistic structures. On top of this work,
researchers might develop NLP models that are more precise for these languages.

6. Reference:
1. International Journal of Advanced Science and Technology, Vol. 29, No. 7, (2020), pp.
2532-2536, ISSN: 2005-4238.

You might also like