Professional Documents
Culture Documents
Abstract:
Information retrieval is a process of retrieving the documents to satisfy the user's need for information. The
user's information need is represented by a query, the retrieval decision is made by comparing the terms of the query
with the terms in the document itself or by estimating the degree of relevance that the document has to the query.
Words in a document may have many morphological variants. These morphological variants of words have similar
semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a
number of so-called Stemming Algorithms, which reduces the word to its stem or root form have been developed.
Thus, the key terms of a query or document are represented by stems rather than by the original words. Stemming
reduces the size of the index files and also improves the retrieval effectiveness. A stemming algorithm is a
computational procedure which reduces all words with the same root (or, if prefixes are left untouched, the same
stem) to a common form, usually by stripping each word of its derivational and inflectional suffixes. This study
evaluated the performance analysis of the basic three prefix and suffix removal stemming algorithms in the Tamil
language called Light Stemmer, Improved Light Stemmer and Rule Based Iterative Affix stripping Stemmer by
Accuracy and strength of the algorithms.
Keywords — Stemming, Suffix Stripping, Information Retrieval, Text Preprocessing, Morphology.
I. Introduction
Stemming:
In most cases, morphological variants of words have similar semantic interpretations and can be
considered as equivalent for the purpose of IR applications. For this reason, a number of so-called stemming
gAlgorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the
key terms of a query or document are represented by stems rather than by the original words. This not only means
that different variants of a term can be conflated to a single representative form – it also reduces the dictionary size,
that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a
saving of storage space and processing time.
For IR purposes, it doesn't usually matter whether the stems generated are genuine words or not –
thus, "computation" might be stemmed to "comput" – provided that (a) different words with the same 'base meaning'
are conflated to the same form, and (b) words with distinct meanings are kept separate. An algorithm which attempts
to convert a word to its linguistically correct root ("compute" in this case) is sometimes called a lemmatiser.
Light stemmer is one type of rule based stemmer. It works by truncating all possible suffixes form
and produce finite verb. Light stemming is used to find the representative
indexing form of given word by the application of truncation of suffixes. The core objective of light stemmer is to
preserve the word meaning intact and so increases the retrieval performance of an IR system. A Light Stemmer
Algorithm for Tamil Language is projected in Figure 1.1
Figure 1.2 Flow chart for Improved Light Stemmer Algorithm for Tamil Language
As shown in Figure 4.4, first the prefixes are removed followed by the suffixes. Every suffix stripping routine
checks for the length of the string before proceeding and after removing a suffix calls the routine responsible for
fixing the endings. The flow of the algorithm is depicted in the figure 4.4.
Figure 1..4 Flowchart for Rule Based Iterative Affix Stripping Algorithm for Tamil Language
The index compression factor is the fractional reduction in index size achieved through stemming. For
example, a corpus with 50,000 words (n) and 40,000 stems (s), would have an index compression factor of 20%.
Stronger stemmers will tend to have larger index compression factors. This can be calculated by,
ICF=(N-S)\N
Where ICF is Index Compression factor
N is Number of unique words before Stemming
S is Number of unique stems after Stemming
3. The number of words and stems that differ
Stemmers often leave words unchanged. For example, a stemmer might not alter "engineer" because it is
already a root word. Stronger stemmers will change words more often than weaker stemmers.
4.The mean number of characters removed in forming stems
There are various metrics which use the principle that strong stemmers remove more characters from words
than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to
a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions",
"reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1,
3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.
5.Hamming Distance
The Hamming Distance takes two strings of equal length and counts the number of corresponding positions
where the characters are different. If the strings are of different lengths, we can use the Modified Hamming
Distance, MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula
MHD = HD(1,P) + (Q-P)
where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer,
suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing
"parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every
word in the original sample.
Data set Total Number Total Number of Minimum Length of Maximum Length of
of Unique Words the Word the Word
Words
Training Dataset 3000 847 3 14
Test Dataset I 700 234 3 9
Test Dataset II 1600 580 3 11
Using this data, I get the Average Accuracy result of the Tamil Stemming Algorithm in the below table
1.6 and figure 1.7
Accuracy
Dataset
90.00%
88.00%
86.00%
84.00%
Accuracy
82.00%
80.00%
78.00%
76.00%
74.00%
72.00%
Light Stemmer Improved Light Stemmer Rule Based Iterative Affix
Stripping
3.5
2.5
1.5
0.5
0
Distance Distance Removed Factor Class Size
Hamming Hamming Characters Compression Conflation
Modified Modified Mean Mean
Mean Median
Light Stemmer Improved Light Stemmer Rule Based Iterative Affix Stripping
IV. Conclusion
By the evaluation of the accuracy of the Tamil Affix Stripping Stemming Algorithm, Light Stemmer
Algorithm provides 81.36% of the accuracy, Improved Light Stemmer Algorithm provides 83.28% of the accuracy
and Rule Based Iterative Affix Stripping shows the 84.96% of accuracy in the Dataset I and II which having totally
2300 Tamil words. In the Strength or Performance side, the proposed algorithm that is Rule Based Iterative Affix
Stripping changes more words into their stems than other two stemming algorithms with high Mean characters
removed, Compression factor and Mean Conflation Class size. Based on these measures, we found that our proposed
algorithm, Rule Based Iterative Affix Stripping is the strongest stemmer, with Improved Light Stemmer somewhat
weaker but still quite strong. Light stemmer is considerably weaker than both Rule Based Iterative Affix Stripping
and Improved Light Stemmer.
References
[1] VairaprakashGurusamy, K.nandhini, “Stemming Techniques for Tamil Languages”, International Journal of
Computer Science and Engineering Technolgy, Vol 8, Issue 6, pp 225-231,2017
[2]VairaprakashGurusamy, S Kannan, “Preprocessing Techniques for Text Mining”
https://www.researchgate.net/publication/273127322
[3] VairaprakashGurusamy, S Kannan, K.Nandhini, “Performance Analysisis:Stemming algorithm for English
Language”, International Journal of Scientific Research and Development, Volume-5, Issue-5,2017, pp 1933-38
[4] Lovins J., Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics,11
pp.
22-31, 1968
[5] M. Porter, An algorithm for suffix stripping Program, Volume 14, Number 3, pages 130-137, 1980.
[6] Paice C., Another stemmer, In Proceedings of SIGIR Forum, Vol. 24(3), pp. 56-61, 1990.
[7] P. Majumder, M. Mitra, Swapan K., Parisi, G. Kole, P. Mitra, and K. Datta, ‘YASS Yet another Suffix Stripper‘,
ACM Transactions, 2007.
[8] Wessel Kraaij. Viewing stemming as recall enhancement, In Hans-Peter Frei, DonnaHarman, Peter Schauble and
Ross Wilkinson(editors), Proceedings of the 19th AnnualInternational ACM-SIGIR Conference on Re-search
and Development in Information Retrieval, pages 40-48, Zurich, Switzerland, 18-22 August 1996. ACM
[9] David A. Hull, Stemming Algorithms-A case study for Detailed Evaluation, Pp.1-27, 1995
[10] NegaAlemayehu and Petter Willett, The Effectiveness of Stemming for InformationRetrieval in Amharic for
Information Retrieval, Electronic Library, and Information Systems,Vol.37, num.4, pp.254-259, 2003.
[11] S.Srinivasan and ZP.Thambidurai, Stans Algorithm for Root Word Stemming, Information Technology Journal
5(4): 685-688, India, 2006
[12] Ekmekcioglu C.F., Lynch M.F and Willett P. “ Stemming and N-gram Matching for termconflation in Turkish
Texests”, 1996.
[13] Llia Smirnov “Overview of Stemming Algorithm” Mechanical Translation, 2008
[14] John A. Goldsmith, “Unsupervised learning of the morphology of a natural language”, Computational
Linguistics, 27(2):153-198,2001
[15] Christopher J. Fox and William B. Frakes,“Strength and Similarity of Affix Removal Stemming Algorithms”,
Special Interest Group on Information Retrieval Forum, 37(1):26-30, 2003.
[16] Harman Donna. “How effective is suffixing?” Journal of the American Society for Information Science.1991;
42, 7-15