You are on page 1of 9

International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

RESEARCH ARTICLE OPEN ACCESS

Performance Analysis: Stemming Algorithm for the


Tamil Language
Vairaprakash Gurusamy1, S.Kannan2 , K.Nandhini3
1,2
Department of Computer Applications, School of IT, Madurai Kamaraj University, Madurai
3
Technical Support Engineer,Concentrix India Pvt Ltd, Chennai,

Abstract:
Information retrieval is a process of retrieving the documents to satisfy the user's need for information. The
user's information need is represented by a query, the retrieval decision is made by comparing the terms of the query
with the terms in the document itself or by estimating the degree of relevance that the document has to the query.
Words in a document may have many morphological variants. These morphological variants of words have similar
semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a
number of so-called Stemming Algorithms, which reduces the word to its stem or root form have been developed.
Thus, the key terms of a query or document are represented by stems rather than by the original words. Stemming
reduces the size of the index files and also improves the retrieval effectiveness. A stemming algorithm is a
computational procedure which reduces all words with the same root (or, if prefixes are left untouched, the same
stem) to a common form, usually by stripping each word of its derivational and inflectional suffixes. This study
evaluated the performance analysis of the basic three prefix and suffix removal stemming algorithms in the Tamil
language called Light Stemmer, Improved Light Stemmer and Rule Based Iterative Affix stripping Stemmer by
Accuracy and strength of the algorithms.
Keywords — Stemming, Suffix Stripping, Information Retrieval, Text Preprocessing, Morphology.

I. Introduction
Stemming:

In most cases, morphological variants of words have similar semantic interpretations and can be
considered as equivalent for the purpose of IR applications. For this reason, a number of so-called stemming
gAlgorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the
key terms of a query or document are represented by stems rather than by the original words. This not only means
that different variants of a term can be conflated to a single representative form – it also reduces the dictionary size,
that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a
saving of storage space and processing time.

For IR purposes, it doesn't usually matter whether the stems generated are genuine words or not –
thus, "computation" might be stemmed to "comput" – provided that (a) different words with the same 'base meaning'
are conflated to the same form, and (b) words with distinct meanings are kept separate. An algorithm which attempts
to convert a word to its linguistically correct root ("compute" in this case) is sometimes called a lemmatiser.

1.1 Light Stemmer:

Light stemmer is one type of rule based stemmer. It works by truncating all possible suffixes form
and produce finite verb. Light stemming is used to find the representative
indexing form of given word by the application of truncation of suffixes. The core objective of light stemmer is to
preserve the word meaning intact and so increases the retrieval performance of an IR system. A Light Stemmer
Algorithm for Tamil Language is projected in Figure 1.1

ISSN: 2395-1303 http://www.ijetjournal.org Page 231


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

Figure 1.1 A Light Stemmer Algorithm for Tamil Language

1.2 Improved Light Stemmer:


Light stemmer works by truncating all possible suffixes form and produce finite verb. Light
stemming is used to find the representative indexing form of given word by the application of truncation of suffixes.
The core objective of light stemmer is to preserve the word meaning intact and so that it increases the retrieval
performance of an IR system. The proposed improved light stemming algorithm in this system defines an algorithm
(removes suffixes recursively and a single prefix non-recursively), that is called SP. The same defined affixation
terms list were used but with a modified execution step via Suffix Prefix Suffix truncating process, that algorithm is
called SPS.
Notice that most of Tamil words use (Thiru, Thirumathi) prefix as a declarative term (e.g., Thiru.
Dr.A.P.J.AbdulKalam) therefore, proposed two new major categories in classifying of the designed algorithms;
Without-Thiru (WOTH the stemmer accepts the non-stemmed words after removing the prefixed Thiru) and With
Thiru (WTH stemmer acquires the whole non-stemmed word). This work applies improved light stemming concept
to replace plural terms, adjectives and tense words. The flowchart and algorithm of improved light stemmer
projected in Fig. 1.2 and Fig. 1.3.

ISSN: 2395-1303 http://www.ijetjournal.org Page 232


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

Figure 1.2 Flow chart for Improved Light Stemmer Algorithm for Tamil Language

Figure 1.3 Improved Light Stemmer Algorithm for Tamil Language


1.3 Rule Based Iterative Affix Stripping Stemmer:
In Tamil suffixes are used for many things like tense, plurality, person etc. So the suffixes are grouped
into categories and a routine is defined for each category to handle the removal the respective suffixes. After
removal of suffix for each category there is routine to fix or recode the ending of the word to make it consumable for
the next routine. Also before stripping the suffix every routine checks for the current size of the string.

As shown in Figure 4.4, first the prefixes are removed followed by the suffixes. Every suffix stripping routine
checks for the length of the string before proceeding and after removing a suffix calls the routine responsible for
fixing the endings. The flow of the algorithm is depicted in the figure 4.4.

ISSN: 2395-1303 http://www.ijetjournal.org Page 233


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

Figure 1..4 Flowchart for Rule Based Iterative Affix Stripping Algorithm for Tamil Language

II.Evaluation of Stemming algorithm for the Tamil Language


To evaluate the stemming algorithms for both Tamil and English languages the following two parameters are
too essential. They are,
I. Accuracy of the Stemming algorithm
II. Strength or Performance of the Stemming Algorithm
I. Accuracy of the Stemming algorithm
It is based on the Number of Words Stemmed correctly given by the Stemming algorithms and number of unique
words in the given datasets. Accuracy is calculated by the below formula,
Accuracy = Number of Correctly Stemmed words / Number of Unique words*10
II. Strength or Performance of the Stemming Algorithm
To evaluate Strength or Performance of the Stemming algorithm for both languages, I used the following
five Parameters or Metrics as follows,
1. The mean number of words per conflation class
This is the average number of words that correspond to the same stem for a corpus. Forexample if the words
"engineer," "engineered," and "engineering" are stemmed to "engineer,"Then this conflation class size is three.
Stronger stemmers will tend to have more words perconflation class. If the conflation of 1,000 different words
resulted in 250 distinct stems, then the mean number of words per conflation class would be 4.
This metric is obviously dependent on the number of words processed, but for a word collection of given
size, a higher value indicates a heavier stemmer. The value is easily calculated as follows:
MWC=N/S
Where MWC is Mean number of words per conflation class
N is Number of unique words before Stemming
S is Number of unique stems after Stemming
2. Index compression factor

ISSN: 2395-1303 http://www.ijetjournal.org Page 234


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

The index compression factor is the fractional reduction in index size achieved through stemming. For
example, a corpus with 50,000 words (n) and 40,000 stems (s), would have an index compression factor of 20%.
Stronger stemmers will tend to have larger index compression factors. This can be calculated by,
ICF=(N-S)\N
Where ICF is Index Compression factor
N is Number of unique words before Stemming
S is Number of unique stems after Stemming
3. The number of words and stems that differ
Stemmers often leave words unchanged. For example, a stemmer might not alter "engineer" because it is
already a root word. Stronger stemmers will change words more often than weaker stemmers.
4.The mean number of characters removed in forming stems
There are various metrics which use the principle that strong stemmers remove more characters from words
than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to
a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions",
"reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1,
3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.
5.Hamming Distance
The Hamming Distance takes two strings of equal length and counts the number of corresponding positions
where the characters are different. If the strings are of different lengths, we can use the Modified Hamming
Distance, MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula
MHD = HD(1,P) + (Q-P)
where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer,
suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing
"parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every
word in the original sample.

III. Experimental Results


Data Set
Two test dataset are considered for evaluating the Tamil Stemming algorithm. The test dataset I has 700
words which are collected from Tamil corpus (Central Institute of Indian Language). Test dataset 2 contains 1600
words constructed from the internet. The stem for these words have been defined manually. The training dataset
consists of 3000 words taken from Tamil corpus. Below Table 1.4 summarizes the details of training and test
datasets.

Data set Total Number Total Number of Minimum Length of Maximum Length of
of Unique Words the Word the Word
Words
Training Dataset 3000 847 3 14
Test Dataset I 700 234 3 9
Test Dataset II 1600 580 3 11

Table 1.4 Data Set Generation

1.Accuracy of the Tamil Stemming Algorithm


The accuracy is a parameter; it is used to evaluate the efficiency of the Tamil stemming algorithm. The
accuracy is defined based on number of words stemmed correctly. The below tables 1.5 shows the computational
results of the three Affix Stripping algorithm

ISSN: 2395-1303 http://www.ijetjournal.org Page 235


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

Number of Number of Number of


Number of Number of Correctly Correctly Correctly
Dataset Words Unique Words Stemmed Word by Stemmed Word by Stemmed Word
Light Stemmer Improved Light by Rule Based
algorithm Stemmer algorithm Iterative Affix
Stripping
algorithm
200 37 30 32 32
400 118 101 99 101
Dataset I 600 182 152 160 161
700 237 200 213 213
200 55 44 44 46
400 148 117 116 120
600 193 145 149 154
800 391 290 318 314
1000 424 338 336 340
1200 486 400 399 417
Dataset II
1400 539 403 411 422

1600 580 480 487 508

Table 1.5 Test Data of Tamil Stemming Algorithm

Using this data, I get the Average Accuracy result of the Tamil Stemming Algorithm in the below table
1.6 and figure 1.7

Accuracy
Dataset

Light Stemmer Improved Light Stemmer Rule Based Iterative Affix


Stripping

Dataset I 84.32% 86.73% 87.52%

Dataset II 78.4% 79.83% 82.4%

Average 81.36% 83.28% 84.96%

Table 1.6 Average Accuracy Result of the Tamil Stemming algorithm

ISSN: 2395-1303 http://www.ijetjournal.org Page 236


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

90.00%
88.00%
86.00%
84.00%
Accuracy

82.00%
80.00%
78.00%
76.00%
74.00%
72.00%
Light Stemmer Improved Light Stemmer Rule Based Iterative Affix
Stripping

Dataset I Dataset II Average

Fig. 1.7Average Accuracy Result of the Tamil Stemming Algorithm


2.Strength or Performance of the Tamil Stemming Algorithm
In this section we analyze the strengths of the three Affix Removal Stemmers. We looked at the five
measures of stemmer strength described above; the results are displayed in the below table 1.7 and Figure 1.8.

Mean Median Mean


Stemmer Modified Modified Characters Compression Mean Word and Stem
Hamming Hamming Removed Factor Conflation Different
Distance Distance Class Size
Light 1.82 2 1.6 0.51 1.84 1845(80.2%)
Stemmer
Improved 2.14 2 1.9 0.58 2.31 1941 (84.4%)
Light
Stemmer
Rule Based 2.76 3 2.4 0.65 2.88 1991(86.53%)
Iterative
Affix
Stripping

Table 1.7 Performance of the Tamil StemmingAlgorithm

ISSN: 2395-1303 http://www.ijetjournal.org Page 237


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

3.5

2.5

1.5

0.5

0
Distance Distance Removed Factor Class Size
Hamming Hamming Characters Compression Conflation
Modified Modified Mean Mean
Mean Median

Light Stemmer Improved Light Stemmer Rule Based Iterative Affix Stripping

Fig. 1.8 Performance of the Tamil StemmingAlgorithm

IV. Conclusion
By the evaluation of the accuracy of the Tamil Affix Stripping Stemming Algorithm, Light Stemmer
Algorithm provides 81.36% of the accuracy, Improved Light Stemmer Algorithm provides 83.28% of the accuracy
and Rule Based Iterative Affix Stripping shows the 84.96% of accuracy in the Dataset I and II which having totally
2300 Tamil words. In the Strength or Performance side, the proposed algorithm that is Rule Based Iterative Affix
Stripping changes more words into their stems than other two stemming algorithms with high Mean characters
removed, Compression factor and Mean Conflation Class size. Based on these measures, we found that our proposed
algorithm, Rule Based Iterative Affix Stripping is the strongest stemmer, with Improved Light Stemmer somewhat
weaker but still quite strong. Light stemmer is considerably weaker than both Rule Based Iterative Affix Stripping
and Improved Light Stemmer.

References
[1] VairaprakashGurusamy, K.nandhini, “Stemming Techniques for Tamil Languages”, International Journal of
Computer Science and Engineering Technolgy, Vol 8, Issue 6, pp 225-231,2017
[2]VairaprakashGurusamy, S Kannan, “Preprocessing Techniques for Text Mining”
https://www.researchgate.net/publication/273127322
[3] VairaprakashGurusamy, S Kannan, K.Nandhini, “Performance Analysisis:Stemming algorithm for English
Language”, International Journal of Scientific Research and Development, Volume-5, Issue-5,2017, pp 1933-38
[4] Lovins J., Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics,11
pp.
22-31, 1968
[5] M. Porter, An algorithm for suffix stripping Program, Volume 14, Number 3, pages 130-137, 1980.
[6] Paice C., Another stemmer, In Proceedings of SIGIR Forum, Vol. 24(3), pp. 56-61, 1990.
[7] P. Majumder, M. Mitra, Swapan K., Parisi, G. Kole, P. Mitra, and K. Datta, ‘YASS Yet another Suffix Stripper‘,
ACM Transactions, 2007.
[8] Wessel Kraaij. Viewing stemming as recall enhancement, In Hans-Peter Frei, DonnaHarman, Peter Schauble and
Ross Wilkinson(editors), Proceedings of the 19th AnnualInternational ACM-SIGIR Conference on Re-search

ISSN: 2395-1303 http://www.ijetjournal.org Page 238


International Journal of Engineering and Techniques - Volume 3 Issue 5, Sep - Oct 2017

and Development in Information Retrieval, pages 40-48, Zurich, Switzerland, 18-22 August 1996. ACM
[9] David A. Hull, Stemming Algorithms-A case study for Detailed Evaluation, Pp.1-27, 1995
[10] NegaAlemayehu and Petter Willett, The Effectiveness of Stemming for InformationRetrieval in Amharic for
Information Retrieval, Electronic Library, and Information Systems,Vol.37, num.4, pp.254-259, 2003.
[11] S.Srinivasan and ZP.Thambidurai, Stans Algorithm for Root Word Stemming, Information Technology Journal
5(4): 685-688, India, 2006
[12] Ekmekcioglu C.F., Lynch M.F and Willett P. “ Stemming and N-gram Matching for termconflation in Turkish
Texests”, 1996.
[13] Llia Smirnov “Overview of Stemming Algorithm” Mechanical Translation, 2008
[14] John A. Goldsmith, “Unsupervised learning of the morphology of a natural language”, Computational
Linguistics, 27(2):153-198,2001
[15] Christopher J. Fox and William B. Frakes,“Strength and Similarity of Affix Removal Stemming Algorithms”,
Special Interest Group on Information Retrieval Forum, 37(1):26-30, 2003.
[16] Harman Donna. “How effective is suffixing?” Journal of the American Society for Information Science.1991;
42, 7-15

ISSN: 2395-1303 http://www.ijetjournal.org Page 239

You might also like