You are on page 1of 9

ISTILAH SAINS: A Malay-English Terminology Retrieval System Tengku Mohd. T.

Sembok, Kulothunkan Palasundram, Nazlena Mohd Ali, Aidanismah Yahya Universiti Kebangsaan Malaysia Email : tmts@pkrisc.ukm.my
Abstract

The emergence of information era has increased the importance of translating articles from Malay to English, and vice-versa, to assist in the dissemination of information into and from South East Asia, mainly Malaysia, Indonesia and Brunei where the Malay language is widely used. This has created the need to have a good computer-aided translation system to facilitate the translation of terms. In this article, the authors describe the design and implementation of a Malay–English and English-Malay scientific terms retrieving software named Istilah Sains UKM. The system was designed and implemented using object-oriented techniques with user friendly interfaces and some basic facilities. For example, the database maintainer can add, change or delete terms from the database, and reindex files. The general user can retrieved terms through the Internet using any browser with JAVA plug-in. The indexing and retrieving strategies are based on stemming and n-gram methods. The system was developed using JAVA language and thus Web enabled and can be accessed at http://www.ftsm.ukm.my/istilah/. Introduction Research in automatic information retrieval has started since 1940s (Frakes 1991). A lot of information retrieval systems have been developed since then such as CONIT (Marcus 1983), and SMART and SIRE (Salton & McGill 1983). However, research in information retrieval using Malay language is still new and few (Fatimah 1995). In an attempt to have a good Malay document retrieval system, SISDOM was developed (Belal 1995). SISDOM uses a Malay-English and English-Malay terms translation module called Istilah. The ever growing need to translate articles from English to Malay and Malay to English has triggered the need to have an independent translation software to maintain and retrieve terms. In this article, the design and implementation of Istilah Sains UKM a bilingual (Malay and English) scientific terms translation software is described. Design Approach The design of Istilah Sains UKM is based on the object-oriented modeling. The architecture (meta module) of Istilah Sains UKM consists of two main modules : the maintenance module and internet access module. The maintenance module consists of indexing and retrieving modules. The function of indexing module is to update the main data file which keeps the terms and create new index files for the system.

This involve the ranking and the calculation of string similarity measures of each unique terms in the . The stemming algorithm used for Malay terms clustering is Fatimah’s algorithm (1996). For bigram matching. Partial string matching techniques used are stemming and bigram matching which determine whether two words match each other. Porters suffix striping stemming algorithm (Porter 1980) is used to cluster the English terms. Figure 1: Output Interface Evaluation Procedure The experiments are performed to evaluate the retrieval effectiveness of various techniques employed within the framework of n-gram matching in the contact of automatic query expansion approach as set by Lennon et al. clustering of the root words into two character subwords (bigrams). second.6 is used for partial matching. The function of the retrieving module is to retrieve words which matches the query word. clustering of the terms into root words using stemming algorithm. Figure 1 shows an example of the output for the query “solar energy”. [1981].The process of indexing can be divided into two steps: first. dice coefficient value with threshold 0.

For each query term. which is a weighted combination of recall and precision is also used to calculate the retrieval effectiveness: ⎛ ( 1 + β 2 ) PR ⎞ E = 100 * ⎜1 − ⎟ β 2P+ R ⎠ ⎝ where β . E.0. is thus defined as the proportion of retrieved actually relevant.0 reflects attaching equal importance to precision and recall. 1988]. P. This is done for all the 84 query terms and the mean values of E is calculated. p.dictionary to a specified query term. associated with the occurrence of a particular number of positives and negatives. Generally the average of the mean values of E at all the cutoff points serve as a good indicator of performance. in the range of 0 to infinity. Binomial distribution with p = q = 1/2 is used to calculate the probability. at an interval of 10. we used the symbol aE to denote this value. The average of the mean values of E at the ten cutoff points are calculated and used to evaluate the effectiveness of each technique. In our experiments we used the value of β = 2.0 reflects attaching twice as much importance to recall as to precision. Statistical sign test will be used to measure the degree of significance between two methods under consideration with the null hypothesis as follows: P[Xi > Yi] = P[Xi < Yi] = 1/2 where Xi and Yi are the two scores for a matched pair obtained from the two methods. In applying the sign test . The recall. The E values are calculated for cutoff points at 10 to 100. The precision. The measure of van Rijsbergen. N. The effectiveness of retrieval is evaluated by using recall and precision measures based on the number of words retrieved and relevant (R&R). is defined as proportion of relevant terms actually retrieved from the dictionary with respect to a specified query term. Two terms are regarded to be relevant to each other if they share a common root. is larger than 35. If the number of matched pairs. β = 1. the terms in the dictionary are ranked based on a similarity matching measure as specified in each experiment. There is an inverse relationship between the E value and the effectiveness. is used to reflect the relative importance of recall and precision. top ranking dictionary terms. We run . R. the normal approximation to the binomial distribution is used in term of the z value [Siegel and Castellan. while β = 2. the direction of the difference between every Xi and Yi noting whether the sign of the difference is positive or negative. The Results The commonly used similarity coefficients are Dice and Overlap coefficients. The relevancy of each term in the dictionary to a specified query term is determined manually by exhaustive scanning through the dictionary with the help of the string matching utility “find” in the Microsoft Excel software.

The first approach taken is to stem the query term first before doing the n-grams matching against the unstemmed dictionary. Digrams and Trigrams Matching Looking at the E values in Table_1. We also tried using both Dice and Overlap coefficient on the digrams matching by taking the average value of the two in order to measure the performance of such fusion of similarity coefficients. In digrams matching.27. we can conclude that the Overlap coefficient performs better than the Dice coefficient and the fusion of the two. This may be due to the Malay morphology which allows long affixes to be attached to a root word and thus Overlap coefficient performs better in this condition. But it performs better than the Dice coefficient with aE = 72.07 and Dice coefficient gives aE = 75. It also performs significantly better than the best non-stemmed approach experimented in the last section. Incorporating-stemming Approach Stemming process is incorporated into n-gram matching framework in order to investigate it effectiveness. Using the E value averages as the basis of comparison. The .58. In trigrams matching. But the difference between the performance of the Overlap and the Dice coefficients is not significant as shown in Table_2. the Overlap coefficient gives aE = 75. i. Table_1 shows that Overlap coefficient performed better than Dice coefficient for both the digrams and trigrams. digrams matching with Overlap similarity coefficient with the aE = 72. This approach is performed on digrams and trigrams using Overlap similarity coefficient. The second approach taken is to stem both the query term and the dictionary terms. We also tried using the fusion of digrams and trigrams by taking the average value of the two using the Overlap coefficient. Generally. The results of using this fusion is given in Table_6 as compared to the approach using the digrams only.e. Thus we conclude that the fusion approach is not worth pursuing further. its performance is worst than Overlap coefficient at all the cutoff points accept at the cutoff of 10. The E values of this run is also given in Table_1. The result obtained indicates that that the digrams performs significantly better as the figures in Table_3 show.45. The fusion approach performs inferiorly at all the cutoff points except at the cutoff 10.58 and the Dice coefficient gives aE = 73. digrams matching performs better than trigrams at all level of cutoffs and coefficients used. the Overlap coefficient gives the value of aE = 72.experiments using both coefficients on digrams and trigrams and the results obtained are given in Table_1.27.41. as shown by the sign test figures given Table_4.64 as compare to 73. The results obtained are given in Table_7 and shows that the one with digrams matching performs better than the trigrams with the aE = 64. Sign test is performed on the performance of digrams and trigrams matching using the Overlap coefficient.61 as compare to 66.

at 5 and 10 cutoff points. we cannot unequivocally say that which performs better. Comparison with Stemmed-Boolean Matching We also run experiments using the conventional stemmed-boolean matching between the stemmed query and the stemmed dictionary to compare its effectiveness to the approach of incorporating stemming in the n-gram framework. E values are used to compare the performance calculated using the following methods: 1) using variable cutoff points based on the number of words retrieved by the stemmed-boolean match: for each query term if the stemmed-boolean match retrieved x terms then the cutoff point of x is taken for that query. on the number of words retrieved by the stemmed-boolean match: if c is the cutoff point and x is the number of words retrieved by the stemmed-boolean match then do the following transformation: if x ≤ c then add in c-x non-relevant words to the number of words retrieved. From Table_8 it shows that using variable cutoff points evaluation. Thus. However. using the cutoff with weighted transformation evaluation the incorporated-stemming approach performs better but not significant as Table_10 shows. Conclusions From the experiments performed we can conclude that the Overlap coefficient performs better than the Dice coefficient but not significantly. this method of evaluation is biased to stemmed-boolean approach. one on the boolean matching and the other on matching with uncertainty measure and both have their functional purposes. and the digrams matching performs significantly better than the trigrams. else multiply the number of words retrieved and relevant by c/x. stemmed-boolean approach performs better than the incorporating-stemming approach.35 and obviously better than with the previous approach when only the query terms are stemmed. 2) using cutoff with weighted transformation.04 as compare to 61. The one using digrams performs better than the trigrams with aE = 61. Table_5 shows the results obtained from sign test which indicates that the digrams matching using stemmedquery and stemmed-dictionary performs significantly better than with stemmed-query alone. Table_9 shows that the difference in performance is significant. On the other hand. the mean E values obtained for this method is shown in Table_8. The usage of stemming prior to digrams and trigram . The two methods based on two different approaches. the mean E value obtained for this method is given in Table_8.results obtained using this approach is given in Table_7. Comparing these two approaches pose some problems: stemmed-boolean matching is based on boolean match whereas the later is based on ranking approach with degree of similarity.

Sembok. Program. T. Tengku Mohd. Furthermore.B. A. W. M. T. 1983. However. MCB University Press Limited. one on the boolean retrieval and the other on the best match ranking retrieval and they having advantages and disadvantages of their own. Fatimah Ahmad. 1995. New York : McGraw-Hill. 1995. Tengku Mohd. we cannot equivocally conclude which is better between the conventional stemmed-boolean approach and the incorporating-stemming in the n-grams approach. R. The application of stemming on the query terms and the dictionary performs significantly better than the application of stemming on the query terms alone which in turn performs significantly better than without the application of stemming. England.. Information Retrieval: Data Structures & Algorithms : 131-160. Fatimah Ahmad. No. ” Asian Libraries. Introduction to Modern Information Retrieval .S. Salton. Journal of American Society of Information Science. Englewood Cliffs : Prentice Hall. “Experiments with A Malay Stemming Algorithm”.3. 1992. Ph. Universiti Kebangsaan Malaysia. 14. . F...B. & McGill. Mohammed Yusoff. (1980).matching have significantly enhance the performance. Stemming Algorithms. W.S. Frakes. G. In Frakes. Satu Sistem Capaian Dokumen Bahasa Melayu : Satu Pendekatan Eksperimen Dan Analisis.).D Thesis. “An Experimental Comparison of the Effectiveness of Computers and Humans as Search Intermediaries. 1983. M.A. ” Journal of the American Society for Information Science 34(6):381-404. Marcus. 130-137. References Belal Mustafa. Vol. & Baeza-Yates. the two approaches are based on different paradigms. R. Yusoff. 1996. & Mohd. “SISDOM : A Multilingual Document Retrieval System. Porter. “An algorithm for suffix stripping”.J. (ed). 4.

000 yes 100 29 1 . Cutoff = 10 20 30 N 26 25 29 negatives 12 6 5 p .500 significant(α =0.85 64.61 78.87 73.71 4.96 81.70 76.39 3.007 .42 4.64 75.44 73.000 .72 58.304 no 80 26 13 .37 83.05 80.06 4.07 Digrams Number of R&R Trigrams Digrams E Dice Overlap Dice Overlap Dice Overlap Fusion Trigrams Dice Overlap Table_2: Sign test to determine whether Overlap coefficient coefficient.91 4.69 79.77 73.41 75.291 no 100 23 10 .58 70 4.06 4.577 no 16 8 .93 90 4.90 3.5 8 81.001 significant(α =0.23 4.24 83.26 82.339 no 12 3 .53 3.000 yes 50 25 0 .Table_1: Mean values at cutoff points to assess similarity coefficients performance Measur e n-gram Coeff 10 2.11 4.60 80.62 80.40 3.262 no 15 7 .22 3.1 4 Cutoff Mean 4.000 .82 76.81 4.119 .08 77.500 no 70 22 10 .48 80 4.11 80.061 .89 2.95 76.54 68.05) no yes no no Trigrams N 18 15 11 13 negatives 6 3 5 6 p .28 3.000 yes 80 30 5 .58 72.82 73.85 55.63 4.57 3.17 3.25 62.05) no yes no no performs significantly better than Dice 50 23 9 .56 69.000 yes Table_4: Sign test to determine whether digrams with stemmed-query digrams with non-stemmed query (overlap coefficient is used).62 Cutoffs 50 4.96 70.000 .70 3.000 .17 81.000 .47 67.423 .75 3.000 yes Table_5: Sign test to determine whether digrams with stemmed-query and stemmed-dictionary performs 7 .000 significant( =0.04 4.84 57.05) no yes yes digrams performs significantly better than trigrams (overlap 40 32 6 .35 67.00 78.000 .000 yes 70 30 3 .67 3.15 62.03 40 4.65 3.85 58.94 76.018 .72 71.000 yes 60 26 2 .65 3.78 78.202 no 16 7 .423 no 13 5 .85 2. Cutoff = 10 20 30 40 Digrams N 27 27 23 29 negatives 13 8 8 12 p .13 4.073 no Table_3: Sign test to determine whether coefficient is used).50 3.94 4.03 20 3.66 56.500 .36 64.05) yes yes yes yes yes yes yes performs significantly better than 80 31 2 .001 yes 100 25 3 .29 75.105 .48 70.5 2 84.08 4.11 100 4.53 4.598 no 90 26 12 .000 yes 90 31 2 .13 78.27 72.229 significant(α =0.90 3.15 81.82 65.85 5.78 2.000 yes 90 30 6 .0 1 81.416 no 15 6 .75 71.25 79.21 73.5 4 84.400 no 60 22 9 .52 60 4.81 73.33 30 3.500 . Cutoff = 10 20 30 40 50 60 70 N 34 30 32 30 30 32 31 negatives 2 2 3 2 2 2 2 p .08 79.

7 3 78.28 6.011 .90 3.7 8 52.6 9 67.17 6.1 1 70.8 5 54.39 4.35 6.89 3.19 4.9 9 67.6 0 80.91 80.21 6.13 6.003 significant( =0.3 5 67.1 4 75.8 0 42.5 9 76.53 77.9 5 76.61 66.70 6.0 5 31.70 5.5 4 90 4.1 9 70.3 3 30 5.1 5 76.45 61.9 2 73.34 71.02 6.8 1 63.05) yes yes yes yes yes yes yes yes yes 100 12 1 .5 3 100 5.57 3.18 64.06 4.4 6 80 6.9 7 63.92 6.97 81.45 4.18 6.03 70.7 9 Cuto ff Mea n 4.3 2 80 4.9 5 77.78 67.000 .11 6.35 76.8 3 40 5.35 74.7 8 78.04 61.0 7 58.06 .6 1 63.38 75.26 6.5 9 30 3.2 5 79.35 61.35 6.27 6.50 4.2 2 31.0 2 90 6.8 0 53.significantly better than digrams with stemmed-query only (overlap coefficient is used).2 4 73.17 4.4 4 74.1 14.32 6.3 5 70 6.0 2 73.28 6.10 5.2 0 40 4.4 4 73.4 8 Cutoffs 60 5.3 5 68.20 56.28 6.21 72.7 0 45.9 7 20 3.1 6.06 56.1 0 75.9 7 75.29 65.000 .3 5 42.4 9 63.6 2 50 4.7 5 71.2 3 76.65 4.5 7 67.66 6.32 69.83 6.25 6.36 6.14 6.26 5.4 0 75. Cutoff = 10 20 30 40 50 60 70 80 90 N 45 29 22 19 15 14 14 13 12 negatives 5 3 2 3 2 1 1 2 1 p or z 5.4 2 70.002 .48 6.91 4.8 0 60.81 4.2 2 51.28 6.23 61.001 .09 5.11 4.38 6.5 8 81.0 5 51.1 4 42.2 5 62.86 6.15 5.1 9 100 6.8 3 Table_7: Mean values at cutoff points when stemming is incorporated.0 1 20 5.26 5.07 49.003 yes Table_6: Mean values at cutoff points to assess the fusion between digrams and trigrams against Digrams only (overlap coefficient is used).00 5.03 Number of R&R Digrams Trigram s Digrams Trigram s Fusion Digrams Trigram s Digrams Trigram s Fusion E 8 .57 6.57 62.28 4.9 9 Cutof f Mean 15.64 6. Measure n-gram Stemmed Quer y yes yes yes yes yes yes yes yes yes yes Dict n no no yes yes yes no no yes yes yes 10 4.36 77.07 6.5 0 70 4.27 42.94 5. Measure n-gram 10 Number of R&R E Digrams Fusion Digrams Fusion 2.72 79.6 2 50 5.6 5 70.2 5 Cutoffs 60 4.00 5.5 7 59.001 .82 5.9 1 58.004 .3 4 31.28 6.22 73.5 8 72.

60 3.000 significant(α =0.05) no no 9 . Measure Approach Stemmed Variable Cutoff with Weighted Cutoff Transformation c/x Quer Dict 5 10 y n No.113 .43 36.24 boolean Table_9: Sign test using variable cutoff to determine whether stemmed-boolean approach performs significantly better than incorporating-stemming approach (overlap coefficient is used).13 33.26 p) and Relevant Stemmedyes yes 5.15 boolean E Digrams(overla yes yes 13.22 p) Stemmedyes yes 8.94 31. Retrieved Digrams(overla yes yes 5.05) yes Table_10: Sign test using cutoff with weighted transformation to determine whether incorporatingstemming approach performs significantly better than stemmed-boolean approach (overlap coefficient is used).48 5. Cutoff = Variable N 12 negatives 0 p . Cutoff = 5 10 N 11 16 negatives 3 6 p .227 significant(α =0.Table_8: The mean values to compare the stemmed-boolean and the incorporating-stemming approaches.44 3.19 37.46 5.