Word Sense Discrimination Using Context Vector Similarity
Atelach Alemu Argaw
Stockholm University/KTH Stockholm, Sweden atelach@dsv.su.se
any two occurrences whether they belong to the
Abstract same sense or not. For Machine Readable Dic- tionary (MRD) based CLIR, we need to auto- matically determine the translation with the cor- rect sense. In this paper we present an approach This paper presents the application of context based on context vector similarity measures in a vector similarity for the purpose of word sense word space generated by random indexing. discrimination during query translation. The random indexing vector space method is used 2 Experiments to accumulate the context vectors. Pair wise similarity of the context vectors of ambiguous In information retrieval overall performance is terms with that of anchor terms indicated the affected by a number of factors, implicitly and possible correct translation of a query term. explicitly. To try and determine the effect of all Two retrieval experiments were conducted us- factors and tune parameters universally is a very ing the discriminated queries and weighted maximally expanded queries. The discrimi- complicated task. Therefore, our focus is in per- nated queries show a substantial increase in re- forming univariate sensitivity tests aimed at op- trieval performance. timizing a specific single parameter while keep- ing the others fixed at reasonable values. In these 1 Introduction experiments, our focus is to use context vector similarity in order to discriminate among poten- In Cross Language Information Retrieval (CLIR) tial translations, and determine the effect of such queries are posed in a language different from WSD in retrieval performance. Therefore, stem- that of the document collection. Automatic trans- ming of the Amharic terms is done manually in lation of the queries to the language of the order to avoid the inclusion of wrong translations document collection is the most commonly used that might potentially misguide the discrimina- approach attributing to the fact that it is the least tion process. Out of dictionary words are as- computationally expensive and the least resource sumed to be named entities and are used as fuzzy intensive, when compared to the alternatives of match terms. document translation and Interlingua representa- tion. One of the issues that need to be addressed during query translation is that of word sense 2.1 Data disambiguation/discrimination where the task is We used the Cross Language Evaluation Forum to pick out the term with the correct sense in the (CLEF1) English collection, consisting of the current context among the possible translations LAT94 (56,472 articles) and GH95 (113005 arti- given in the lexical resource utilized. Word sense cles) to generate the word space for sense dis- disambiguation involves sense labeling and clas- crimination as well as to perform the retrieval sifying or categorizing words into appropriate experiments. 50 Amharic queries from the CLEF sense classes. Word sense discrimination (WSD) 2004 were used for the discrimination and re- on the other hand, is not concerned about sense trieval experiments. labeling, and aims to group the occurrences of a word into a number of classes by determining for 1 www.clef-campaign.org/
65 SLTC 2008
2.2 Preprocessing CLEF English document collection to generate
the word space model using Random Indexing For ease of use and compatibility purposes, the and train context vectors for syntagmatic word Amharic queries originally written using the spaces (Sahlgren, 2006). We calculate the cosine Amharic script fidel were transliterated to an similarity between the context vectors of each ASCII representation using a file conversion util- pair using the functions provided in GSDM. ity called g22. Query terms that appear more than 15 times were discarded as they turned out to be 2.4 Retrieval Experiments non content bearing words and introduce noise when translated. The retrieval experiments were conducted using The query translation was done through term- Lucene7. Two runs were designed for compari- lookup in an Amharic-English MRD (Aklilu, son purposes. Run1 used the discriminated que- 1981) containing 15,000 entries and an online ries and Run2 used a maximally expanded and Amharic-English dictionary3 with approximately weighted queries. In Run2, all translations for 18,0004 entries. Bigrams were given precedence each word as given in the MRD are taken, but for lookup over unigrams. The translated list of lesser weights were assigned to those terms with terms was further preprocessed by removing stop multiple translations. If a term has n possible words and morphological analysis using the Por- translations, each of those translations are down ter stemming algorithm5. The document collec- scaled by assigning a weight of 1/n to each of the tion is also morphologically normalized using the terms. The results are given in the table below. same algorithm. Relevant Relevant map R- 2.3 Word Sense Discrimination Total Retrieved Prec Given an Amharic query with N terms {wa1, wa2, Run1 375 163 11.91 10.76 wa3, …, wan} and corresponding English transla- Run2 375 129 5.79 4.83 Table 1: Retrieval Results tions {wa1e1, wa1e2, wa1e3; wa2e1; wa3e1, wa3e2, wa3e3, wa1e4, wa1e5; …; wane1, wane2} we need to select the 3 Concluding Remarks correct translation in the given context for the terms {wa1, wa3, …, wan}. The term wa2 is labeled The results obtained show that there is a substan- as an ‘anchor’ term since it has a single term tial increase in retrieval performance when using translation. Each possible translation is then the discriminated queries. The context vector paired with the closest anchor term, in this case similarity based discrimination works very well {(wa1e1, wa2e1), (wa1e2, wa2e1), (wa1e3 wa2e1) …}. In in picking out the correct translation. Further ex- the cases where there are two anchor terms at the periments using 150 more queries are currently same distance as a term that needs discrimina- underway to give the results more statistical sig- tion, precedence is given to the anchor occurring nificance and draw conclusions from. before the current term in the query, and in cases where there are no anchor terms, each unique Acknowledgments pairing will be taken. We then calculate the simi- The GSDM tool and the Guile functions were larity of the context vectors for each of the pairs provided by Anders Holst and Magnus Sahlgren. of translated words, and from each group of translations for a single term, we pick the transla- References tion that has the highest similarity with the near- Amsalu Aklilu. 1981. Amharic English Dictionary, est anchor term. The pair with the highest simi- volume 1. Mega Publishing Enterprise, Addis larity will be taken in the absence of anchors. Ababa, Ethiopia. We used GSDM6 (Guile Sparse Distributed Memory) and the morphologically normalized Magnus Sahlgren. 2006. The Word-Space Model: using distributional analysis to represent syntag- matic and paradigmatic relations between words in 2 g2 was made available to us through Daniel Yacob of the high-dimensional vector spaces. Doctoral thesis, Ge’ez Frontier Foundation (http://www.ethiopic.org/) Stockholm University. 3 http://www.amharicdictionary.com/ 4 Personal correspondence 5 http://tartarus.org/~martin/PorterStemmer/index.html 6 GSDM is an open source C-library for Guile, designed 7 specifically for the Random Indexing methodology, written Apache Lucene is an open source high-performance, by Anders Holst at the Swedish Institute for Computer Sci- fullfeatured text search engine library written in Java ence. http://lucene.apache.org/java/docs/index.html