You are on page 1of 6

A Classifier-based Text Mining Approach for Evaluating Semantic Relatedness Using Support Vector Machines

Chung-Hong Lee Department of Electrical Engineering National Kaohsiung University of Applied Sciences Kaohsiung, TAIWAN leechung@mail.ee.kuas.edu.tw Hsin-Chang Yang Department of Information Management Chang Jung University Tainan, TAIWAN hcyang@mail.cju.edu.tw Abstract
The quantification of evaluating semantic relatedness among texts has been a challenging issue that pervades much of machine learning and natural language processing. This paper presents a hybrid approach of a text-mining technique for measuring semantic relatedness among texts. In this work we develop several text classifiers using Support Vector Machines (SVM) method to supporting acquisition of relatedness among texts. First, we utilized our developed text mining algorithms, including text mining techniques based on classification of texts in several text collections. After that, we employ various SVM classifiers to deal with evaluation of relatedness of the target documents. The results indicate that this approach can also be fitted to other research work, such as information filtering, and re-categorizing resulting documents of search engine queries. found thousands of documents when in fact most of the documents on the first screen (the highest ranked documents) are not relevant to the user. Eliminating the gap between the query results and the documents that satisfy users true information needs would enable more research effort to be involved for further enhancement. Examples of situations in which acquisition of textual semantic relatedness can be employed are: Email filtering. The user wishes to establish a personalized automatic junk email filter. In the learning phase the classifier has access to the users past email files. It interactively brings up past email and asks the user whether the displayed email is junk or not. Based on the users justification it brings up another email and queries the user. The process is repeated several times and the result is an email filter tailored to that specific person. Relevance feedback. The user wishes to sort through a internet search engine or database for items (articles, images, etc.) that are of personal interest an Ill know it when I see it type of search. The search engine displays a list of resulting documents and the user justify whether the items are interesting or not, respectively. Based on the users justification, the search engine brings up another item list from the internet websites. After several iterations, the system has learned to locating documents more precisely with the support of the classifier, and then returns a new list of items that it believes will be of interest to the user.

1. Introduction
The analysis and organization of large document repositories is one of todays great challenges in machine learning, a key issue being the quantitative assessment of document relatedness. A sensible relatedness measure would offer answers to questions like: How related are two documents and which documents match a given query best? As anyone who has done information retrieval or web searches using search engines will attest, it is rather discouraging to get a return of a search stating that the search has

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

It is understood that both filtering and relevancy feedback problems are both classification problems in that documents are assigned to one of two classes (relevant or not), and what are to be considered relevant documents is user dependent, this would mean that every user must construct a different training set. Generally speaking, in these cases acquisition of textual relatedness can be achieved with the supports of intelligent classifiers which are designed to meet specific information requirements (or topics) given by end users. Therefore, the focus of this work is on development of a novel classifier-based technique that computes relatedness of documents based on a specific training corpus of text documents without requiring domain-specific knowledge.

1.1 Techniques applied to acquisition of semantic relatedness


It is often easy to get confusion about several terminologies related to this topic; in our survey at least three different terms are used by different authors: semantic relatedness, semantic similarity, and semantic distance. The distinction between semantic relatedness and semantic similarity can be described by way of examples, Cars and gasoline, he writes, would seem to be more closely related than, say, cars and bicycles, but the latter pair are certainly more similar. Similarity is thus a special case of semantic relatedness, and we adopt this viewpoint in this paper. Among other relationships that the notion of relatedness encompasses are the various kinds of meronymy, antonymy, functional association, and other non-classical relations. The term semantic distance may cause even more confusion, as it can be used when talking about either just similarity or relatedness in general. In this work, we focus on dealing with the issues associated with measuring semantic relatedness among texts. The majority of the approaches applied to measuring semantic relatedness is through a semantic network, such as WordNet [4], [14]. WordNet is a broad coverage semantic network established as an attempt to model the lexical knowledge of a native speaker of English [18]. In WordNet, English nouns, verbs, adjectives, and adverbs are organized into synonym sets (synsets), each representing one underlying lexical concept, that are interlinked with a variety of relations. A natural way to evaluate semantic relatedness in a WordNet taxonomy, given its graphical representation, is to evaluate the distance between the nodes corresponding to the items being compared the shorter the path from one node to another, the more similar they are. Given multiple paths, one takes the length of the shortest one[16].

Instead of WordNet, Radas central knowledge source is MeSH(Medical Subject Headings) [15]. The networks 15,000 terms from a nine-level hierarchy that includes high-level nodes such as anatomy, organism, and disease and is based on the BROADER-THAN relationship. The principal assumption put forward by Rada is that the number of edges between terms in the MeSH hierarchy is a measure of conceptual distance between terms. Despite its apparent simplicity, a widely acknowledged problem with the edge counting approach mentioned above is that it typically relies on the notion that links in the taxonomy represent uniform distances. It is not always true due to that there is a wide variability in the distance covered by a single taxonomic link, particularly when certain subtaxonomies are much denser than others. In addition, the edge counting approaches should rely on a wellestablished lexical knowledge base as a source of semantic network (e.g., WordNet) for computing semantic relatedness. It is not well suited for applications in specific domains in which the standard lexical knowledge bases are not available. Recent work in computational linguistics suggests that large amounts of semantic information can be extracted automatically from large text corpora on the basis of lexical co-occurrence information. Such semantics has been becoming important and useful for being as an essential representation of content in each web page, particularly with the increasing availability of digital documents from all around the world. In this work we attempt to develop a novel algorithmic approach for extracting semantic information from the web text corpora. Using a variation of automatic text categorization, which applies the Support Vector Machines (SVM), we have conducted several experiments using several text classifiers associated with some related topics based on SVM-based learning processes. Furthermore, when exposed to the classified texts, we employ a novel algorithm to measure the implicit semantic relatedness among them.

2. Automatic text categorization based on support vector machines


Support Vector Machine (SVM) is a relatively new learning technique for data classification. The goal of SVM is to find a decision surface to separate the training data samples into two classes and make decisions based on the support vectors that are selected as the only effective elements from the training set. For text classification, SVM makes decision based on the globally optimized separating hyperplane. It simply finds out on which side of the hyperplane the test

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

pattern is located (see Figure. 1). This characteristics makes SVM highly competitive, compared with other pattern classification methods, in terms of predictive accuracy and efficiency. Various quadratic programming methods have been proposed and extensively studied to solve the SVM problem. In particular, Joachims has done much research on the application of SVM to text categorization [7].

Acquisition of Textual Semantic Relatedness

(Feature selection of input texts)

Text Corpus
(Support Vector Machine algorithm)

SVM classifier-based categorization

Figure 2. System Framework The implementation of semantic relatedness measures includes two subtasks concerning preparation of the information sources. Our approach begins with a standard practice in information retrieval to encode documents with vectors, in which each component corresponds to a different word, and the value of the component reflects the frequency of word occurrence in the document. Subsequently, we employed the SVM technique for developing text classifiers.

Figure 1. SVM Classifier Structure [19]

3.1 The reason for choosing SVM and not other classification techniques
In practice, the resulting dimensionality of the space is often tremendously huge, since the number of dimensions is determined by the number of distinct indexed terms in the corpus. As a result, techniques for controlling the dimensionality of the vector space are often required. Since SVM techniques have a superior potential to manage high dimensional input spaces effectively than other classification techniques, the need for time consuming linguistic preprocessing (i.e. reduction of dimensions of the feature space) can be largely eliminated. Therefore in this work, we choose SVM methods as the major approach for classification. The comparison of performance between SVM classifiers and other classifiers have also been examined in our experimental work.

2.1 How support vector machines work


When SVMs are constructed, two sets of hyperplanes are formed, one hyperplane going through one or more examples of the non-relevant vectors and one hyperplane going through one or more examples of the relevant vectors. Vectors lying on the hyperplanes are termed support vectors and in fact define the two hyperplanes. If we define the margin as the orthogonal distance between the two hyperplanes, then a SVM maximizes this margin. Equivalently, the optimal hyperplane is such that the distance to the nearest vector is maximum.

3. System implementation
In this work we develop an approach applying a classifier-based technique with Support Vector Machines (SVM) method to supporting acquisition of relatedness among texts. We utilized our previously developed text mining algorithms and platforms [10], [11], [12], [20], including text mining techniques based on Support Vector Machines (SVM) and SelfOrganizing Maps (SOM)for performing clustering and classification of texts in several text collections. After that, we employ SVM methods to deal with acquisition of relatedness of the target documents of text mining process, in order to find the semantic connections and relatedness among the mined texts, shown in Figure 2.

4. Acquisition of semantic relatedness using support vector machines


In this section we will introduce the implemented algorithms for acquisition of semantic relatedness from text corpora by means of a machine learning technique, namely the Support Vector Machines. As stated above, the Support Vector Machine is one of the major statistical learning models. It basically provides a way for text categorization by producing a decision surface to separate the training data samples into two classes (Figure 1). As such, the resulting categories are capable of performing the grouping of semantically related texts, and further computing the degree of

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

semantic relatedness among the texts by means of our developed algorithm.

Figure 3. Analyzing semantic relatedness among texts using SVM classifiers with OneAgainst-All (OAA) technique Again, the algorithm employed for acquisition of semantic relatedness among texts is established based on a multiple categorizing processes using the SVM classifiers with One-Against-All (OAA) method (see Figure 3). Instead of a numerical relatedness or similarity value, each of the measures that we tested returns two classes indicating related/unrelated judgment required by the algorithm. We therefore set the threshold of relatedness of each measure at which it separated the higher level of semantically related texts from the lower level. According to the results of the measures, the degree of semantic relatedness of the tested texts can be obtained, and which can be recorded in order to produce the final report of the acquisition of semantic relatedness among texts in the evaluated corpora. Figure 4 shows the resulting map of textual relatedness evaluation using SVM categorizing process. The most related texts were categorized into S3 group, and in turn the less related texts were mapped into the S2, S1 and S0 text collections.

Figure 4. Resulting map of textual relatedness evaluation using SVM categorizing process

5. Related work
In this paper we introduce the implemented algorithms for acquisition of semantic relatedness from

text corpora by means of a machine learning technique, namely the Support Vector Machines. As stated above, the Support Vector Machine is one of the major statistical learning models. It basically provides a way for text categorization by producing a decision surface to separate the training data samples into two classes. As such, the resulting categories are capable of performing the grouping of semantically related texts, and further computing the degree of semantic relatedness among the texts by means of our developed algorithm. Text mining is a new interdisciplinary field. It combines the disciplines of data mining, information extraction, information retrieval, text categorization, machine learning, and computational linguistics to discover structure, patterns, and knowledge in large textual corpora. With the huge amount of information available online, the World Wide Web has been becoming a fertile area for text mining research. The Web content data include unstructured data such as free texts, semi-structured data such as HTML documents, and a more structured data such as tabular data in the databases. However, much of the Web content data is unstructured text data. The research around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in texts (KDT) [1], or text data mining [5], or text mining. Advances in computational resources and new statistical algorithms for text analysis have helped text mining develop as a field. Recently there have been some innovative techniques developed for text mining. For example, Feldman uses text category labels (associated with Reuters newswire) to find unexpected patterns among text articles [1], [2], [3]. Text mining by using self-organizing map (SOM) techniques has already gained some attentions in the knowledge discovery research and information retrieval field. The paper of [13] perhaps marks the first attempt to utilize SOM (unsupervised neural networks) for an information retrieval work. In this paper, however, the document representation is made from 25 manually selected indexed terms and is thus not really realistic. In addition, among the most influential work we certainly have to mention WEBSOM [6], [8], [9]. Their work aims at constructing methods for exploring full-text document collections, the WEBSOM started from Honkela's suggestion of using the self-organizing semantic maps [17] as a preprocessing stage for encoding documents. Such maps are, in turn, used to automatically organize (i.e., cluster) documents according to the words that they contain. When the documents are organized, following the steps in the preprocessing stage, on a map in such a way that nearby locations contain similar documents, exploration of the collection is facilitated by the

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

intuitive neighborhood relations. Thus, users can easily navigate a word category map and zoom in on groups of documents related to a specific group of words.

100 90 80 70 F1 60 50 40 30 0

Gaussian RBF Polynomial Exponential RBF

6. Experimental results
In this work we develop several text classifiers including Support Vector Machines (SVM) methods to supporting relatedness measurement among texts. First, we utilized our developed text mining algorithms, including text mining techniques based on classification of texts in several text collections. After that, we employ various SVM classifiers to deal with categorization of the target documents for evaluating relatedness. The experiments (Figure 5 and Figure 6) used a random set of relevant and non-relevant documents in a corpus. In Figure 5, we show the recall ratios for five topics (classes), including Finance, Politics, Movies, Sports and Tech. In the testing process, we will first assume that topic number one is the relevant topic and all others are non-relevant. Then we will assume topic number two is the most relevant topic and all others non-relevant, etc. Recall is defined as the number of relevant documents actually retrieved in a function of iteration divided the number of relevant documents in the collection. In order to compare the performance of various classifiers on the same topic, it is reasonable to cover the performance of SVM-based classifiers with various kernel functions (i.e. Gaussian, Exponential and Polynomial functions) and other classifiers including artificial neural network (ANN) and kNN algorithms. For each topic (class), we use a trained classifier for performing text classification, to classify relevant and non-relevant documents. From the results shown in Figure 5, the classifier of SVM with a Gaussian kernel was superior to other classifiers, including SVM classifiers with other kernel functions.
Classifier Class

500

1000 Dimension

1500

2000

Figure 6. Resulting F1 ratios

7. Conclusions
This paper presents a hybrid approach of a textmining technique for evaluating relatedness among texts. In this work we develop several text classifiers using Support Vector Machines (SVM) methods to supporting acquisition of relatedness among texts. First, we utilized our developed text mining algorithms, including text mining techniques based on classification of texts in several text collections. After that, we employ various SVM classifiers to deal with evaluation of relatedness of the target documents. The results indicate that this approach can also be fitted to other research work, such as information filtering, and re-categorizing resulting documents of search engine queries. Experimental results show that the technique performs well in practice, successfully adapting the classification function of SVM-based classifiers to the acquisition of relatedness among texts.

8. References
[1]R. Feldman and I. Dagan, KDT - Knowledge Discovery in Texts, In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal, 1995. [2]R. Feldman, W. Klosgen, and A. Zilberstein, Visualization Techniques to Explore Data Mining Results for Document Collections, In Proceedings of the Third Annual Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, 1997. [3]R. Feldman, I. Dagan and H. Hirsh, Mining Text Using Keyword Distributions, In J. of Intelligent Information Systems, Vol. 10, pp. 281-300, 1998. [4]C. Fellbaum, WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA. 1998.

ANN 91.13 92.37 91.74 90.02 89.97

kNN 78.57 84.75 79.63 76.54 86.31

SVM
(Gaussian)

SVM
(Exponential)

SVM
(Polynomial)

Finance Politics Movies Sports Tech.

99.68 97.46 98.37 94.18 98.86

54.43 39.24 39.87 40.56 38.60

86.70 82.28 73.51 74.69 75.94

Figure 5. Resulting recall ratios Also, the results of F1 (Eq.1) values could be obtained as shown in Figure 6.

F1 =

Precision Recall 2 Precision+Recall

(1)

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

[5]M.A. Hearst, Untangling Text Data Mining, In Proceedings of ACL99: the 37th Annual Meeting of Association for Computational Linguistics, 1999. [6]T. Honkela, S. Kaski, K. Lagus and T. Kohonen, Newsgroup Exploration with WEBSOM Method and Browsing Interface, Technical Report A32. Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland, 1996. [7]T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proceedings 10th European Conference on Machine Learning (ECML), Springer Verlag, 1998. Science, Number 1398, pp. 137142, 1998. [8]S. Kaski., T. Honkela, K. Lagus, and T. Kohonen, WEBSOM--Self-Organizing Maps of Document Collections, Neurocomputing, Vol. 21, pp. 101-117, 1998. [9]T. Kohonen, Self-Organization of Very Large Document Collections: State of the Art, In Niklasson, L., Boden, M., and Ziemke, T., editors, In Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, Vol. 1, London, pp. 65-74. Springer, 1998. [10]C.H. Lee and H.C. Yang, A Web Text Mining Approach Based on Self-Organizing Map, In Proceedings of the ACM CIKM'99 2nd Workshop on Web Information and Data Management (WIDM'99), Kansas City, Missouri, USA, pp. 59-62, 1999. [11]C.H. Lee and H.C. Yang, A Text Data Mining Approach Using a Chinese Corpus Based on Self-Organizing Map, In Proceedings of the 4th International Workshop on Information Retrieval with Asian Language, Taipei, Taiwan, pp. 19-22, 1999. [12]C.H. Lee and H.C. Yang, A Multilingual Text Mining Approach Based on Self-Organizing Maps, Applied Intelligence, Vol. 18(3): pp. 295-310, 2003. [13]X. Lin, D. Soergel. and G. Marchionini, A SelfOrganizing Semantic Map for Information Retrieval, In Proceedings of the ACM SIGIR Intl Conference on Research and Development in Information Retrieval (SIGIR91), Chicago, IL,1991. [14]G.A. Miller, et al., Five Papers on WordNet. CSL Report 43, Princeton University, 1990. revised August 1993. [15]R. Rada., et al., Development and Application of a Metric on Semantic Nets, IEEE Transactions on Systems, Man, and Cybernetics, 19(1): 17-30, February 1989. [16]P. Resnik, Using Information Content to Evaluate Semantic Similarity, In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448-453, 1995.

[17]H. Ritter and T. Kohonen, Self-Organizing Semantic Maps. Biological Cybernetics, Vol. 61, pp. 241-254, 1989. [18]R. Richardson and A. F. Smeaton. 1995, Using WordNet in a Knowledge-based Approach to Information Retrieval, Working paper CA-0395, School of Computer Applications, Dublin University, 1995. [19]V. Vapnik, The Nature of Statistical Learning Theory. Springer, N.Y., 1995. ISBN 0-387-94559-8, 1995. [20]H.C. Yang and C.H. Lee, Automatic Hypertext Construction through a Text Mining Approach by SelfOrganizing Maps, In Proceedings of the Knowledge Discovery and Data Mining - PAKDD 2001, 5th Pacific-Asia Conference, Hong Kong, China, April 16-18, 2001, (PAKDD 2001): pp. 108-113, 2001.

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05) 0-7695-2315-3/05 $ 20.00 IEEE

You might also like