You are on page 1of 5

International Journal of Advanced Computer Science, Vol. 2, No. 4, Pp. 163-167, Apr. 2012.

Chinese Question Classification Using Multi-class SVM Based on Dependency Parsing


Zhang Wei, & Chen Junjie
Abstract Question classification is a key part of Chinese question answering system. In the paper, Support Vector Machine (SVM) was used to classify the Chinese question. Existing four methods for multi-category support vector machines were presented and their performances were compared in the paper. The key of the question classification lies in the feature selection. Four features of questions were selected as feature selection and discussed in more detail in the paper. Hownet and dependency parsing were used in feature selection. Then, the four SVM multi-classification algorithms were applied to Chinese question classification and some contrast experiments were done. The result of the experiments shows that the binary-tree algorithm is more effective than the other algorithms in the Chinese question classification.

Manuscript
Received: 25,Jun., 2011 Revised: 8, Dec., 2011 Accepted: 30,Mar.,2012 Published: 15,May, 2012

Keywords
Support Vector Machine, feature selection, Chinese question classification, dependency relation, Multi-category

to popularize SVM to multi-class classification and these algorithms are called Multi-category Support Vector Machines, M-SVMs. In the paper, we shall discuss the algorithms of multi-class SVMs and apply them in Chinese question classification. Feature model plays an important role in question classification. In the paper, we combined sememes of HowNet with Chinese dependency syntax parsing to construct feature model of the question.

2. The Algorithms of Multi-class SVM and Analysis


The problem of multi-class classification can be formally stated: given n training samples belonging to m classes, i.e., (x1,y1),,(xn,yn), one of this, xiRn, i=1,,n, and yi {1, ,m}. It is requisite to construct a classification function by the above training samples which makes the error probability of classifying the unknown sample x as small as possible. Here are some algorithms to use SVM to solve multi-class classification. A. 1-v-r SVMs(one-versus-rest) The earliest strategy to use SVM to solve the problem of multi-class classification may be the algorithm of 1-v-r SVMs (one-versus-rest) [4]. The method uses a two-class SVM classifier to distinguish every class from the other classes in turn and gets m classification functions. The unknown samples are classified into the class that has the maximum classification function while classifying. The method is simple and effective. Because the method only needs to train n classifiers, gets less number of the classification functions, needs less training time, the classification velocity of the method is fast and the method can be applied in large scale data. But the fault of the method is that every classification function is trained in the whole training set, the method requires to solve the quadratic program problem whose scale is big, the training velocity decreases dramatically with the increase of the number of training samples and the training time becomes longer. B. 1-v-1 SVMs(one-versus-one) The method trains a classifier between every two classes. So, there will be m*(m-1)/2 classification functions for a problem of m classes. When an unknown sample is classified, every classifier will judge its category and cast a ticket for corresponding category. Finally, the category

1. Introduction
In general, a Question-Answering system is made up of three parts: Question Analysis, Information Retrieval and Answer Extraction [1]. Question classification is an important step of the module of question analysis and it is a key factor to locate the answer and generate the strategy of answer extraction [2]. Approaches to question classification are generally divided into two broad classes: rule-based and statistical-based. The initial approach is mainly rule-based method, and its accuracy is only 57.57% in seven categories defined. Now statistical-based machine learning methods account for a dominant position and have shown a better performance. For instance, Dell Zhang et al, proposed SVM algorithm in English question classification [2]. Li et al. used SNoW algorithm and proposed hierarchical taxonomy [3]. Support Vector Machine (SVM) is initially used in a two-class classification problem and a lot of research results have been obtained. But how to popularize the result of two-class classification to multi-class classification has been a problem which needs to be more discussed and investigated. Chinese question generally needs multi-class classification. At present, there are already many algorithms
Zhang Wei is with information center at Shanxi Medical College for Continuing Education. Chen Junjie is with the Department of computer science at Taiyuan university of Technology.

164

International Journal of Advanced Computer Science, Vol. 2, No. 4, Pp. 163-167, Apr. 2012.

recorded the highest vote shall be the category of the unknown sample. The strategy is called vote law [5]. Multi-class SVMs using the above method is abbreviated as 1-v-1 SVMs algorithm [6]. The idea of the method is relatively simple. Because every SVM only considers the samples of two types, the single SVM is easy to be trained, but the method has the following faults: too many classification models are produced, which is more serious if the number of the category of training samples is more and causes training velocity and predicting velocity to slow down greatly; a tie of vote may occur during the testing stage, which will cause impartibility, i.e. there are misclassifying areas and areas rejected to classify. Therefore, it is difficult to judge the category which the testing samples belong to. C. Directed Acyclic Graph SVMs(DAG-SVMs) Directed Acyclic Graph SVMs classification will construct classification hyperplane between every two classes during the training period, as is same as 1-v-1 SVMs, namely there are m*(m-1)/2 classifiers. But during the classifying period, the method constructs the used classifier as one kind of 2-directional directed acyclic graph (Fig. 1).

D.

Algorithm of Binary Tree Because SVM is designed to classify based on two classes, we can combine SVM with the binary decision tree to construct multi-category classifier. The basic idea of the method is: for the training samples of m classes, we train m-1 SVM. First we use SVM to process the samples of the first category as positive training samples, process the samples of the second, third,mth category as negative training samples to train SVM1. According to the rule, we continue to train SVM until the (m-1)th SVM processes the samples of (m-1)th category as positive training samples, and processes the samples of mth category as negative training samples to train m-1 SVM. The decision method needs to construct many SVM classifier, and we can determine the precise number of the classifier according to the properties of binary tree. Its classification figure is shown as Fig. 2.

Fig. 2 Classification diagram with method of binary-tree.

Fig. 1 Classification diagram with method of DAG.

There are m*(m-1)/2 nodes and k leaves in Fig. 1, with every classifier for a node. Every node is connected with two nodes (or leaves) of next layer. When we classify an unknown sample, firstly begin with the top root node, continue to use the left nodes or right nodes of next layer to classify according to the classification result of the root node until we reach some leaves of the bottom layer. The category that the leaf represents is the category of the unknown sample. The advantage of the method is the increase of predicting velocity and we only need judgement of n-1 SVM to get the category of a sample, as is smaller than m*(m-1)/2 SVM which 1-v-1 SVM requires and m SVM which 1-v-r SVM requires. The following faults remain the same: too many classification models are generated and the training velocity is slow; the classifying accuracy depends on an array of specific sequence of the category of DAG. Different sequence will cause different classification results. It is important to select the root node, as different root node will cause different classification results.

The advantage of the method is that the structure is simple, and the method can avoid the disadvantage of the unclassifiable region problems that 1-v-1 algorithm and 1-v-r algorithm exist, and for m classes classification problem, only m-1 binary classifiers need to be constructed, the scale of quadratic programming problems is decreased rank by rank. When test, it is unnecessary to traverse the whole binary tree, but to march along the direction from the root node to the classification category, the number of used SVM classification functions is between 1 and the depth of binary tree, the classifying velocity is faster relatively and it saves the testing time. But, the classification approach has its shortcoming likewise: the array of sequence of different category will cause the generation of different binary tree and get different structure of binary tree, while the structure of binary tree has great influence on the performance of SVM classifier and generalization ability of the whole classification model. Therefore, it is important how to generate the reasonable structure of the binary tree. The above-mentioned four algorithms will be applied to Chinese question classification in the paper. The reason why SVM is applied in Chinese question classification is that compared with the traditional classification method, SVM method has the following characteristic and advantages: SVM is built on the basis of structural risk minimization principle, involve tradeoffs between
International Journal Publishers Group (IJPG)

Wei et al.: Chinese Question Classification Using Multi-class SVM Based on Dependency Parsing.

165

empirical risk and fiducial range, has very strong learning capacity and generalization performance, it can effectively solve some problems such as small samples, non-linear, high-dimension and local minimum, can effectively classify, do a regression and density analysis. In the research of Chinese question classification of the restricted domain, there are no ready question base for us to use. The question sample set used by us to do a experiment research has smaller scale, while in the machine-learning problems of small samples, SVM has the advantage which the other machine-learning algorithm has not. The research results show that we have used SVM to get the better effect in English question classification and get better application experience, which brings about very important reference to the research of Chinese question classification.

valueless words and some symbols such as punctuation, Arabic numerals, empty words and so on. Every word appeared in stop word list will be filtered as a stop word. As the questions are relatively short and include less word-based information, it is different from the traditional text classification. In the paper we selected four features of questions as feature selection and discussed them in more detail in the following sections. A. Interrogative Word At first an interrogative table Q was created. Then the given question was processed by word segmentation and part of speech tagging. The word which belonged to the interrogative table Q was extracted from the sentence as the first feature of question classification. B. Primary Sememe of First-degree and Second-degree Dependent Word of Interrogative Word Main element of syntax structure of dependency grammar is dependency-relationship, i.e. binary relation of word pairs in the sentence. One of them is recorded as the head word, the other is recorded as the dependent word. The paper used the syntactic parser provided by Center for Information Retrieval, Harbin Institute of Technology (HIT). First-degree dependent word of interrogative word is the word which has dependency relation with interrogative word. Second-degree dependent word of interrogative word is the word which has dependency relation with first-degree dependent word. For Fig.3, interrogative word is , first-degree dependent word of is , second-degree dependent word of is .
HED

3. The Feature Selection of the Question


There is not a standard for Chinese question classification at present. According to the feature of objects in the hospital domain, we define Chinese question classification hierarchy by analyzing the distribution of the practical questions, also referring to English question classification standards of TREC (Text Retrieval Conference) and some foreign existing question classification hierarchy. Table 1 gives Chinese question classification hierarchy of the hospital domain. The hierarchy contains 6 coarse classes and some fine classes, and fine classes expect to return more accurate answers by question answering system.
TABLE 1 CHINESE QUESTION CLASSIFICATION HIERARCHY OF THE HOSPITAL
DOMAIN

DE

ATT

SBV

VOB

Coarse Class HUMAN LOCATION NUMERIC TIME ENTITY DESCRIPTION

Fine Classes personage introduction, individual, description, group district, city, address, place number, price, distance, age, area, code date, time food, vehicle, technique, instrument, language, term reason, meaning, abbreviation, definition, manner

Root

ADSL ws

Fig. 3 Example of dependency parsing.

The key of the question classification lies in the feature selection. The first step is the pretreatment of the question, which includes Chinese word segmentation, part-of-speech tagging and the removal of stop word. Chinese word segmentation means to divide a sequence of Chinese word into some single words one by one. Here we use ICTCLAS to do this. ICTCLAS is a Chinese word segmentation and part-of-speech integrative software which was developed by Institute of Computing Technology, Chinese Academy of Sciences. Then, a stop word list was constructed according to the need in this paper. The stop word list includes some
International Journal Publishers Group (IJPG)

The definition of concept can be described by Knowledge Database Mark-up Language (KDML) in HowNet. The definition of concept is described by DEF semantic expression in KDML. DEF describes detailed semantic characteristic of the word. The first sememe of the word in HowNet is the first sememe appeared in DEF definition of the word. The first sememe is a main description for semantics of the word and it can better express main semantic information of the concept which the word corresponds to. The first sememe abstracts the word to some extent, but the abstract level is perhaps so high that it affects the classified result for the question classification. For example, the first sememes of the word and the word are both time|. Although abstract is made

166

International Journal of Advanced Computer Science, Vol. 2, No. 4, Pp. 163-167, Apr. 2012.

well, there is no difference in the word and the word . As a result, it affects the classified result of the question. To sum up, the paper proposed the method that extracted primary sememe, which is in HowNet, of first-degree and second-degree dependent word of interrogative word as classification characteristic to solve the above problems. But there are many concepts in HowNet for one word, it is very important how to select the correct concept, and this involves word sense disambiguation. In view of many of first-degree and second-degree dependent words are nouns and predicate verbs whose meanings are most monosemous and fewer-meaning, we proposes a method to do simple word sense disambiguation. The method comprises the following steps: Step 1: At first, word segmentation, POS Tagging and dependency parsing were made to extract first-degree and second-degree dependent words of interrogative word. Then simple word sense disambiguation was made by using POS. Firstly, HowNet was searched according to POS of the word W, if there was a concept, then Step 2 would be executed; otherwise, HowNet was searched according to the word W, if there was concept, then Step 2 would be executed. Step 2: If the number of searched concepts was one, word sense disambiguation ended, otherwise, proceeded to Step 3. Step 3: The word Ws every concept waiting for disambiguation was assigned an initial value, the sememe of certain concept of the word W was compared with the sememe of every concept of the word W1 which had dependency relation to the word W, if 16 relations defined in HowNet appeared, the proportion of corresponding concept would be increased, continued to Step 4. Step 4: If the proportion of every concept waiting for disambiguation of the word W still equaled to the initial value, then the concept, in which the first sememe appeared more, was selected and made to do simple word sense disambiguation; otherwise, the concept with the biggest proportion was selected as the result of word sense disambiguation. After above four steps, unique concept by disambiguation could be obtained, the following strategies were taken to extract the primary sememe of the concept: Step 1: If there was only one sememe in DEF of the concept, the sememe was selected as main sememe, otherwise, continued to Setp 2. Step 2: If there were two sememes in DEF of the concept, then the two sememes were selected as the word, otherwise, continued to Step3. Step 3: If the second position of DEF of the concept contained the relation sign of HowNet such as ^, %, &, ?, *, $, @, #, =, {, }, (, ), !, <, >,then the sememe of the first position and the third position was selected as main sememe, otherwise, the sememe of the first position and the second position was selected as main sememe.

C.

Named Entity Named entity is a noun phrase which has exact meaning in the sentence and includes personal names, place names, names of organization, proper names, phrase of time or date or quantity et al. Every named entity expresses strong semantic information and has close relation with question classification. Therefore, named entity is selected as semantic feature of question classification. We used Chinese named entity parser developed by Center for Information Retrieval, Harbin Institute of Technology (HIT).For Chinese question, some named entities can bring bad effects to question classification. Take the question as an example, the named entity of the word is S-Ns but the question classification is , so will bring noise to question classification. We selected named entity of subject, predicate, object and first-degree, second-degree dependent words as classification feature. So, the feature vector of the above question was {S-Nh,ResultFrom|,} D. Singular/Plural The feature of singular/plural is mainly for enumerated class of classification hierarchy, such as time enumeration. If the numeral which has dependency relation to interrogative word or first-degree, second-degree dependent word of interrogative word is , then one (singular) is selected as feature, otherwise, ones (plural) is selected as feature. For every question, there are 1703 feature items altogether. Among them, the feature items from 1 to 64 represent interrogative word, ones from 65-83 represent named entity, ones from 84-85 represent singular/plural, ones from 86-1703 represent sememes. We use Boolean method to compute the values of feature items, i.e. if the corresponding feature item appears, the value of feature is 1, and otherwise, the value of feature is 0.

4. Experiment
For Chinese question classification, there is no unitive standard training datasets and testing datasets of questions. The question sets is a collection of sentences whose types are labeled according to question classification hierarchy. Question datasets in the paper come from: (1) Collect related questions on hospital field from the Internet and make changes. (2) Translate the TREC correlative question datasets, and transform some questions. Finally 600 questions on hospital domain were established as question datasets in the experiment, and all the datasets had been manually labeled according to the coarse and fine grained categories beforehand. We selected 400 questions as training datasets, 200 questions as testing datasets. Many studies have shown that classification results of Support Vector Machine(SVM) classifier is better than other classifiers[7], therefore, we applied SVM classifiers to do experiments of Chinese question classification. We took the above four SVM multi-class classification methods to do the classification experiment, so as to compare the
International Journal Publishers Group (IJPG)

Wei et al.: Chinese Question Classification Using Multi-class SVM Based on Dependency Parsing.

167

performance of the algorithms of all kinds of SVM multi-class classification. We used the classification tool, i.e. LIBSVM toolbox which was developed by associate professor Chih-Jen Lin in Taiwan. Radial base kernel function was used as the kernel function of SVM classifier, its performance outperformed Linear, Polynomial and Sigmoid kernel function. Specific classification process was shown in Fig. 4.
preprocessing Training datasets Feature extraction Feature vector of questions

References
[1] Z.SH. fu, L. Ting, & Q. Bing, "Overview of Question-Answering," (2002) Journal of Chinese Information Processing, vol. 16, no. 6, pp. 46-52. D. Zhang & W. Lee, "Question classification using support vector machines," (2003) Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Canada, pp. 26-32. X. Li & D. Roth, "Learning question classifiers," (2002) Proceedings of the19th International Conference on Computational Linguistics, Taiwan, Association for Computational Linguistics, pp. 556-562. L. Bottou, C. Cortes, & D. Denker, "Comparison of Classifier Methods: A Case Study in Handwriting Digit Recognition," (1994) International Conference on Pattern Recognition, IEEE Computer Society Press, pp. 77-87. J.H. Friedman, "Another Approach to Polychotomous Classification," (1996) Technical report, Stanford University, From:http://www-stat.stanford.edu/reports/friedman/poly.ps. U. KreBel, "Pairwise Classification and Support Vector Machines," (1999) B Scholkopf, C J C Burges, A J Smola eds. Advances in Kernal Methods: Support Vector Learning, The MIT Press, Cambridge, MA, pp. 255-268. L. Xia, Z. Teng, & F. Ren, "Question Classification in Chinese Restricted-Domain Based on SVM and Domain Dictionary," (2008) Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, China.

[2]

[3]

Training model

[4]

preprocessing Testing datasets Feature extraction

Feature vector of questions

Classification results

[5]

Fig. 4 Classification process.

Table 2 was the experiment result of SVM multi-class classification. From Table 2, we can find that among the four SVM multi-class classification methods, the classification accuracy of one-versus-rest, one-versus-one, DAG, binary-tree classification method increases in turn. Among them, the classification performance of binary tree classification is best, which verifies the front theory analysis of SVM multi-class classification.
TABLE 2 COMPARISON OF QUESTION CLASSIFICATION ACCURACY RATE USING SVM MULTI-CLASS CLASSIFICATION SVM multi-class classification method One versus rest One versus one DAG SVM Binary-tree SVM Classification accuracy of big category(%) 81.3 82.3 84.2 85.3 Classification accuracy of small category(%) 77.5 78.1 79.3 83.4

[6]

[7]

5. Conclusion
In general, Chinese question answering system (QA) include question classification, information retrieval and answer extraction. Question classification plays an important role in QA. To improve classification accuracy of questions, we compared the four algorithms of multi-category classification SVM and discussed their advantages and faults, then applies them to Chinese question classification. Finally, we did some experiments to verify our conclusion and found the classification performance of binary tree classification algorithm in Chinese question classification was best.

Zhang Wei was born in Fenyang, China, in 1968. He received the B.S. degree from Taiyuan University of Technology in 1990, and obtained the M.S. and Ph.D. degrees from Shanxi University in 2003 and Taiyuan University of Technology in 2011, respectively. He is the author or coauthor of more than ten national and international papers and also collaborated in several research projects. Since 1995 he has been with information center at Shanxi Medical College for Continuing Education. He is currently the associate professor. His current research interests include question answering system, natural language processing and data mining. Chen Junjie was born in Dingzhou, China, in 1956. He received the B.S. degree in computer science from Shanghai Jiaotong University in 1982, and obtained the M.S. and Ph.D. degrees from Shanghai Jiaotong University in 1984 and Beijing Institute of Technology in 2003, respectively. Since 1984 he has been with the Department of computer science at Taiyuan university of Technology. He is currently the professor. His research interests include intelligent information processing, data mining, and database study.

International Journal Publishers Group (IJPG)