You are on page 1of 5

Accurate kN N Chinese Text Classication via Multiple Strategies

Xiulan Hao1 , Chenghong Zhang2, Xiaopeng Tao1 , Shuyun Wang1 , Yunfa Hu1 1 Department of Computing and Information Technology, Fudan University 2 School of Management, Fudan University No. 220, Handan Road, Shanghai, China, 200433 {hxl2221 cn,wang shuyun}@126.com , {chzhang,xptao,yfhu}@fudan.edu.cn Abstract
Text classication is one of means to understand text content. It is widely used in information retrieving, ltering spam, monitoring ill gossips, and blocking pornographic and evil messages. kN N is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare (smaller) categories are classied into common (larger) ones by traditional kN N classier. The performance of text classication can not satisfy the users requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kN N classier. Firstly, we optimize features by removing some candidate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kN N achieves signicant classication performance improvement on biased corpora.

evaluation on text categorization greatly promotes the advance of this technique. As a simple and efcient approach to text categorization, kN N is widely used and obtains a better result [1, 10, 11]. Its performance depends greatly on two factors, i.e., a suitable similarity function and an appropriate value for parameter k. Biased distribution of data set is one of challenges in text categorization [8]. The main strategies to deal with this problem include feature optimization [2], modication of traditional ones [4, 9], and re-sampling [6], etc. We take 2 measures to improve the performance of kN N classier. One is to optimize feature selection, which can be generalized to any Chinese text classier. The other is to adapt decision functions according to the number of training samples and solve the problem of larger classes overwhelming smaller classes. Experiments indicate, when training samples keep unchanged, Macro-F1 and Micro-Recall of categorization rise dramatically. Our main contributions are to , Put forward a strategy to optimize features; Give formal denitions and properties of some concepts relative to CP (Critical Point); Verify their validity by exhaustive experiments on three corpora. The remainder of the paper is organized as follows: Section 2 introduce feature selection for Chinese categorization. Section 3 are some notations and denitions about CP . Section 4 explains our kN N categorization methods. Section 5 reports the test results using this method. Section 6 concludes the ndings.

Introduction

With more and more online documents available, how to retrieve, lter, monitor, and even block specic information accurately from this huge repository becomes one of hot topics in natural language processing (N LP ), pattern recognition, and knowledge discovery in databases (KDD) communities. Text categorization is one of the key techniques they resort to. As one of main tasks in T REC (Text Retrieval Conference), T DT (Topic Detection and Tracking) and M U C (Message Understanding Conference), various methods of text categorization are evaluated. In reverse,
This work was supported by the Natural Science Foundation of China (NSFC) under grant number 70471011 and 60473070. Chenghong Zhang is the correspondence author.

Feature Optimization for Chinese Categorization

Concretely, in Chinese Text Categorization tasks, the two most important indexing units (feature terms) are word and

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 2007

character-bigram. They have quite different characteristics and inuence the classication performance in different ways. The loss and distortion of semantic information is a disadvantage of character-bigram features over word features [5]. Besides, character-bigram features tend to have a high dimensionality. In a kN N classier, the time is mainly spent on similarity computing between test samples and training samples, so reducing dimensionality of features is very important. Because word scheme works better at low dimensionality, we choose it as our method of feature selection. According to [5], word segmentation precision impacts classication performance. ICT CLAS is one of the best word segmentation systems (SIGHAN 2003) and reaches a segmentation precision of more than 97%, so we choose it as our scheme for automatic word-indexing of document. Furthermore, one-character words cover about 7% of words and more than 30% of word occurrences in the Chinese language [5]. When skimming over Contemporary Chinese Dictionary, you will nd that 1-character Chinese word tend to have more senses. Thus the impact of effective one-character words on the classication is not as large as their total frequency and they are often too common and ambiguous to have a good classication power. We adopt 2 methods to optimize the features: Firstly, we remove 1-character Chinese word from features. Because single Chinese characters tend to have more senses, i.e., they are usually more ambiguous and too common, they contribute little to distinct different categories. Secondly, English stop words are removed. In Chinese text, especially Chinese papers (usually including an English abstract), many English stop words may occur. Also, these English stop words contribute nothing to text categorization.

and its minimum value is x0 , if x0 < 0 , then distribution of this group number is called as biased (unbalanced or skewed). Denition 2 For a group of positive real numbers N1 , N2 , , Nm with biased distribution, if exponent operation is exerted on each Nj , that is, Nji = (Nj )1/sfi , sfi > 1, at a denite sfi , xi will equal to i ,. We call this denite sfi as Critical Point (CP ). Denition 3 In Denition 2, when exponent operation is exerted on each Nj , that is, j, Nji = (Nj )1/sfi , sfi > 1, each Nji will be less than Nj , i.e., for j, Nji < Nj . Thus sfi is called Shrink Factor (SF ). Though there must be such a sfi satised xi = i , it is hard to calculate it precisely. We can use a secondary optimal value to replace it and estimate the ratio i of yi to xi . Property 1 For a group of positive real numbers N1 , N2 , , Nm with biased distribution, at CP , ratio i of yi to xi satises, 0.707m 1+ i 1 + 0.707m m1 Proof: See Section 3 of [3] for more details. Denition 4 Upper boundary (U B) and lower boundary (LB) of CP are 2 positive numbers, satisfying (1) LB CP < U B; (2) U B LB = 0.5; (3) Fraction of LB or UN is 0.5 or 0.0. Using Property 1 and those denitions, CP, U B, LB can be computed. We have given an evaluation process in our previous work. See Algorithm 1 in [3]. (3.1)

Denitions and Notations

Adapted kN N Categorization

Some notations are used: Nj Sample size of Category j in the Training Set x0 = min(N1 , , Nm ) x0 y0 = max(N1 , , Nm ) y0 xi xi = (x0 )1/sfi yi yi = (y0 )1/sfi Nji Nji = (Nj )1/sfi , j = 1, , m Standard deviation of Nji , j = 1, , m i y i = xi , i = 0 i i Mean of Nji , j = 1, , m i Critical Point (CP ) is a concept we proposed in [3]. We shall give formal denitions and properties of CP explicitly here. Denition 1 For a group of positive real numbers N1 , N2 , , Nm , suppose that its standard deviation is 0

Assume that n-dimensional vector X = (x1 , x2 , ..., xn ) i i i represents a document, Ci = (X1 , X2 , ..., Xq ) represents a category (also known as class) containing q documents. Given a training set D consisting of m categories, C1 , C2 , ..., Cm , and a new arriving document X , kN N classier will compute the similarity of each document in document set D to X and search the k neighbors nearest to X based on similarity.

4.1

Traditional Decision Rules

If there are ki documents belonging to category Ci , dene 2 decision functions as follows: Function 1. Assume fi (X) = ki , i = 1, 2, ..., m f (X) = arg max(fj (X)), j = 1, 2, ..., m
j

(4.1)

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 2007

then X is classied into Cj , i.e., X Cj . Function 2. Assume gi (X) =


ki l=1

sim(X, Xli ), i = 1, 2, ..., m (4.2)

g(X) = arg max(gj (X)), j = 1, 2, ..., m


j

an optimal exponent according to the distribution of training data set. Theorem 1 For a biased training set of text documents, kN N text classier will balance larger classes and smaller classes and have the optimal discriminative power at sf = LB or sf = U B. Proof: We can empirically prove the sound of this theorem.

then X is classied into Cj , i.e., X Cj , where sim(X, Xli ) is similarity between X and training sample Xli . The algorithm described by Function (4.2) is labeled as Trad2. In our project, we nd that performance of kN N deteriorates when training set is biased distributed.

5
5.1

Experiments and Evaluations


Datasets

4.2

Modication of Decision Rules

To overcome the defect resulting from biased training samples, we redene decision functions by integrating the number of training documents in each category into them. 1/sf 1 , a factor, By multiplying (Nj )1/sf (minT rainN um) k Function 4.1, 4.2 are modied to Function 4.3, 4.4 respectively. kj (minT rainN um)1/sf ] k (Nj )1/sf (4.3) gj (X) (minT rainN um)1/sf ] k (Nj )1/sf (4.4)

f (X) = arg max[


j

g (X) = arg max[


j

Because minT rainN um Nj , kj k 0 f (X) 1 minT rainN um Nj , gj (X) k 0 g (X) 1 f (X) and g (X) are normalized to [0,1]. In Function 4.3 and Function 4.4, the term (minT rainN um) k for all categories, thus they are equivalent to, kj ] f (X) = arg max[ j (Nj )1/sf g (X) = arg max[
j
1/sf

To verify the validity of our decision functions, we use three data sets. DataSet1 is collected by our research group and is a collection of news and papers. News is mainly downloaded from http://www.sina.com.cn and http://www.chinadaily.com.cn, and papers mainly come from http://www.edu.cnki.net. The collection is composed of 13796 documents and can be divided into 12 categories. As can be seen in Table 1, the rare class with the minimum training samples is C29-Transport, only having 76 samples, and the standard deviation of DataSet1 is 442.5 , thus distribution of DataSet1 is very skewed. Reuter and TDT2 employed by [9] can be used directly to prove modication of decision rules as sound. DataSet1 is also used to validate our feature optimization. To compare with [9], we choose 10,000 features and use Information Gain as feature selection. Three-fold cross validation [3, 9] are executed. Experiments are performed at discrete points k = 5, , 100 respectively. For convenience, we use the mean of 20 test points as comparison values. Because Equation 4.3 vs. Equation 4.1 has the similar conclusions as Equation 4.4 vs. Equation 4.2, we only provide the experimental results of the latter. Using Algorithm 1, we can obtain the CP, LB, U B of the 3 DataSets respectively, as listed in Table 2. Table 2. Parameters of the 3 DataSets

is the same
Name of DataSet DataSet1 Reuter TDT2 CP 2.1 3.7 3.7 LB 2.0 3.5 3.5 UB 2.5 4.0 4.0

(4.3 )

gj (X) ] (Nj )1/sf

(4.4 )

The algorithm represented by Function (4.4) is labeled as SF = sfi , where sfi is the value of sf . For example, if sf takes 2.0, we denote it by SF = 2.0. For convenience, we use the cosine value of two vectors to measure the similarity between the two documents, although other similarity measures are possible. Our decision functions (Function 4.3, 4.4) happen to have the similar components as [9]. But we focus on to nd

5.2

Evaluation measures

The category assignments of a binary classier can be evaluated using a contingency table (Table 3) for each category. Conventional performance measures are dened and computed from these contingency tables. These measures are recall (r), precision (p), and F 1:
r = a/(a + c), if a + c > 0, otherwise undened;

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 2007

Table 1. Statistical Information of Training samples in DataSet1


Name of Class C3-Art C7-History C19-Computer C31-Environment C34-Economy C38-Politics No. of samples 530 622 1070 784 1663 1329 Percentage 5.8 6.8 11.6 8.5 18.1 14.4 Name of Class C5-Education C11-Space C29-Transport C32-Agriculture C37-Military C39-Sports No. of samples 669 502 76 988 99 866 Percentage 7.3 5.5 0.8 10.7 1.1 9.4

Table 3. A contingency table


Assigned Yes Assigned No Yes is correct a c No is correct b d
Macro F1

87 86 85 84 83 82 81 80 79 78 5 15 25 35 45 55 65 75 85 95 Values of K SF=1.0 SF=1.5 SF=2.0 SF=2.5 SF=3.0 SF=4.0 Trad2

p = a/(a + b), if a + b > 0, otherwise undened; F 1 = 2rp/(r + p);

Micro-average scores and macro-average are widely used to evaluate the performance of a classier. Assume there are C categories, then
M acro F 1 = ( M icro r =(
cC F 1c )/|C| cC

Figure 1. Macro F1 on DataSet1


a+
cC

a)/(

cC

c)

82 80 78 Macro F1 76 74 72 70 68 66 5 15 25 35 45 55 65 75 85 95 Values of K SF=1.0 SF=1.5 SF=2.0 SF=2.5 SF=3.0 SF=4.0 Trad2

For evaluation of single-label classications, F1-measure, precision, recall and accuracy [7] have the same value by micro-average, so Micro Recall in the following can be regarded as M icro r, M icro p or M icro F 1.

5.3

Experiment 1: Modication of Decision Rules

We only perform experiments on DataSet1. As to Reuter and TDT2, we cite results of [9] directly. 5.3.1 F1 Measure

Figure 2. Macro F1 on DataSet1 (before feature optimization) As can be seen from Figure 1, when sf = 2.0, F1 reaches a peak of 85.308%; when sf = 2.5, F1 is 84.996% and is the second highest peak. At the same time, Trad2 hits a low of 81.032%. That is, sf = 2.0 beats Trad2 on DataSet1. Note that 2.0 is the value of LB of DataSet1. The similar conclusion can be reached from Figure 2. According to [9], when exponent takes 4.0, their algorithm N W KN N achieves the best results on Reuter and TDT2 and beats kN N by 10% on TDT2. Again 4.0 is just the value of U B on these two datasets. 5.3.2 Micro-Recall 5.4.2 Micro-Recall As indicated by Figure 3, 4, before feature optimization, the highest of Micro-Recall is 82.193% at sf = 2.0 and Trad2 is 80.720%, the worst. After feature optimization,

5.4
5.4.1

Experiment 2: Feature Optimization


F1 Measure

We can see in Figure 2, before feature optimization, F1 reaches a peak of 78.662% at sf = 2.0 and Trad2 hits a low of 72.463%. After feature optimization, as indicated by Figure 1, F1 reaches a peak of 85.308% at sf = 2.0 and Trad2 hits a low of 81.032%. That is, the worst score of F1 after optimization is better than the highest score before optimization. The highest F1 after optimization is almost 13% higher than Trad2 before optimization.

As indicated by Figure 3, 4, when sf LB, there are not appreciable differences among Micro-recalls. This suggests that our kN N classier is improved without compromise on the per-document average.

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 2007

Micro-Recall is 88.078% at sf=2.0 while Trad2 is 87.272%. That is, sf = 2.0 after optimization beats Trad2 before optimization by 7.35%.
89 88 Micro Recall 87 86 85 84 83 5 15 25 35 45 55 65 75 85 95 Values of K SF=1.0 SF=1.5 SF=2.0 SF=2.5 SF=3.0 SF=4.0 Trad2

is different from theirs in that we extract features after segmentation of text other than n-grams without segmentation and length of our features is not xed. The result of removal of single Chinese characters or 1-grams is the same: both remove features contributing little to categorize and improve the performance of classiers. The 2 strategies proposed by us are easy to implement and the result is encouraging. With combination of the 2 methods, we get an dramatic improvement on DataSet1, with 7.35% on Micro-Recall and 13% on F1 respectively. Acknowledgement. We would like to thank the anonymous reviewers for their insightful comments.

References
Figure 3. Micro-Recall on DataSet1
[1] A. Cardoso-Cachopo and A. L. Oliveira. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval, pages 183196, 2003. [2] E.-H. S. Han, G. Karypis, and V. Kumar. Text categorization using weight adjusted k-nearest neighbor classication. In Proceedings of the 5th Pacic-Asia Conference on Knowledge Discovery and Data Mining, pages 5365, 2001. [3] X. Hao, X. Tao, C. Zhang, and Y. Hu. An effective method to improve knn text classier. In Proceedings of 8th ACIS International Conference on Software Engineering, Articial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), 2007, to appear. [4] B. Li, Q. Lu, and S. Yu. An adaptive k-nearest neighbor text categorization strategy. In ACM Transactions on Asian Language Information Processing, volume 3, pages 408421, December 2004. [5] J. Li, M. Sun, and X. Zhang. A comparison and semiquantitative analysis of words and character-bigrams as features in chinese text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 545552, 2006. [6] R. Li and Y. Hu. Noise reduction to text categorization based on density for knn. In the 2th International Conference on Machine Learning and Cybernetics, Xian, China, Nov. 2003. [7] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):147, March 2002. [8] J.-S. Su, B.-F. Zhang, and X. Xu. Advances in machine learning based text categorization. Jounal of Software, 17(9):18481859, Sep. 2006. [9] S. Tan. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 28(4):667671, 2005. [10] Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, (1):6990, 1999. [11] Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. of the 22nd ACM Intl Conf. on Research and Development in Information Retrieval (SIGIR99), pages 4249, 1999. [12] S. Zhou and J. Guan. Chinese documents classication based on n-grams. In Proceedings of CICLing 2002, LNCS 2276, pages 405414, 2002.

85 84 83 Micro Recall 82 81 80 79 78 77 76 5 15 25 35 45 55 65 75 85 95 Values of K SF=1.0 SF=1.5 SF=2.0 SF=2.5 SF=3.0 SF=4.0 Trad2

Figure 4. Micro-Recall on DataSet1 (before optimization)

Conclusion

On all three corpus, F1 measure reaches its peak at sf = LB or sf = U B. Though only DataSet1 gets its highest F1 score at sf = LB, we recommend to choose LB. Because Macro-Recall and recall of rare categories tend to have better scores at LB than at U B. Sometimes rare categories are more useful. Note: Results on Reuter and TDT2 are quoted from [9]. The modication of decision rule can be used in English text classier. Its effectiveness on Reuter justies our argument. Both Macro-F1 and Micro-Recall after optimization are better than before optimization testies our method as feasible. This feature selection method can be applied to other Chinese text classiers. [12] has proved that after 1-grams are removed from features, performance of classiers, kN N and N aiveBayes, is improved. Our method

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 2007

You might also like