You are on page 1of 15

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO.

3, MARCH 2011 335

A Fuzzy Self-Constructing Feature Clustering


Algorithm for Text Classification
Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE

Abstract—Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper,
we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document
set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature,
corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the
number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be
avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods.

Index Terms—Fuzzy similarity, feature clustering, feature extraction, feature reduction, text classification.

1 INTRODUCTION

I N text classification, the dimensionality of the feature


vector is usually huge. For example, 20 Newsgroups [1]
and Reuters21578 top-10 [2], which are two real-world data
[21], [22], [23] have been proposed to improve the
computational complexity. However, the complexity of
these approaches is still high. Feature clustering [24], [25],
sets, both have more than 15,000 features. Such high [26], [27], [28], [29] is one of effective techniques for feature
dimensionality can be a severe obstacle for classification reduction in text classification. The idea of feature cluster-
algorithms [3], [4]. To alleviate this difficulty, feature ing is to group the original features into clusters with a high
reduction approaches are applied before document classi- degree of pairwise semantic relatedness. Each cluster is
fication tasks are performed [5]. Two major approaches, treated as a single new feature, and, thus, feature
feature selection [6], [7], [8], [9], [10] and feature extraction dimensionality can be drastically reduced.
[11], [12], [13], have been proposed for feature reduction. In The first feature extraction method based on feature
general, feature extraction approaches are more effective clustering was proposed by Baker and McCallum [24],
than feature selection techniques, but are more computa- which was derived from the “distributional clustering” idea
tionally expensive [11], [12], [14]. Therefore, developing of Pereira et al. [30]. Al-Mubaid and Umair [31] used
scalable and efficient feature extraction algorithms is highly distributional clustering to generate an efficient representa-
demanded for dealing with high-dimensional document tion of documents and applied a learning logic approach for
data sets. training text classifiers. The Agglomerative Information
Classical feature extraction methods aim to convert the Bottleneck approach was proposed by Tishby et al. [25],
representation of the original high-dimensional data set into [29]. The divisive information-theoretic feature clustering
a lower-dimensional data set by a projecting process algorithm was proposed by Dhillon et al. [27], which is an
through algebraic transformations. For example, Principal information-theoretic feature clustering approach, and is
Component Analysis [15], Linear Discriminant Analysis more effective than other feature clustering methods. In
[16], Maximum Margin Criterion [12], and Orthogonal these feature clustering methods, each new feature is
Centroid algorithm [17] perform the projection by linear generated by combining a subset of the original words.
transformations, while Locally Linear Embedding [18], However, difficulties are associated with these methods. A
ISOMAP [19], and Laplacian Eigenmaps [20] do feature word is exactly assigned to a subset, i.e., hard-clustering,
extraction by nonlinear transformations. In practice, linear based on the similarity magnitudes between the word and
algorithms are in wider use due to their efficiency. Several the existing subsets, even if the differences among these
scalable online linear feature extraction algorithms [14], magnitudes are small. Also, the mean and the variance of a
cluster are not considered when similarity with respect to
. The authors are with the Department of Electrical Engineering, National
the cluster is computed. Furthermore, these methods
Sun Yat-Sen University, Kaohsiung, 804, Taiwan. require the number of new features be specified in advance
E-mail: jungyi@water.ee.nsysu.edu.tw, ericliu0715@gmail.com, by the user.
leesj@mail.ee.nsysu.edu.tw. We propose a fuzzy similarity-based self-constructing
Manuscript received 18 Feb. 2009; revised 27 July 2009; accepted 4 Nov. 2009; feature clustering algorithm, which is an incremental feature
published online 26 July 2010. clustering approach to reduce the number of features for the
Recommended for acceptance by K.-L. Tan.
For information on obtaining reprints of this article, please send e-mail to:
text classification task. The words in the feature vector of a
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-02-0077. document set are represented as distributions, and pro-
Digital Object Identifier no. 10.1109/TKDE.2010.122. cessed one after another. Words that are similar to each
1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

X
p
other are grouped into the same cluster. Each cluster is IGðwj Þ ¼  P ðcl ÞlogP ðcl Þ
characterized by a membership function with statistical l¼1
mean and deviation. If a word is not similar to any existing X
p
cluster, a new cluster is created for this word. Similarity þ P ðwj Þ P ðcl jwj ÞlogP ðcl jwj Þ ð1Þ
between a word and a cluster is defined by considering both l¼1

the mean and the variance of the cluster. When all the words Xp
þ P ðwj Þ P ðcl jwj ÞlogP ðcl jwj Þ;
have been fed in, a desired number of clusters are formed l¼1
automatically. We then have one extracted feature for each
cluster. The extracted feature corresponding to a cluster is a where P ðcl Þ denotes the prior probability for class cl , P ðwj Þ
weighted combination of the words contained in the cluster. denotes the prior probability for feature wj , P ðwj Þ is
Three ways of weighting, hard, soft, and mixed, are identical to 1  P ðwj Þ, and P ðcl jwj Þ and P ðcl jwj Þ denote
introduced. By this algorithm, the derived membership the probability for class cl with the presence and absence,
functions match closely with and describe properly the real respectively, of wj . The words of top k weights in W are
distribution of the training data. Besides, the user need not selected as the features in W0 .
specify the number of extracted features in advance, and In feature extraction approaches, extracted features are
trial-and-error for determining the appropriate number of obtained by a projecting process through algebraic trans-
extracted features can then be avoided. Experiments on real- formations. An incremental orthogonal centroid (IOC)
world data sets show that our method can run faster and algorithm was proposed in [14]. Let a corpus of documents
obtain better extracted features than other methods. be represented as an m  n matrix X 2 Rmn , where m is
The remainder of this paper is organized as follows: the number of features in the feature set and n is the
Section 2 gives a brief background about feature reduction. number of documents in the document set. IOC tries to find
Section 3 presents the proposed fuzzy similarity-based self- an optimal transformation matrix F 2 Rmk , where k is the
constructing feature clustering algorithm. An example, desired number of extracted features, according to the
illustrating how the algorithm works, is given in Section 4. following criterion:
Experimental results are presented in Section 5. Finally, we
F ¼ arg max traceðFT Sb FÞ; ð2Þ
conclude this work in Section 6.
where F 2 Rmk and FT F ¼ I, and
2 BACKGROUND AND RELATED WORK X
p
Sb ¼ P ðcq ÞðMq  Mall ÞðMq  Mall ÞT ð3Þ
To process documents, the bag-of-words model [32], [33] is
q¼1
commonly used. Let D ¼ fd1 ; d2 ; . . . ; dn g be a document set
of n documents, where d1 , d2 ; . . . ; dn are individual docu- with P ðcq Þ being the prior probability for a pattern
ments, and each document belongs to one of the classes in the belonging to class cq , Mq being the mean vector of
set fc1 ; c2 ; . . . . . . ; cp g. If a document belongs to two or more class cq , and Mall being the mean vector of all patterns.
classes, then two or more copies of the document with 2.2 Feature Clustering
different classes are included in D. Let the word set W ¼
Feature clustering is an efficient approach for feature
fw1 ; w2 ; . . . ; wm g be the feature vector of the document set.
reduction [25], [29], which groups all features into some
Each document di , 1  i  n, is represented as di ¼ clusters, where features in a cluster are similar to each
<di1 ; di2 ; . . . ; dim >, where each dij denotes the number of other. The feature clustering methods proposed in [24], [25],
occurrences of wj in the ith document. The feature reduction [27], [29] are “hard” clustering methods, where each word
task is to find a new word set W0 ¼ fw01 ; w02 ; . . . ; w0k g, k << m, of the original features belongs to exactly one word cluster.
such that W and W0 work equally well for all the desired Therefore each word contributes to the synthesis of only
properties with D. After feature reduction, each document di one new feature. Each new feature is obtained by summing
is converted into a new representation d0i ¼ <d0i1 ; d0i2 ; . . . ; d0ik > up the words belonging to one cluster. Let D be the matrix
and the converted document set is D0 ¼ fd01 ; d02 ; . . . ; d0n g. If k consisting of all the original documents with m features and
is much smaller than m, computation cost with subsequent D0 be the matrix consisting of the converted documents
operations on D0 can be drastically reduced. with new k features. The new feature set W0 ¼ fw01 ; w02 ; . . . ;
w0k g corresponds to a partition T fW1 ; W2 ; . . . ; Wk g of the
2.1 Feature Reduction original feature set W, i.e., Wt Wq ¼ ;, where 1  q; t  k
In general, there are two ways of doing feature reduction, and t 6¼ q. Note that a cluster corresponds to an element in
feature selection, and feature extraction. By feature selec- the partition. Then, the tth feature value of the converted
tion approaches, a new feature set W0 ¼ fw01 ; w02 ; . . . ; w0k g is document d0i is calculated as follows:
obtained, which is a subset of the original feature set W. X
Then W0 is used as inputs for classification tasks. d0it ¼ dij ; ð4Þ
wj 2Wt
Information Gain (IG) is frequently employed in the
feature selection approach [10]. It measures the reduced which is a linear sum of the feature values in Wt .
uncertainty by an information-theoretic measure and gives The divisive information-theoretic feature clustering
each word a weight. The weight of a word wj is calculated (DC) algorithm, proposed by Dhillon et al. [27] calculates
as follows: the distributions of words over classes, P ðCjwj Þ, 1  j  m,
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 337

11þ21þ30þ40
where C ¼ fc1 ; c2 ; . . . . . . ; cp g, and uses Kullback-Leibler di- P ðc1 jw1 Þ ¼ ¼ 0:3;
vergence to measure the dissimilarity between two dis- 1þ2þ3þ4
10þ20þ31þ41 ð10Þ
tributions. The distribution of a cluster Wt is calculated P ðc2 jw1 Þ ¼ ¼ 0:7;
as follows: 1þ2þ3þ4
x1 ¼ <0:3; 0:7>:
X P ðwj Þ
P ðCjWt Þ ¼ P P ðCjwj Þ: ð5Þ It is these word patterns, our clustering algorithm will work
wj 2Wt wj 2Wt P ðwj Þ
on. Our goal is to group the words in W into clusters, based
The goal of DC is to minimize the following objective on these word patterns. A cluster contains a certain number
function: of word patterns, and is characterized by the product of
p one-dimensional Gaussian functions. Gaussian functions
X
k X are adopted because of their superiority over other
P ðwj ÞKLðP ðCjwj Þ; P ðCjWt ÞÞ; ð6Þ functions in performance [34], [35]. Let G be a cluster
t¼1 wj 2Wt
containing q word patterns x1 , x2 ; . . . ; xq . Let xj ¼ <xj1 ;
which takes the sum over all the k clusters, where k is xj2 ; . . . ; xjp >, 1  j  q. Then the mean m ¼ <m1 ; m2 ; . . . ;
specified by the user in advance. mp > and the deviation  ¼ <1 ; 2 ; . . . ; p > of G are
defined as
Pq
3 OUR METHOD j¼1 xji
mi ¼ ð11Þ
There are some issues pertinent to most of the existing jGj
sP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
feature clustering methods. First, the parameter k, indicat- q 2
j¼1 ðxji  mji Þ
ing the desired number of extracted features, has to be i ¼ ð12Þ
jGj
specified in advance. This gives a burden to the user, since
trial-and-error has to be done until the appropriate number for 1  i  p, where jGj denotes the size of G, i.e., the
of extracted features is found. Second, when calculating number of word patterns contained in G. The fuzzy
similarities, the variance of the underlying cluster is not similarity of a word pattern x ¼ <x1 ; x2 ; . . . ; xp > to cluster
considered. Intuitively, the distribution of the data in a G is defined by the following membership function:
cluster is an important factor in the calculation of similarity. "   #
Yp
xi  mi 2
Third, all words in a cluster have the same degree of G ðxÞ ¼ exp  : ð13Þ
contribution to the resulting extracted feature. Sometimes, it i¼1
i
may be better if more similar words are allowed to have
Notice that 0  G ðxÞ  1. A word pattern close to the mean
bigger degrees of contribution. Our feature clustering
of a cluster is regarded to be very similar to this cluster, i.e.,
algorithm is proposed to deal with these issues.
G ðxÞ  1. On the contrary, a word pattern far distant from
Suppose, we are given a document set D of n documents
a cluster is hardly similar to this cluster, i.e., G ðxÞ  0. For
d1 , d2 ; . . . ; dn , together with the feature vector W of
example, suppose G1 is an existing cluster with a mean
m words w1 ; w2 ; . . . ; wm and p classes c1 , c2 ; . . . ; cp , as vector m1 ¼< 0:4; 0:6 > and a deviation vector  1 ¼ <0:2;
specified in Section 2. We construct one word pattern for 0:3>. The fuzzy similarity of the word pattern x1 shown in
each word in W. For word wi , its word pattern xi is defined, (10) to cluster G1 becomes
similarly as in [27], by
G1 ðx1 Þ
"   # "   #
xi ¼ <xi1 ; xi2 ; . . . ; xip > 0:3  0:4 2 0:7  0:6 2
ð7Þ ¼ exp   exp 
¼ <P ðc1 jwi Þ; P ðc2 jwi Þ; . . . ; P ðcp jwi Þ>; 0:2 0:3
where ¼ 0:7788  0:8948 ¼ 0:6969:
Pn ð14Þ
q¼1 dqi   qj
P ðcj jwi Þ ¼ Pn ; ð8Þ
q¼1 dqi
3.1 Self-Constructing Clustering
for 1  j  p. Note that dqi indicates the number of Our clustering algorithm is an incremental, self-construct-
occurrences of wi in document dq , as described in Section 2. ing learning approach. Word patterns are considered one
Also, qj is defined as by one. The user does not need to have any idea about
the number of clusters in advance. No clusters exist at the
 beginning, and clusters can be created if necessary. For
1; if document dq belongs to class cj ;
qj ¼ ð9Þ each word pattern, the similarity of this word pattern to
0; otherwise: each existing cluster is calculated to decide whether it is
combined into an existing cluster or a new cluster is
Therefore, we have m word patterns in total. For example, created. Once a new cluster is created, the corresponding
suppose we have four documents d1 , d2 , d3 , and d4 belonging membership function should be initialized. On the
to c1 , c1 , c2 , and c2 , respectively. Let the occurrences of w1 in contrary, when the word pattern is combined into an
these documents be 1, 2, 3, and 4, respectively. Then, the word existing cluster, the membership function of that cluster
pattern x1 of w1 is: should be updated accordingly.
338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Let k be the number of currently existing clusters. The tj ¼ A  B þ 0 ; ð21Þ
clusters are G1 , G2 ; . . . ; Gk , respectively. Each cluster Gj has 2
ðSt  1Þðtj  0 Þ þ St mtj þ xij 2 2
mean mj ¼ <mj1 ; mj2 ; . . . ; mjp > and deviation  j ¼ <j1 ; A¼ ; ð22Þ
St
j2 ; . . . ; jp >. Let Sj be the size of cluster Gj . Initially, we  2
St þ 1 St mtj þ xij
have k ¼ 0. So, no clusters exist at the beginning. For each B¼ ; ð23Þ
St St þ 1
word pattern xi ¼ <xi1 ; xi2 ; . . . ; xip >; 1  i  m, we calcu-
late, according to (13), the similarity of xi to each existing for 1  j  p, and
clusters, i.e.,
"  St ¼ St þ 1: ð24Þ
Yp  #
xiq  mjq 2 Equations (20) and (21) can be derived easily from (11) and
Gj ðxi Þ ¼ exp  ð15Þ
q¼1
jq (12). Note that k is not changed in this case. Let’s give an
for 1  j  k. We say that xi passes the similarity test on example, following (10) and (14). Suppose x1 is most similar
cluster Gj if to cluster G1 , and the size of cluster G1 is 4 and the initial
deviation 0 is 0.1. Then cluster G1 is modified as follows:
Gj ðxi Þ  ; ð16Þ
4  0:4 þ 0:3
where , 0    1, is a predefined threshold. If the user m11 ¼ ¼ 0:38;
4þ1
intends to have larger clusters, then he/she can give a 4  0:6 þ 0:7
smaller threshold. Otherwise, a bigger threshold can be m12 ¼ ¼ 0:62;
4þ1
given. As the threshold increases, the number of clusters
m1 ¼ <0:38; 0:62>;
also increases. Note that, as usual, the power in (15) is 2 [34],
[35]. Its value has an effect on the number of clusters ð4  1Þð0:2  0:1Þ2 þ 40:382 þ 0:32
A11 ¼ ;
obtained. A larger value will make the boundaries of the 4
 2
Gaussian function sharper, and more clusters will be 4 þ 1 4  0:38 þ 0:3
B11 ¼ ;
obtained for a given threshold. On the contrary, a smaller 4 4þ1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
value will make the boundaries of the Gaussian function 11 ¼ A11  B11 þ 0:1 ¼ 0:1937;
smoother, and fewer clusters will be obtained instead.
Two cases may occur. First, there are no existing fuzzy ð4  1Þð0:3  0:1Þ2 þ 4  0:622 þ 0:72
A12 ¼ ;
clusters on which xi has passed the similarity test. For this 4
 
case, we assume that xi is not similar enough to any existing 4 þ 1 4  0:62 þ 0:7 2
B12 ¼ ;
cluster and a new cluster Gh , h ¼ k þ 1, is created with 4 4þ1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
mh ¼ xi ;  h ¼  0 ; ð17Þ 12 ¼ A12  B12 þ 0:1 ¼ 0:2769;
 1 ¼ <0:1937; 0:2769>;
where  0 ¼ <0 ; . . . ; 0 > is a user-defined constant vector.
Note that the new cluster Gh contains only one member, the S1 ¼ 4 þ 1 ¼ 5:
word pattern xi , at this point. Estimating the deviation of a The above process is iterated until all the word patterns
cluster by (12) is impossible, or inaccurate, if the cluster have been processed. Consequently, we have k clusters. The
contains few members. In particular, the deviation of a new whole clustering algorithm can be summarized below.
cluster is 0, since it contains only one member. We cannot
use zero deviation in the calculation of fuzzy similarities. Initialization:
Therefore, we initialize the deviation of a newly created ] of original word patterns: m
cluster by  0 , as indicated in (17). Of course, the number of ] of classes: p
clusters is increased by 1 and the size of cluster Gh , Sh , Threshold: 
should be initialized, i.e., Initial deviation: 0
Initial ] of clusters: k ¼ 0
k ¼ k þ 1; Sh ¼ 1: ð18Þ Input:
Second, if there are existing clusters on which xi has passed xi ¼ <xi1 ; xi2 ; . . . ; xip >, 1  i  m
the similarity test, let cluster Gt be the cluster with the Output:
largest membership degree, i.e., Clusters G1 , G2 ; . . . ; Gk
procedure Self-Constructing-Clustering-Algorithm
t ¼ arg max ðGj ðxi ÞÞ: ð19Þ for each word pattern xi , 1  i  m
1jk
temp W ¼ fGj jGj ðxi Þ  ; 1  j  kg;
In this case, we regard xi to be most similar to cluster Gt , if (temp W ¼¼ )
and mt and  t of cluster Gt should be modified to include xi A new cluster Gh , h ¼ k þ 1, is created by
as its member. The modification to cluster Gt is described as (17)-(18);
follows: else let Gt 2 temp W be the cluster to which xi is
closest by (19);
St mtj þ xij
mtj ¼ ; ð20Þ Incorporate xi into Gt by (20)-(24);
St þ 1
endif;
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 339

endfor; Section 2.2, the elements of T in (25) are binary and can be
return with the created k clusters; defined as follows:
endprocedure 
1; if wi 2 Wj ;
Note that the word patterns in a cluster have a high degree of tij ¼ ð29Þ
0; otherwise;
similarity to each other. Besides, when new training patterns
are considered, the existing clusters can be adjusted or new where 1  i  m and 1  j  k. That is, if a word wi belongs
clusters can be created, without the necessity of generating to cluster Wj , tij is 1; otherwise tij is 0.
the whole set of clusters from the scratch. By applying our clustering algorithm, word patterns have
Note that the order in which the word patterns are fed been grouped into clusters, and words in the feature vector
in influences the clusters obtained. We apply a heuristic to W are also clustered accordingly. For one cluster, we have
determine the order. We sort all the patterns, in decreas- one extracted feature. Since we have k clusters, we have
ing order, by their largest components. Then the word k extracted features. The elements of T are derived based on
patterns are fed in this order. In this way, more significant the obtained clusters, and feature extraction will be done. We
propose three weighting approaches: hard, soft, and mixed.
patterns will be fed in first and likely become the core of
In the hard-weighting approach, each word is only allowed
the underlying cluster. For example, let x1 ¼ <0:1; 0:3;
to belong to a cluster, and so it only contributes to a new
0:6>; x2 ¼ <0:3; 0:3; 0:4>, and x3 ¼ <0:8; 0:1; 0:1> be three
extracted feature. In this case, the elements of T in (25) are
word patterns. The largest components in these word
defined as follows:
patterns are 0.6, 0.4, and 0.8, respectively. The sorted list is

0.8, 0.6, 0.4. So the order of feeding is x3 , x1 , x2 . This 1; if j ¼ arg max1k ðG ðxi ÞÞ;
heuristic seems to work well. tij ¼ ð30Þ
0; otherwise:
We discuss briefly here the computational cost of our
method and compare it with DC [27], IOC [14], and IG [10]. Note that if j is not unique in (30), one of them is chosen
For an input pattern, we have to calculate the similarity randomly. In the soft-weighting approach, each word is
between the input pattern and every existing cluster. Each allowed to contribute to all new extracted features, with the
degrees depending on the values of the membership
pattern consists of p components where p is the number of
functions. The elements of T in (25) are defined as follows:
classes in the document set. Therefore, in worst case, the
time complexity of our method is OðmkpÞ, where m is the tij ¼ Gj ðxi Þ: ð31Þ
number of original features and k is the number of clusters
finally obtained. For DC, the complexity is OðmkptÞ, where The mixed-weighting approach is a combination of the
t is the number of iterations to be done. The complexity of hard-weighting approach and the soft-weighting ap-
IG is Oðmp þ mlogmÞ, and the complexity of IOC is proach. For this case, the elements of T in (25) are defined
OðmkpnÞ, where n is the number of documents involved. as follows:
Apparently, IG is the quickest one. Our method is better tij ¼ ðÞtH S
ij þ ð1  Þtij; ð32Þ
than DC and IOC.
where tH S
ij is obtained by (30) and tij is obtained by (31), and 
3.2 Feature Extraction is a user-defined constant lying between 0 and 1. Note that 
Formally, feature extraction can be expressed in the is not related to the clustering. It concerns the merge of
following form: component features in a cluster into a resulting feature. The
merge can be “hard” or “soft” by setting  to 1 or 0. By
D0 ¼ DT; ð25Þ
selecting the value of , we provide flexibility to the user.
where When the similarity threshold is small, the number of
clusters is small, and each cluster covers more training
D ¼ ½d1 d2    dn T ; ð26Þ patterns. In this case, a smaller  will favor soft-weighting
0
D ¼ ½d01 d02 
T
d0n ; ð27Þ and get a higher accuracy. On the contrary, when the
2 3 similarity threshold is large, the number of clusters is large,
t11 ... t1k
and each cluster covers fewer training patterns. In this case, a
6t . . . t2k 7
6 21 7 larger  will favor hard-weighting and get a higher accuracy.
T¼6
6 .. .. .. 7
7; ð28Þ
4 . . . 5 3.3 Text Classification
tm1 . . . tmk Given a set D of training documents, text classification can
with be done as follows: We specify the similarity threshold  for
(16), and apply our clustering algorithm. Assume that
di ¼ ½di1 di2    dim ; k clusters are obtained for the words in the feature vector
d0i ¼ ½d0i1 d0i2    d0ik ; W. Then we find the weighting matrix T and convert D to
D0 by (25). Using D0 as training data, a classifier based on
for 1  i  n. Clearly, T is a weighting matrix. The goal of support vector machines (SVM) is built. Note that any
feature reduction is achieved by finding an appropriate T classifying technique other than SVM can be applied.
such that k is smaller than m. In the divisive information- Joachims [36] showed that SVM is better than other
theoretic feature clustering algorithm [27] described in methods for text categorization. SVM is a kernel method,
340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

TABLE 1
A Simple Document Set D

which finds the maximum margin hyperplane in feature 9 documents d1 , d2 ; . . . ; d9 of two classes c1 and c2 , with
space separating the images of the training patterns into 10 words “office,” “building,”; . . . ; “fridge” in the feature
two groups [37], [38], [39]. To make the method more vector W, as shown in Table 1. For simplicity, we denote
flexible and robust, some patterns need not be correctly the ten words as w1 , w2 ; . . . ; w10 , respectively.
classified by the hyperplane, but the misclassified patterns We calculate the ten word patterns x1 ; x2 ; . . . ; x10
should be penalized. Therefore, slack variables i are according to (7) and (8). For example, x6 ¼ <P ðc1 jw6 Þ;
introduced to account for misclassifications. The objective P ðc2 jw6 Þ> and P ðc2 jw6 Þ is calculated by (35).
function and constraints of the classification problem can be
formulated as: P ðc2 jw6 Þ ¼ 1  0 þ 2  0 þ 0  0 þ 1  0 þ 1  1 þ 1  1
þ1  1 þ 1  1 þ 0  1=1 þ 2 þ 0 þ 1 þ 1 þ 1
1 Xl
min wT w þ C i þ1 þ 1 þ 0 ¼ 0:50:
w;b 2 ð33Þ
i¼1 ð35Þ
 
s:t:yi wT ðxi Þ þ b  1  i ; i  0; i ¼ 1; 2; . . . ; l;
The resulting word patterns are shown in Table 2. Note that
where l is the number of training patterns, C is a each word pattern is a two-dimensional vector, since there
parameter, which gives a tradeoff between maximum are two classes involved in D.
margin and classification error, and yi , being +1 or -1, is We run our self-constructing clustering algorithm, by
the target label of pattern xi . Note that  : X ! F is a setting 0 ¼ 0:5 and  ¼ 0:64, on the word patterns and
mapping from the input space to the feature space F , obtain 3 clusters G1 , G2 , and G3 , which are shown in Table 3.
where patterns are more easily separated, and wT ðxi Þ þ The fuzzy similarity of each word pattern to each cluster is
b ¼ 0 is the hyperplane to be derived with w, and b being shown in Table 4. The weighting matrices TH , TS , and TM
weight vector and offset, respectively.
obtained by hard-weighting, soft-weighting, and mixed-
An SVM described above can only separate apart two
weighting (with  ¼ 0:8), respectively, are shown in Table 5.
classes, yi ¼ þ1 and yi ¼ 1. We follow the idea in [36] to
The transformed data sets D0H , D0S , and D0M obtained by (25)
construct an SVM-based classifier. For p classes, we create p
for different cases of weighting are shown in Table 6.
SVMs, one SVM for each class. For the SVM of class cv ,
Based on D0H , D0S , or D0M , a classifier with two SVMs is
1  v  p, the training patterns of class cv are treated as
built. Suppose d is an unknown document, and d ¼
having yi ¼ þ1, and the training patterns of the other
<0; 1; 1; 1; 1; 1; 0; 1; 1; 1> . We first convert d to d0 by (34).
classes are treated as having yi ¼ 1. The classifier is then
Then, the transformed document is obtained as d0H ¼
the aggregation of these SVMs. Now we are ready for
dTH ¼ <2; 4; 2>; d0S ¼ dTS ¼ <2:5591; 4:3478; 3:9964>, o r
classifying unknown documents. Suppose, d is an unknown
document. We first convert d to d0 by d0M ¼ dTM ¼ <2:1118; 4:0696; 2:3993>. Then the trans-
formed unknown document is fed to the classifier. For this
d0 ¼ dT: ð34Þ example, the classifier concludes that d belongs to c2 .
0
Then we feed d to the classifier. We get p values, one from
each SVM. Then d belongs to those classes with 1, 5 EXPERIMENTAL RESULTS
appearing at the outputs of their corresponding SVMs. In this section, we present experimental results to show the
For example, consider a case of three classes c1 , c2 , and c3 . If effectiveness of our fuzzy self-constructing feature clustering
the three SVMs output 1, 1, and 1, respectively, then the method. Three well-known data sets for text classification
predicted classes will be c1 and c3 for this document. If the
three SVMs output 1, 1, and 1, respectively, the predicted
classes will be c2 and c3 . TABLE 2
Word Patterns of W

4 AN EXAMPLE
We give an example here to illustrate how our method
works. Let D be a simple document set, containing
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 341

TABLE 3 To compare classification effectiveness of each method,


Three Clusters Obtained we adopt the performance measures in terms of micro-
averaged precision (MicroP), microaveraged recall (Mi-
croR), microaveraged F1 (MicroF1), and microaveraged
accuracy (MicroAcc) [4], [43], [44] defined as follows:

pi¼1 T Pi
MicroP ¼ p ; ð36Þ
i¼1 ðT Pi þ F Pi Þ
research: 20 Newsgroups [1], RCV1 [40], and Cade12 [41] are pi¼1 T Pi
MicroR ¼ ; ð37Þ
used in the following experiments. We compare our method pi¼1 ðT Pi þ F Ni Þ
with other three feature reduction methods: IG [10], IOC [14], 2MicroP  MicroR
and DC [27]. As described in Section 2, IG is one of the state- MicroF 1 ¼ ; ð38Þ
MicroP þ MicroR
of-art feature selection approaches, IOC is an incremental p
i¼1 ðT Pi þ T Ni Þ
feature extraction algorithm, and DC is a feature clustering MicroAcc ¼ p ; ð39Þ
i¼1 ðT Pi þ T Ni þ F Pi þ F Ni Þ
approach. For convenience, our method is abbreviated as
FFC, standing for Fuzzy Feature Clustering, in this section. In where p is the number of classes. T Pi (true positives wrt ci )
our method, 0 is set to 0.25. The value was determined by is the number of ci test documents that are correctly
investigation. Note that 0 is set to the constant 0.25 for all the classified to ci . T Ni (true negatives wrt ci ) is the number of
following experiments. Furthermore, we use H-FFC, S-FFC, non-ci test documents that are classified to non-ci . F Pi (false
and M-FFC to represent hard-weighting, soft-weighting, and positives wrt ci ) is the number of non-ci test documents that
mixed-weighting, respectively. We run a classifier built on are incorrectly classified to ci . F Ni (false negatives wrt ci ) is
support vector machines [42], as described in Section 3.3, on the number of ci test documents that are classified to non-ci .
the extracted features obtained by different methods. We use
a computer with Intel(R) Core(TM)2 Quad CPU Q6600 5.1 Experiment 1: 20 Newsgroups Data Set
2.40 GHz, 4 GB of RAM to conduct the experiments. The The 20 Newsgroups collection contains about 20,000 articles
programming language used is MATLAB 7.0. taken from the Usenet newsgroups. These articles are

TABLE 4
Fuzzy Similarities of Word Patterns to Three Clusters

TABLE 5
Weighting Matrices: Hard TH , Soft TS , and Mixed TM

TABLE 6
Transformed Data Sets: Hard D0H , Soft D0S , and Mixed D0M
342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

Fig. 1. Class distributions of three data sets. (a) 20 Newsgroups. (b) RCV1. (c) Cade12.
TABLE 7
Sampled Execution Times (Seconds) of Different Methods on 20 Newsgroups Data

evenly distributed over 20 classes, and each class has about 88.38 seconds and IOC requires 6,943.40 seconds. For
1,000 articles, as shown in Fig. 1a. In this figure, the x-axis 84 extracted features, our method only needs 17.68 seconds,
indicates the class number, and the y-axis indicates the but DC and IOC require 293.98 and 28,098.05 seconds,
fraction of the articles of each class. We use two-thirds of respectively. As the number of extracted features increases,
the documents for training and the rest for testing. After DC and IOC run significantly slow. In particular, when the
preprocessing, we have 25,718 features, or words, for this number of extracted features exceeds 280, IOC spends more
data set. than 100,000 seconds without getting finished, as indicated
Fig. 2 shows the execution time (sec) of different feature by dashes in Table 7.
reduction methods on the 20 Newsgroups data set. Since H- Fig. 3 shows the MicroAcc (percent) of the 20 News-
FFC, S-FFC, and M-FFC have the same clustering phase, groups data set obtained by different methods, based on
they have the same execution time, and thus we use FFC to the extracted features previously obtained. The vertical axis
denote them in Fig. 2. In this figure, the horizontal axis indicates the MicroAcc values (percent), and the horizontal
indicates the number of extracted features. To obtain axis indicates the number of extracted features. Table 8 lists
different numbers of extracted features, different values of values of ceratin points in Fig. 3. Note that  is set to 0.5 for
 are used in FFC. The number of extracted features is M-FFC. Also, no accuracies are listed for IOC when the
identical to the number of clusters. For the other methods, number of extracted features exceeds 280. As shown in the
the number of extracted features should be specified in figure, IG performs the worst in classification accuracy,
advance by the user. Table 7 lists values of certain points in especially when the number of extracted features is small.
Fig. 2. Different values of  are used in FFC and are listed in For example, IG only gets 95.79 percent in accuracy for
the table. Note that for IG, each word is given a weight. The 20 extracted features. S-FFC works very well when the
words of top k weights in W are selected as the extracted number of extracted features is smaller than 58. For
features in W0 . Therefore, the execution time is basically the example, S-FFC gets 98.46 percent in accuracy for 20 ex-
same for any value of k. Obviously, our method runs much tracted features. H-FFC and M-FFC perform well in
faster than DC and IOC. For example, our method needs accuracy all the time, except for the case of 20 extracted
8.61 seconds for 20 extracted features, while DC requires features. For example, H-FFC and M-FFC get 98.43 percent

Fig. 2. Execution time (sec.) of different methods on 20 Newsgroups Fig. 3. Microaveraged accuracy (percent) of different methods for
data. 20 Newsgroups data.
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 343

TABLE 8
Microaveraged Accuracy (Percent) of Different Methods for 20 Newsgroups Data

TABLE 9
Microaveraged Precision, Recall, and F1 (Percent) of Different Methods for 20 Newsgroups Data

and 98.54 percent, respectively, in accuracy for 58 extracted well when the number of features is greater than 200, but
features, 98.63 percent and 98.56 percent for 203 extracted it performs worse when the number of features is less
features, and 98.65 percent and 98.59 percent for 521 ex- than 200. For example, DC gets 88.00 percent and M-FFC
tracted features. Although the number of extracted features gets 89.65 percent when the number of features is 20. For
is affected by the threshold value, the classification MicroR, S-FFC performs best for all the cases. M-FFC
accuracies obtained keep fairly stable. DC and IOC perform performs well, especially when the number of features is
a little bit worse in accuracy when the number of extracted less than 200. Note that while the whole 25,718 features
features is small. For example, they get 97.73 percent and are used for classification, MicroP is 94.53 percent, MicroR
97.09 percent, respectively, in accuracy for 20 extracted is 73.18 percent, and MicroF1 is 82.50 percent.
features. Note that while the whole 25,718 features are used In summary, IG runs very fast in feature reduction, but
for classification, the accuracy is 98.45 percent. the extracted features obtained are not good for classifica-
Table 9 lists values of the MicroP, MicroR, and MicroF1 tion. FFC can not only run much faster than DC and IOC in
(percent) of the 20 Newsgroups data set obtained by feature reduction, but also provide comparably good or
different methods. Note that  is set to 0.5 for M-FFC. better extracted features for classification. Fig. 4 shows the
From this table, we can see that S-FFC can, in general, get
best results for MicroF1, followed by M-FFC, H-FFC, and
DC in order. For MicroP, H-FFC performs a little bit
better than S-FFC. But for MicroR, S-FFC performs better
than H-FFC by a greater margin than in the MicroP case.
M-FFC performs well for MicroP, MicroR, and MicroF1.
For MicroF1, M-FFC performs only a little bit worse than
S-FFC, with almost identical performance when the
number of features is less than 500. For MicroP, M-FFC
performs better than S-FFC most of the time. H-FFC
performs well when the number of features is less than
200, while it performs worse than DC and M-FFC when Fig. 4. Microaveraged F1 (percent) of M-FFC with different  values for
the number of features is greater than 200. DC performs 20 Newsgroups data.
344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

Fig. 5. Execution time (sec) of different methods on RCV1 data. Fig. 6. Microaveraged accuracy (percent) of different methods for RCV1
data.
influence of  values on the performance of M-FFC in
MicroF1 for three numbers of extracted features, 20, 203, of extracted features is small. For example, IG only gets
and 1,453, indicated by M-FFC20, M-FFC203, and M- 96.95 percent in accuracy for 18 extracted features. H-FFC, S-
FFC1453, respectively. As shown in the figure, the MircoF1 FFC, and M-FFC perform well in accuracy all the time.
values do not vary significantly with different settings of . For example, H-FFC, S-FFC, and M-FFC get 98.03 percent,
98.26 percent, and 97.85 percent, respectively, in accuracy
5.2 Experiment 2: Reuters Corpus Volume 1 (RCV1) for 18 extracted features, 98.13 percent, 98.39 percent, and
Data Set 98.31 percent, respectively, for 65 extracted features, and
The RCV1 data set consists of 804,414 news stories 98.41 percent, 98.53 percent, and 98.56 percent, respectively,
produced by Reuters from 20 August 1996 to 19 August for 421 extracted features. Note that while the whole
1997. All news stories are in English, and have 109 distinct 47,152 features are used for classification, the accuracy is
terms per document on average. The documents are 98.83 percent. Table 12 lists values of the MicroP, MicroR,
divided, by the “LYRL2004” split defined in Lewis et al. and MicroF1 ( percent) of the RCV1 data set obtained by
[40], into 23,149 training documents and 781,265 testing different methods. From this table, we can see that S-FFC can,
documents. There are 103 Topic categories and the in general, get best results for MicroF1, followed by M-FFC,
distribution of the documents over the classes is shown in DC, and H-FFC in order. S-FFC performs better than the
Fig. 1b. All the 103 categories have one or more positive test other methods, especially when the number of features is less
examples in the test set, but only 101 of them have one or than 500. For MicroP, all the methods perform about equally
more positive training examples in the training set. It is well, the values being around 80 percent-85 percent. IG gets a
these 101 categories that we use in this experiment. After high MicroP when the number of features is smaller than 50.
pre-processing, we have 47,152 features for this data set. M-FFC performs well for MicroP, MicroR, and MicroF1. Note
Fig. 5 shows the execution time (sec) of different feature that while the whole 47,152 features are used for classifica-
reduction methods on the RCV1 data set. Table 10 lists tion, MicroP is 86.66 percent, MicroR is 75.03 percent, and
values of certain points in Fig. 5. Clearly, our method runs MicroF1 is 80.43 percent.
much faster than DC and IOC. For example, for 18 extracted In summary, IG runs very fast in feature reduction, but the
features, DC requires 609.63 seconds and IOC requires extracted features obtained are not good for classification. For
94,533.88 seconds, but our method only needs 14.42 seconds. example, IG gets 96.95 percent in MicroAcc, 91.58 percent
For 65 extracted features, our method only needs 37.40 sec- in MicroP, 5.80 percent in MicroR, and 10.90 percent in
onds, but DC and IOC require 1,709.77 and over 100,000 sec- MicroF1 for 18 extracted features, and 97.24 percent in
onds, respectively. As the number of extracted features MicroAcc, 68.82 percent in MicroP, 25.72 percent in MicroR,
increases, DC and IOC run significantly slow. and 37.44 percent in MicroF1 for 65 extracted features. FFC
Fig. 6 shows the MicroAcc (percent) of the RCV1 data set can not only run much faster than DC and IOC in feature
obtained by different methods, based on the extracted reduction, but also provide comparably good or better
features previously obtained. Table 11 lists values of certain extracted features for classification. For example, S-FFC
points in Fig. 6. Note that  is set to 0.5 for M-FFC. No gets 98.26 percent in MicroAcc, 85.18 percent in MicroP,
accuracies are listed for IOC when the number of extracted 55.33 percent in MicroR, and 67.08 percent in MicroF1 for
features exceeds 18. As shown in the figure, IG performs the 18 extracted features, and 98.39 percent in MicroAcc,
worst in classification accuracy, especially when the number 82.34 percent in MicroP, 63.41 percent in MicroR, and

TABLE 10
Sampled Execution Times (sec.) of Different Methods on RCV1 Data
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 345

TABLE 11
Microaveraged Accuracy (percent) of Different Methods for RCV1 Data

TABLE 12
Microaveraged Precision, Recall and F1 (Percent) of Different Methods for RCV1 Data

71.65 percent in MicroF1 for 65 extracted features. Fig. 7 40,983 documents in total with 122,607 features, was
shows the influence of  values on the performance of M-FFC obtained from [45], from which two-thirds, 27,322 docu-
in MicroF1 for three numbers of extracted features, 18, 202, ments, are split for training, and the remaining, 13,661 docu-
and 1,700, indicated by M-FFC18, M-FFC202, and M- ments, for testing. The distribution of Cade12 data is shown
FFC1700, respectively. As shown in the figure, the MircoF1 in Fig. 1c and Table 13.
values do not vary significantly with different settings of . Fig. 8 shows the execution time (sec) of different feature
reduction methods on the Cade12 data set. Clearly, our
5.3 Experiment 3: Cade12 Data method runs much faster than DC and IOC. For example,
Cade12 is a set of classified Web pages extracted from the for 22 extracted features, DC requires 689.63 seconds and
Cad Web directory [41]. This directory points to Brazilian IOC requires 57,958.10 seconds, but our method only needs
Web pages that were classified by human experts into 24.02 seconds. For 84 extracted features, our method only
12 classes. The Cade12 collection has a skewed distribution, needs 47.48 seconds, while DC requires 2,939.35 seconds
and the three most popular classes represent more than and IOC has a difficulty in completion. As the number of
50 percent of all documents. A version of this data set, extracted features increases, DC and IOC run significantly
slow. In particular, IOC can only get finished in 100,000 sec-
onds with 22 extracted features, as shown in Table 14.
Fig. 9 shows the MicroAcc (percent) of the Cade12 data
set obtained by different methods, based on the extracted
features previously obtained. Table 15 lists values of
certain points in Fig. 9. XO Table 16 lists values of the
MicroP, MicroR, and MicroF1 (percent) of the Cade12 data
set obtained by different methods. Note that  is set to 0.5
for M-FFC. Also, no accuracies are listed for IOC when the
number of extracted features exceeds 22. As shown in the
tables, all the methods work equally well for MicroAcc.
Fig. 7. Microaveraged F1(percent) of M-FFC with different  values for But none of the methods work satisfactorily well for
RCV1 data. MicroP, MicroR, and MicroF1. In fact, it is hard to get
346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

TABLE 13
The Number of Documents Contained in Each Class of the Cade12 Data Set

TABLE 14
Sampled Execution Times (Seconds) of Different Methods on Cade12 Data

TABLE 15
Microaveraged Accuracy (Percent) of Different Methods for Cade12 Data

feature reduction for Cade12. Loose correlation exists performance of M-FFC in MicroF1 for three numbers of
among the original features. IG performs best for MicroP. extracted features, 22, 286, and 1,338, indicated by M-
For example, IG gets 78.21 percent in MicroP when the FFC22, M-FFC286, and M-FFC1338, respectively. As shown
number of features is 38. However, IG performs worst for in the figure, the MircoF1 values do not vary significantly
MicroR, getting only 7.04 percent in MicroR when the with different settings of .
number of features is 38. S-FFC, M-FFC, and H-FFC get a
little bit worse than IG in MicroP, but they are good in
MicroR. For example, S-FFC gets 75.83 percent in MicroP 6 CONCLUSIONS
and 38.33 percent in MicroR when the number of features We have presented a fuzzy self-constructing feature
is 38. In general, S-FFC can get best results, followed by M- clustering (FFC) algorithm, which is an incremental
FFC, H-FFC, and DC in order. Note that while the whole clustering approach to reduce the dimensionality of the
122,607 features are used for classification, MicroAcc features in text classification. Features that are similar to
is 93.55 percent, MicroP is 69.57 percent, MicroR is each other are grouped into the same cluster. Each cluster is
40.11 percent, and MicroF1 is 50.88 percent. Again, this characterized by a membership function with statistical
experiment shows that FFC can not only run much faster mean and deviation. If a word is not similar to any existing
than DC and IOC in feature reduction, but also provide cluster, a new cluster is created for this word. Similarity
comparably good or better extracted features for classifica- between a word and a cluster is defined by considering
tion. Fig. 10 shows the influence of  values on the

Fig. 9. Microaveraged accuracy (percent) of different methods for


Fig. 8. Execution time (seconds) of different methods on Cade12 data. Cade12 data.
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 347

TABLE 16
Microaveraged Precision, Recall and F1 (Percent) of Different Methods for Cade12 Data

both the mean and the variance of the cluster. When all the applied the Rocchio classifier over TFIDF-weighted repre-
words have been fed in, a desired number of clusters are sentation and obtained 91.8 percent unilabeled accuracy on
formed automatically. We then have one extracted feature the 20 Newsgroups data set. Lewis et al. [40] used the SVM
for each cluster. The extracted feature corresponding to a classifier over a form of TFIDF weighting and obtained
cluster is a weighted combination of the words contained in 81.6 percent microaveraged F1 on the RCV1 data set. Al-
the cluster. By this algorithm, the derived membership Mubaid and Umair [31] applied a classifier, called Lsquare,
functions match closely with and describe properly the real which combines the distributional clustering of words and a
distribution of the training data. Besides, the user need not learning logic technique, and achieved 98.00 percent
specify the number of extracted features in advance, and microaveraged accuracy and 86.45 percent microaveraged
trial-and-error for determining the appropriate number of F1 on the 20 Newsgroups data set. Cardoso-Cachopo [45]
extracted features can then be avoided. Experiments on applied the SVM classifier over the normalized TFIDF term
three real-world data sets have demonstrated that our weighting, and achieved 52.84 percent unilabeled accuracy
method can run faster and obtain better extracted features on the Cade12 data set and 82.48 percent on the 20 News-
than other methods. groups data set. Al-Mubaid and Umair [31] applied AIB
Other projects were done on text classification, with or [25] for feature reduction, and the number of features was
without feature reduction, and evaluation results were reduced to 120. No feature reduction was applied in the
published. Some methods adopted the same performance other three methods. The evaluation results published for
measures as we did, i.e., microaveraged accuracy or these methods are summarized in Table 17. Note that,
microaveraged F1. Others adopted unilabeled accuracy according to the definitions, unilabeled accuracy is usually
(ULAcc). Unilabeled accuracy is basically the same as the lower than Microaveraged accuracy for multilabel classifi-
normal accuracy, except that a pattern, either training or
cation. This explains why both Joachims [33] and Cardoso-
testing, with m class labels are copied m times, each copy is
Cachopo [45] have lower unilabeled accuracies than the
associated with a distinct class label. We discuss some
microaveraged accuracies shown in Tables 8 and 15.
methods here. The evaluation results given here for them
However, Lewis et al. [40] has a higher Microaveraged F1
are taken directly from corresponding papers. Joachims [33]
than that shown in Table 11. This may be due to that Lewis
et al. [40] used the TFIDF weighting, which is more
elaborate than the TF weighting we used. However, it is

TABLE 17
Published Evaluation Results for Some Other Text Classifiers

Fig. 10. Microaveraged F1 (percent) of M-FFC with different  values for


Cade12 data.
348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

hard for us to explain the difference between the micro- [15] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[16] A.M. Martinez and A.C. Kak, “PCA versus LDA,” IEEE Trans.
averaged F1 obtained by Al-Mubaid and Umair [31] and Pattern Analysis and Machine Intelligence, vol. 23, no. 2 pp. 228-233,
that shown in Table 9, since a kind of sampling was applied Feb. 2001.
in Al-Mubaid and Umair [31]. [17] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representa-
Similarity-based clustering is one of the techniques we tion of Text Data Based on Centroids and Least Squares,” BIT
Numerical Math, vol. 43, pp. 427-448, 2003.
have developed in our machine learning research. In this [18] S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reduction
paper, we apply this clustering technique to text categor- by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326,
ization problems. We are also applying it to other problems, 2000.
such as image segmentation, data sampling, fuzzy model- [19] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global
Geometric Framework for Nonlinear Dimensionality Reduction,”
ing, web mining, etc. The work of this paper was motivated Science, vol. 290, pp. 2319-2323, 2000.
by distributional word clustering proposed in [24], [25], [20] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral
[26], [27], [29], [30]. We found that when a document set is Techniques for Embedding and Clustering,” Advances in Neural
transformed to a collection of word patterns, as by (7), the Information Processing Systems, vol. 14, pp. 585-591, The MIT Press
2002.
relevance among word patterns can be measured, and the [21] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima,
word patterns can be grouped by applying our similarity- and S. Yoshizawa, “Successive Learning of Linear Discriminant
based clustering algorithm. Our method is good for text Analysis: Sanger-Type Algorithm,” Proc. IEEE CS Int’l Conf.
Pattern Recognition, pp. 2664-2667, 2000.
categorization problems due to the suitability of the
[22] J. Weng, Y. Zhang, and W.S. Hwang, “Candid Covariance-Free
distributional word clustering concept. Incremental Principal Component Analysis,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1034-1040, Aug.
2003.
ACKNOWLEDGMENTS [23] J. Yan, B.Y. Zhang, S.C. Yan, Z. Chen, W.G. Fan, Q. Yang, W.Y.
Ma, and Q.S. Cheng, “IMMC: Incremental Maximum Margin
The authors are grateful to the anonymous reviewers for Criterion,” Proc. 10th ACM SIGKDD, pp. 725-730, 2004.
their comments, which were very helpful in improving the [24] L.D. Baker and A. McCallum, “Distributional Clustering of Words
quality and presentation of the paper. They’d also like to for Text Classification,” Proc. ACM SIGIR, pp. 96-103, 1998.
express their thanks to National Institute of Standards and [25] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distribu-
tional Word Clusters versus Words for Text Categorization,”
Technology for the provision of the RCV1 data set. This J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
work was supported by the National Science Council under [26] M.C. Dalmau and O.W.M. Flórez, “Experimental Results of the
the grants NSC-95-2221-E-110-055-MY2 and NSC-96-2221- Signal Processing Approach to Distributional Clustering of Terms
E-110-009. on Reuters-21578 Collection,” Proc. 29th European Conf. IR Research,
pp. 678-681, 2007.
[27] I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infomation-
Theoretic Feature Clustering Algorithm for Text Classification,”
REFERENCES J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[1] Http://people.csail.mit.edu/jrennie/20Newsgroups/, 2010. [28] D. Ienco and R. Meo, “Exploration and Reduction of the Feature
[2] Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578. Space by Hierarchical Clustering,” Proc. SIAM Conf. Data Mining,
html. 2010. pp. 577-587, 2008.
[3] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text [29] N. Slonim and N. Tishby, “The Power of Word Clusters for Text
Classification with Support Vector Machines,” J. Machine Learning Classification,” Proc. 23rd European Colloquium on Information
Research, vol. 6, pp. 37-53, 2005. Retrieval Research (ECIR), 2001.
[4] F. Sebastiani, “Machine Learning in Automated Text Categoriza- [30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of
tion,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002. English Words,” Proc. 31st Ann. Meeting of ACL, pp. 183-190, 1993.
[5] B.Y. Ricardo and R.N. Berthier, Modern Information Retrieval. [31] H. Al-Mubaid and S.A. Umair, “A New Text Categorization
Addison Wesley Longman, 1999. Technique Using Distributional Clustering and Learning Logic,”
[6] A.L. Blum and P. Langley, “Selection of Relevant Features and IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1156-1165,
Examples in Machine Learning,” Aritficial Intelligence, vol. 97, Sept. 2006.
nos. 1/2, pp. 245-271, 1997. [32] G. Salton and M.J. McGill, Introduction to Modern Retrieval.
[7] E.F. Combarro, E. Montañés, I. Dı́az, J. Ranilla, and R. Mones, McGraw-Hill Book Company, 1983.
“Introducing a Family of Linear Measures for Feature Selection in [33] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm
Text Categorization,” IEEE Trans. Knowledge and Data Eng., vol. 17, with TFIDF for Text Categorization,” Proc. 14th Int’l Conf. Machine
no. 9, pp. 1223-1232, Sept. 2005. Learning, pp. 143-151, 1997.
[8] K. Daphne and M. Sahami, “Toward Optimal Feature Selection,”
[34] J. Yen and R. Langari, Fuzzy Logic-Intelligence, Control, and
Proc. 13th Int’l Conf. Machine Learning, pp. 284-292, 1996.
Information. Prentice-Hall, 1999.
[9] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,”
[35] J.S. Wang and C.S.G. Lee, “Self-Adaptive Neurofuzzy Inference
Aritficial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
Systems for Classification Applications,” IEEE Trans. Fuzzy
[10] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature
Systems, vol. 10, no. 6, pp. 790-802, Dec. 2002.
Selection in Text Categorization,” Proc. 14th Int’l Conf. Machine
Learning, pp. 412-420, 1997. [36] T. Joachims, “Text Categorization with Support Vector Machine:
[11] D.D. Lewis, “Feature Selection and Feature Extraction for Text Learning with Many Relevant Features,” Technical Report LS-8-
Categorization,” Proc. Workshop Speech and Natural Language, 23, Univ. of Dortmund, 1998.
pp. 212-217, 1992. [37] C. Cortes and V. Vapnik, “Support-Vector Network,” Machine
[12] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Learning, vol. 20, no. 3, pp. 273-297, 1995.
Extraction by Maximum Margin Criterion,” T. Sebastian, S. [38] B. Schölkopf and A.J. Smola, Learning with Kernels: Support Vector
Lawrence, and S. Bernhard eds. Advances in Neural Information Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
Processing System, pp. 97-104, Springer, 2004. [39] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern
[13] E. Oja, Subspace Methods of Pattern Recognition. Research Studies Analysis. Cambridge Univ. Press, 2004.
Press, 1983. [40] D.D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New
[14] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Benchmark Collection for Text Categorization Research,”
and Z. Chen, “Effective and Efficient Dimensionality Reduction J. Machine Learning Research, vol. 5, pp. 361-397, http://
for Large-Scale and Streaming Data Preprocessing,” IEEE Trans. www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf, 2004.
Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006. [41] The Cad Web directory, http://www.cade.com.br/, 2010.
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 349

[42] C.C. Chang and C.J. Lin, “Libsvm: A Library for Support Vector Shie-Jue Lee received the BS and MS degrees
Machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2001. in electrical engineering, in 1977 and 1979,
[43] Y. Yang and X. Liu, “A Re-Examination of Text Categorization respectively, from National Taiwan University,
Methods,” Proc. ACM SIGIR, pp. 42-49, 1999. and the PhD degree in computer science from
[44] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-Label the University of North Carolina, Chapel Hill, in
Data,” Data Mining and Knowledge Discovery Handbook, O. Maimon 1990. He joined the faculty of the Department of
and L. Rokach eds., second ed. Springer, 2009. Electrical Engineering at National Sun Yat-Sen
[45] Http://web.ist.utl.pt/~acardoso/data sets/, 2010. University, Taiwan, in 1983, where he has been
working as a professor since 1994. Now, he is
Jung-Yi Jiang received the BS degree from I- the deputy dean of academic affairs and director
Shou University, Taiwan, in 2002, and the MSEE of NSYSU-III Research Center, National Sun Yat-Sen University. His
degree from Sun Yat-Sen University, Taiwan, in research interests include artificial intelligence, machine learning, data
2004. He is currently pursuing the PhD degree at mining, information retrieval, and soft computing. He served as director
the Department of Electrical Engineering, Na- of the Center for Telecommunications Research and Development,
tional Sun Yat-Sen University. His main re- National Sun Yat-Sen University (1997-2000), director of the Southern
search interests include machine learning, data Telecommunications Research Center, National Science Council (1998-
mining, and information retrieval. 1999), and chair of the Department of Electrical Engineering, National
Sun Yat-Sen University (2000-2003). He is a recipient of the
Distinguished Teachers Award of the Ministry of Education, Taiwan,
1993. He was awarded by the Chinese Institute of Electrical Engineering
for Outstanding MS Thesis Supervision, 1997. He received the
Ren-Jia Liou received the BS degree from Distinguished Paper Award of the Computer Society of the Republic of
National Dong Hwa University, Taiwan, in 2005, China, 1998, and the Best Paper Award of the Seventh Conference on
and the MSEE degree from Sun Yat-Sen Artificial Intelligence and Applications, 2002. He received the Distin-
University, Taiwan, in 2009. He is currently a guished Research Award of National Sun Yat-Sen University, 1998. He
research assistant in the Department of Chem- received the Distinguished Teaching Award of National Sun Yat-Sen
istry, National Sun Yat-Sen University. His main University, 1993 and 2008. He received the Best Paper Award of the
research interests include machine learning, International Conference on Machine Learning and Cybernetics, 2008.
data mining, and web programming. He also received the Distinguished Mentor Award of National Sun Yat-
Sen University, 2008. He served as the program chair for the
International Conference on Artificial Intelligence (TAAI-96), Kaohsiung,
Taiwan, December 1996, International Computer Symposium—Work-
shop on Artificial Intelligence, Tainan, Taiwan, December 1998, and the
Sixth Conference on Artificial Intelligence and Applications, Kaohsiung,
Taiwan, November, 2001. He is a member of the IEEE Society of
Systems, Man, and Cybernetics, the Association for Automated
Reasoning, the Institute of Information and Computing Machinery, the
Taiwan Fuzzy Systems Association, and the Taiwanese Association of
Artificial Intelligence. He is also a member of the IEEE.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like