1

A Fuzzy Self-Constructing Feature Clustering
Algorithm for Text Classification
Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE
!
Abstract—Feature clustering is a powerful method to reduce the di-
mensionality of feature vectors for text classification. In this paper, we
propose a fuzzy similarity-based self-constructing algorithm for feature
clustering. The words in the feature vector of a document set are
grouped into clusters based on similarity test. Words that are similar
to each other are grouped into the same cluster. Each cluster is charac-
terized by a membership function with statistical mean and deviation.
When all the words have been fed in, a desired number of clusters
are formed automatically. We then have one extracted feature for each
cluster. The extracted feature corresponding to a cluster is a weighted
combination of the words contained in the cluster. By this algorithm, the
derived membership functions match closely with and describe properly
the real distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and trial-and-error
for determining the appropriate number of extracted features can then
be avoided. Experimental results show that our method can run faster
and obtain better extracted features than other methods.
Index Terms—Fuzzy similarity, feature clustering, feature extraction,
feature reduction, text classification.
1 INTRODUCTION
I
N text classification, the dimensionality of the feature vector
is usually huge. For example, 20 Newsgroups [1] and
Reuters21578 top-10 [2], which are two real-world datasets,
both have more than 15,000 features. Such high dimensionality
can be a severe obstacle for classification algorithms [3], [4].
To alleviate this difficulty, feature reduction approaches are
applied before document classification tasks are performed [5].
Two major approaches, feature selection [6], [7], [8], [9], [10]
and feature extraction [11], [12], [13], have been proposed for
feature reduction. In general, feature extraction approaches are
more effective than feature selection techniques, but are more
computationally expensive [11], [12], [14]. Therefore, devel-
oping scalable and efficient feature extraction algorithms is
highly demanded for dealing with high-dimensional document
datasets.
Classical feature extraction methods aim to convert the
representation of the original high-dimensional dataset into
a lower-dimensional dataset by a projecting process through
algebraic transformations. For example, Principal Component
Analysis [15], Linear Discriminant Analysis [16], Maximum
This work was supported by the National Science Council under the grants
NSC-95-2221-E-110-055-MY2 and NSC-96-2221-E-110-009.
Corresponding author: Shie-Jue Lee, leesj@mail.ee.nsysu.edu.tw
Margin Criterion [12], and Orthogonal Centroid algorithm [17]
perform the projection by linear transformations, while Locally
Linear Embedding [18], ISOMAP [19], and Laplacian Eigen-
maps [20] do feature extraction by nonlinear transformations.
In practice, linear algorithms are in wider use due to their
efficiency. Several scalable online linear feature extraction al-
gorithms [14], [21], [22], [23] have been proposed to improve
the computational complexity. However, the complexity of
these approaches is still high. Feature clustering [24], [25],
[26], [27], [28], [29] is one of effective techniques for feature
reduction in text classification. The idea of feature clustering is
to group the original features into clusters with a high degree
of pairwise semantic relatedness. Each cluster is treated as
a single new feature and thus feature dimensionality can be
drastically reduced.
The first feature extraction method based on feature clus-
tering was proposed by Baker and McCallum [24], which was
derived from the “distributional clustering” idea of Pereira
et al. [30]. Al-Mubaid and Umair [31] used distributional
clustering to generate an efficient representation of documents
and applied a learning logic approach for training text classi-
fiers. The Agglomerative Information Bottleneck approach was
proposed by Tishby et al. [25], [29]. The divisive information-
theoretic feature clustering algorithm was proposed by Dhillon
et al. [27], which is an information-theoretic feature clustering
approach and more effective than other feature clustering
methods. In these feature clustering methods, each new feature
is generated by combining a subset of the original words.
However, difficulties are associated with these methods. A
word is exactly assigned to a subset, i.e., hard-clustering, based
on the similarity magnitudes between the word and the existing
subsets, even if the differences among these magnitudes are
small. Also, the mean and the variance of a cluster are
not considered when similarity with respect to the cluster is
computed. Furthermore, these methods require the number of
new features be specified in advance by the user.
We propose a fuzzy similarity-based self-constructing fea-
ture clustering algorithm which is an incremental feature
clustering approach to reduce the number of features for the
text classification task. The words in the feature vector of a
document set are represented as distributions and processed
one after another. Words that are similar to each other are
grouped into the same cluster. Each cluster is characterized
by a membership function with statistical mean and deviation.
Digital Object Indentifier 10.1109/TKDE.2010.122 1041-4347/10/$26.00 © 2010 IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2
If a word is not similar to any existing cluster, a new cluster is
created for this word. Similarity between a word and a cluster
is defined by considering both the mean and the variance of
the cluster. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have
one extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the
words contained in the cluster. Three ways of weighting,
hard, soft, and mixed, are introduced. By this algorithm, the
derived membership functions match closely with and describe
properly the real distribution of the training data. Besides,
the user need not specify the number of extracted features
in advance, and trial-and-error for determining the appropriate
number of extracted features can then be avoided. Experiments
on real-world datasets show that our method can run faster and
obtain better extracted features than other methods.
The remainder of this paper is organized as follows. Sec-
tion 2 gives a brief background about feature reduction.
Section 3 presents the proposed fuzzy similarity-based self-
constructing feature clustering algorithm. An example illus-
trating how the algorithm works is given in Section 4. Experi-
mental results are presented in Section 5. Finally, we conclude
this work in Section 6.
2 BACKGROUND AND RELATED WORK
To process documents, the bag-of-words model [32], [33] is
commonly used. Let D = {d
1
, d
2
, . . . , d
n
} be a document
set of n documents, where d
1
, d
2
, . . . , d
n
are individual
documents and each document belongs to one of the classes
in the set {c
1
, c
2
, ......, c
p
}. If a document belongs to two
or more classes, then two or more copies of the document
with different classes are included in D. Let the word set
W = {w
1
, w
2
, . . . , w
m
} be the feature vector of the document
set. Each document d
i
, 1 ≤ i ≤ n, is represented as d
i
=
< d
i1
, d
i2
, . . . , d
im
>, where each d
ij
denotes the number of
occurrences of w
j
in the ith document. The feature reduction
task is to find a new word set W

= {w

1
, w

2
, . . . , w

k
}, k <<
m, such that W and W

work equally well for all the desired
properties with D. After feature reduction, each document d
i
is
converted into a new representation d

i
= < d

i1
, d

i2
, . . . , d

ik
>
and the converted document set is D

= {d

1
, d

2
, . . . , d

n
}. If
k is much smaller than m, computation cost with subsequent
operations on D

can be drastically reduced.
2.1 Feature Reduction
In general, there are two ways of doing feature reduction,
feature selection and feature extraction. By feature selection
approaches, a new feature set W

= {w

1
, w

2
, . . . , w

k
} is
obtained, which is a subset of the original feature set W.
Then W

is used as inputs for classification tasks. Information
Gain (IG) is frequently employed in the feature selection
approach [10]. It measures the reduced uncertainty by an
information-theoretic measure and gives each word a weight.
The weight of a word w
j
is calculated as follows:
IG(w
j
) = −
¸
p
l=1
P(c
l
)logP(c
l
)
+P(w
j
)
¸
p
l=1
P(c
l
|w
j
)logP(c
l
|w
j
)
+P(w
j
)
¸
p
l=1
P(c
l
|w
j
)logP(c
l
|w
j
)
(1)
where P(c
l
) denotes the prior probability for class c
l
, P(w
j
)
denotes the prior probability for feature w
j
, P(w
j
) is identical
to 1 −P(w
j
), and P(c
l
|w
j
) and P(c
l
|w
j
) denote the proba-
bility for class c
l
with the presence and absence, respectively,
of w
j
. The words of top k weights in W are selected as the
features in W

.
In feature extraction approaches, extracted features are
obtained by a projecting process through algebraic transforma-
tions. An incremental orthogonal centroid (IOC) algorithm was
proposed in [14]. Let a corpus of documents be represented
as an m× n matrix X ∈ R
m×n
, where m is the number of
features in the feature set and n is the number of documents in
the document set. IOC tries to find an optimal transformation
matrix F

∈ R
m×k
, where k is the desired number of extracted
features, according to the following criterion:
F

= arg max trace(F
T
S
b
F) (2)
where F ∈ R
m×k
and F
T
F = I, and
S
b
=
p
¸
q=1
P(c
q
)(M
q
−M
all
)(M
q
−M
all
)
T
(3)
with P(c
q
) being the prior probability for a pattern belonging
to class c
q
, M
q
being the mean vector of class c
q
, and M
all
being the mean vector of all patterns.
2.2 Feature Clustering
Feature clustering is an efficient approach for feature reduc-
tion [25], [29], which groups all features into some clusters
where features in a cluster are similar to each other. The
feature clustering methods proposed in [24], [25], [27], [29]
are “hard” clustering methods where each word of the original
features belongs to exactly one word cluster. Therefore each
word contributes to the synthesis of only one new feature. Each
new feature is obtained by summing up the words belonging
to one cluster. Let D be the matrix consisting of all the
original documents with m features and D

be the matrix
consisting of the converted documents with new k features.
The new feature set W

= {w

1
, w

2
, . . . , w

k
} corresponds to
a partition {W
1
, W
2
, . . . , W
k
} of the original feature set W,
i.e., W
t
¸
W
q
= ∅, where 1 ≤ q, t ≤ k and t = q. Note that
a cluster corresponds to an element in the partition. Then, the
tth feature value of the converted document d

i
is calculated
as follows:
d

it
=
¸
wj∈Wt
d
ij
(4)
which is a linear sum of the feature values in W
t
.
The divisive information-theoretic feature clustering (DC)
algorithm, proposed by Dhillon et al. [27] calculates the dis-
tributions of words over classes, P(C|w
j
), 1 ≤ j ≤ m, where
C = {c
1
, c
2
, ......, c
p
}, and uses Kullback-Leibler divergence
to measure the dissimilarity between two distributions. The
distribution of a cluster W
t
is calculated as follows:
P(C|W
t
) =
¸
wj∈Wt
P(w
j
)
¸
wj∈Wt
P(w
j
)
P(C|w
j
). (5)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
3
The goal of DC is to minimize the following objective
function:
k
¸
t=1
¸
w
j
∈W
t
P(w
j
)KL(P(C|w
j
), P(C|W
t
)) (6)
which takes the sum over all the k clusters, where k is specified
by the user in advance.
3 OUR METHOD
There are some issues pertinent to most of the existing feature
clustering methods. Firstly, the parameter k, indicating the
desired number of extracted features, has to be specified in
advance. This gives a burden to the user since trial-and-error
has to be done until the appropriate number of extracted
features is found. Secondly, when calculating similarities, the
variance of the underlying cluster is not considered. Intuitively,
the distribution of the data in a cluster is an important factor
in the calculation of similarity. Thirdly, all words in a cluster
have the same degree of contribution to the resulting extracted
feature. Sometimes it may be better if more similar words are
allowed to have bigger degrees of contribution. Our feature
clustering algorithm is proposed to deal with these issues.
Suppose we are given a document set D of n documents d
1
,
d
2
, . . . , d
n
, together with the feature vector W of m words
w
1
, w
2
, . . . , w
m
and p classes c
1
, c
2
, . . . , c
p
, as specified in
Section 2. We construct one word pattern for each word in
W. For word w
i
, its word pattern x
i
is defined, similarly as
in [27], by
x
i
= < x
i1
, x
i2
, . . . , x
ip
>
= < P(c
1
|w
i
), P(c
2
|w
i
), . . . , P(c
p
|w
i
) > (7)
where
P(c
j
|w
i
) =
¸
n
q=1
d
qi
×δ
qj
¸
n
q=1
d
qi
(8)
for 1 ≤ j ≤ p. Note that d
qi
indicates the number of
occurrences of w
i
in document d
q
, as described in Section 2.
Also, δ
qj
is defined as
δ
qj
=

1, if document d
q
belongs to class c
j
;
0, otherwise.
(9)
Therefore, we have m word patterns in total. For example,
suppose we have four documents d
1
, d
2
, d
3
, and d
4
belonging
to c
1
, c
1
, c
2
, and c
2
, respectively. Let the occurrences of w
1
in these documents be 1, 2, 3, and 4, respectively. Then, the
word pattern x
1
of w
1
is:
P(c
1
|w
1
) =
1×1 + 2×1 + 3×0 + 4×0
1 + 2 + 3 + 4
= 0.3,
P(c
2
|w
1
) =
1×0 + 2×0 + 3×1 + 4×1
1 + 2 + 3 + 4
= 0.7,
x
1
= < 0.3, 0.7 > . (10)
It is these word patterns our clustering algorithm will work
on. Our goal is to group the words in W into clusters based
on these word patterns. A cluster contains a certain number
of word patterns, and is characterized by the product of p
one-dimensional Gaussian functions. Gaussian functions are
adopted because of their superiority over other functions in
performance [34], [35]. Let G be a cluster containing q word
patterns x
1
, x
2
, . . . , x
q
. Let x
j
= < x
j1
, x
j2
, . . . , x
jp
>, 1 ≤
j ≤ q. Then the mean m = < m
1
, m
2
, . . . , m
p
> and the
deviation σ = < σ
1
, σ
2
, . . . , σ
p
> of G are defined as
m
i
=
¸
q
j=1
x
ji
|G|
(11)
σ
i
=

¸
q
j=1
(x
ji
−m
ji
)
2
|G|
(12)
for 1 ≤ i ≤ p, where |G| denotes the size of G, i.e., the
number of word patterns contained in G. The fuzzy similarity
of a word pattern x = < x
1
, x
2
, . . . , x
p
> to cluster G is
defined by the following membership function:
μ
G
(x) =
p
¸
i=1
exp
¸

x
i
−m
i
σ
i

2
¸
. (13)
Notice that 0 ≤ μ
G
(x) ≤ 1. A word pattern close to the mean
of a cluster is regarded to be very similar to this cluster, i.e.,
μ
G
(x) ≈ 1. On the contrary, a word pattern far distant from
a cluster is hardly similar to this cluster, i.e., μ
G
(x) ≈ 0. For
example, suppose G
1
is an existing cluster with a mean vector
m
1
= < 0.4, 0.6 > and a deviation vector σ
1
= < 0.2, 0.3 >.
The fuzzy similarity of the word pattern x
1
shown in Eq.(10)
to cluster G
1
becomes
μ
G1
(x
1
)
= exp

0.3−0.4
0.2

2

×exp

0.7−0.6
0.3

2

= 0.7788 ×0.8948 = 0.6969. (14)
3.1 Self-Constructing Clustering
Our clustering algorithm is an incremental, self-constructing
learning approach. Word patterns are considered one by one.
The user does not need to have any idea about the number
of clusters in advance. No clusters exist at the beginning, and
clusters can be created if necessary. For each word pattern,
the similarity of this word pattern to each existing cluster is
calculated to decide whether it is combined into an existing
cluster or a new cluster is created. Once a new cluster is
created, the corresponding membership function should be
initialized. On the contrary, when the word pattern is combined
into an existing cluster, the membership function of that cluster
should be updated accordingly.
Let k be the number of currently existing clusters. The
clusters are G
1
, G
2
, . . . , G
k
, respectively. Each cluster G
j
has mean m
j
= < m
j1
, m
j2
, . . . , m
jp
> and deviation σ
j
= < σ
j1
, σ
j2
, . . . , σ
jp
> . Let S
j
be the size of cluster G
j
.
Initially, we have k = 0. So, no clusters exist at the beginning.
For each word pattern x
i
= < x
i1
, x
i2
, . . . , x
ip
>, 1 ≤ i ≤ m,
we calculate, according to Eq.(13), the similarity of x
i
to each
existing clusters, i.e.,
μ
Gj
(x
i
) =
p
¸
q=1
exp
¸

x
iq
−m
jq
σ
jq

2
¸
(15)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4
for 1 ≤ j ≤ k. We say that x
i
passes the similarity test on
cluster G
j
if
μ
Gj
(x
i
) ≥ ρ (16)
where ρ, 0 ≤ ρ ≤ 1, is a predefined threshold. If the user
intends to have larger clusters, then he/she can give a smaller
threshold. Otherwise, a bigger threshold can be given. As the
threshold increases, the number of clusters also increases. Note
that, as usual, the power in Eq.(15) is two [34], [35]. Its value
has an effect on the number of clusters obtained. A larger value
will make the boundaries of the Gaussian function sharper,
and more clusters will be obtained for a given threshold. On
the contrary, a smaller value will make the boundaries of
the Gaussian function smoother, and fewer clusters will be
obtained instead.
Two cases may occur. Firstly, there are no existing fuzzy
clusters on which x
i
has passed the similarity test. For this
case, we assume that x
i
is not similar enough to any existing
cluster and a new cluster G
h
, h = k + 1, is created with
m
h
= x
i
, σ
h
= σ
0
(17)
where σ
0
=< σ
0
, . . . , σ
0
> is a user-defined constant vector.
Note that the new cluster G
h
contains only one member, the
word pattern x
i
, at this point. Estimating the deviation of a
cluster by Eq.(12) is impossible or inaccurate if the cluster
contains few members. In particular, the deviation of a new
cluster is 0 since it contains only one member. We cannot
use zero deviation in the calculation of fuzzy similarities.
Therefore, we initialize the deviation of a newly created cluster
by σ
0
, as indicated in Eq.(17). Of course, the number of
clusters is increased by 1 and the size of cluster G
h
, S
h
, should
be initialized, i.e.,
k = k + 1, S
h
= 1. (18)
Secondly, if there are existing clusters on which x
i
has passed
the similarity test, let cluster G
t
be the cluster with the largest
membership degree, i.e.,
t = arg max
1≤j≤k

Gj
(x
i
)). (19)
In this case, we regard x
i
to be most similar to cluster G
t
,
and m
t
and σ
t
of cluster G
t
should be modified to include x
i
as its member. The modification to cluster G
t
is described as
follows:
m
tj
=
S
t
×m
tj
+ x
ij
S
t
+ 1
, (20)
σ
tj
=

A−B + σ
0
, (21)
A =
(S
t
−1)(σ
tj
−σ
0
)
2
+ S
t
×m
tj
2
+ x
ij
2
S
t
, (22)
B =
S
t
+ 1
S
t
(
S
t
×m
tj
+ x
ij
S
t
+ 1
)
2
, (23)
for 1 ≤ j ≤ p, and
S
t
= S
t
+ 1. (24)
Eqs.(20)-(21) can be derived easily from Eqs.(11)-(12). Note
that k is not changed in this case. Let’s give an example,
following Eq.(10) and Eq.(14). Suppose x
1
is most similar
to cluster G
1
, and the size of cluster G
1
is 4 and the initial
deviation σ
0
is 0.1. Then cluster G
1
is modified as follows:
m
11
=
4×0.4 + 0.3
4 + 1
= 0.38,
m
12
=
4×0.6 + 0.7
4 + 1
= 0.62,
m
1
= < 0.38, 0.62 >,
A
11
=
(4 −1)(0.2 −0.1)
2
+ 4×0.38
2
+ 0.3
2
4
,
B
11
=
4 + 1
4
(
4×0.38 + 0.3
4 + 1
)
2
,
σ
11
=

A
11
−B
11
+ 0.1 = 0.1937,
A
12
=
(4 −1)(0.3 −0.1)
2
+ 4×0.62
2
+ 0.7
2
4
,
B
12
=
4 + 1
4
(
4×0.62 + 0.7
4 + 1
)
2
,
σ
12
=

A
12
−B
12
+ 0.1 = 0.2769,
σ
1
= < 0.1937, 0.2769 >,
S
1
= 4 + 1 = 5.
The above process is iterated until all the word patterns have
been processed. Consequently, we have k clusters. The whole
clustering algorithm can be summarized below.
Initialization:
of original word patterns: m
of classes: p
Threshold: ρ
Initial deviation: σ
0
Initial of clusters: k = 0
Input:
x
i
= < x
i1
, x
i2
, . . . , x
ip
>, 1 ≤ i ≤ m
Output:
Clusters G
1
, G
2
, . . . , G
k
procedure Self-Constructing-Clustering-Algorithm
for each word pattern x
i
, 1 ≤ i ≤ m
temp W = {G
j

Gj
(x
i
) ≥ ρ, 1 ≤ j ≤ k};
if (temp W == φ)
A new cluster G
h
, h = k + 1, is created by
Eqs.(17)-(18);
else let G
t
∈ temp W be the cluster to which x
i
is
closest by Eq.(19);
Incorporate x
i
into G
t
by Eqs.(20)-(24);
endif;
endfor;
return with the created k clusters;
endprocedure
Note that the word patterns in a cluster have a high degree of
similarity to each other. Besides, when new training patterns
are considered, the existing clusters can be adjusted or new
clusters can be created, without the necessity of generating
the whole set of clusters from the scratch.
Note that the order in which the word patterns are fed
in influences the clusters obtained. We apply a heuristic to
determine the order. We sort all the patterns, in decreasing
order, by their largest components. Then the word patterns are
fed in this order. In this way, more significant patterns will be
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
5
fed in first and likely become the core of the underlying cluster.
For example, let x
1
= < 0.1, 0.3, 0.6 >, x
2
= < 0.3, 0.3, 0.4 >,
and x
3
= < 0.8, 0.1, 0.1 > be three word patterns. The largest
components in these word patterns are 0.6, 0.4, and 0.8,
respectively. The sorted list is 0.8, 0.6, 0.4. So the order of
feeding is x
3
, x
1
, x
2
. This heuristic seems to work well.
We discuss briefly here the computational cost of our
method and compare it with DC [27], IOC [14], and IG
[10]. For an input pattern, we have to calculate the similarity
between the input pattern and every existing cluster. Each
pattern consists of p components where p is the number of
classes in the document set. Therefore, in worst case, the time
complexity of our method is O(mkp) where m is the number
of original features and k is the number of clusters finally
obtained. For DC, the complexity is O(mkpt) where t is the
number of iterations to be done. The complexity of IG is
O(mp + mlogm), and the complexity of IOC is O(mkpn)
where n is the number of documents involved. Apparently, IG
is the quickest one. Our method is better than DC and IOC.
3.2 Feature Extraction
Formally, feature extraction can be expressed in the following
form:
D

= DT (25)
where
D = [d
1
d
2
· · · d
n
]
T
, (26)
D

= [d

1
d

2
· · · d

n
]
T
, (27)
T =





t
11
. . . t
1k
t
21
. . . t
2k
.
.
.
.
.
.
.
.
.
t
m1
. . . t
mk





, (28)
with
d
i
= [d
i1
d
i2
· · · d
im
],
d

i
= [d

i1
d

i2
· · · d

ik
]
for 1 ≤ i ≤ n. Clearly, T is a weighting matrix. The goal of
feature reduction is achieved by finding an appropriate T such
that k is smaller than m. In the divisive information-theoretic
feature clustering algorithm [27] described in Section 2.2, the
elements of T in Eq.(25) are binary and can be defined as
follows:
t
ij
=

1, if w
i
∈ W
j
;
0, otherwise
(29)
where 1 ≤ i ≤ m and 1 ≤ j ≤ k. That is, if a word w
i
belongs to cluster W
j
, t
ij
is 1; otherwise t
ij
is 0.
By applying our clustering algorithm, word patterns have
been grouped into clusters, and words in the feature vector
W are also clustered accordingly. For one cluster, we have
one extracted feature. Since we have k clusters, we have k
extracted features. The elements of T are derived based on
the obtained clusters and feature extraction will be done. We
propose three weighting approaches: hard, soft, and mixed.
In the hard-weighting approach, each word is only allowed to
belong to a cluster and so it only contributes to a new extracted
feature. In this case, the elements of T in Eq.(25) are defined
as follows:
t
ij
=

1, if j = arg max
1≤α≤k

G
α
(x
i
));
0, otherwise.
(30)
Note that if j is not unique in Eq.(30), one of them is
chosen randomly. In the soft-weighting approach, each word
is allowed to contribute to all new extracted features, with the
degrees depending on the values of the membership functions.
The elements of T in Eq.(25) are defined as follows:
t
ij
= μ
Gj
(x
i
). (31)
The mixed-weighting approach is a combination of the hard-
weighting approach and the soft-weighting approach. For this
case, the elements of T in Eq.(25) are defined as follows:
t
ij
= (γ)×t
H
ij
+ (1 −γ)×t
S
ij
(32)
where t
H
ij
is obtained by Eq.(30) and t
S
ij
is obtained by Eq.(31),
and γ is a user-defined constant lying between 0 and 1. Note
that γ is not related to the clustering. It concerns the merge
of component features in a cluster into a resulting feature.
The merge can be ‘hard’ or ‘soft’ by setting γ to 1 or 0. By
selecting the value of γ, we provide flexibility to the user.
When the similarity threshold is small, the number of clusters
is small and each cluster covers more training patterns. In this
case, a smaller γ will favor soft-weighting and get a higher
accuracy. On the contrary, when the similarity threshold is
large, the number of clusters is large and each cluster covers
fewer training patterns. In this case, a larger γ will favor hard-
weighting and get a higher accuracy.
3.3 Text Classification
Given a set D of training documents, text classification can
be done as follows. We specify the similarity threshold ρ
for Eq.(16), and apply our clustering algorithm. Assume that
k clusters are obtained for the words in the feature vector
W. Then we find the weighting matrix T and convert D
to D

by Eq.(25). Using D

as training data, a classifier
based on support vector machines (SVM) is built. Note that
any classifying technique other than SVM can be applied.
Joachims [36] showed that SVM is better than other methods
for text categorization. SVM is a kernel method which finds
the maximum margin hyperplane in feature space separating
the images of the training patterns into two groups [37], [38],
[39]. To make the method more flexible and robust, some
patterns need not be correctly classified by the hyperplane, but
the misclassified patterns should be penalized. Therefore, slack
variables ξ
i
are introduced to account for misclassifications.
The objective function and constraints of the classification
problem can be formulated as:
min
w,b
1
2
w
T
w + C
l
¸
i=1
ξ
i
(33)
s.t. y
i

w
T
φ(x
i
) + b

≥ 1 −ξ
i
, ξ
i
≥ 0, i = 1, 2, ..., l
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6
TABLE 3
Three clusters obtained.
cluster size S mean m deviation σ
G
1
3 < 1, 0 > < 0.5, 0.5 >
G
2
5 < 0.08, 0.92 > < 0.6095, 0.6095 >
G
3
2 < 0.5833, 0.4167 > < 0.6179, 0.6179 >
where l is the number of training patterns, C is a parameter
which gives a tradeoff between maximum margin and classifi-
cation error, and y
i
, being +1 or -1, is the target label of pattern
x
i
. Note that φ : X → F is a mapping from the input space to
the feature space F where patterns are more easily separated,
and w
T
φ(x
i
) +b = 0 is the hyperplane to be derived, with w
and b being weight vector and offset, respectively.
An SVM described above can only separate apart two
classes, y
i
= +1 and y
i
= −1. We follow the idea in [36]
to construct an SVM-based classifier. For p classes, we create
p SVMs, one SVM for each class. For the SVM of class c
v
,
1 ≤ v ≤ p, the training patterns of class c
v
are treated as
having y
i
= +1 and the training patterns of the other classes
are treated as having y
i
= −1. The classifier is then the
aggregation of these SVMs. Now we are ready for classifying
unknown documents. Suppose d is an unknown document. We
first convert d to d

by
d

= dT. (34)
Then we feed d

to the classifier. We get p values, one from
each SVM. Then d belongs to those classes with 1 appearing
at the outputs of their corresponding SVMs. For example,
consider a case of three classes c
1
, c
2
, and c
3
. If the three
SVMs output 1, -1, and 1, respectively, then the predicted
classes will be c
1
and c
3
for this document. If the three SVMs
output -1, 1, and 1, respectively, the predicted classes will be
c
2
and c
3
.
4 AN EXAMPLE
We give an example here to illustrate how our method works.
Let D be a simple document set, containing 9 documents d
1
,
d
2
, . . . , d
9
of two classes c
1
and c
2
, with 10 words ‘office’,
‘building’, . . . , ‘fridge’ in the feature vector W, as shown in
Table 1. For simplicity, we denote the ten words as w
1
, w
2
,
. . . , w
10
, respectively.
We calculate the ten word patterns x
1
, x
2
, . . . , x
10
according to Eq.(7) and Eq.(8). For example, x
6
= <
P(c
1
|w
6
), P(c
2
|w
6
) > and P(c
2
|w
6
) is calculated by Eq.(35).
The resulting word patterns are shown in Table 2. Note that
each word pattern is a two-dimensional vector since there are
two classes involved in D.
We run our self-constructing clustering algorithm, by setting
σ
0
= 0.5 and ρ = 0.64, on the word patterns and obtain 3
clusters G
1
, G
2
, and G
3
which are shown in Table 3. The
fuzzy similarity of each word pattern to each cluster is shown
in Table 4. The weighting matrices T
H
, T
S
, and T
M
obtained
by hard-weighting, soft-weighting, and mixed-weighting (with
γ = 0.8), respectively, are shown in Table 5. The transformed
data sets D

H
, D

S
, and D

M
obtained by Eq.(25) for different
cases of weighting are shown in Table 6.
Based on D

H
, D

S
, or D

M
, a classifier with two SVMs
is built. Suppose d is an unknown document, and d = <
0, 1, 1, 1, 1, 1, 0, 1, 1, 1 > . We first convert d to d

by Eq.(34).
Then, the transformed document is obtained as d

H
= dT
H
=
< 2, 4, 2 >, d

S
= dT
S
= < 2.5591, 4.3478, 3.9964 >, or d

M
= dT
M
= < 2.1118, 4.0696, 2.3993 >. Then the transformed
unknown document is fed to the classifier. For this example,
the classifier concludes that d belongs to c
2
.
5 EXPERIMENTAL RESULTS
In this section, we present experimental results to show the
effectiveness of our fuzzy self-constructing feature clustering
method. Three well-known datasets for text classification
research: 20 Newsgroups [1], RCV1 [40], and Cade12 [41], are
used in the following experiments. We compare our method
with other three feature reduction methods: IG [10], IOC [14],
and DC [27]. As described in Section 2, IG is one of the
state-of-art feature selection approaches, IOC is an incremental
feature extraction algorithm, and DC is a feature clustering
approach. For convenience, our method is abbreviated as FFC,
standing for Fuzzy Feature Clustering, in this section. In our
method, σ
0
is set to 0.25. The value was determined by
investigation. Note that σ
0
is set to the constant 0.25 for all
the following experiments. Furthermore, we use H-FFC, S-
FFC, and M-FFC to represent hard-weighting, soft-weighting,
and mixed-weighting, respectively. We run a classifier built
on support vector machines [42], as described in Section 3.3,
on the extracted features obtained by different methods. We
use a computer with Intel(R) Core(TM)2 Quad CPU Q6600
2.40GHz, 4GB of RAM to conduct the experiments. The
programming language used is MATLAB7.0.
To compare classification effectiveness of each method, we
adopt the performance measures in terms of microaveraged
precision (MicroP), microaveraged recall (MicroR), microav-
eraged F1 (MicroF1), and microaveraged accuracy (MicroAcc)
[4], [43], [44] defined as follows:
MicroP =
Σ
p
i=1
TP
i
Σ
p
i=1
(TP
i
+ FP
i
)
(36)
MicroR =
Σ
p
i=1
TP
i
Σ
p
i=1
(TP
i
+ FN
i
)
(37)
MicroF1 =
2MicroP ×MicroR
MicroP + MicroR
(38)
MicroAcc =
Σ
p
i=1
(TP
i
+ TN
i
)
Σ
p
i=1
(TP
i
+ TN
i
+ FP
i
+ FN
i
)
(39)
where p is the number of classes. TP
i
(true positives wrt c
i
)
is the number of c
i
test documents that are correctly classified
to c
i
. TN
i
(true negatives wrt c
i
) is the number of non-
c
i
test documents that are classified to non-c
i
. FP
i
(false
positives wrt c
i
) is the number of non-c
i
test documents that
are incorrectly classified to c
i
. FN
i
(false negatives wrt c
i
) is
the number of c
i
test documents that are classified to non-c
i
.
5.1 Experiment 1: 20 Newsgroups Dataset
The 20 Newsgroups collection contains about 20,000 articles
taken from the Usenet newsgroups. These articles are evenly
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
7
TABLE 1
A simple document set D.
office building line floor bedroom kitchen apartment internet WC fridge
(w
1
) (w
2
) (w
3
) (w
4
) (w
5
) (w
6
) (w
7
) (w
8
) (w
9
) (w
10
) class
d
1
0 1 0 0 1 1 0 0 0 1 c
1
d
2
0 0 0 0 0 2 1 1 0 0 c
1
d
3
0 0 0 0 0 0 1 0 0 0 c
1
d
4
0 0 1 0 2 1 2 1 0 1 c
1
d
5
0 0 0 1 0 1 0 0 1 0 c
2
d
6
2 1 1 0 0 1 0 0 1 0 c
2
d
7
3 2 1 3 0 1 0 1 1 0 c
2
d
8
1 0 1 1 0 1 0 0 0 0 c
2
d
9
1 1 1 1 0 0 0 0 0 0 c
2
P(c
2
|w
6
) =
1×0 + 2×0 + 0×0 + 1×0 + 1×1 + 1×1 + 1×1 + 1×1 + 0×1
1 + 2 + 0 + 1 + 1 + 1 + 1 + 1 + 0
= 0.50. (35)
TABLE 2
Word patterns of W.
x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
0.00 0.20 0.20 0.00 1.00 0.50 1.00 0.67 0.00 1.00
1.00 0.80 0.80 1.00 0.00 0.50 0.00 0.33 1.00 0.00
TABLE 4
Fuzzy similarities of word patterns to three clusters.
similarity x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
μ
G
1
(x) 0.0003 0.0060 0.0060 0.0003 1.0000 0.1353 1.0000 0.4111 0.0003 1.0000
μ
G
2
(x) 0.9661 0.9254 0.9254 0.9661 0.0105 0.3869 0.0105 0.1568 0.9661 0.0105
μ
G
3
(x) 0.1682 0.4631 0.4631 0.1682 0.4027 0.9643 0.4027 0.9643 0.1682 0.4027
20 50 100 200 500 1000 1453
10
0
10
1
10
2
10
3
10
4
10
5
Number of extracted features
E
x
e
c
u
tio
n
tim
e
(
s
e
c
)
FFC
DC
IG
IOC
Fig. 2. Execution time (sec) of different methods on 20
Newsgroups data.
distributed over 20 classes and each class has about 1,000 arti-
cles, as shown in Figure 1(a). In this figure, the x-axis indicates
the class number and the y-axis indicates the fraction of the
articles of each class. We use two-thirds of the documents for
training and the rest for testing. After preprocessing, we have
25,718 features, or words, for this data set.
Figure 2 shows the execution time (sec) of different feature
reduction methods on the 20 Newsgroups dataset. Since H-
FFC, S-FFC, and M-FFC have the same clustering phase, they
have the same execution time and thus we use FFC to denote
them in Figure 2. In this figure, the horizontal axis indicates
the number of extracted features. To obtain different numbers
of extracted features, different values of ρ are used in FFC.
The number of extracted features is identical to the number
of clusters. For the other methods, the number of extracted
features should be specified in advance by the user. Table 7
lists values of certain points in Figure 2. Different values
of ρ are used in FFC and are listed in the table. Note that
for IG, each word is given a weight. The words of top k
weights in W are selected as the extracted features in W

.
Therefore, the execution time is basically the same for any
value of k. Obviously, our method runs much faster than DC
and IOC. For example, our method needs 8.61 seconds for
20 extracted features, while DC requires 88.38 seconds and
IOC requires 6943.40 seconds. For 84 extracted features, our
method only needs 17.68 seconds, but DC and IOC require
293.98 and 28098.05 seconds, respectively. As the number
of extracted features increases, DC and IOC run significantly
slow. In particular, when the number of extracted features
exceeds 280, IOC spends more than 100,000 seconds without
getting finished, as indicated by dashes in Table 7.
Figure 3 shows the MicroAcc (%) of the 20 Newsgroups
data set obtained by different methods, based on the extracted
features previously obtained. The vertical axis indicates the
MicroAcc values (%) and the horizontal axis indicates the
number of extracted features. Table 8 lists values of ceratin
points in Figure 3. Note that γ is set to 0.5 for M-FFC. Also,
no accuracies are listed for IOC when the number of extracted
features exceeds 280. As shown in the figure, IG performs the
worst in classification accuracy, especially when the number of
extracted features is small. For example, IG only gets 95.79%
in accuracy for 20 extracted features. S-FFC works very well
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8
TABLE 5
Weighting matrices: hard T
H
, soft T
S
, and mixed T
M
.
T
H
=











0 1 0
0 1 0
0 1 0
0 1 0
1 0 0
0 0 1
1 0 0
0 0 1
0 1 0
1 0 0











, T
S
=











0.0003 0.9661 0.1682
0.0060 0.9254 0.4631
0.0060 0.9254 0.4631
0.0003 0.9661 0.1682
1.0000 0.0105 0.4027
0.1353 0.3869 0.9643
1.0000 0.0105 0.4027
0.4111 0.1568 0.9643
0.0003 0.9661 0.1682
1.0000 0.0105 0.4027











, T
M
=











0.0001 0.9932 0.0336
0.0012 0.9851 0.0926
0.0012 0.9851 0.0926
0.0001 0.9932 0.0336
1.0000 0.0021 0.0805
0.0271 0.0774 0.9929
1.0000 0.0021 0.0805
0.0822 0.0314 0.9929
0.0001 0.9932 0.0336
1.0000 0.0021 0.0805











.
TABLE 6
Transformed data sets: hard D

H
, soft D

S
, and mixed D

M
.
(w

1
) (w

2
) (w

3
) (w

1
) (w

2
) (w

3
) (w

1
) (w

2
) (w

3
)
d

1
2 1 1 d

1
2.1413 1.3333 2.2327 d

1
2.0283 1.0667 1.2465
d

2
1 0 3 d

2
1.6818 0.9411 3.2955 d

2
1.1364 0.1882 3.0591
d

3
1 0 0 d

3
1.0000 0.0105 0.4027 d

3
1.0000 0.0021 0.0805
d

4
5 1 2 d

4
5.5524 1.5217 4.4051 d

4
5.1105 1.1043 2.4810
d

5
0 2 1 d

5
0.1360 2.3192 1.3006 d

5
0.0272 2.0638 1.0601
d

6
0 5 1 d

6
0.1483 5.1362 2.3949 d

6
0.0297 5.0272 1.2790
d

7
0 10 2 d

7
0.5667 10.0829 4.4950 d

7
0.1133 10.0166 2.4990
d

8
0 3 1 d

8
0.1420 3.2446 1.7637 d

8
0.0284 3.0489 1.1527
d

9
0 4 0 d

9
0.0126 3.7831 1.2625 d

9
0.0025 3.9566 0.2525
D

H
D

S
D

M
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
0.01
0.02
0.03
0.04
0.05
0.06
Classes of 20 Newsgroups
R
a
tio
o
f th
e
c
la
s
s
to
a
ll
(a) 20 Newsgroups
0 20 40 60 80 100
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Classes of RCV1
R
a
tio
o
f th
e
c
la
s
s
to
a
ll
(b) RCV1
1 2 3 4 5 6 7 8 9 10 11 12
0
0.05
0.1
0.15
0.2
0.25
Classes of Cade12
R
a
tio
o
f th
e
c
la
s
s
to
a
ll
(c) Cade12
Fig. 1. Class distributions of three datasets.
TABLE 7
Sampled execution times (sec) of different methods on 20 Newsgroups data.
of extracted
features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
IG 19.98 19.98 19.98 19.98 19.98 19.98 19.98 19.98 19.98
DC 88.38 204.90 293.98 486.57 704.24 972.69 1973.3 3425.04 5012.79
IOC 6943.40 19397.91 28098.05 39243.00 67513.52 93010.60 ——— ——— ———
FFC 8.61 13.95 17.68 23.44 39.36 55.30 79.79 155.99 185.24
when the number of extracted features is smaller than 58.
For example, S-FFC gets 98.46% in accuracy for 20 extracted
features. H-FFC and M-FFC perform well in accuracy all the
time, except for the case of 20 extracted features. For example,
H-FFC and M-FFC get 98.43% and 98.54%, respectively, in
accuracy for 58 extracted features, 98.63% and 98.56% for 203
extracted features, and 98.65% and 98.59% for 521 extracted
features. Although the number of extracted features is affected
by the threshold value, the classification accuracies obtained
keep fairly stable. DC and IOC perform a little bit worse
in accuracy when the number of extracted features is small.
For example, they get 97.73% and 97.09%, respectively, in
accuracy for 20 extracted features. Note that while the whole
25,718 features are used for classification, the accuracy is
98.45%.
Table 9 lists values of the MicroP, MicroR, and MicroF1 (%)
of the 20 Newsgroups dataset obtained by different methods.
Note that γ is set to 0.5 for M-FFC. From this table, we can
see that S-FFC can, in general, get best results for MicroF1,
followed by M-FFC, H-FFC, and DC in order. For MicroP, H-
FFC performs a little bit better than S-FFC. But for MicroR,
S-FFC performs better than H-FFC by a greater margin than in
the MicroP case. M-FFC performs well for MicroP, MicroR,
and MicroF1. For MicroF1, M-FFC performs only a little bit
worse than S-FFC, with almost identical performance when
the number of features is less than 500. For MicroP, M-FFC
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
9
TABLE 8
Microaveraged accuracy (%) of different methods for 20 Newsgroups data.
of extracted features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
microaveraged IG 95.79 96.20 96.32 96.49 96.81 96.94 97.33 97.86 97.95
accuracy DC 97.73 98.19 98.38 98.36 98.43 98.56 98.28 98.63 98.26
(MicroAcc) IOC 97.09 97.44 97.50 97.51 97.68 97.62 —- —- —-
H-FFC 97.94 98.43 98.44 98.41 98.63 98.59 98.65 98.61 98.40
S-FFC 98.46 98.57 98.62 98.59 98.64 98.69 98.69 98.70 98.72
M-FFC 98.10 98.54 98.57 98.58 98.56 98.60 98.59 98.60 98.62
Full features (25718 features): MicroAcc = 98.45
TABLE 9
Microaveraged Precision, Recall, and F1 (%) of different methods for 20 Newsgroups data.
of extracted features 20 58 84 120 203 280 521 1187 1453
threshold (ρ) (0.01) (0.02) (0.03) (0.06) (0.12) (0.19) (0.23) (0.32) (0.36)
microaveraged IG 91.91 90.77 89.78 89.07 89.53 90.29 89.57 91.19 91.17
precision DC 88.00 90.97 91.69 91.90 93.07 92.51 93.28 93.82 94.27
(MicroP) IOC 86.56 87.07 87.52 88.41 88.22 89.28 —- —- —-
H-FFC 90.87 92.35 92.55 92.59 91.79 92.01 91.37 92.05 92.90
S-FFC 90.81 91.99 91.80 91.79 91.65 92.25 91.07 91.42 91.80
M-FFC 89.65 91.69 91.92 92.15 92.34 92.71 92.58 92.56 92.90
microaveraged IG 17.75 27.11 30.22 34.16 41.26 43.72 52.96 63.45 65.46
recall DC 62.88 70.80 74.38 73.76 74.26 77.54 77.91 77.70 77.85
(MicroR) IOC 48.92 56.87 57.86 57.48 61.56 59.30 —- —- —-
H-FFC 65.21 74.67 74.96 74.27 79.84 78.73 80.61 78.95 78.43
S-FFC 76.98 78.16 79.40 78.72 80.05 80.58 81.84 81.62 81.70
M-FFC 70.25 77.71 78.32 78.32 77.76 78.17 78.11 78.36 78.42
microaveraged IG 29.76 41.77 45.23 49.39 56.50 58.92 66.57 74.85 76.22
F1 DC 73.36 79.64 82.15 81.85 82.62 84.38 84.91 85.02 85.28
(MicroF1) IOC 62.53 68.82 69.68 67.66 72.53 71.28 —- —- —-
H-FFC 75.93 82.59 82.84 82.44 85.40 84.85 85.65 85.00 85.05
S-FFC 83.34 84.53 85.17 84.77 85.46 86.04 86.22 86.25 86.46
M-FFC 78.77 84.93 85.03 85.54 85.72 85.67 85.59 84.87 85.05
Full features (25718 features): MicroP = 94.53, MicroR = 73.18, MicroF1 = 82.50
20 50 100 200 500 1000 1500 25718
95
95.5
96
96.5
97
97.5
98
98.5
99
Number of extracted features
M
ic
r
o
a
v
e
r
a
g
e
d
A
c
c
u
r
a
c
y
%
H−FFC
S−FFC
M−FFC
DC
IG
IOC
Fig. 3. Microaveraged accuracy (%) of different methods
for 20 Newsgroups data.
performs better than S-FFC most of the time. H-FFC performs
well when the number of features is less than 200, while
it performs worse than DC and M-FFC when the number
of features is greater than 200. DC performs well when the
number of features is greater than 200, but it performs worse
when the number of features is less than 200. For example,
DC gets 88.00% and M-FFC gets 89.65% when the number
of features is 20. For MicroR, S-FFC performs best for all
the cases. M-FFC performs well, especially when the number
of features is less than 200. Note that while the whole 25,718
features are used for classification, MicroP is 94.53%, MicroR
is 73.18%, and MicroF1 is 82.50%.
0.2 0.3 0.4 0.5 0.6 0.7 0.8
70
72
74
76
78
80
82
84
86
88
90
Parameter of M−FFC
M
ic
r
o
a
v
e
r
a
g
e
d
F
1
%
M−FFC20
M−FFC203
M−FFC1453
Fig. 4. Microaveraged F1 (%) of M-FFC with different γ
values for 20 Newsgroups data.
In summary, IG runs very fast in feature reduction, but
the extracted features obtained are not good for classifica-
tion. FFC can not only run much faster than DC and IOC
in feature reduction, but also provide comparably good or
better extracted features for classification. Figure 4 shows
the influence of γ values on the performance of M-FFC in
MicroF1 for three numbers of extracted features, 20, 203, and
1,453, indicated by M-FFC20, M-FFC203, and M-FFC1453,
respectively. As shown in the figure, the MircoF1 values do
not vary significantly with different settings of γ.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10
18 50 100 200 500 1000 1700
10
0
10
1
10
2
10
3
10
4
10
5
Number of extracted features
E
x
e
c
u
tio
n
tim
e
(
s
e
c
)
FFC
DC
IG
IOC
Fig. 5. Execution time (sec) of different methods on RCV1
data.
18 50 100 200 500 1000 170047152
96.5
97
97.5
98
98.5
99
Number of extracted features
M
ic
r
o
a
v
e
r
a
g
e
d
A
c
c
u
r
a
c
y
%
H−FFC
S−FFC
M−FFC
DC
IG
IOC
Fig. 6. Microaveraged accuracy (%) of different methods
for RCV1 data.
5.2 Experiment 2: REUTERS CORPUS VOLUME 1
(RCV1) Dataset
The RCV1 dataset consists of 804,414 news stories produced
by Reuters from 20 Aug 1996 to 19 Aug 1997. All news stories
are in English and have 109 distinct terms per document on
average. The documents are divided, by the ”LYRL2004” split
defined in Lewis et al. [40], into 23,149 training documents
and 781,265 testing documents. There are 103 Topic categories
and the distribution of the documents over the classes is shown
in Figure 1(b). All the 103 categories have one or more positive
test examples in the test set, but only 101 of them have one
or more positive training examples in the training set. It is
these 101 categories that we use in this experiment. After pre-
processing, we have 47,152 features for this data set.
Figure 5 shows the execution time (sec) of different feature
reduction methods on the RCV1 dataset. Table 10 lists values
of certain points in Figure 5. Clearly, our method runs much
faster than DC and IOC. For example, for 18 extracted fea-
tures, DC requires 609.63 seconds and IOC requires 94533.88
seconds, but our method only needs 14.42 seconds. For 65
extracted features, our method only needs 37.40 seconds,
but DC and IOC require 1709.77 and over 100,000 seconds,
respectively. As the number of extracted features increases,
DC and IOC run significantly slow.
Figure 6 shows the MicroAcc (%) of the RCV1 dataset
obtained by different methods, based on the extracted features
previously obtained. Table 11 lists values of certain points
in Figure 6. Note that γ is set to 0.5 for M-FFC. No
accuracies are listed for IOC when the number of extracted
features exceeds 18. As shown in the figure, IG performs the
0.2 0.3 0.4 0.5 0.6 0.7 0.8
40
45
50
55
60
65
70
75
Parameter of M−FFC
M
ic
r
o
a
v
e
r
a
g
e
d
F
1
%
M−FFC18
M−FFC202
M−FFC1700
Fig. 7. Microaveraged F1(%) of M-FFC with different γ
values for RCV1 data.
worst in classification accuracy, especially when the number
of extracted features is small. For example, IG only gets
96.95% in accuracy for 18 extracted features. H-FFC, S-
FFC, and M-FFC perform well in accuracy all the time. For
example, H-FFC, S-FFC, and M-FFC get 98.03%, 98.26%,
and 97.85%, respectively, in accuracy for 18 extracted features,
98.13%, 98.39%, and 98.31%, respectively, for 65 extracted
features, and 98.41%, 98.53%, and 98.56%, respectively, for
421 extracted features. Note that while the whole 47,152
features are used for classification, the accuracy is 98.83%.
Table 12 lists values of the MicroP, MicroR, and MicroF1 (%)
of the RCV1 dataset obtained by different methods. From this
table, we can see that S-FFC can, in general, get best results
for MicroF1, followed by M-FFC, DC, and H-FFC in order. S-
FFC performs better than the other methods, especially when
the number of features is less than 500. For MicroP, all the
methods perform about equally well, the values being around
80%-85%. IG gets a high MicroP when the number of features
is smaller than 50. M-FFC performs well for MicroP, MicroR,
and MicroF1. Note that while the whole 47,152 features are
used for classification, MicroP is 86.66%, MicroR is 75.03%,
and MicroF1 is 80.43%.
In summary, IG runs very fast in feature reduction, but the
extracted features obtained are not good for classification. For
example, IG gets 96.95% in MicroAcc, 91.58% in MicroP,
5.80% in MicroR, and 10.90% in MicroF1 for 18 extracted
features, and 97.24% in MicroAcc, 68.82% in MicroP, 25.72%
in MicroR, and 37.44% in MicroF1 for 65 extracted features.
FFC can not only run much faster than DC and IOC in
feature reduction, but also provide comparably good or better
extracted features for classification. For example, S-FFC gets
98.26% in MicroAcc, 85.18% in MicroP, 55.33% in MicroR,
and 67.08% in MicroF1 for 18 extracted features, and 98.39%
in MicroAcc, 82.34% in MicroP, 63.41% in MicroR, and
71.65% in MicroF1 for 65 extracted features. Figure 7 shows
the influence of γ values on the performance of M-FFC in
MicroF1 for three numbers of extracted features, 18, 202, and
1700, indicated by M-FFC18, M-FFC202, and M-FFC1700,
respectively. As shown in the figure, the MircoF1 values do
not vary significantly with different settings of γ.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
11
TABLE 10
Sampled execution times (sec) of different methods on RCV1 data.
of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
IG 9.44 9.44 9.44 9.44 9.44 9.44 9.44 9.44
DC 609.63 1329.64 1709.77 1980.48 4256.1 8501.36 16592.09 21460.46
IOC 94533.88 ——— ——— ——— ——— ——— ——— ———
FFC 14.42 31.13 37.40 46.57 107.87 181.60 327.04 612.51
TABLE 11
Microaveraged Accuracy (%) of different methods for RCV1 data.
of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
microaveraged IG 96.95 97.13 97.24 97.26 97.60 98.03 98.37 98.69
accuracy DC 98.03 98.07 98.19 98.29 98.39 98.51 98.58 98.67
(MicroAcc) IOC 96.98 —- —- —- —- —- —- —-
H-FFC 98.03 98.02 98.13 98.14 98.31 98.41 98.49 98.54
S-FFC 98.26 98.37 98.39 98.41 98.58 98.53 98.57 98.58
M-FFC 97.85 98.28 98.31 98.38 98.49 98.56 98.58 98.62
Full features (47152 features): MicroAcc = 98.83
TABLE 12
Microaveraged Precision, Recall and F1 (%) of different methods for RCV1 data.
of extracted features 18 42 65 74 202 421 818 1700
threshold (ρ) (0.01) (0.03) (0.07) (0.1) (0.2) (0.3) (0.4) (0.5)
microaveraged IG 91.58 92.00 68.82 69.46 76.25 83.74 87.31 88.94
precision DC 79.87 79.96 82.82 83.58 83.75 85.38 86.51 87.57
(MicroP) IOC 61.89 —- —- —- —- —- —- —-
H-FFC 81.91 78.35 79.74 79.68 81.42 83.03 84.42 84.31
S-FFC 85.18 82.96 82.34 82.68 87.28 84.73 85.43 85.52
M-FFC 71.42 82.50 81.58 82.16 87.32 86.78 85.51 86.21
microaveraged IG 5.80 11.82 25.72 26.32 36.66 48.16 57.67 67.78
recall DC 51.72 53.27 54.93 58.02 61.75 64.81 66.23 68.16
(MicroR) IOC 15.68 —- —- —- —- —- —- —-
H-FFC 49.45 53.05 55.92 56.50 61.32 63.49 64.91 66.83
S-FFC 55.33 61.91 63.41 63.76 65.33 66.11 66.81 67.30
M-FFC 55.53 58.85 61.30 63.18 61.86 65.30 67.27 68.00
microaveraged IG 10.90 20.94 37.44 38.17 49.51 61.15 69.46 76.93
F1 DC 62.79 63.94 66.05 68.49 71.08 73.68 75.02 76.66
(MicroF1) IOC 25.01 —- —- —- —- —- —- —-
H-FFC 61.67 63.27 65.74 66.12 69.96 71.96 73.39 74.56
S-FFC 67.08 70.91 71.65 72.00 74.73 74.27 74.98 75.32
M-FFC 62.48 68.69 70.00 71.43 72.42 74.52 75.30 76.03
Full features (47152 features): MicroP = 86.66, MicroR = 75.03, MicroF1 = 80.43
TABLE 13
The number of documents contained in each class of the Cade12 dataset.
class 01–servicos 02–sociedade 03–lazer 04–informatica 05–saude 06–educacao
of documents 8473 7363 5590 4519 3171 2856
class 07–internet 08–cultura 09–esportes 10–noticias 11–ciencias 12–compras-online
of documents 2381 2137 1907 1082 879 625
5.3 Experiment 3: Cade12 Data
Cade12 is a set of classified Web pages extracted from the
Cadˆ e Web directory [41]. This directory points to Brazilian
Web pages that were classified by human experts into 12
classes. The Cade12 collection has a skewed distribution and
the three most popular classes represent more than 50% of
all documents. A version of this dataset, 40,983 documents
in total with 122,607 features, was obtained from [45], from
which two-thirds, 27,322 documents, are split for training and
the remaining, 13,661 documents, for testing. The distribution
of Cade12 data is shown in Figure 1(c) and Table 13.
Figure 8 shows the execution time (sec) of different feature
reduction methods on the Cade12 dataset. Table 14 lists values
of certain points in Figure 8. Clearly, our method runs much
faster than DC and IOC. For example, for 22 extracted fea-
tures, DC requires 689.63 seconds and IOC requires 57958.10
seconds, but our method only needs 24.02 seconds. For 84
extracted features, our method only needs 47.48 seconds, while
DC requires 2939.35 seconds and IOC has a difficulty in
completion. As the number of extracted features increases, DC
and IOC run significantly slow. In particular, IOC can only
get finished in 100,000 seconds with 22 extracted features, as
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
12
TABLE 14
Sampled execution times (sec) of different methods on Cade12 data.
of extracted
features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
IG 105.18 105.18 105.18 105.18 105.18 105.18 105.18 105.18 105.18
DC 689.63 1086.91 1871.67 2939.35 4167.23 7606.41 11593.84 23509.45 30331.91
IOC 57958.10 ——— ——— ——— ——— ——— ——— ——— ———
FFC 24.02 27.61 34.40 47.48 57.55 100.18 145.54 276.47 381.72
20 50 100 200 500 10001338
10
1
10
2
10
3
10
4
10
5
Number of extracted features
E
x
e
c
u
tio
n
tim
e
(
s
e
c
)
FFC
DC
IG
IOC
Fig. 8. Execution time (sec) of different methods on
Cade12 data.
20 50 100 200 500 1,000 122607
91
92
93
94
Number of extracted features
M
ic
r
o
a
v
e
r
a
g
e
d
A
c
c
u
r
a
c
y
%
H−FFC
S−FFC
M−FFC
DC
IG
IOC
Fig. 9. Microaveraged accuracy (%) of different methods
for Cade12 data.
shown in Table 14.
Figure 9 shows the MicroAcc (%) of the Cade12 dataset
obtained by different methods, based on the extracted features
previously obtained. Table 15 lists values of certain points
in Figure 9. Table 16 lists values of the MicroP, MicroR,
and MicroF1 (%) of the Cade12 dataset obtained by different
methods. Note that γ is set to 0.5 for M-FFC. Also, no
accuracies are listed for IOC when the number of extracted
features exceeds 22. As shown in the tables, all the methods
work equally well for MicroAcc. But none of the methods
work satisfactorily well for MicroP, MicroR, and MicroF1.
In fact, it is hard to get feature reduction for Cade12. Loose
correlation exists among the original features. IG performs
best for MicroP. For example, IG gets 78.21% in MicroP
when the number of features is 38. However, IG performs
worst for MicroR, getting only 7.04% in MicroR when the
number of features is 38. S-FFC, M-FFC, and H-FFC get a
little bit worse than IG in MicroP, but they are good in MicroR.
For example, S-FFC gets 75.83% in MicroP and 38.33% in
MicroR when the number of features is 38. In general, S-
0.2 0.3 0.4 0.5 0.6 0.7 0.8
45
46
47
48
49
50
51
52
53
54
55
Parameter of M−FFC
M
ic
r
o
a
v
e
r
a
g
e
d
F
1
%
M−FFC22
M−FFC286
M−FFC1338
Fig. 10. Microaveraged F1 (%) of M-FFC with different γ
values for Cade12 data.
FFC can get best results, followed by M-FFC, H-FFC, and
DC in order. Note that while the whole 122,607 features
are used for classification, MicroAcc is 93.55%, MicroP is
69.57%, MicroR is 40.11%, and MicroF1 is 50.88%. Again,
this experiment shows that FFC can not only run much faster
than DC and IOC in feature reduction, but also provide
comparably good or better extracted features for classification.
Figure 10 shows the influence of γ values on the performance
of M-FFC in MicroF1 for three numbers of extracted features,
22, 286, and 1,338, indicated by M-FFC22, M-FFC286, and
M-FFC1338, respectively. As shown in the figure, the MircoF1
values do not vary significantly with different settings of γ.
6 CONCLUSIONS
We have presented a fuzzy self-constructing feature clustering
(FFC) algorithm which is an incremental clustering approach
to reduce the dimensionality of the features in text classifica-
tion. Features that are similar to each other are grouped into the
same cluster. Each cluster is characterized by a membership
function with statistical mean and deviation. If a word is not
similar to any existing cluster, a new cluster is created for
this word. Similarity between a word and a cluster is defined
by considering both the mean and the variance of the cluster.
When all the words have been fed in, a desired number of
clusters are formed automatically. We then have one extracted
feature for each cluster. The extracted feature corresponding
to a cluster is a weighted combination of the words contained
in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and trial-
and-error for determining the appropriate number of extracted
features can then be avoided. Experiments on three real-world
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
13
TABLE 15
Microaveraged Accuracy (%) of different methods for Cade12 data.
of extracted features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
microaveraged IG 91.89 92.09 92.20 92.24 92.28 92.45 92.51 92.80 92.85
accuracy DC 93.18 93.20 93.17 93.30 93.30 93.59 93.63 93.65 93.68
(MicroAcc) IOC 92.82 —- —- —- —- —- —- —- —-
H-FFC 93.63 93.51 93.46 93.48 93.51 93.63 93.59 93.64 93.62
S-FFC 93.79 93.84 93.80 93.83 93.91 93.89 93.87 93.80 93.77
M-FFC 93.75 93.63 93.54 93.66 93.72 93.89 93.91 93.94 93.95
Full features (122607 features): MicroAcc = 93.55
TABLE 16
Microaveraged Precision, Recall and F1 (%) of different methods for Cade12 data.
of extracted features 22 38 59 84 108 286 437 889 1338
threshold (ρ) (0.01) (0.03) (0.07) (0.14) (0.19) (0.28) (0.37) (0.48) (0.54)
microaveraged IG 76.33 78.21 78.59 78.16 78.41 76.33 76.02 74.10 73.92
precision DC 70.79 69.22 68.60 69.56 68.60 72.26 72.35 73.36 73.76
(MicroP) IOC 76.26 —- —- —- —- —- —- —- —-
H-FFC 71.16 71.31 71.63 72.31 72.93 72.89 72.38 73.67 73.03
S-FFC 75.10 75.83 74.54 75.60 75.37 73.87 76.07 73.27 75.38
M-FFC 73.78 72.94 72.58 75.12 75.23 75.86 74.58 74.49 74.47
microaveraged IG 3.89 7.04 8.78 9.59 10.21 13.67 14.71 20.82 21.89
recall DC 30.90 33.04 33.18 34.84 36.18 37.48 38.07 37.42 37.52
(MicroR) IOC 20.06 —- —- —- —- —- —- —- —-
H-FFC 39.59 37.05 35.63 35.33 36.66 37.46 37.29 36.82 37.08
S-FFC 38.08 38.33 38.85 38.40 39.98 38.82 38.53 40.35 37.51
M-FFC 38.74 37.37 36.06 35.80 36.71 38.86 41.83 41.47 41.72
microaveraged IG 7.41 12.92 15.80 17.08 18.07 23.18 24.66 32.50 33.78
F1 DC 43.02 44.73 44.73 46.42 47.37 49.36 49.89 49.56 49.74
(MicroF1) IOC 31.76 —- —- —- —- —- —- —- —-
H-FFC 50.88 48.77 47.59 47.47 48.51 49.49 49.22 49.10 49.19
S-FFC 50.53 50.92 51.08 50.93 52.25 52.86 53.02 52.04 50.09
M-FFC 50.80 49.42 48.18 48.49 49.34 52.36 52.85 53.28 53.48
Full features (122607 features): MicroP = 69.57, MicroR = 40.11, MicroF1 = 50.88
datasets have demonstrated that our method can run faster and
obtain better extracted features than other methods.
Other projects were done on text classification, with or with-
out feature reduction, and evaluation results were published.
Some methods adopted the same performance measures as we
did, i.e., microaveraged accuracy or microaveraged F1. Others
adopted uni-labeled accuracy (ULAcc). Uni-labeled accuracy
is basically the same as the normal accuracy, except that a
pattern, either training or testing, with m class labels are
copied m times, each copy is associated with a distinct class
label. We discuss some methods here. The evaluation results
given here for them are taken directly from corresponding pa-
pers. Joachims [33] applied the Rocchio classifier over TFIDF-
weighted representation and obtained 91.8% uni-labeled accu-
racy on the 20 Newsgroups dataset. Lewis et al. [40] used
the SVM classifier over a form of TFIDF weighting and
obtained 81.6% microaveraged F1 on the RCV1 dataset. Al-
Mubaid & Umair [31] applied a classifier, called Lsquare,
which combines the distributional clustering of words and a
learning logic technique and achieved 98.00% microaveraged
accuracy and 86.45% microaveraged F1 on the 20 Newsgroups
dataset. Cardoso-Cachopo [45] applied the SVM classifier over
the normalized TFIDF term weighting, and achieved 52.84%
uni-labeled accuracy on the Cade12 dataset and 82.48% on
the 20 Newsgroups dataset. Al-Mubaid & Umair [31] applied
AIB [25] for feature reduction and the number of features
TABLE 17
Published evaluation results for some other text
classifiers.
method dataset feature red. evaluation
Joachims [33] 20 NG no 91.80% (ULAcc)
Lewis et al. [40] RCV1 no 81.6% (MicroF1)
Al-Mubaid 20 NG AIB [25] 98.00% (MicroAcc)
& Umair [31] 86.45% (MicroF1)
Cardoso-Cachopo [45] 20 NG no 82.48% (ULAcc)
Cade12 no 52.84% (ULAcc)
was reduced to 120. No feature reduction was applied in the
other three methods. The evaluation results published for these
methods are summarized in Table 17. Note that, according
to the definitions, uni-labeled accuracy is usually lower than
Microaveraged accuracy for multi-label classification. This
explains why both Joachims [33] and Cardoso-Cachopo [45]
have lower uni-labeled accuracies than the microaveraged
accuracies shown in Table 8 and Table 15. However, Lewis
et al. [40] has a higher Microaveraged F1 than that shown
in Table 11. This may be due to that Lewis et al. [40] used
the TFIDF weighting which is more elaborate than the TF
weighting we used. However, it is hard for us to explain
the difference between the microaveraged F1 obtained by Al-
Mubaid & Umair [31] and that shown in Table 9 since a kind
of sampling was applied in Al-Mubaid & Umair [31].
Similarity-based clustering is one of the techniques we have
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
14
developed in our machine learning research. In this paper, we
apply this clustering technique to text categorization problems.
We are also applying it to other problems, such as image
segmentation, data sampling, fuzzy modeling, web mining, etc.
The work of this paper was motivated by distributional word
clustering proposed in [24], [25], [26], [27], [29], [30]. We
found that when a document set is transformed to a collection
of word patterns, as by Eq.(7), the relevance among word
patterns can be measured and the word patterns can be grouped
by applying our similarity-based clustering algorithm. Our
method is good for text categorization problems due to the
suitability of the distributional word clustering concept.
ACKNOWLEDGMENT
The authors are grateful to the anonymous reviewers for their
comments which were very helpful in improving the quality
and presentation of the paper. They’d also like to express their
thanks to National Institute of Standards & Technology for the
provision of the RCV1 dataset.
REFERENCES
[1] Http://people.csail.mit.edu/jrennie/20Newsgroups/.
[2] Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.
[3] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text
Classification with Support Vector Machines,” Journal of Machine
Learning Research, vol. 6, pp. 37-53, 2005.
[4] F. Sebastiani, “Machine Learning in Automated Text Categorization,”
ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[5] B. Y. Ricardo and R. N. Berthier, Modern Information Retrieval.
Addison Wesley Longman, 1999.
[6] A. L. Blum and P. Langley, “Selection of Relevant Features and
Examples in Machine Learning,” Aritficial Intelligence, vol. 97, nos.
1-2, pp. 245-271, 1997.
[7] E. F. Combarro, E. Monta˜ n´ es, I. D´ıaz, J. Ranilla, and R. Mones,
“Introducing a Family of Linear Measures for Feature Selection in Text
Categorization,” IEEE Transactions on Knowledge and Data Engineer-
ing, vol. 17, no. 9, pp. 1223-1232, 2005.
[8] K. Daphne and M. Sahami, “Toward Optimal Feature Selection,” 13th
International Conference on Machine Learning, pp. 284-292, 1996.
[9] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,”
Aritficial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997.
[10] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection
in Text Categorization,” 14th International Conference on Machine
Learning, pp. 412-420, 1997.
[11] D. D. Lewis, “Feature Selection and Feature Extraction for Text Catego-
rization,” Workshop Speech and Natural Language, pp. 212-217, 1992.
[12] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction
by Maximum Margin Criterion,” Conference on Advances in Neural
Information Processing System, pp. 97-104, 2004.
[13] E. Oja, “Subspace Methods of Pattern Recognition,” Pattern Recognition
and Image Processing Series, vol. 6, 1983.
[14] J.Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi,
and Z. Chen, “Effective and Efficient Dimensionality Reduction for
Large-Scale and Streaming Data Preprocessing,” IEEE Transactions on
Knowledge and Data Engineering, vol. 18, no. 3, pp. 320-331, 2006.
[15] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[16] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-
233, 2001.
[17] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation
of Text Data Based on Centroids and Least Squares,” BIT Numberical
Math, vol. 43, pp. 427-448, 2003.
[18] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduction by
Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000.
[19] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global Geometric
Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290,
pp. 2319-2323, 2000.
[20] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Tech-
niques for Embedding and Clustering,” Advances in Neural Information
Processing Systems 14, 2002.
[21] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, and
S. Yoshizawa, “Successive Learning of Linear Discriminant Analysis:
Sanger-Type Algorithm,” 14th International Conference on Pattern
Recognition, pp. 2664-2667, 2000.
[22] J. Weng, Y. Zhang, and W. S. Hwang, “Candid Covariance-Free Incre-
mental Principal Component Analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 25, pp. 1034-1040, 2003.
[23] J. Yan, B. Y. Zhang, S. C. Yan, Z. Chen, W. G. Fan, Q. Yang, W. Y. Ma,
and Q. S. Cheng, “IMMC: Incremental Maximum Margin Criterion,”
10th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 725-730, 2004.
[24] L. D. Baker and A. McCallum, “Distributional Clustering of Words for
Text Classification,” 21st Annual International ACM SIGIR, pp. 96-103,
1998.
[25] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional
Word Clusters vs. Words for Text Categorization,” Journal of Machine
Learning Research, vol. 3, pp. 1183-1208, 2003.
[26] M. C. Dalmau and O. W. M. Fl´ orez, “Experimental Results of the Signal
Processing Approach to Distributional Clustering of Terms on Reuters-
21578 Collection,” 29th European Conference on IR Research, pp. 678-
681, 2007.
[27] I. S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infromation-
Theoretic Feature Clustering Algorithm for Text Classification,” Journal
of Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[28] D. Ienco and R. Meo, “Exploration and Reduction of the Feature Space
by Hierarchical Clustering,” 2008 SIAM Conference on Data Mining,
pp. 577-587, 2008.
[29] N. Slonim and N. Tishby, “The Power of Word Clusters for Text
Classification,” 23rd European Colloquium on Information Retrieval
Research (ECIR), 2001.
[30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English
Words,” 31st Annual Meeting of ACL, pp. 183-190, 1993.
[31] H. Al-Mubaid and S. A. Umair, “A New Text Categorization Technique
Using Distributional Clustering and Learning Logic,” IEEE Transactions
on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156-1165,
2006.
[32] G. Salton and M. J. McGill, Introduction to Modern Retrieval.
McGraw-Hill Book Company, 1983.
[33] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with
TFIDF for Text Categorization,” 14th International Conference on
Machine Learning, pp. 143-151, 1997.
[34] J. Yen and R. Langari, Fuzzy Logic–Intelligence, Control, and Informa-
tion. Prentice-Hall, Upper Saddle River, NJ, USA, 1999.
[35] J. S. Wang and C. S. G. Lee, “Self-Adaptive Neurofuzzy Inference
Systems for Classification Applications,” IEEE Transactions on Fuzzy
Systems, vol. 10, no. 6, pp. 790-802, 2002.
[36] T. Joachims, “Text Categorization with Support Vector Machine: Learn-
ing with Many Relevant Features,” Technical Report LS-8-23, University
of Dortmund, Computer Science Department, 1998.
[37] C. Cortes and V. Vapnik, “Support-Vector Network,” Machine Learning,
vol. 20, no. 3, pp. 273-297, 1995.
[38] B. Sch¨ olkopf and A. J. Smola, Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. MIT Press,
Cambridge, MA, USA, 2001.
[39] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.
Cambridge University Press, Cambridge, UK, 2004.
[40] D. D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New
Benchmark Collection for Text Categorization Research,” Jour-
nal of Machine Learning Research, vol. 5, pp. 361-397, 2004
(http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf).
[41] The Cadˆ e Web directory, http://www.cade.com.br/.
[42] C. C. Chang and C. J. Lin, “Libsvm: A Library for
Support Vector Machines,” 2001, software available at
http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[43] Y. Yang and X. Liu, “A Re-examination of Text Categorization Meth-
ods,” ACM SIGIR Conference, pp. 42-49, 1999.
[44] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-label Data,”
Data Mining and Knowledge Discovery Handbook (draft of preliminary
accepted chapter, O. Maimon, L. Rokach (Ed.), Springer, 2nd edition,
2009.
[45] Http://web.ist.utl.pt/∼acardoso/datasets/.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
15
Jung-Yi Jiang Jung-Yi Jiang was born in
Changhua, Taiwan, ROC., in 1979. He received
the B.S. degree from I-Shou University, Taiwan
in 2002 and M.S.E.E. degree from Sun Yat-
Sen University, Taiwan in 2004. He is currently
pursuing the Ph.D. degree at the Department
of Electrical Engineering, National Sun Yat-Sen
University. His main research interests include
machine learning, data mining, and information
retrieval.
Ren-Jia Liou Ren-Jia Liou was born in Ban-
qiao, Taipei County, Taiwan, ROC., in 1983. He
received the B.S. degree from National Dong
Hwa University, Taiwan in 2005 and M.S.E.E.
degree from Sun Yat-Sen University, Taiwan in
2009. He is currently a research assistant at
the Department of Chemistry, National Sun Yat-
Sen University. His main research interests in-
clude machine learning, data mining and web
programming.
Shie-Jue Lee Shie-Jue Lee was born at Kin-
Men, ROC on August 15, 1955. He received
the B.S. and M.S. degrees of Electrical Engi-
neering in 1977 and 1979, respectively, from
National Taiwan University, and the Ph.D. de-
gree of Computer Science from the University
of North Carolina, Chapel Hill, USA, in 1990.
Dr. Lee joined the faculty of the Department of
Electrical Engineering at National Sun Yat-Sen
University, Taiwan, in 1983, and has become
a professor of the department since 1994. His
research interests include Artificial Intelligence, Machine Learning, Data
Mining, Information Retrieval, and Soft Computing.
Dr. Lee served as director of the Center for Telecommunications
Research and Development, National Sun Yat-Sen University, 1997-
2000, director of the Southern Telecommunications Research Center,
National Science Council, 1998-1999, and Chair of the Department
of Electrical Engineering, National Sun Yat-Sen University, 2000-2003.
Now he is serving as Deputy Dean of Academic Affairs and director
of NSYSU-III Research Center, National Sun Yat-Sen University. Dr.
Lee obtained the Distinguished Teachers Award of the Ministry of
Education, Taiwan, 1993. He was awarded by the Chinese Institute of
Electrical Engineering for Outstanding M.S. Thesis Supervision, 1997.
He obtained the Distinguished Paper Award of the Computer Society
of the Republic of China, 1998, and Best Paper Award of the 7th
Conference on Artificial Intelligence and Applications, 2002. He obtained
the Distinguished Research Award of National Sun Yat-Sen University,
1998. He obtained the Distinguished Teaching Award of National Sun
Yat-Sen University, 1993 and 2008, respectively. He obtained the Best
Paper Award of the International Conference on Machine Learning
and Cybernetics, 2008. He also obtained the Distinguished Mentor
Award of National Sun Yat-Sen University, 2008. He served as the
program chair for the International Conference on Artificial Intelligence
(TAAI-96), Kaohsiung, Taiwan, December 1996, International Computer
Symposium – Workshop on Artificial Intelligence, Tainan, Taiwan, De-
cember 1998, and the 6th Conference on Artificial Intelligence and
Applications, Kaohsiung, Taiwan, November, 2001. Dr. Lee is a member
of the IEEE Society of Systems, Man, and Cybernetics, the Association
for Automated Reasoning, the Institute of Information and Computing
Machinery, Taiwan Fuzzy Systems Association, and the Taiwanese
Association of Artificial Intelligence.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

If a document belongs to two or more classes. where P (cl ) denotes the prior probability for class cl . where d1 . [33] is commonly used. . then two or more copies of the document with different classes are included in D. W2 . Then W is used as inputs for classification tasks. each document di is converted into a new representation di = < di1 . c2 . . 1 ≤ j ≤ m. Mq being the mean vector of class cq . The words of top k weights in W are selected as the features in W .1 Feature Reduction In general. hard. Experimental results are presented in Section 5. . We then have one extracted feature for each cluster. proposed by Dhillon et al.. . and p (2) Sb = q=1 P (cq )(Mq − Mall )(Mq − Mall )T (3) 2 BACKGROUND AND RELATED WORK To process documents. is represented as di = < di1 . Then. An example illustrating how the algorithm works is given in Section 4. computation cost with subsequent operations on D can be drastically reduced.This article has been accepted for publication in a future issue of this journal. P (wj ) is identical to 1 − P (wj ). and P (cl |wj ) and P (cl |wj ) denote the probability for class cl with the presence and absence. t ≤ k and t = q. dn are individual documents and each document belongs to one of the classes in the set {c1 . . wk }. The weight of a word wj is calculated as follows: IG(wj ) = − l=1 P (cl )logP (cl ) p +P (wj ) l=1 P (cl |wj )logP (cl |wj ) p +P (wj ) l=1 P (cl |wj )logP (cl |wj ) p with P (cq ) being the prior probability for a pattern belonging to class cq . If k is much smaller than m. . Wk } of the original feature set W. . P (wj ) denotes the prior probability for feature wj . Let D be the matrix consisting of all the original documents with m features and D be the matrix consisting of the converted documents with new k features.. wk } corresponds to a partition {W1 . . An incremental orthogonal centroid (IOC) algorithm was proposed in [14]. . . . It measures the reduced uncertainty by an information-theoretic measure and gives each word a weight.. IOC tries to find an optimal transformation matrix F∗ ∈ Rm×k . . such that W and W work equally well for all the desired properties with D. Let the word set W = {w1 . c2 . which is a subset of the original feature set W. . . . wm } be the feature vector of the document set. Each document di . where m is the number of features in the feature set and n is the number of documents in the document set. and mixed.e. but has not been fully edited. the user need not specify the number of extracted features in advance. dn }. according to the following criterion: F∗ = arg max trace (FT Sb F) where F ∈ Rm×k and FT F = I. where each dij denotes the number of occurrences of wj in the ith document. di2 . Information Gain (IG) is frequently employed in the feature selection approach [10]. we conclude this work in Section 6. where 1 ≤ q.. soft. . di2 . extracted features are obtained by a projecting process through algebraic transformations. feature selection and feature extraction. . . The divisive information-theoretic feature clustering (DC) algorithm. . are introduced. dik > and the converted document set is D = {d1 . IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2 If a word is not similar to any existing cluster. Each new feature is obtained by summing up the words belonging to one cluster. . a new cluster is created for this word. wj ∈Wt P (wj ) (5) . and trial-and-error for determining the appropriate number of extracted features can then be avoided. the derived membership functions match closely with and describe properly the real distribution of the training data. . Besides. Finally. . . wk } is obtained. w2 . 2. The feature clustering methods proposed in [24]. By feature selection approaches. d2 . respectively. . . By this algorithm. [29].2 Feature Clustering Feature clustering is an efficient approach for feature reduction [25]. . of wj . dn } be a document set of n documents. 1 ≤ i ≤ n. a desired number of clusters are formed automatically.. Experiments on real-world datasets show that our method can run faster and obtain better extracted features than other methods. After feature reduction. The remainder of this paper is organized as follows. which groups all features into some clusters where features in a cluster are similar to each other. . a new feature set W = {w1 . where k is the desired number of extracted features. [29] are “hard” clustering methods where each word of the original features belongs to exactly one word cluster. [27].. Wt Wq = ∅.. [27] calculates the distributions of words over classes. Section 2 gives a brief background about feature reduction. The distribution of a cluster Wt is calculated as follows: P (C|Wt ) = wj ∈Wt (1) P (wj ) P (C|wj ). Note that a cluster corresponds to an element in the partition. The new feature set W = {w1 . Similarity between a word and a cluster is defined by considering both the mean and the variance of the cluster. w2 . . . P (C|wj ). w2 . w2 . cp }. Let a corpus of documents be represented as an m × n matrix X ∈ Rm×n . Content may change prior to final publication. When all the words have been fed in.. .. [25]. . i. d2 . . dim >.. the bag-of-words model [32]. Let D = {d1 . . cp }. Three ways of weighting. In feature extraction approaches. . and uses Kullback-Leibler divergence to measure the dissimilarity between two distributions.. . there are two ways of doing feature reduction. . Therefore each word contributes to the synthesis of only one new feature. k << m. . where C = {c1 . the tth feature value of the converted document di is calculated as follows: dit = dij (4) wj ∈Wt which is a linear sum of the feature values in Wt . The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. 2. . .. .. . Section 3 presents the proposed fuzzy similarity-based selfconstructing feature clustering algorithm. . and Mall being the mean vector of all patterns. d2 . The feature reduction task is to find a new word set W = {w1 .

1+2+3+4 1×0 + 2×0 + 3×1 + 4×1 P (c2 |w1 ) = = 0. cp . and 4. . Each cluster Gj has mean mj = < mj1 . . . Secondly. (9) Therefore.3−0. i. as specified in Section 2. . . So. the corresponding membership function should be initialized. μG (x) ≈ 0. c2 .e.. G2 .7−0. Intuitively. i. It is these word patterns our clustering algorithm will work on. no clusters exist at the beginning. similarly as in [27]. mjp > and deviation σ j = < σj1 . . all words in a cluster have the same degree of contribution to the resulting extracted feature. the membership function of that cluster should be updated accordingly. i. 1 ≤ i ≤ m. . x2 . .2. The fuzzy similarity of the word pattern x1 shown in Eq. . .6969. Initially. μG (x) ≈ 1. For word wi . d2 . 1 ≤ j ≤ q. the distribution of the data in a cluster is an important factor in the calculation of similarity. .4. A cluster contains a certain number of word patterns. a word pattern far distant from a cluster is hardly similar to this cluster. The user does not need to have any idea about the number of clusters in advance. . σ2 . On the contrary.. we have m word patterns in total.7.2 × exp − 0. This gives a burden to the user since trial-and-error has to be done until the appropriate number of extracted features is found. Also. σj2 . . the parameter k. . . as described in Section 2. the variance of the underlying cluster is not considered. 3. Note that dqi indicates the number of occurrences of wi in document dq . P (cp |wi ) > n q=1 dqi ×δqj n q=1 dqi for 1 ≤ i ≤ p. . P (C|Wt )) t=1 wj ∈Wt (6) adopted because of their superiority over other functions in performance [34]. xip >. according to Eq.e. suppose we have four documents d1 . when calculating similarities. . x2 . . .e. respectively. w2 . respectively. . σp > of G are defined as mi σi = = q j=1 which takes the sum over all the k clusters.3. . . . and d4 belonging to c1 .7 > . . . the similarity of this word pattern to each existing cluster is calculated to decide whether it is combined into an existing cluster or a new cluster is created. Gk . xji (xji − mji ) |G| 2 |G| q j=1 (11) (12) 3 O UR METHOD There are some issues pertinent to most of the existing feature clustering methods. . we have k = 0. . . . c1 . . Let k be the number of currently existing clusters.8948 = 0. xip > < P (c1 |wi ). (13) Notice that 0 ≤ μG (x) ≤ 1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 The goal of DC is to minimize the following objective function: k P (wj )KL(P (C|wj ). d2 . and is characterized by the product of p one-dimensional Gaussian functions. Let Sj be the size of cluster Gj . xi2 . the number of word patterns contained in G.6 2 0. dn . Then. .1 Self-Constructing Clustering Our clustering algorithm is an incremental. . Gaussian functions are μGj (xi ) = q=1 exp − xiq − mjq σjq 2 (15) . its word pattern xi is defined. . . (14) (8) 3. .(13). together with the feature vector W of m words w1 . 0. where |G| denotes the size of G. . For example. by xi where P (cj |wi ) = = = < xi1 . P (c2 |wi ). has to be specified in advance.7788 × 0. The clusters are G1 .6 > and a deviation vector σ 1 = < 0. and c2 . . .(10) to cluster G1 becomes μG1 (x1 ) = exp − 0. . For each word pattern xi = < xi1 . A word pattern close to the mean of a cluster is regarded to be very similar to this cluster.3. Sometimes it may be better if more similar words are allowed to have bigger degrees of contribution. m2 . xi2 . respectively. the similarity of xi to each existing clusters. δqj is defined as δqj = 1. . For each word pattern. we calculate. 0. xj2 .. . On the contrary. .4 2 0. . The fuzzy similarity of a word pattern x = < x1 . . For example. Let G be a cluster containing q word patterns x1 . [35]. xq . wm and p classes c1 . Let the occurrences of w1 in these documents be 1. p for 1 ≤ j ≤ p. Suppose we are given a document set D of n documents d1 . 0. self-constructing learning approach. Thirdly. 1+2+3+4 (10) x1 = < 0. and clusters can be created if necessary. xp > to cluster G is defined by the following membership function: p μG (x) = i=1 exp − xi − mi σi 2 . Our goal is to group the words in W into clusters based on these word patterns. but has not been fully edited. d3 . where k is specified by the user in advance. 2. Our feature clustering algorithm is proposed to deal with these issues. the word pattern x1 of w1 is: P (c1 |w1 ) = 1×1 + 2×1 + 3×0 + 4×0 = 0. Firstly. . Then the mean m = < m1 . . indicating the desired number of extracted features.. . mp > and the deviation σ = < σ1 . Let xj = < xj1 . . i. σjp > . . Content may change prior to final publication. .3 >.e. Once a new cluster is created. Word patterns are considered one by one. otherwise. No clusters exist at the beginning. .This article has been accepted for publication in a future issue of this journal. suppose G1 is an existing cluster with a mean vector m1 = < 0. . c2 . . when the word pattern is combined into an existing cluster. We construct one word pattern for each word in W. if document dq belongs to class cj . . . mj2 . xjp >. . 0.3 (7) = 0.

(12) is impossible or inaccurate if the cluster contains few members. Of course.3 − 0. We cannot use zero deviation in the calculation of fuzzy similarities. Incorporate xi into Gt by Eqs. in decreasing order. Note that k is not changed in this case.1)2 + 4×0. .7 2 B12 = ( ) . let cluster Gt be the cluster with the largest membership degree. and fewer clusters will be obtained instead.(17). Suppose x1 is most similar . (21) σtj = (St − 1)(σtj − σ0 )2 + St ×mtj 2 + xij 2 . i. the existing clusters can be adjusted or new clusters can be created. 1≤j≤k (19) In this case. 0. Note that the new cluster Gh contains only one member. Therefore.38.e. (22) A = St St + 1 St ×mtj + xij 2 ( B = (23) ) . we assume that xi is not similar enough to any existing cluster and a new cluster Gh . As the threshold increases. The whole clustering algorithm can be summarized below. . A12 = 4 4 + 1 4×0. following Eq.This article has been accepted for publication in a future issue of this journal.2769. as usual. at this point. For this case. σ0 > is a user-defined constant vector. 0 ≤ ρ ≤ 1. Firstly. k = k + 1.1 = 0. but has not been fully edited. . Note that the order in which the word patterns are fed in influences the clusters obtained. Sh = 1.4 + 0.38 + 0. and St = St + 1.1937. there are no existing fuzzy clusters on which xi has passed the similarity test. then he/she can give a smaller threshold. more significant patterns will be Secondly.382 + 0.(17)-(18). (24) Eqs. by their largest components. Content may change prior to final publication. return with the created k clusters.(11)-(12).. and mt and σ t of cluster Gt should be modified to include xi as its member. if (temp W == φ) A new cluster Gh .1)2 + 4×0. (18) to cluster G1 . On the contrary. St St + 1 for 1 ≤ j ≤ p. Gk procedure Self-Constructing-Clustering-Algorithm for each word pattern xi .62 + 0.(14). and more clusters will be obtained for a given threshold. Its value has an effect on the number of clusters obtained. The modification to cluster Gt is described as follows: St ×mtj + xij mtj = . t = arg max (μGj (xi )). σ12 = σ1 = < 0. and the size of cluster G1 is 4 and the initial deviation σ0 is 0. a bigger threshold can be given.6 + 0. If the user intends to have larger clusters. endprocedure Note that the word patterns in a cluster have a high degree of similarity to each other.62. as indicated in Eq.1 = 0.(10) and Eq. endfor.2769 >. . xip >. In this way. Note that. 1 ≤ i ≤ m Output: Clusters G1 . We apply a heuristic to determine the order. without the necessity of generating the whole set of clusters from the scratch. is created with m h = xi . Sh . the power in Eq. Initialization: of original word patterns: m of classes: p Threshold: ρ Initial deviation: σ0 Initial of clusters: k = 0 Input: xi = < xi1 . should be initialized. A11 = 4 4 + 1 4×0. σ11 = (4 − 1)(0.(15) is two [34]. We sort all the patterns. we initialize the deviation of a newly created cluster by σ 0 .2 − 0. . . endif. 1 ≤ j ≤ k}. S1 = 4 + 1 = 5. is a predefined threshold.32 . the word pattern xi .. 4 4+1 A12 − B12 + 0. xi2 . The above process is iterated until all the word patterns have been processed. if there are existing clusters on which xi has passed the similarity test. m11 = 4+1 4×0. Then the word patterns are fed in this order.622 + 0. 4+1 m1 = < 0. Then cluster G1 is modified as follows: 4×0. We say that xi passes the similarity test on cluster Gj if μGj (xi ) ≥ ρ (16) where ρ. the number of clusters also increases. .1937. we have k clusters. Besides. a smaller value will make the boundaries of the Gaussian function smoother. we regard xi to be most similar to cluster Gt . i. .72 . (4 − 1)(0. 0. the deviation of a new cluster is 0 since it contains only one member.7 m12 = = 0.38. A larger value will make the boundaries of the Gaussian function sharper. Otherwise. 4 4+1 A11 − B11 + 0. Consequently.(20)-(21) can be derived easily from Eqs. is created by Eqs. Two cases may occur.3 = 0. when new training patterns are considered. Estimating the deviation of a cluster by Eq.(19). . else let Gt ∈ temp W be the cluster to which xi is closest by Eq.3 2 B11 = ( ) . the number of clusters is increased by 1 and the size of cluster Gh . h = k + 1. . In particular.62 >. .(20)-(24). [35].1. 1 ≤ i ≤ m temp W = {Gj |μGj (xi ) ≥ ρ. (20) S +1 √ t A − B + σ0 . Let’s give an example.e. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 for 1 ≤ j ≤ k. σ h = σ 0 (17) where σ 0 =< σ0 . G2 . . h = k + 1.

. .2. For an input pattern. .1 > be three word patterns. . tij is 1. text classification can be done as follows. 0.6. and IG [10].t.. For one cluster. ⎣ . each word is allowed to contribute to all new extracted features. To make the method more flexible and robust. respectively. feature extraction can be expressed in the following form: D = DT (25) where D = [d1 d2 · · · dn ] . = [di1 di2 · · · dik ] for 1 ≤ i ≤ n. The merge can be ‘hard’ or ‘soft’ by setting γ to 1 or 0. 0. The largest components in these word patterns are 0. By selecting the value of γ.(25) are defined as follows: tij = 1. SVM is a kernel method which finds the maximum margin hyperplane in feature space separating the images of the training patterns into two groups [37].6.3. 0. (30) Note that if j is not unique in Eq. if j = arg max1≤α≤k (μGα (xi )). 0. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5 fed in first and likely become the core of the underlying cluster. otherwise. Using D as training data. 0. ⎡ t11 . soft.8. if wi ∈ Wj . t2k ⎢ T = ⎢ . . For DC. . . Therefore. IOC [14].3. The elements of T in Eq. For example.4.1. but the misclassified patterns should be penalized. i = 1. word patterns have been grouped into clusters. Note that γ is not related to the clustering.(25) are binary and can be defined as follows: tij = 1. x1 . Since we have k clusters. The elements of T are derived based on the obtained clusters and feature extraction will be done. and apply our clustering algorithm. In the hard-weighting approach.. otherwise tij is 0. ⎦ (27) (28) where tH is obtained by Eq.(30).This article has been accepted for publication in a future issue of this journal. The objective function and constraints of the classification problem can be formulated as: 1 min wT w + C w. the number of clusters is small and each cluster covers more training patterns.4 >. some patterns need not be correctly classified by the hyperplane. one of them is chosen randomly. That is. we have one extracted feature. Then we find the weighting matrix T and convert D to D by Eq.(25) are defined as follows: tij = (γ)×tH + (1 − γ)×tS ij ij (32) (26) ⎤ ⎥ ⎥ ⎥. otherwise (29) where 1 ≤ i ≤ m and 1 ≤ j ≤ k. . and mixed. we have to calculate the similarity between the input pattern and every existing cluster.(31). and the complexity of IOC is O(mkpn) where n is the number of documents involved. It concerns the merge of component features in a cluster into a resulting feature.(25) are defined as follows: tij = μGj (xi ).3 Text Classification Given a set D of training documents.2 Feature Extraction Formally. Clearly. [38]. but has not been fully edited. yi wT φ(xi ) + b ≥ 1 − ξi . Therefore. tmk T T belong to a cluster and so it only contributes to a new extracted feature. For this case. We propose three weighting approaches: hard. in worst case.(30) and tS is obtained by Eq. On the contrary. a smaller γ will favor soft-weighting and get a higher accuracy. t1k ⎢ t21 .8. l . . with the degrees depending on the values of the membership functions. and 0.(25). In this case. Content may change prior to final publication. slack variables ξi are introduced to account for misclassifications. Apparently... Note that any classifying technique other than SVM can be applied. In the soft-weighting approach. if a word wi belongs to cluster Wj . ξi ≥ 0. the number of clusters is large and each cluster covers fewer training patterns. (31) The mixed-weighting approach is a combination of the hardweighting approach and the soft-weighting approach. . x2 . When the similarity threshold is small. tm1 . Our method is better than DC and IOC. 0.6 >.b 2 l with di di = [di1 di2 · · · dim ]. x2 = < 0.4. The sorted list is 0. when the similarity threshold is large. We specify the similarity threshold ρ for Eq. and words in the feature vector W are also clustered accordingly. 0. The complexity of IG is O(mp + mlogm). So the order of feeding is x3 .(16). In this case. We discuss briefly here the computational cost of our method and compare it with DC [27]. 3. By applying our clustering algorithm. IG is the quickest one. 0. the elements of T in Eq. The goal of feature reduction is achieved by finding an appropriate T such that k is smaller than m. T is a weighting matrix. D = [d1 d2 · · · dn ] . we provide flexibility to the user. the elements of T in Eq. a larger γ will favor hardweighting and get a higher accuracy. a classifier based on support vector machines (SVM) is built. the time complexity of our method is O(mkp) where m is the number of original features and k is the number of clusters finally obtained. . 3. and x3 = < 0. [39].3. let x1 = < 0. 2. the complexity is O(mkpt) where t is the number of iterations to be done. the elements of T in Eq. In the divisive information-theoretic feature clustering algorithm [27] described in Section 2.1. each word is only allowed to ξi i=1 (33) s.8. Joachims [36] showed that SVM is better than other methods for text categorization. Each pattern consists of p components where p is the number of classes in the document set. In this case. 0. ij ij and γ is a user-defined constant lying between 0 and 1. 0. 0. . This heuristic seems to work well. we have k extracted features. . Assume that k clusters are obtained for the words in the feature vector W.

1 Experiment 1: 20 Newsgroups Dataset The 20 Newsgroups collection contains about 20. Content may change prior to final publication. microaveraged F1 (MicroF1). 1.64. We calculate the ten word patterns x1 . We use a computer with Intel(R) Core(TM)2 Quad CPU Q6600 2.9964 >. For the SVM of class cv . 0. . DS . We first convert d to d by Eq. 1. . soft-weighting. 5. The weighting matrices TH . For p classes. F Pi (false positives wrt ci ) is the number of non-ci test documents that are incorrectly classified to ci . 1 > . . 4. and Cade12 [41]. microaveraged recall (MicroR). 1. 4GB of RAM to conduct the experiments. c2 . An SVM described above can only separate apart two classes. We first convert d to d by d = dT. If the three SVMs output -1. in this section. ‘fridge’ in the feature vector W. 0.5 > < 0. Suppose d is an unknown document. The programming language used is MATLAB7. w10 . 0. [44] defined as follows: M icroP M icroR M icroF 1 = = = Σ p T Pi i=1 p Σi=1 (T Pi + F Pi ) Σp T Pi i=1 Σp (T Pi + F Ni ) i=1 (36) (37) We give an example here to illustrate how our method works.25. and DC is a feature clustering approach. respectively. or dM = dTM = < 2. one from each SVM. x10 according to Eq. To compare classification effectiveness of each method. .0696.4167 > deviation σ < 0. The value was determined by investigation. x2 . C is a parameter which gives a tradeoff between maximum margin and classification error. respectively. DS . 4. 1 ≤ v ≤ p.(34). is the target label of pattern xi . our method is abbreviated as FFC. . F Ni (false negatives wrt ci ) is the number of ci test documents that are classified to non-ci . . IOC [14]. . For example. the predicted classes will be c2 and c3 . and mixed-weighting (with γ = 0. and 1. 1. In our method. .(25) for different cases of weighting are shown in Table 6. Furthermore. The fuzzy similarity of each word pattern to each cluster is shown in Table 4. 0.6179. respectively.(8). . . 4 A N E XAMPLE In this section. respectively.3478. We run a classifier built on support vector machines [42]. we denote the ten words as w1 . the training patterns of class cv are treated as having yi = +1 and the training patterns of the other classes are treated as having yi = −1. . Then the transformed unknown document is fed to the classifier. by setting σ0 = 0. 1. w2 . We compare our method with other three feature reduction methods: IG [10]. P (c2 |w6 ) > and P (c2 |w6 ) is calculated by Eq. and TM obtained by hard-weighting. the classifier concludes that d belongs to c2 . and DC [27].8). (34) Based on DH . The transformed data sets DH . These articles are evenly .5833. IOC is an incremental feature extraction algorithm.25 for all the following experiments.0.40GHz. Note that φ : X → F is a mapping from the input space to the feature space F where patterns are more easily separated. we adopt the performance measures in terms of microaveraged precision (MicroP). one SVM for each class. Note that each word pattern is a two-dimensional vector since there are two classes involved in D. on the word patterns and obtain 3 clusters G1 . with 10 words ‘office’. and microaveraged accuracy (MicroAcc) [4]. 5 E XPERIMENTAL RESULTS Then we feed d to the classifier. respectively.6095 > < 0. ‘building’. as shown in Table 1. are used in the following experiments. a classifier with two SVMs is built. SFFC. being +1 or -1. Note that σ0 is set to the constant 0. For this example. the transformed document is obtained as dH = dTH = < 2. 4.3. dS = dTS = < 2. 3. and 1. We get p values.3993 >. T Pi (true positives wrt ci ) is the number of ci test documents that are correctly classified to ci . and G3 which are shown in Table 3.000 articles taken from the Usenet newsgroups. Now we are ready for classifying unknown documents. as described in Section 3. [43]. respectively. Then d belongs to those classes with 1 appearing at the outputs of their corresponding SVMs. 1. and wT φ(xi ) + b = 0 is the hyperplane to be derived. 0 > < 0.08.5. Let D be a simple document set.5591. and d = < 0. soft-weighting. x6 = < P (c1 |w6 ). We run our self-constructing clustering algorithm. For simplicity. If the three SVMs output 1. . IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6 TABLE 3 Three clusters obtained.6095. 0. cluster G1 G2 G3 size S 3 5 2 mean m < 1. As described in Section 2. The classifier is then the aggregation of these SVMs. -1. we create p SVMs.92 > < 0. IG is one of the state-of-art feature selection approaches. . T Ni (true negatives wrt ci ) is the number of nonci test documents that are classified to non-ci . 2 >. and M-FFC to represent hard-weighting. G2 . with w and b being weight vector and offset. we use H-FFC. but has not been fully edited. or DM . The resulting word patterns are shown in Table 2. d2 . are shown in Table 5. We follow the idea in [36] to construct an SVM-based classifier. d9 of two classes c1 and c2 . and mixed-weighting. 1. standing for Fuzzy Feature Clustering. and DM obtained by Eq.(7) and Eq.6179 > where l is the number of training patterns.5 and ρ = 0.(35). consider a case of three classes c1 . M icroAcc = 2M icroP × M icroR (38) M icroP + M icroR p Σi=1 (T Pi + T Ni ) (39) p Σi=1 (T Pi + T Ni + F Pi + F Ni ) where p is the number of classes. TS . . containing 9 documents d1 . Then. For convenience. we present experimental results to show the effectiveness of our fuzzy self-constructing feature clustering method. yi = +1 and yi = −1. Three well-known datasets for text classification research: 20 Newsgroups [1]. For example. and yi . .1118. 2. Suppose d is an unknown document. and c3 . σ0 is set to 0. 0. . then the predicted classes will be c1 and c3 for this document. 1.This article has been accepted for publication in a future issue of this journal. on the extracted features obtained by different methods. RCV1 [40].

Figure 2 shows the execution time (sec) of different feature reduction methods on the 20 Newsgroups dataset.1568 0.00 x6 0. the x-axis indicates the class number and the y-axis indicates the fraction of the articles of each class.1682 x5 1. especially when the number of extracted features is small. For 84 extracted features. when the number of extracted features exceeds 280. Execution time (sec) of different methods on 20 Newsgroups data.0060 0. Different values of ρ are used in FFC and are listed in the table.00 1. We use two-thirds of the documents for training and the rest for testing.0105 0. or words.0003 0.0003 0.9643 x9 0. our method runs much faster than DC and IOC.38 seconds and IOC requires 6943.0003 0. but DC and IOC require 293.0000 0. 2. IG only gets 95. distributed over 20 classes and each class has about 1. no accuracies are listed for IOC when the number of extracted features exceeds 280.00 x5 1.05 seconds.00 TABLE 4 Fuzzy similarities of word patterns to three clusters.9661 0.3869 0.000 articles. Figure 3 shows the MicroAcc (%) of the 20 Newsgroups data set obtained by different methods. the horizontal axis indicates the number of extracted features. The words of top k weights in W are selected as the extracted features in W . they have the same execution time and thus we use FFC to denote them in Figure 2. The vertical axis indicates the MicroAcc values (%) and the horizontal axis indicates the number of extracted features.00 0.80 x4 0. Obviously.0000 0. different values of ρ are used in FFC.00 0.50. Note that for IG.0105 0. For the other methods.00 x2 0. for this data set.4027 x6 0. Table 7 lists values of certain points in Figure 2.61 seconds for 20 extracted features. S-FFC works very well Execution time (sec) .80 x3 0. but has not been fully edited.1353 0. After preprocessing. To obtain different numbers of extracted features.00 x10 1. The number of extracted features is identical to the number of clusters.40 seconds.98 and 28098.00 x8 0. Therefore.5 for M-FFC.20 0.9254 0. based on the extracted features previously obtained.4111 0. our method only needs 17. S-FFC.4631 x3 0. and M-FFC have the same clustering phase. as indicated by dashes in Table 7. 1+2+0+1+1+1+1+1+0 (35) TABLE 2 Word patterns of W. the number of extracted features should be specified in advance by the user.68 seconds. IG performs the worst in classification accuracy. we have 25.1682 x2 0. Also.9643 x7 1. similarity μG1 (x) μG2 (x) μG3 (x) x1 0. while DC requires 88. In this figure. For example.9661 0.50 0.0060 0.1682 x10 1. DC and IOC run significantly slow.00 1. the execution time is basically the same for any value of k. each word is given a weight.9661 0. our method needs 8.00 0.4027 x8 0.67 0. respectively.4631 x4 0.20 0. In particular.000 seconds without getting finished.79% in accuracy for 20 extracted features.00 1.9254 0.4027 10 5 10 4 FFC DC IG IOC 10 3 10 2 10 1 10 0 20 50 100 200 Number of extracted features 500 1000 1453 Fig.This article has been accepted for publication in a future issue of this journal.0105 0. Since HFFC. As shown in the figure. as shown in Figure 1(a). As the number of extracted features increases.718 features. Note that γ is set to 0. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 TABLE 1 A simple document set D. Table 8 lists values of ceratin points in Figure 3. x1 0. In this figure. For example. Content may change prior to final publication.33 x9 0.50 x7 1.0000 0. office d1 d2 d3 d4 d5 d6 d7 d8 d9 (w1 ) 0 0 0 0 0 2 3 1 1 building (w2 ) 1 0 0 0 0 1 2 0 1 line (w3 ) 0 0 0 1 0 1 1 1 1 floor (w4 ) 0 0 0 0 1 0 3 1 1 bedroom (w5 ) 1 0 0 2 0 0 0 0 0 kitchen (w6 ) 1 2 0 1 1 1 1 1 0 apartment (w7 ) 0 1 1 2 0 0 0 0 0 internet (w8 ) 0 1 0 1 0 0 1 0 0 WC (w9 ) 0 0 0 0 1 1 1 0 0 fridge (w10 ) 1 0 0 1 0 0 0 0 0 class c1 c1 c1 c1 c2 c2 c2 c2 c2 P (c2 |w6 ) = 1×0 + 2×0 + 0×0 + 1×0 + 1×1 + 1×1 + 1×1 + 1×1 + 0×1 = 0. IOC spends more than 100.

0336 0.0489 3. But for MicroR.0003 1.4950 1.1364 1.0000 0. but has not been fully edited.38 6943.9929 0.0601 1.0000 0. Table 9 lists values of the MicroP.65% and 98.44 203 (0.98 88.1682 0.0283 1.08 0.This article has been accepted for publication in a future issue of this journal. H-FFC.9851 0. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8 TABLE 5 Weighting matrices: hard TH . Class distributions of three datasets. with almost identical performance when the number of features is less than 500.12 Ratio of the class to all Ratio of the class to all 0.2 0. For example. M-FFC performs only a little bit worse than S-FFC.0336 0.12) 19. For MicroF1.15 0.0000 0. M-FFC performs well for MicroP.43% and 98. For MicroP.9932 0.9254 0.0000 0.68 120 (0.24 when the number of extracted features is smaller than 58. and 98.9932 0.3333 0.4027 0.60 55.23) 19.5524 0.79 ——— 185.19) 19. in accuracy for 20 extracted features.0272 10.0000 0.1043 2.05 17.0021 0.0805 0.36) 19.16 0.91 13.718 features are used for classification.1 0.2446 3.1682 0.4631 0. MicroR.98 972.2327 3.5667 0.9661 0.4027 0.24 67513.46% in accuracy for 20 extracted features.02 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Classes of 20 Newsgroups 1 2 3 4 5 6 7 8 Classes of Cade12 9 10 11 12 (a) 20 Newsgroups (b) RCV1 (c) Cade12 Fig. of extracted features threshold (ρ) IG DC IOC FFC 20 (0. TM = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎦ ⎣ 0.0166 3.3 ——— 79.98 204.00 23.95 84 (0.0001 1.9411 0.73% and 97.0105 0.90 19397.9643 0. and MicroF1 (%) of the 20 Newsgroups dataset obtained by different methods.54%.0336 0.30 521 (0.9929 0.0829 3. and DC in order.25 0. soft TS .0272 0.79 1187 (0.1413 1.0638 5.0021 1.3006 2.98 5012.0667 0.0021 0. For MicroP. Note that while the whole 25. in accuracy for 58 extracted features.1105 0.2625 d1 d2 d3 d4 d5 d6 d7 d8 d9 (w1 ) 2. get best results for MicroF1.0025 (w2 ) 1. respectively.9851 0.56% for 203 extracted features.7831 DS (w3 ) 2.9643 0.6818 1.0060 0. respectively.2525 0.04 0.9661 0. 1.98 1973.40 8.0805 0.1568 0.98 28098.98 704.5217 2.0003 1.0060 0.52 39.01 0.1360 0. Content may change prior to final publication.98 3425.01) 19.09%. From this table.1882 0.45%.1527 0.1133 0. For example.0000 5.14 0.0012 0. soft DS .3192 5. except for the case of 20 extracted features.4111 0.4027 4.0297 0.63% and 98. in general. ⎥ ⎥ ⎥ ⎦ TABLE 6 Transformed data sets: hard DH .5 for M-FFC. and MicroF1.2955 0.0012 0.0001 1.0284 0.4051 1. followed by M-FFC.0001 0.02) 19. H-FFC and M-FFC get 98.4990 1. TS = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎦ ⎣ 0.1682 0. S-FFC performs better than H-FFC by a greater margin than in the MicroP case. S-FFC gets 98.1362 10.61 58 (0. TABLE 7 Sampled execution times (sec) of different methods on 20 Newsgroups data. they get 97.57 39243.1483 0.4810 1.1353 1.2790 2. DC and IOC perform a little bit worse in accuracy when the number of extracted features is small.05 0.06) 19.59% for 521 extracted features.99 1453 (0.04 ——— 155.36 280 (0. we can see that S-FFC can.0805 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥.3869 0.4631 0. M-FFC . HFFC performs a little bit better than S-FFC. ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ TH = ⎢ ⎢ ⎢ ⎢ ⎣ 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 ⎤ ⎡ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ .32) 19.69 93010.1 0.0822 0. Note that γ is set to 0.0591 0.03 0.06 0.3949 4.0126 (w2 ) 1. and mixed DM .04 Ratio of the class to all 0 20 40 60 Classes of RCV1 80 100 0.9932 0.0000 0.02 0. 98. Although the number of extracted features is affected by the threshold value. For example.9661 0.0271 1. the accuracy is 98.0000 5. and mixed TM .7637 1.9566 DM (w3 ) 1.0314 0.06 0.0105 0.03) 19.0105 0. H-FFC and M-FFC perform well in accuracy all the time. the classification accuracies obtained keep fairly stable.4027 ⎤ ⎡ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ .98 293.0021 0.2465 3.0805 2.05 0.9254 0. d1 d2 d3 d4 d5 d6 d7 d8 d9 (w1 ) 2 1 1 5 0 0 0 0 0 (w2 ) 1 0 0 1 2 5 10 3 4 DH (w3 ) 1 3 0 2 1 1 2 1 0 d1 d2 d3 d4 d5 d6 d7 d8 d9 (w1 ) 2. MicroR.98 486.0926 0.1420 0.0926 0.0105 1.0774 0.0003 0.

Note that while the whole 25.21 74.72 85.22 85.58 52.28 —91.10 98.53 68. 203 280 (0.67 74.38 (MicroR) IOC 48.90 65.39 81.65 92.65 91.91 —85.28 91.86 98. and M-FFC1453.5 97 96.90 88.34 92.80 M-FFC 89.68 98.5 96 95.01 91.43 97.45 120 (0. For MicroR.06) 89.85 85. but has not been fully edited.01) (0.25 92.63 —98. In summary.79 96.85 85.69 (MicroP) IOC 86.62 M-FFC 98.72 98.91 —80.62 78.15 (MicroF1) IOC 62.26 77.07 92.76 78.22 85.50 521 (0.01) (0.96 77. MicroP is 94.81 98.41 98.06) 96.11 66. and F1 (%) of different methods for 20 Newsgroups data.38 72.97 91.62 98.55 S-FFC 90.90 91.37 91.40 84. DC gets 88.51 88.44 S-FFC 98.75 27.59 91.20 96.93 85.18%.59 98. 3.23 F1 DC 73. Recall.8 H−FFC S−FFC M−FFC DC IG IOC 50 100 200 500 Number of extracted features 1000 1500 25718 Fig.46 86.28 —98. .56 97.62 84.4 0.12) (0.36 79.05 91. especially when the number of features is less than 200.56 59.57 Full features (25718 features): MicroAcc = 98.98 78.27 78.77 84.96 S-FFC 76.50 58.69 91.46 98.92 56.07 92.54 61.94 98.32) 97.42 92.58 77.5 0.87 1453 (0.16 73.32) 91.79 92.85 67.36 74.60 1453 (0.82 69.57 93.84 S-FFC 83.12) 96.73 98.77 85.53.29 93. while it performs worse than DC and M-FFC when the number of features is greater than 200.28 85. and MicroF1 is 82. IG runs very fast in feature reduction.65 98.05 86.56 280 (0.99 91.71 41.43 81.03 Full features (25718 features): MicroP = 94. MicroR 120 (0.53 90.16 79.02) (0.84 78.50%. of extracted features 20 58 84 threshold (ρ) (0.70 78. and 1. Fig.95 98.67 MicroF1 = 82.5 98 97.53%.15 34.02) (0.44 97. but it performs worse when the number of features is less than 200.73 80.66 82. of extracted features 20 58 84 threshold (ρ) (0.46 77.453. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9 TABLE 8 Microaveraged accuracy (%) of different methods for 20 Newsgroups data.86 H-FFC 65.44 84.07 87.82 —92. indicated by M-FFC20.36 97.91 90.25 77. MicroR is 73.87 92.27 —92.53 85. As shown in the figure.00 90.70 98.03) microaveraged IG 95.65 86.09 97. S-FFC performs best for all the cases. DC performs well when the number of features is greater than 200.61 98.93 82.6 Parameter of M−FFC M−FFC20 M−FFC203 M−FFC1453 0.40 M-FFC 70.25 84.64 82. respectively.63 98.56 87.46 85. but the extracted features obtained are not good for classification.72 74.42 76. 4.32 accuracy DC 97.69 98. Content may change prior to final publication. the MircoF1 values do not vary significantly with different settings of γ.7 0.19) 89.53 71.76 41.77 45.03) microaveraged IG 91.718 features are used for classification.5 95 20 90 88 86 Microaveraged Accuracy % Microaveraged F1 % 84 82 80 78 76 74 72 70 0.59 1187 (0.05 99 98. M-FFC performs well.This article has been accepted for publication in a future issue of this journal. 203.56 63.87 57. performs better than S-FFC most of the time.72 78.17 M-FFC 78.69 98.94 98.48 74.3 0.57 84.23) 97.92 microaveraged IG 17. Microaveraged accuracy (%) of different methods for 20 Newsgroups data.00 86.80 74.2 0.18.59 1187 (0.11 30.22 89. FFC can not only run much faster than DC and IOC in feature reduction.04 85.59 98.19 98.41 92.81 91.54 = 73.84 78.00% and M-FFC gets 89.57 98.26 —98.36) 97.59 82.19 93.51 98.19) 96. Microaveraged F1 (%) of M-FFC with different γ values for 20 Newsgroups data.71 78.58 203 (0.36) 91.95 81.40 98.38 (MicroAcc) IOC 97. 20.65% when the number of features is 20.62 TABLE 9 Microaveraged Precision.07 91.64 98.34 84.02 —85.23) 89.28 —85.85 —78. Figure 4 shows the influence of γ values on the performance of M-FFC in MicroF1 for three numbers of extracted features.30 79.68 H-FFC 75.17 94.80 92.45 77.32 microaveraged IG 29.76 57.33 98. For example.70 —78.50 H-FFC 97.26 43.60 521 (0.88 70.61 81.32 49.49 98.79 92.54 98.22 recall DC 62.77 89.35 92. M-FFC203.17 56. H-FFC performs well when the number of features is less than 200.78 precision DC 88. but also provide comparably good or better extracted features for classification.05 80.92 82.52 H-FFC 90.43 98.

and 98. Figure 6 shows the MicroAcc (%) of the RCV1 dataset obtained by different methods.2 Experiment 2: REUTERS CORPUS VOLUME 1 (RCV1) Dataset The RCV1 dataset consists of 804.33% in MicroR. Microaveraged F1(%) of M-FFC with different γ values for RCV1 data.88 seconds.000 seconds. 82. in accuracy for 18 extracted features.5 H−FFC S−FFC M−FFC DC IG IOC 50 100 200 500 Number of extracted features 1000 170047152 97 96. IG gets 96.40 seconds. MicroP is 86. we can see that S-FFC can.42 seconds. 6. All the 103 categories have one or more positive test examples in the test set. and M-FFC1700. MicroR. Table 11 lists values of certain points in Figure 6. and 67. M-FFC performs well for MicroP.77 and over 100. All news stories are in English and have 109 distinct terms per document on average. our method runs much faster than DC and IOC. Table 12 lists values of the MicroP.03%.63 seconds and IOC requires 94533.43%. especially when the number of extracted features is small.This article has been accepted for publication in a future issue of this journal.90% in MicroF1 for 18 extracted features. M-FFC202. MicroR is 75. and 97. The documents are divided. and M-FFC perform well in accuracy all the time. respectively.65% in MicroF1 for 65 extracted features. For example.95% in accuracy for 18 extracted features. and 98. FFC can not only run much faster than DC and IOC in feature reduction.5 0. For example. 91. respectively. Note that while the whole 47.03%.66%.95% in MicroAcc. 98. In summary.414 news stories produced by Reuters from 20 Aug 1996 to 19 Aug 1997. for 421 extracted features. IG only gets 96. for 65 extracted features. and 71.13%. S-FFC. .26%. As shown in the figure. get best results for MicroF1. by the ”LYRL2004” split defined in Lewis et al. DC.80% in MicroR. 18. After preprocessing. but only 101 of them have one or more positive training examples in the training set. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10 10 5 10 4 FFC DC IG IOC 75 70 Microaveraged F1 % Execution time (sec) 65 60 55 50 10 3 10 2 10 1 45 10 0 M−FFC18 M−FFC202 M−FFC1700 0. and H-FFC in order. [40]. For example. 99 Fig.5 for M-FFC. 5. indicated by M-FFC18. For example.3 0. SFFC performs better than the other methods. and 97. especially when the number of features is less than 500. but DC and IOC require 1709.265 testing documents. Content may change prior to final publication. and 1700. MicroR. respectively. the values being around 80%-85%. H-FFC. 202.18% in MicroP. Figure 5 shows the execution time (sec) of different feature reduction methods on the RCV1 dataset. Microaveraged accuracy (%) of different methods for RCV1 data. 5. 98.6 Parameter of M−FFC 0. Note that while the whole 47.41%.08% in MicroF1 for 18 extracted features. SFFC. we have 47. There are 103 Topic categories and the distribution of the documents over the classes is shown in Figure 1(b).152 features are used for classification. our method only needs 37. followed by M-FFC. 98. As the number of extracted features increases. but our method only needs 14. and MicroF1 (%) of the RCV1 dataset obtained by different methods. For 65 extracted features.82% in MicroP. 63. 5. and MicroF1 is 80.58% in MicroP.5 Microaveraged Accuracy % 98 97.41% in MicroR. S-FFC gets 98. For example. respectively. but the extracted features obtained are not good for classification. but also provide comparably good or better extracted features for classification. in general.72% in MicroR. IG gets a high MicroP when the number of features is smaller than 50. As shown in the figure. and 98. all the methods perform about equally well.44% in MicroF1 for 65 extracted features. for 18 extracted features.152 features are used for classification.83%.2 Fig. and MicroF1. based on the extracted features previously obtained. and 37. and M-FFC get 98.4 0. For MicroP.8 18 50 100 200 500 Number of extracted features 1000 1700 40 0. Clearly. Figure 7 shows the influence of γ values on the performance of M-FFC in MicroF1 for three numbers of extracted features. and 10. No accuracies are listed for IOC when the number of extracted features exceeds 18. 7.26% in MicroAcc. the accuracy is 98. Note that γ is set to 0. and 98. 85. Table 10 lists values of certain points in Figure 5. 55.31%. 25. IG performs the worst in classification accuracy.53%.149 training documents and 781.39%. From this table.152 features for this data set.7 0. H-FFC.85%.34% in MicroP. DC requires 609. IG runs very fast in feature reduction. the MircoF1 values do not vary significantly with different settings of γ. 98.39% in MicroAcc. Execution time (sec) of different methods on RCV1 data. 98.56%.24% in MicroAcc. 68. DC and IOC run significantly slow. into 23. respectively.5 18 Fig. but has not been fully edited. It is these 101 categories that we use in this experiment.

56 75.41 98.94 66. 40.91 78.91 66.983 documents in total with 122.09 ——— 327. For 84 extracted features.13 65 (0.01) (0. Table 14 lists values of certain points in Figure 8.54 98.08 70. Recall and F1 (%) of different methods for RCV1 data.57 98. of extracted features threshold (ρ) IG DC IOC FFC 18 (0.57 —84.82 25.72 recall DC 51.42 87.1) 69.27 69.62 TABLE 12 Microaveraged Precision.03.38 202 (0.72 53.2) 9.16 26.49 98.50 63.96 82.69 70.58 —98.32 65.82 precision DC 79.50 81.28 98.39 M-FFC 97.322 documents.3) 98. but our method only needs 24.48 ——— 46.43 818 (0.16 —66.96 74.66 —74.44 21460.27 74.58 —79.81 67.48 68.39 74.96 82.03 98. for 22 extracted features.63 seconds and IOC requires 57958.83 67.15 73.68 —71.13 97.31 98.39 —98.53 58.1 ——— 107.This article has been accepted for publication in a future issue of this journal.46 83.94 87.35 seconds and IOC has a difficulty in completion.03 98.03) 9.77 ——— 37.79 63.3 Experiment 3: Cade12 Data Cade12 is a set of classified Web pages extracted from the Cadˆ Web directory [41].67 66. of extracted features 18 42 65 threshold (ρ) (0. 202 (0.07) microaveraged IG 96.08 —69.58 microaveraged IG 5. are split for training and the remaining.36 ——— 181.85 98.67 63.93 76.4) 87.07) microaveraged IG 91.14 98.02 seconds.32 76.26 98.18 82. Content may change prior to final publication. our method only needs 47.58 92.68 ——H-FFC 49.10 seconds. as .4) 98.44 609.5) 98.25 83.31 85.30 68.74 S-FFC 85.4) 9.58 1700 (0. In particular.661 documents.43 = 75.35 79. was obtained from [45].49 66.67 —98. DC requires 689.04 1700 (0.34 M-FFC 71.53 98.00 68.01) (0.58 98.94 37.85 61.73 72.27 54.63 94533.07) 9.40 74 (0.27 65.42 85. class of documents class of documents 01–servicos 8473 07–internet 2381 02–sociedade 7363 08–cultura 2137 03–lazer 5590 09–esportes 1907 04–informatica 4519 10–noticias 1082 05–saude 3171 11–ciencias 879 06–educacao 2856 12–compras-online 625 5. For example.60 98. The distribution of Cade12 data is shown in Figure 1(c) and Table 13.69 98. A version of this dataset.38 —83. but has not been fully edited. from which two-thirds. This directory points to Brazilian e Web pages that were classified by human experts into 12 classes.74 S-FFC 67.16 64. while DC requires 2939.03 TABLE 13 The number of documents contained in each class of the Cade12 dataset.80 11.51 71.29 —98. 13.23 —64.3) 9.33 61.78 68.41 M-FFC 55.68 82.45 53.86 49.51 —84.60 818 (0.87 79.44 1709.30 microaveraged IG 10.78 48.83 74 (0.68 82.51 —98.42 82.88 14.31 Full features (47152 features): MicroAcc = 98.19 (MicroAcc) IOC 96.03) (0.74 85.42 42 (0.03) (0.21 67.07 98.41 98. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11 TABLE 10 Sampled execution times (sec) of different methods on RCV1 data.11 65.1) 9.18 38.30 1700 (0.2) 76.12 72. of extracted features 18 42 65 threshold (ρ) (0.44 1980.58 98.00 Full features (47152 features): MicroP = 86. IOC can only get finished in 100.73 86.46 75.98 ——H-FFC 98.52 86.05 (MicroF1) IOC 25.66.75 —61.44 1329. Figure 8 shows the execution time (sec) of different feature reduction methods on the Cade12 dataset. our method runs much faster than DC and IOC.95 97.93 (MicroR) IOC 15.98 75.607 features.87 421 (0.89 ——H-FFC 81.82 (MicroP) IOC 61.05 55. for testing.3) 83.66 61.46 ——— 612.17 68.02 98.03 84.48 seconds.01) 9.28 87.44 F1 DC 62.49 421 (0.44 4256.90 20.02 —73.5) 88.02 —56.75 —81.00 76.91 71.57 202 (0.1) 97.51 57.00 71.03 98.43 85. DC and IOC run significantly slow.32 36.000 seconds with 22 extracted features.92 S-FFC 55.65 M-FFC 62.52 = 80.81 —63.37 98.76 63.33 61.44 16592. MicroR 74 (0.31 86.13 S-FFC 98.37 98.42 MicroF1 421 (0.30 61.24 accuracy DC 98.91 63.96 74.32 58.2) 97.56 818 (0.44 8501.64 ——— 31.51 TABLE 11 Microaveraged Accuracy (%) of different methods for RCV1 data.26 98. The Cade12 collection has a skewed distribution and the three most popular classes represent more than 50% of all documents. Clearly.49 —66. 27.01 ——H-FFC 61.5) 9. As the number of extracted features increases.

MicroR. In general.7 0.8 Execution time (sec) 10 3 10 2 10 1 20 50 100 200 Number of extracted features 500 10001338 45 0. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12 TABLE 14 Sampled execution times (sec) of different methods on Cade12 data. Figure 10 shows the influence of γ values on the performance of M-FFC in MicroF1 for three numbers of extracted features. a new cluster is created for this word. but has not been fully edited. Again. and DC in order. Microaveraged F1 (%) of M-FFC with different γ values for Cade12 data. As shown in the tables. For example. this experiment shows that FFC can not only run much faster than DC and IOC in feature reduction. Table 15 lists values of certain points in Figure 9. Each cluster is characterized by a membership function with statistical mean and deviation. 9.84 ——— 145. Fig.5 for M-FFC. MicroR is 40.48) 105. Note that while the whole 122. and MicroF1 (%) of the Cade12 dataset obtained by different methods. all the methods work equally well for MicroAcc. and H-FFC get a little bit worse than IG in MicroP. Also. 94 93 92 H−FFC S−FFC M−FFC DC IG IOC 50 100 200 500 Number of extracted features 1. IG gets 78. If a word is not similar to any existing cluster. IG performs worst for MicroR.45 ——— 276. 8. indicated by M-FFC22. M-FFC. 10. S- 6 C ONCLUSIONS We have presented a fuzzy self-constructing feature clustering (FFC) algorithm which is an incremental clustering approach to reduce the dimensionality of the features in text classification. Besides.35 ——— 47.57%. H-FFC.18 1871. and M-FFC1338.14) 105. no accuracies are listed for IOC when the number of extracted features exceeds 22. but also provide comparably good or better extracted features for classification. followed by M-FFC. respectively.33% in MicroR when the number of features is 38.18 1086.18 2939.47 1338 (0.11%.41 ——— 100.607 features are used for classification.91 ——— 27. and trialand-error for determining the appropriate number of extracted features can then be avoided.23 ——— 57.02 38 (0. and MicroF1 is 50. it is hard to get feature reduction for Cade12. and 1.2 Fig.18 23509.28) 105. 22. the derived membership functions match closely with and describe properly the real distribution of the training data.000 122607 91 20 Fig.18 7606.18 11593. MicroR. Microaveraged accuracy (%) of different methods for Cade12 data. However.18 4167. MicroAcc is 93.91 ——— 381.61 59 (0. and MicroF1. of extracted features threshold (ρ) IG DC IOC FFC 22 (0. Similarity between a word and a cluster is defined by considering both the mean and the variance of the cluster.83% in MicroP and 38. M-FFC286.40 84 (0.10 24.55%.72 10 5 55 FFC DC IG IOC 54 53 10 4 Microaveraged F1 % 52 51 50 49 48 47 46 M−FFC22 M−FFC286 M−FFC1338 0.18 689. By this algorithm.48 108 (0.18 437 (0.338. IG performs best for MicroP.37) 105. Table 16 lists values of the MicroP.5 0. Figure 9 shows the MicroAcc (%) of the Cade12 dataset obtained by different methods. Note that γ is set to 0. In fact. getting only 7.67 ——— 34. the MircoF1 values do not vary significantly with different settings of γ.55 286 (0. 286.03) 105. the user need not specify the number of extracted features in advance.88%. a desired number of clusters are formed automatically. but they are good in MicroR. Features that are similar to each other are grouped into the same cluster. Content may change prior to final publication. When all the words have been fed in. Execution time (sec) of different methods on Cade12 data.04% in MicroR when the number of features is 38. Experiments on three real-world . We then have one extracted feature for each cluster. But none of the methods work satisfactorily well for MicroP.3 0. S-FFC.63 57958.18 30331. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. Loose correlation exists among the original features. Microaveraged Accuracy % shown in Table 14. S-FFC gets 75.07) 105.01) 105.21% in MicroP when the number of features is 38.This article has been accepted for publication in a future issue of this journal.6 Parameter of M−FFC 0.54 889 (0. FFC can get best results. MicroP is 69.4 0.54) 105. As shown in the figure.19) 105. For example. based on the extracted features previously obtained.

07 F1 DC 43.00% microaveraged accuracy and 86.48% on the 20 Newsgroups dataset.48) 74.18 34.48 93.36 = 50.12 75.62 93.25 M-FFC 50.84 36.40 39.49 52.63 35.79 93. This explains why both Joachims [33] and Cardoso-Cachopo [45] have lower uni-labeled accuracies than the microaveraged accuracies shown in Table 8 and Table 15. either training or testing.58 14.98 M-FFC 38.34 Full features (122607 features): MicroP = 69.57.49 20.65 —93.73 46.48) 92.80 17.86 23.59 37.03) (0.47 48.21 78.74 —49.83 93.02 44.. This may be due to that Lewis et al. MicroR = 40.75 93.28) 92.63 —93.80 36.18 93.18 (MicroR) IOC 20.87 75.16 78.06 ————H-FFC 39.18 48.71 38.66 S-FFC 38.28 93.80% (ULAcc) 81.63 72.83 74.01) (0.38 76.92 15.59 47. but has not been fully edited. Lewis et al.77 47.33 78.84 93.82 40. each copy is associated with a distinct class label. [40] Al-Mubaid & Umair [31] Cardoso-Cachopo [45] dataset 20 NG RCV1 20 NG 20 NG Cade12 feature red. according to the definitions.82 ——H-FFC 93.36 —49.91 889 (0.48 datasets have demonstrated that our method can run faster and obtain better extracted features than other methods. Recall and F1 (%) of different methods for Cade12 data.90 33.76 —73.18 49. microaveraged accuracy or microaveraged F1.58 75.6% (MicroF1) 98.28) 76.59 78.80 93.42 48.28 1338 (0.89 93. with or without feature reduction.93 52.87 93.80 93.52 —37.36 —73.41 12. [40] used the TFIDF weighting which is more elaborate than the TF weighting we used.06 35.63 93. Cardoso-Cachopo [45] applied the SVM classifier over the normalized TFIDF term weighting.42 47. Note that.46 38.35 —72. of extracted features 22 38 59 84 108 threshold (ρ) (0.09 53.74 37. [40] has a higher Microaveraged F1 than that shown in Table 11. no no AIB [25] no no evaluation 91.53 41.82 38.17 (MicroAcc) IOC 92.46 S-FFC 93. of extracted features 22 38 59 threshold (ρ) (0.85 889 (0. and achieved 52.55 84 (0.37 36.78 9.e.89 73.67 73.82 37.48% (ULAcc) 52. method Joachims [33] Lewis et al.95 TABLE 16 Microaveraged Precision.50 49.73 44. which combines the distributional clustering of words and a learning logic technique and achieved 98.02 52.92 51.51 41.60 69.26 —72.91 93.67 37.48 —37.20 accuracy DC 93.66 108 (0.31 71.93 S-FFC 75.45% microaveraged F1 on the 20 Newsgroups dataset.09 92. it is hard for us to explain the difference between the microaveraged F1 obtained by AlMubaid & Umair [31] and that shown in Table 9 since a kind of sampling was applied in Al-Mubaid & Umair [31].89 7.03 75.88 437 (0.04 33.47 32.16 71.37 M-FFC 73. No feature reduction was applied in the other three methods.89 —49.89 92.37) 76.01) (0.38 74.37) 92.07) microaveraged IG 91.56 68. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13 TABLE 15 Microaveraged Accuracy (%) of different methods for Cade12 data.00% (MicroAcc) 86.23 microaveraged IG 3.54) 92.47 21.19) microaveraged IG 76.10 73.59 93.63 93.86 52.51 93.10 75. Some methods adopted the same performance measures as we did. Others adopted uni-labeled accuracy (ULAcc).26 ————H-FFC 71.68 —93. with m class labels are copied m times.72 33.24 93.59 10.78 72.42 —36.54 75.59 —93.84% (ULAcc) was reduced to 120.64 93.89 437 (0.22 53. AlMubaid & Umair [31] applied a classifier.51 93.08 50.14) (0. Content may change prior to final publication. Al-Mubaid & Umair [31] applied AIB [25] for feature reduction and the number of features TABLE 17 Published evaluation results for some other text classifiers.6% microaveraged F1 on the RCV1 dataset. Other projects were done on text classification.60 75.77 93.30 —93.05 35.29 38.07 —37. i. called Lsquare.10 52.08 38.79 69.51 S-FFC 50.33 72. The evaluation results published for these methods are summarized in Table 17.88 48.63 93.72 286 (0.11.49 49.84% uni-labeled accuracy on the Cade12 dataset and 82.35 41.This article has been accepted for publication in a future issue of this journal.54 Full features (122607 features): MicroAcc = 93.08 37. MicroF1 286 (0.04 53.31 72.21 recall DC 30.03) (0.80 M-FFC 93. We discuss some methods here.14) 92.45% (MicroF1) 82.54) 73.07 74.80 49.85 93.04 8.86 13. However.8% uni-labeled accuracy on the 20 Newsgroups dataset.83 24.07) (0. [40] used the SVM classifier over a form of TFIDF weighting and obtained 81.19 50.60 (MicroP) IOC 76.53 50.94 72.78 49.45 93.51 93. Joachims [33] applied the Rocchio classifier over TFIDFweighted representation and obtained 91.33 36.37 (MicroF1) IOC 31. However.71 microaveraged IG 7.66 49. Similarity-based clustering is one of the techniques we have . Uni-labeled accuracy is basically the same as the normal accuracy.85 38.76 ————H-FFC 50. The evaluation results given here for them are taken directly from corresponding papers.19) 92. uni-labeled accuracy is usually lower than Microaveraged accuracy for multi-label classification.27 74. except that a pattern.94 1338 (0.22 68.89 37.56 —49.41 precision DC 70.33 38.30 —93. and evaluation results were published. Lewis et al.08 18.20 93.92 73.02 72.

Salton and M. Kohavi and G.csail.” 14th International Conference on Machine Learning. Pereira. Ricardo and R. pp. etc. pp. “Feature Selection and Feature Extraction for Text Categorization. B. vol. 1034-1040. Y. Kak. Hidai. Kumar. F.” Data Mining and Knowledge Discovery Handbook (draft of preliminary accepted chapter.” Journal of Machine Learning Research. 2005. 2. Y. Shawe-Taylor and N. and K. “The Power of Word Clusters for Text Classification. 97-104. vol. Mones.” IEEE Transactions on Pattern Analysis and Machine Intelligence. “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing. Oja. vol. Chen. pp. Liu. W. 143-151. S. de Silva. pp. Yan. and L. vol. 577-587. S. Al-Mubaid and S. 2006. “Experimental Results of the Signal o Processing Approach to Distributional Clustering of Terms on Reuters21578 Collection. UK. H. S. Pedersen.This article has been accepted for publication in a future issue of this journal. 1-2. software available at http://www. [45] Http://web. 2000. Yan. [27] I. and F. “Self-Adaptive Neurofuzzy Inference Systems for Classification Applications. pp. Cheng. [37] C. pp. S. Mallela. Li. pp.). Daphne and M. C. and I. Belkin and P. 2319-2323. University of Dortmund. 2002. R. the relevance among word patterns can be measured and the word patterns can be grouped by applying our similarity-based clustering algorithm. 20. 2003. McGill. 1983. [29] N.” 2001. [25] R. pp. 2000. [22] J.” Technical Report LS-8-23. Lin. 284-292. [36] T. pp. [24] L. 9. 1-2. Cortes and V. Learning with Kernels: Support Vector o Machines. no. 97. 2009. Umair. 3. Fl´ rez. vol.” 21st Annual International ACM SIGIR. pp. [44] G.” 31st Annual Meeting of ACL. 2002. O. M. 245-271. [23] J. Meo. Springer. Vapnik.pdf). R EFERENCES [1] [2] [3] [4] [5] [6] [7] Http://people. 290. 1996. Fan. 34. and Q. “Machine Learning in Automated Text Categorization. Liu. Q.” IEEE Transactions on Fuzzy Systems. “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares. K. 6.tw/∼cjlin/libsvm. “A New Text Categorization Technique Using Distributional Clustering and Learning Logic.” Machine Learning. “Toward Optimal Feature Selection.” ACM Computing Surveys. Our method is good for text categorization problems due to the suitability of the distributional word clustering concept. 273-297. “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. McGraw-Hill Book Company. [33] T. [32] G. Roweis and L. S. Springer-Verlag. Howland. Blum and P. G. Bekkerman. Y. Maimon. “A Comparative Study on Feature Selection in Text Categorization. Zhang. pp. 2000.” IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 1183-1208. Chang and C. 1997. 412-420. 1999. 97.” Journal of Machine Learning Research. Joachims. Words for Text Categorization. “Mining Multi-label Data. Content may change prior to final publication. 1998. [27]. Rose. Langari. vol. and Information.” 23rd European Colloquium on Information Retrieval Research (ECIR). H. 96-103. pp. Monta˜ es.” Science. Sebastiani. D. Jiang. ACKNOWLEDGMENT The authors are grateful to the anonymous reviewers for their comments which were very helpful in improving the quality and presentation of the paper. and Y.ntu. Tishby. 2002. I. vol. Ienco and R. 1999. and W. Tenenbaum. [40] D. pp. W. “Distributional Clustering of English Words. D. J. 3. Zhang. 1. Wang and C. 273-324. D´az.pt/∼acardoso/datasets/. Weng.” 14th International Conference on Pattern Recognition. 23. Yang. Langford. Hwang. no. Chen. pp. 2004. 9. N. 3. e [42] C.cade. Cambridge University Press. “Libsvm: A Library for Support Vector Machines. Park. 1265-1287. Cheng. 427-448. Http://kdd. 3. [21] K. [31] H.com. pp. Langley. and R.jmlr. Control. S. no. “A Global Geometric Framework for Nonlinear Dimensionality Reduction. 6. 1995. The work of this paper was motivated by distributional word clustering proposed in [24]. Addison Wesley Longman.” Journal of Machine Learning Research. Li. 183-190. 1156-1165. Niyogi. In this paper. and H. 1998. Hiraoka. [30] F. 2008. J. “Successive Learning of Linear Discriminant Analysis: Sanger-Type Algorithm. MA. A. 25.” IEEE Transactions on Knowledge and Data Engineering. W. We are also applying it to other problems. O. no. pp. B. but has not been fully edited. “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Ranilla. 1993. Yang.org/papers/volume5/lewis04a/lewis04a. and J.” Workshop Speech and Natural Language. [25]. 2001. Fuzzy Logic–Intelligence. N. Yoshizawa. Z. 2001. n´ ı “Introducing a Family of Linear Measures for Feature Selection in Text Categorization. M. 320-331. Zhang. Computer Science Department. 1223-1232. [30]. Smola. Baker and A. pp. 2001.” 13th International Conference on Machine Learning. USA. Hamahira. 37-53. “Wrappers for Feature Subset Selection. 5. “IMMC: Incremental Maximum Margin Criterion. 678681. Slonim and N. fuzzy modeling. Cristianini. A. no. . 1997. we apply this clustering technique to text categorization problems. 1986. “Subspace Methods of Pattern Recognition. Yang.” 14th International Conference on Machine Learning. 2004. vol. and J. vol. “Efficient and Robust Feature Extraction by Maximum Margin Criterion. McCallum.ics. Tsoumakas. I. and Z. T. “Support-Vector Network. C. Y. no. 1997. 725-730. Sch¨ lkopf and A. 2004 (http://www. Tishby. pp. and R.” Aritficial Intelligence. Ma. J. 2323-2326. “Distributional Clustering of Words for Text Classification. 2003. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] M.” 29th European Conference on IR Research. vol. nos. “A Re-examination of Text Categorization Methods. [35] J. Vlahavas. NJ. 1999. and S. no. C. A. Yang and J. vol. Rosen. pp. Rokach (Ed. Yen and R.uci. http://www. H. Jeon. Lewis. R. L. They’d also like to express their thanks to National Institute of Standards & Technology for the provision of the RCV1 dataset. pp. [26]. N. I. Introduction to Modern Retrieval. “Selection of Relevant Features and Examples in Machine Learning. 2003. web mining.” IEEE Transactions on Knowledge and Data Engineering. E. such as image segmentation.edu/databases/reuters21578/reuters21578. W. 17. [43] Y. N. 2006.Yan. 1983. nos. Katakis. K. F. S. T.csie. 42-49. 18. “Candid Covariance-Free Incremental Principal Component Analysis. S. [41] The Cadˆ Web directory. C. B. pp. vol. “Distributional Word Clusters vs. Lewis. pp.html.” Pattern Recognition and Image Processing Series. “RCV1: A New Benchmark Collection for Text Categorization Research. vol. Fan. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14 developed in our machine learning research. Y. D. B. 2nd edition. Combarro. Principal Component Analysis. E. 2005. 361-397. Dalmau and O. Dhillon. M. J. Winter. [39] J. 2003. V. pp. J. Cambridge.edu/jrennie/20Newsgroups/. Prentice-Hall. vol.mit. 1997. pp. S. K. 290. MIT Press. Sahami. “Dimension Reduction in Text Classification with Support Vector Machines. H. Xi. pp.” BIT Numberical Math. as by Eq. 6. Yang and X. We found that when a document set is transformed to a collection of word patterns.” Journal of Machine Learning Research.” Advances in Neural Information Processing Systems 14. vol. data sampling. pp.ist. John. Q. Joachims. Jolliffe. “A Divisive InfromationTheoretic Feature Clustering Algorithm for Text Classification.(7). pp. “Exploration and Reduction of the Feature Space by Hierarchical Clustering. T. J. Saul. 18.utl. 43. D. [38] B. pp. Mizoguchi. [28] D. Kernel Methods for Pattern Analysis. El-Yaniv.” IEEE Transactions on Knowledge and Data Engineering.” Science.” ACM SIGIR Conference. vol. vol. Park.” 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. M. Yan. “Nonlinear Dimensionality Reduction by Locally Linear Embedding. 2664-2667. Zang. 212-217. T. T. C. [29].” Conference on Advances in Neural Information Processing System. Tishby. [34] J. Kim. P.” Aritficial Intelligence. Mishima. Berthier. and Beyond.” 2008 SIAM Conference on Data Mining.edu. Regularization. [26] M.br/. Y. Lee. USA. Lee. Q. Modern Information Retrieval. 10. G. Cambridge. Martinez and A. Optimization. Upper Saddle River. pp. E. L. 228233. “PCA versus LDA. 2007. 1992. 790-802. “Text Categorization with Support Vector Machine: Learning with Many Relevant Features. 2004. W. 1-47.

in 1979. Taiwan Fuzzy Systems Association.S. Taiwan. His main research interests include machine learning. He received the B. Tainan. Data Mining. director of the Southern Telecommunications Research Center. the Institute of Information and Computing Machinery. in 1990. and Chair of the Department of Electrical Engineering. respectively. degree of Computer Science from the University of North Carolina. His research interests include Artificial Intelligence. and M. ROC. National Sun YatSen University. degree at the Department of Electrical Engineering.S. He is currently pursuing the Ph. and has become a professor of the department since 1994. Taipei County. Dr. National Science Council. Dr.This article has been accepted for publication in a future issue of this journal. Taiwan.. respectively. He is currently a research assistant at the Department of Chemistry. He received the B.S.D. 1998. Kaohsiung. degree from Sun Yat-Sen University. 2008.D. from National Taiwan University.S. He obtained the Best Paper Award of the International Conference on Machine Learning and Cybernetics. 1998-1999. Taiwan in 2005 and M. data mining and web programming.S. degree from Sun YatSen University. December 1996. 19972000.E. and Soft Computing. 2001. and the Ph. Chapel Hill. Machine Learning. 2002. 1998. Ren-Jia Liou Ren-Jia Liou was born in Banqiao. 2008. and the 6th Conference on Artificial Intelligence and Applications. data mining. He received the B. Man. Information Retrieval. Dr. Taiwan. National Sun Yat-Sen University.E. Dr. degree from I-Shou University. Lee is a member of the IEEE Society of Systems. but has not been fully edited. the Association for Automated Reasoning. Taiwan in 2004.E. 1955. ROC. Lee served as director of the Center for Telecommunications Research and Development. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 15 Jung-Yi Jiang Jung-Yi Jiang was born in Changhua. He also obtained the Distinguished Mentor Award of National Sun Yat-Sen University. His main research interests include machine learning.E. and Cybernetics. He was awarded by the Chinese Institute of Electrical Engineering for Outstanding M. USA. . Taiwan. Lee obtained the Distinguished Teachers Award of the Ministry of Education. 1997. Thesis Supervision. ROC on August 15. Taiwan in 2009. International Computer Symposium – Workshop on Artificial Intelligence. He obtained the Distinguished Teaching Award of National Sun Yat-Sen University. December 1998. He obtained the Distinguished Research Award of National Sun Yat-Sen University. November. Taiwan in 2002 and M. and the Taiwanese Association of Artificial Intelligence. 1993. National Sun Yat-Sen University.S. Taiwan. 2000-2003. Content may change prior to final publication. and information retrieval. degrees of Electrical Engineering in 1977 and 1979. National Sun Yat-Sen University. He obtained the Distinguished Paper Award of the Computer Society of the Republic of China.. in 1983. Kaohsiung. in 1983. and Best Paper Award of the 7th Conference on Artificial Intelligence and Applications. Taiwan. National Sun Yat-Sen University. He served as the program chair for the International Conference on Artificial Intelligence (TAAI-96). 1993 and 2008. Lee joined the faculty of the Department of Electrical Engineering at National Sun Yat-Sen University. Taiwan. Now he is serving as Deputy Dean of Academic Affairs and director of NSYSU-III Research Center. Shie-Jue Lee Shie-Jue Lee was born at KinMen.S. degree from National Dong Hwa University.

Sign up to vote on this title
UsefulNot useful