Professional Documents
Culture Documents
net/publication/221346493
CITATIONS READS
222 274
4 authors, including:
Wei-Ying Ma
Toutiao
389 PUBLICATIONS 28,256 CITATIONS
SEE PROFILE
All content following this page was uploaded by Wei-Ying Ma on 03 June 2014.
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
contribution to an entropy index based on the data 2.1 Information Gain (IG)
similarity; the individual “feature saliency” is estimated and
an Expectation-Maximization (EM) algorithm using Information gain (Yang & Pedersen, 1997) of a term
Minimum Message Length criteria is derived to select the measures the number of bits of information obtained for
feature subset and the number of clusters (Martin et al., category prediction by the presence or absence of the term
2002). in a document. Let m be the number of classes. The
information gain of a term t is defined as
While the methods mentioned above are not directly
IG(t ) = −∑i =1 p(ci ) log p(ci )
m
targeted to clustering text documents, in this paper we
+ p(t )∑i =1 p(ci | t ) log p (c i | t )
introduce two novel feature selection methods for text m (1)
clustering. One is Term Contribution (TC) which ranks the
+ p(t )∑i =1 p(ci | t ) log p (c i | t )
feature by its overall contribution to the documents m
∑ f (t , d ) × f (t , d
variables based on the current parameters estimation at the
TC (t ) = ) (8)
i j E-step. Then at the M-step, missing values are replaced by
i , j Ii ≠ j
the expected values calculated in the E-step to find a new
“ltc” scheme is used to compute each term’s tf*idf value estimation of parameters that can maximize the complete
which takes the log of term’s frequency in the document, data likelihood function. These two steps are iterated to
then multiplies it by the IDF weight of this term, and finally convergence.
normalizes the document length.
In the clustering applications, it is normally assumed that a
If the weights of all terms are equal, we simply set document is generated by a finite mixture model and there
f (t , d ) = 1 when the term t appears in the document d . is one-to-one correspondence between mixture components
Then the value TC(t) can be written as the equation (9): and clusters. So the likelihood function p( D | θ ) , i.e. the
TC (t ) = DF (t )( DF (t ) − 1) (9) probability of all documents D given the model parameter
θ , can be written as the equation (10):
Since the transformation is increasing monotonously while N |C |
the DF(t) (the document frequency of the term t) is a p ( D | θ ) = ∏∑ p(c j | θ ) p (d i | c j , θ ) (10)
positive integer, DF is just the special case of TC. i =1 j =1
where c j is j-th cluster, | C | is the number of clusters, because K-means algorithm can be described by slightly
p(c j | θ ) is the prior distribution of cluster j, extending the mathematics of the EM algorithm to the hard
threshold case (Bottou et al., 1995), we use K-means
and p(d i | c j ,θ ) is the distribution of document i in cluster j. clustering algorithm to obtain the cluster results based on
It is further assumed that the terms are conditional the selected terms.
independent given the class label, and then the likelihood
function can be rewritten as the equation (11): 4. Experiments
p ( D | θ ) = ∏ ∑ ( p (c j | θ )∏ p (t | c j , θ ))
N |C |
(11) We first conducted an ideal case experiment to demonstrate
i =1 j =1 t∈d i that supervised feature selection methods, including IG and
where p(t | c j , θ ) is term distribution for the term t in the CHI, can significantly improve the clustering performance.
Then, we evaluated the performance of unsupervised
cluster j. Not all terms are equivalently relevant to the feature selection methods, including DF, TS, TC and En in
document, so p(t | c j , θ ) can be treated as the weighted real case. Finally we evaluated the iterative feature selection
sum of relevant distribution and irrelevant distribution as algorithm. K-means was chosen as our basic clustering
the equation (12): algorithm and entropy measure and precision measure were
p (t | c j , θ ) = z (t ) p(t is relevant | c j , θ ) used to evaluate the clustering performance. Since K-means
(12) clustering algorithm is easily influenced by selection of
+ (1 − z (t )) p (t is irrelevant | θ ) initial centroids, we random produced 10 sets of initial
where z (t ) = p(t is relevant ) is defined as the probability centroids for each data set and averaged 10 times
that the term t is relevant. In addition, if the term is not performances as the final clustering performance. Before
relevant, the term distribution is assumed to be the same performing clustering, tf*idf (with “ltc” scheme) was used
over different clusters and is denoted as to calculate the weight of each term.
p(t is irrelevant | θ ) . Hence, the likelihood function can 4.1 Performance Measures
be written as the equation (13):
N |C |
Two kinds of popular measurements, entropy and precision,
p(D | θ ) = ∏∑
i =1 j =1
( p (c j | θ ) were used to evaluate the clustering performance.
4.1.1 ENTROPY
∏ ( z (t ) p(t is relevant | c ,θ ) + (1 − z (t )) p(t is irrelevant | θ )))
i∈d i
j
Entropy measures the uniformity or purity of a cluster. Let
(13) G ', G denote the number of obtained clusters and the
number of original classes respectively. Let A denote the
To maximize this likelihood function, EM algorithm can
set of documents in a obtained cluster, and the class label of
find a local maximum by iterating the following two steps:
each document d i ∈ A, i = 1,..., | A | is denoted as label (di ) ,
(1) E-step: z) ( k +1) = E ( z | D, θ ( k ) ) (14)
)
which takes value c j ( j = 1..., G ) . The entropy for all
(2) M-step: θ = arg max θ p( D | θ , z ( k ) ) (15)
) ( k +1)
clusters is defined by the weighted sum of the entropy for
)
all clusters, as shown in the equation (16):
The E-step corresponds to calculating the expected feature
Entropy = − ∑ ∑ p jk × log( p jk )
G'
relevance given the clustering result, and the M-step | Ak | G (16)
corresponds to calculating a new maximum likelihood k =1 N j =1
estimation of the clustering result in the new feature space.
where
The proposed EM algorithm for text clustering and feature
1
selection is a general framework and the full Pjk = | {d i | label ( d i = c j )} | . (17)
implementation is beyond the scope of this paper. Our | Ak |
Iterative Feature Selection method can be fitted into this
4.1.2 PRECISION
framework. On the E-step, to approximate the expectation
of feature relevance, we use supervised feature selection Since entropy is not intuitive, we choose another measure to
algorithm to calculate the relevance score for each term, evaluate the clustering performance, i.e. precision. For each
then the probability for the term relevance is simplified to cluster, it always consists of documents from several
z ( t ) = { 0 ,1} according to whether the term relevance different classes. So we simply choose the class label which
shares with most documents in this cluster as the final class
score is larger than a predefined threshold value. So at each
label. Then, the precision for each cluster is defined as:
iteration, we will remove some irrelevant terms based on
the calculated relevance of each term. On the M-step,
1 Web Directory dataset, there is a significant performance
Precision( A) = max(| {d i | label ( d i ) = c j } |) (18)
| A| improvement. For example, using CHI method on Web
Directory dataset, when 98% terms are removed, the
In order to avoid the possible bias from small clusters
entropy is reduced from 2.305 to 1.870 (18.9% entropy
which have very high precision, the final precision is
reduction relatively), the precision is improved from 52.9%
defined by the weighted sum of the precision for all clusters,
to 63.0% (19.1% precision improvement relatively).
as shown in the equation (19):
Certainly, the results are just the upper bound of the
Precision = ∑
G'
| Ak | (19) clustering performance in real case because it is difficult to
Pr ecision( Ak ) select discriminative terms without prior class labels
k =1 N
information.
4.2 Data Sets
As reported in past research works (Yang & Pederen, 1997),
2.2
text categorization performance varies greatly on different
dataset. So we chose three different text datasets to evaluate 1.7
text clustering performance, including two standard labeled
Entropy
datasets: Reuters-21578 1 (Reuters), 20 Newsgroups 2 1.2
(20NG), and one web directory dataset (Web) collected
from the Open Directory Project3. There were total 21578 0.7 IG.Reuters
IG.Web
CHI.Reuters
CHI.Web
documents in Reuters, but we only chose the documents IG.20NG CHI.20NG
0.2
that have at least one topic and Lewis split assignment, and 100 40 20 10 8 6 4 2
assigned the first topic to each document before the Percentage of Selected Features
evaluation. The information about these datasets is shown
in Table 1. Figure 1. Entropy comparison on 3 datasets (supervised)
Table 1. The three datasets properties
0.81
0.61
REUTERS 80 10733 18484 40.7 23.6 0.56
IG.Reuters CHI.Reuters
1.1
1.05
very similar to DF because the removal of common term is
1
easily to cause more entropy reduction than the removal of
IG
DF
CHI
TC
rare term. TS is sensitive to the document similarity
0.95
TS En threshold β . When β is just suitable, the performance of
0.9
100 40 20 10 8 6 4 2 TS is better than TC. However, the threshold is difficult to
Percentage of Selected Features
tune. In addition, the time complexity of TS (at least O( N 2 ) )
Figure 3. Entropy comparison on Reuters (unsupervised) is always higher than that of TC ( O ( MN 2 ) ), because N is
0.81
always much small than N (see Table 1). Therefore, TC is
0.8
the preferred method with effective performance and low
0.79
computation cost.
0.78 Similar points can be found on the rest two datasets (20NG
Precision
0.77 and Web) (Figure 5 and Figure 6). Due to the limitations of
0.76 the paper length, we only drew the entropy measure for
0.75
IG CHI these two datasets. As can be seen from these figures, the
DF TC
0.74 TS En best performances are also yielded on Web Directory data.
0.73 About 8.5% entropy reduction and 8.1% precision
100 40 20 10 8 6 4 2 improvement are achieved while 96% features are removed
Percentage of Selected Features
with TS method.
Figure 4. Precision comparison on Reuters (unsupervised)
From these figures, we found the following points. First,
unsupervised feature selection can also improve the 2.3
2
improvement relatively while 90% terms are removed.
Second, unsupervised feature selection can be comparable 1.9
IG CHI
to supervised feature selection with up to 90% term 1.8 DF TC
removal. When more terms are removed, the performances 1.7
TS En
IG CHI
1.11
1.65
DF TC
0.78
1.1
1.6 TS En
1.09 0.775
Precision
Entropy
Entropy
1.55 1.08
1.07 0.77
1.5
IG.Entropy
1.06
1.45 CHI.Entropy 0.765
1.05 IG.Precision
CHI.Precision
1.4 1.04 0.76
100 40 20 10 8 6 4 0 1 2 3 4 5 6 7 8 9 10 11 12
Percentage of Selected Features Iteration
Figure 6. Entropy comparison on 20NG (unsupervised) Figure 7. Entropy & precision with IF selection on Reuters
2.35 0.62
Precision
Entropy
2.15 0.56
Since IG and CHI have good performance in ideal case,
2.1
they were chosen in iterative feature selection experiment. IG.Entropy 0.54
2.05
In order to speed up the iterative process, we removed the CHI.Entropy
IG.Precision 0.52
2
terms whose document frequency was lower than 3 before CHI.Precision
1.95 0.5
clustering. Then at each iteration, about 10% terms (or 3% 0 1 2 3 4 5 6 7 8 9 10 11 12
terms when less than 10% terms left) with lowest ranking Iteration
score outputted by IG or CHI were removed out. Figure 7
displays the entropy (dashed line) and precision (solid line) Figure 8. Entropy & precision with IF selection on Web Directory
results on Reuters. Figure 8 displays the results on Web
Directory dataset. We did not show the results on 20NG 12000
8000
of iterative feature selection is quite good. It is very close to Noisy Terms in Reuters
the ideal case and much better than any unsupervised 6000
feature selection methods. For example, after 11 iterations
with CHI selection on Web Directory dataset (nearly 98% 4000