You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221346493

An Evaluation on Feature Selection for Text Clustering

Conference Paper · January 2003


Source: DBLP

CITATIONS READS
222 274

4 authors, including:

Wei-Ying Ma
Toutiao
389 PUBLICATIONS   28,256 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Wei-Ying Ma on 03 June 2014.

The user has requested enhancement of the downloaded file.


An Evaluation on Feature Selection for Text Clustering

Tao Liu LTMAILBOX@263.SINA.COM


Department of Information Science, Nankai University, Tianjin 300071, P. R. China

Shengping Liu LSP@IS.PKU.EDU.CN


Department of Information Science, Peking University, Beijing 100871, P. R. China

Zheng Chen ZHENGC@MICROSOFT.COM


Wei-Ying Ma WYMA@MICROSOFT.COM
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, P. R. China

Abstract the inherent data sparsity. Obviously, a single document has


a sparse vector over the set of all terms. The performance of
Feature selection methods have been successfully clustering algorithms will decline dramatically due to the
applied to text categorization but seldom applied problems of high dimensionality and data sparseness
to text clustering due to the unavailability of class (Aggrawal &Yu, 2000). Therefore it is highly desirable to
label information. In this paper, we first give reduce the feature space dimensionality. There are two
empirical evidence that feature selection methods commonly used techniques to deal with this problem:
can improve the efficiency and performance of text feature extraction and feature selection. Feature extraction
clustering algorithm. Then we propose a new is a process that extracts a set of new features from the
feature selection method called “Term original features through some functional mapping (Wyse et
Contribution (TC)” and perform a comparative al., 1980), such as principal component analysis (PCA)
study on a variety of feature selection methods for (Jolliffe, 1986) and word clustering (Slonim & Tishby,
text clustering, including Document Frequency 2000). The feature extraction methods have a drawback that
(DF), Term Strength (TS), Entropy-based (En), the generated new features may not have a clear physical
Information Gain (IG) and ‫א‬2 statistic (CHI). meaning so that the clustering results are difficult to
Finally, we propose an “Iterative Feature Selection interpret (Dash & Liu, 2000).
(IF)” method that addresses the unavailability of
label problem by utilizing effective supervised Feature selection is a process that chooses a subset from the
feature selection method to iteratively select original feature set according to some criterions. The
features and perform clustering. Detailed selected feature retains original physical meaning and
experimental results on Web Directory data are provides a better understanding for the data and learning
provided in the paper. process. Depending on if the class label information is
required, feature selection can be either unsupervised or
supervised. For supervised methods, the correlation of each
1. Introduction feature with the class label is computed by distance,
information dependence, or consistency measures (Dash &
Text clustering is one of the central problems in text mining Liu, 1997). Further theoretical study based on information
and information retrieval area. The task of text clustering is theory can be found on (Koller & Sahami, 1996) and
to group similar documents together. It had been applied to complete reviews can be found on (Blum & Langley, 1997;
several applications, including improving retrieval Jain et al., 2000; Yang & Pedersen, 1997).
efficiency of information retrieval systems (Kowalski,
1997), organizing the results returned by a search engine in As for feature selection for clustering, there have some
response to user’s query (Zamir et al., 1997), browsing works on it. Firstly, any traditional feature selection method
large document collections (Cutting et al., 1992), and that does not need the class information, such as document
generating taxonomy of web documents (Koller & Sahami, frequency (DF) and term strength (TS) (Yang, 1995), can
1997), etc. be easily applied to clustering. Secondly, there are some
newly proposed methods, for example, entropy-based
In text clustering, a text or document is always represented feature ranking method (En) is proposed by Dash and Liu
as a bag of words. This representation raises one severe (2000) in which feature importance is measured by the
problem: the high dimensionality of the feature space and

Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
contribution to an entropy index based on the data 2.1 Information Gain (IG)
similarity; the individual “feature saliency” is estimated and
an Expectation-Maximization (EM) algorithm using Information gain (Yang & Pedersen, 1997) of a term
Minimum Message Length criteria is derived to select the measures the number of bits of information obtained for
feature subset and the number of clusters (Martin et al., category prediction by the presence or absence of the term
2002). in a document. Let m be the number of classes. The
information gain of a term t is defined as
While the methods mentioned above are not directly
IG(t ) = −∑i =1 p(ci ) log p(ci )
m
targeted to clustering text documents, in this paper we
+ p(t )∑i =1 p(ci | t ) log p (c i | t )
introduce two novel feature selection methods for text m (1)
clustering. One is Term Contribution (TC) which ranks the
+ p(t )∑i =1 p(ci | t ) log p (c i | t )
feature by its overall contribution to the documents m

similarity in a dataset. Another is Iterative Feature Selection


(IF), which utilizes some successful feature selection
methods (Yang & Pedersen, 1997), such as Information 2.2 ‫א‬2 statistic (CHI)
Gain (IG) and ‫א‬2 statistics (CHI), to iteratively select
features and perform text clustering at the same time. The ‫א‬2 statistic measures the association between the term
and the category (Galavotti et al., 2000). It is defined to be
Another contribution of this paper is a comparative study
on feature selection for text clustering. We investigate (a) to N × ( p(t , c) × p (t ,c) − p(t , c) × p(t , c)) 2 (2)
χ 2 (t , c) =
what extent feature selection can improve the clustering p(t ) × p(t ) × p(c) × p(c)
quality, (b) how much of the document vocabulary can be
reduced without losing useful information in text clustering, m
(3)
(c) what the strengths and weaknesses of existing feature χ 2 (t ) = avg{χ 2 (t ,c i )}
i =1
selection methods are when applied to text clustering, and
(d) what the difference is among the results of different 2.3 Document Frequency (DF)
datasets. In this paper, we try to address these problems by
empirical evidences. We first show that feature selection Document frequency is the number of documents in which a
methods can improve the efficiency and performance of text term occurs in a dataset. It is the simplest criterion for term
clustering in ideal cases, in which the class label for each selection and easily scales to a large dataset with linear
document is already known. Then we perform a computation complexity. It is a simple but effective feature
comparative study on various feature selection methods for selection method for text categorization (Yang & Pedersen,
text clustering. Finally, we evaluate the performance of 1997).
iterative feature selection method based on K-means using
entropy and precision measures. 2.4 Term Strength (TS)
The rest of this paper is organized as follows. In Section 2, Term strength is originally proposed and evaluated for
we give a brief introduction on several feature selection vocabulary reduction in text retrieval (Wilbur & Sirotkin,
methods and propose a new feature selection method, called 1992), and later applied to text categorization (Yang, 1995).
Term Contribution. In Section 3, we propose a new iterative It is computed based on the conditional probability that a
feature selection method that utilizes the supervised feature term occurs in the second half of a pair of related
selection algorithm without the need to know the class documents given that it occurs in the first half:
information in advance. In Section 4, we conduct several
experiments to compare the effectiveness of different TS (t ) = p (t ∈ d j | t ∈ d i ), d i , d j ∈ D I sim(d i , d j ) > β (4)
feature selection methods in ideal and real cases. Finally,
we summarize our major contributions in Section 5. where β is the parameter to determine the related pairs.
Since we need to calculate the similarity for each document
pair, the time complexity of TS is quadratic to the number
2. Feature Selection Methods of documents. Because the class label information is not
required, this method is also suitable for term reduction in
In this Section, we give a brief introduction on several
text clustering.
effective feature selection methods, including two
supervised methods, IG and CHI, and four unsupervised
methods, DF, TS, En and TC. All these methods assign a 2.5 Entropy-based Ranking (En)
score to each individual feature and then select features Entropy-based ranking is proposed by Dash and Liu (2000).
which are greater than a pre-defined threshold. In this method, the term is measured by the entropy
In the following, let D denote the documents set, M the reduction when it is removed. The entropy is defined as the
dimension of the features, and N the number of documents equation (5):
in the dataset.
E (t ) = −∑ ∑ ( S i , j × log(S i , j ) + (1 − S i , j ) × log(1 − S i , j )), (5)
N N Using the inverse document indexing technology (Salton,
1989), the time complexity of TC is O( MN 2 ) , where N is
i =1 j =1
the average documents per term occurs in.
where Si , j is the similarity value between the document d i
and d j . Si , j is defined as the equation (6):
3. Iterative Feature Selection (IF) Method
−α × disti , j ln(0.5) (6)
S i, j = e ,α = − Feature selection methods have successfully applied to text
dist
categorization for long years. Feature selection can
where dist i , j is the distance between the document di and dramatically improve the efficiency of text categorization
algorithm by removing up to 98% unique terms and even
d j after the term t is removed, dist is the average distance
improve the categorization accuracy to some extent (Yang
among the documents after the term t is removed. & Pedersen, 1997). So it is an interesting idea to apply
The most serious problem of this method is its high feature selection methods to text clustering task to improve
computation complexity O ( MN 2 ) . It is impractical when the clustering performance. In order to test our idea, several
there is a large number of documents and terms, and experiments were conducted. As can be seen from Section
therefore, sampling technique is used in real experiments 4.3.1 and Section 4.3.2, if the class label information is
(Dash & Liu, 2000). known, supervised feature selection methods, such as CHI
and IG, are much more efficient than unsupervised feature
2.6 Term Contribution (TC) selection methods for text clustering, not only more terms
can be removed, but also better performance can be yielded.
We introduce a new feature selection method called “Term
Contribution” that takes the term weight into account. Supervised feature selection method can not be directly
Because the simple method like DF assumes that each term applied to text clustering because of the unavailability of
is of same importance in different documents, it is easily the required class label information. Fortunately, we found
biased by those common terms which have high document that clustering performance and feature selection can be
frequency but uniform distribution over different classes. reinforced by each other. On the one hand, good clustering
TC is proposed to deal with this problem, results will provide good class labels to select better
features for each category; on the other hand, better features
The result of text clustering is highly dependent on the
documents similarity. So the contribution of a term can be will help clustering to improve the performance to provide
viewed as its contribution to the documents’ similarity. The better class labels. Enlightened by the EM algorithm, we
similarity between documents d i and d j is computed by propose a novel iterative feature selection method, which
dot product: utilizes supervised feature selection methods for text
clustering methods.
sim ( d i , d j ) = ∑ f (t , d i ) × f (t , d j ) (7)
t
EM is a class of iterative algorithms for maximum
likelihood estimation in problems with incomplete data
where f (t , d ) represents the tf*idf (Salton, 1989) weight of (Dempster et al., 1977). Suppose we have a dataset
term t in document d . generated from a model with parameters and the data has
So we define the contribution of a term in a dataset as its missing values or unobservable (hidden) variables. To find
overall contribution to the documents’ similarities. The the parameters that can maximize the likelihood function,
equation is EM algorithm first calculates the expectation of hidden

∑ f (t , d ) × f (t , d
variables based on the current parameters estimation at the
TC (t ) = ) (8)
i j E-step. Then at the M-step, missing values are replaced by
i , j Ii ≠ j
the expected values calculated in the E-step to find a new
“ltc” scheme is used to compute each term’s tf*idf value estimation of parameters that can maximize the complete
which takes the log of term’s frequency in the document, data likelihood function. These two steps are iterated to
then multiplies it by the IDF weight of this term, and finally convergence.
normalizes the document length.
In the clustering applications, it is normally assumed that a
If the weights of all terms are equal, we simply set document is generated by a finite mixture model and there
f (t , d ) = 1 when the term t appears in the document d . is one-to-one correspondence between mixture components
Then the value TC(t) can be written as the equation (9): and clusters. So the likelihood function p( D | θ ) , i.e. the
TC (t ) = DF (t )( DF (t ) − 1) (9) probability of all documents D given the model parameter
θ , can be written as the equation (10):
Since the transformation is increasing monotonously while N |C |
the DF(t) (the document frequency of the term t) is a p ( D | θ ) = ∏∑ p(c j | θ ) p (d i | c j , θ ) (10)
positive integer, DF is just the special case of TC. i =1 j =1
where c j is j-th cluster, | C | is the number of clusters, because K-means algorithm can be described by slightly
p(c j | θ ) is the prior distribution of cluster j, extending the mathematics of the EM algorithm to the hard
threshold case (Bottou et al., 1995), we use K-means
and p(d i | c j ,θ ) is the distribution of document i in cluster j. clustering algorithm to obtain the cluster results based on
It is further assumed that the terms are conditional the selected terms.
independent given the class label, and then the likelihood
function can be rewritten as the equation (11): 4. Experiments
p ( D | θ ) = ∏ ∑ ( p (c j | θ )∏ p (t | c j , θ ))
N |C |
(11) We first conducted an ideal case experiment to demonstrate
i =1 j =1 t∈d i that supervised feature selection methods, including IG and
where p(t | c j , θ ) is term distribution for the term t in the CHI, can significantly improve the clustering performance.
Then, we evaluated the performance of unsupervised
cluster j. Not all terms are equivalently relevant to the feature selection methods, including DF, TS, TC and En in
document, so p(t | c j , θ ) can be treated as the weighted real case. Finally we evaluated the iterative feature selection
sum of relevant distribution and irrelevant distribution as algorithm. K-means was chosen as our basic clustering
the equation (12): algorithm and entropy measure and precision measure were
p (t | c j , θ ) = z (t ) p(t is relevant | c j , θ ) used to evaluate the clustering performance. Since K-means
(12) clustering algorithm is easily influenced by selection of
+ (1 − z (t )) p (t is irrelevant | θ ) initial centroids, we random produced 10 sets of initial
where z (t ) = p(t is relevant ) is defined as the probability centroids for each data set and averaged 10 times
that the term t is relevant. In addition, if the term is not performances as the final clustering performance. Before
relevant, the term distribution is assumed to be the same performing clustering, tf*idf (with “ltc” scheme) was used
over different clusters and is denoted as to calculate the weight of each term.
p(t is irrelevant | θ ) . Hence, the likelihood function can 4.1 Performance Measures
be written as the equation (13):
N |C |
Two kinds of popular measurements, entropy and precision,
p(D | θ ) = ∏∑
i =1 j =1
( p (c j | θ ) were used to evaluate the clustering performance.
4.1.1 ENTROPY
∏ ( z (t ) p(t is relevant | c ,θ ) + (1 − z (t )) p(t is irrelevant | θ )))
i∈d i
j
Entropy measures the uniformity or purity of a cluster. Let
(13) G ', G denote the number of obtained clusters and the
number of original classes respectively. Let A denote the
To maximize this likelihood function, EM algorithm can
set of documents in a obtained cluster, and the class label of
find a local maximum by iterating the following two steps:
each document d i ∈ A, i = 1,..., | A | is denoted as label (di ) ,
(1) E-step: z) ( k +1) = E ( z | D, θ ( k ) ) (14)
)
which takes value c j ( j = 1..., G ) . The entropy for all
(2) M-step: θ = arg max θ p( D | θ , z ( k ) ) (15)
) ( k +1)
clusters is defined by the weighted sum of the entropy for
)
all clusters, as shown in the equation (16):
The E-step corresponds to calculating the expected feature
Entropy = − ∑ ∑ p jk × log( p jk )
G'
relevance given the clustering result, and the M-step | Ak | G (16)
corresponds to calculating a new maximum likelihood k =1 N j =1
estimation of the clustering result in the new feature space.
where
The proposed EM algorithm for text clustering and feature
1
selection is a general framework and the full Pjk = | {d i | label ( d i = c j )} | . (17)
implementation is beyond the scope of this paper. Our | Ak |
Iterative Feature Selection method can be fitted into this
4.1.2 PRECISION
framework. On the E-step, to approximate the expectation
of feature relevance, we use supervised feature selection Since entropy is not intuitive, we choose another measure to
algorithm to calculate the relevance score for each term, evaluate the clustering performance, i.e. precision. For each
then the probability for the term relevance is simplified to cluster, it always consists of documents from several
z ( t ) = { 0 ,1} according to whether the term relevance different classes. So we simply choose the class label which
shares with most documents in this cluster as the final class
score is larger than a predefined threshold value. So at each
label. Then, the precision for each cluster is defined as:
iteration, we will remove some irrelevant terms based on
the calculated relevance of each term. On the M-step,
1 Web Directory dataset, there is a significant performance
Precision( A) = max(| {d i | label ( d i ) = c j } |) (18)
| A| improvement. For example, using CHI method on Web
Directory dataset, when 98% terms are removed, the
In order to avoid the possible bias from small clusters
entropy is reduced from 2.305 to 1.870 (18.9% entropy
which have very high precision, the final precision is
reduction relatively), the precision is improved from 52.9%
defined by the weighted sum of the precision for all clusters,
to 63.0% (19.1% precision improvement relatively).
as shown in the equation (19):
Certainly, the results are just the upper bound of the
Precision = ∑
G'
| Ak | (19) clustering performance in real case because it is difficult to
Pr ecision( Ak ) select discriminative terms without prior class labels
k =1 N
information.
4.2 Data Sets
As reported in past research works (Yang & Pederen, 1997),
2.2
text categorization performance varies greatly on different
dataset. So we chose three different text datasets to evaluate 1.7
text clustering performance, including two standard labeled

Entropy
datasets: Reuters-21578 1 (Reuters), 20 Newsgroups 2 1.2
(20NG), and one web directory dataset (Web) collected
from the Open Directory Project3. There were total 21578 0.7 IG.Reuters
IG.Web
CHI.Reuters
CHI.Web
documents in Reuters, but we only chose the documents IG.20NG CHI.20NG
0.2
that have at least one topic and Lewis split assignment, and 100 40 20 10 8 6 4 2
assigned the first topic to each document before the Percentage of Selected Features
evaluation. The information about these datasets is shown
in Table 1. Figure 1. Entropy comparison on 3 datasets (supervised)
Table 1. The three datasets properties
0.81

DATA CLASSE DOCS TERMS AVG. AVG. 0.76

SETS S NUM. NUM. NUM TERMS DF PER 0.71


Precision

PER DOC TERM 0.66

0.61
REUTERS 80 10733 18484 40.7 23.6 0.56
IG.Reuters CHI.Reuters

20NG 20 18828 91652 85.3 17.5 0.51 IG.Web


IG.20NG
CHI.Web
CHI.20NG
0.46
WEB 35 5035 56399 131.9 11.8 100 40 20 10 8 6 4 2
Percentage of Selected Features
4.3 Results and Analysis
Figure 2. Precision comparison on 3 datasets (supervised)
4.3.1 SUPERVISED FEATURE SELECTION Feature selection makes little progress on Reuters and
First, we conducted an ideal case experiment to see whether 20NG, while achieves much improvement on Web
good terms can help text clustering. That is, we applied directory dataset. It motivates us to find out the reason. As
supervised feature selection methods to choose the best reported in the past research works on feature selection for
terms based on the class label information. Then, we text categorization, the classification accuracy is almost the
executed the text clustering task on these selected terms and same after removing some terms from Reuters and 20NG
compared the clustering results with the baseline system, datasets with most classifiers, including Naïve Bayesian
which clustered the documents on full feature space. classifier and KNN classifier (Yang & Pedersen, 1997). We
The entropy comparison over different datasets is shown in conclude that most terms in these two datasets still have
Figure 1, and the precision comparison is shown in Figure 2. discriminative values for classification although few terms
As can be seen, at least 90% terms can be removed with per class are enough to achieve acceptable classification
either an improvement or no loss in clustering performance accuracy. In other words, there are few noisy terms in these
on any dataset. With more terms removed, the clustering two datasets. Similarly, although feature selection can
performances on Reuters and 20NG changes a little, but on reduce the dimension of the feature, the clustering
performance can not be improved significantly because
1
some useful terms are also ignored after removing some
http://www.daviddlewis.com/resources/testcollections/ terms from these two datasets. However, it is different for
2
http://www.ai.mit.edu/people/jrennie/20Newsgroups/
3 the Web Directory data, in which there are much more
http://dmoz.org/
noisy terms. To prove this, we conducted a Naïve Bayesian document frequency terms were removed by both methods,
classification experiment on Web Directory dataset and but at the next stage, with more terms removed, the
found that the classification accuracy increased from 49.6% discrimination value between the term and the class is much
to 57.6% after removing 98% terms. Hence, when these more important than the document frequency for term
noisy terms, such as “site, tool, category”, are removed in selection. Since discrimination value is dependent on the
clustering, the clustering performance can also be class information, supervised method IG could keep those
significantly improved. less common but discriminative terms, such as “OPEC,
copper, drill”. However, without class information, TC
4.3.2 UNSUPERVISED FEATURE SELECTION
could not decide whether a term was discriminative and still
The second experiment we conducted is to compare the kept those common terms, such as “April, six, expect”.
unsupervised feature selection methods (DF, TS, TC, and Hence, it is obvious that supervised methods are much
En) with supervised feature selection methods (IG and CHI). better than unsupervised methods when more terms are
The entropy and precision results on Reuters are shown in removed.
Figure 3 and Figure 4. Finally, compared with other unsupervised feature selection
1.3
methods, TC is better than DF and En, but little worse than
1.25
TS. For example, as Figure 3 shows, when 96% terms are
1.2
removed, the entropies of DF and En are much higher than
1.15
the baseline value (on full feature space) but the entropies
of TC and TS are still lower than the baseline value. En is
Entropy

1.1

1.05
very similar to DF because the removal of common term is
1
easily to cause more entropy reduction than the removal of
IG
DF
CHI
TC
rare term. TS is sensitive to the document similarity
0.95
TS En threshold β . When β is just suitable, the performance of
0.9
100 40 20 10 8 6 4 2 TS is better than TC. However, the threshold is difficult to
Percentage of Selected Features
tune. In addition, the time complexity of TS (at least O( N 2 ) )
Figure 3. Entropy comparison on Reuters (unsupervised) is always higher than that of TC ( O ( MN 2 ) ), because N is
0.81
always much small than N (see Table 1). Therefore, TC is
0.8
the preferred method with effective performance and low
0.79
computation cost.
0.78 Similar points can be found on the rest two datasets (20NG
Precision

0.77 and Web) (Figure 5 and Figure 6). Due to the limitations of
0.76 the paper length, we only drew the entropy measure for
0.75
IG CHI these two datasets. As can be seen from these figures, the
DF TC
0.74 TS En best performances are also yielded on Web Directory data.
0.73 About 8.5% entropy reduction and 8.1% precision
100 40 20 10 8 6 4 2 improvement are achieved while 96% features are removed
Percentage of Selected Features
with TS method.
Figure 4. Precision comparison on Reuters (unsupervised)
From these figures, we found the following points. First,
unsupervised feature selection can also improve the 2.3

clustering performance when a certain terms are removed. 2.2


For example, any unsupervised feature selection methods 2.1
can achieve about 2% entropy reduction and 1% precision
Entropy

2
improvement relatively while 90% terms are removed.
Second, unsupervised feature selection can be comparable 1.9
IG CHI
to supervised feature selection with up to 90% term 1.8 DF TC
removal. When more terms are removed, the performances 1.7
TS En

of supervised methods can be still improved, but the 100 40 20 10 8 6 4 2


performances of unsupervised methods drop quickly. In Percentage of Selected Features

order to find out the reason, we compared the terms


Figure 5. Entropy comparison on Web Directory (unsupervised)
selected by IG and TC at different removal thresholds.
After analysis, we found that at the beginning stage, low
1.7 1.12 0.785

IG CHI
1.11
1.65
DF TC
0.78
1.1
1.6 TS En
1.09 0.775

Precision
Entropy
Entropy

1.55 1.08

1.07 0.77
1.5
IG.Entropy
1.06
1.45 CHI.Entropy 0.765
1.05 IG.Precision
CHI.Precision
1.4 1.04 0.76
100 40 20 10 8 6 4 0 1 2 3 4 5 6 7 8 9 10 11 12
Percentage of Selected Features Iteration

Figure 6. Entropy comparison on 20NG (unsupervised) Figure 7. Entropy & precision with IF selection on Reuters
2.35 0.62

4.3.3 ITERATIVE FEATURE SELECTION 2.3


0.6
2.25
The third experiment we conducted is to measure the 0.58
2.2
iterative feature selection algorithm proposed in Section 3.

Precision
Entropy
2.15 0.56
Since IG and CHI have good performance in ideal case,
2.1
they were chosen in iterative feature selection experiment. IG.Entropy 0.54
2.05
In order to speed up the iterative process, we removed the CHI.Entropy
IG.Precision 0.52
2
terms whose document frequency was lower than 3 before CHI.Precision
1.95 0.5
clustering. Then at each iteration, about 10% terms (or 3% 0 1 2 3 4 5 6 7 8 9 10 11 12
terms when less than 10% terms left) with lowest ranking Iteration
score outputted by IG or CHI were removed out. Figure 7
displays the entropy (dashed line) and precision (solid line) Figure 8. Entropy & precision with IF selection on Web Directory
results on Reuters. Figure 8 displays the results on Web
Directory dataset. We did not show the results on 20NG 12000

due to the limitations of page. 10000


Good Terms in Web
Good Terms in Reuters
As can be seen from Figure 7 and Figure 8, the performance Noisy Term in Web
Number of Terms

8000
of iterative feature selection is quite good. It is very close to Noisy Terms in Reuters

the ideal case and much better than any unsupervised 6000
feature selection methods. For example, after 11 iterations
with CHI selection on Web Directory dataset (nearly 98% 4000

terms removal), the entropy was reduced from 2.305 to


2000
1.994 (13.5% entropy reduction relatively) and the
precision was improved from 52.9% to 60.6% (14.6% 0
precision improvement relatively). It is close to the upper 1 2 3 4 5 6 7 8 9 10 11 12
Iteration
bound from the ideal case (18.9% entropy reduction and
19.1% precision improvement, see Section 4.3.1). Figure 9. Feature Tracking on Reuters & Web Directory
In order to verify that iterative feature selection works, we
traced the terms which were filtered out at each iteration. 5. Conclusions
Firstly, we assumed that the top 2% terms ranked by CHI in
ideal case experiment were good terms, and others were In this paper, we first demonstrated that feature selection
noisy terms. Then we calculated how many “good” terms can improve the text clustering efficiency and performance
were kept at each iteration. As can be seen from Figure 9, in ideal case, in which features are selected based on class
the “good” terms (two bottom lines) are almost kept and information. But in real case the class information is
most noisy terms (two top lines) are filtered out after 10 unknown, so only unsupervised feature selection can be
iterations. exploited. In many cases, unsupervised feature selection are
much worse than supervised feature selection, not only less
terms they can remove, but also much worse clustering
performance they yield. In order to utilize the efficient
supervised methods, we proposed an iterative feature
selection method that iteratively performs clustering and
feature selection in a unified framework. It is found that its
performance is close to the ideal case and much better than Jain, A.K., Duin P.W., & Jianchang, M. (2000). Statistical
any unsupervised methods. Another work we have done is a pattern recognition: a review. IEEE Transactions on
comparative study on several unsupervised feature selection Pattern Analysis and Machine Intelligence, 22, 4-37.
methods, including DF, TS, En, and a new proposed Jolliffe, I.T. (1986). Principal Component Analysis.
method TC. It is found that TS and TC are better than DF Springer Series in Statistics.
and En. Since TS has high computation complexity and is
difficult to tune its parameters, TC is the preferred Koller, D., & Sahami, M. (1996). Toward Optimal Feature
unsupervised feature selection method for text clustering. Selection. Proc. of ICML’96 (pp.284-292).
Koller, D., & Sahami, M. (1997). Hierarchically classifying
References documents using very few words. Proc. of ICML-97 (pp.
170-178).
Aggrawal, C.C., & Yu, P.S. (2000). Finding generalized
projected clusters in high dimensional spaces. Proc. of Kowalski, G. (1997). Information Retrieval Systems Theory
SIGMOD’00 (pp. 70-81). and Implementation. Kluwer Academic Publishers.
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. Martin, H. C. L., Mario, A. T. F., & Jain, A.K (2002).
(2001). On Feature Distributional Clustering for Text Feature Saliency in unsupervised learning(Technical
Categorization. Proc. of SIGIR’01 (pp. 146-153). Report 2002). Michigan State University.
Blum, A. L., & Langley, P. (1997). Selection of relevant Salton, G. (1989). Automatic Text Processing: The
features and examples in machine learning. Artificial Transformation, Analysis, and Retrieval of Information
Intelligence, 1(2), 245-271. by Computer. Addison-wesley, Reading, Pennsylvania.
Bottou L., & Bengio Y. (1995). Convergence properties of Slonim, N., & Tishby, N. (2000). Document clustering
the k-means algorithms. Advances in Neural Information using word clusters via the information bottleneck method.
Processing Systems, 7, 585-592. Proc. of SIGIR’00 (pp. 208-215).
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey.W. Wilbur, J.W., & Sirotkin, K. (1992). The automatic
(1992). Scatter/Gather: A Cluster-based Approach to identification of stop words. Journal of Information
Browsing Large Document Collections. Proc. of Science, 18, 45-55.
SIGIR’92 (pp. 318–329).
Wyse, N., Dubes, R., & Jain, A.K. (1980). A critical
Dash, M., & Liu, H. (1997). Feature selection for evaluation of intrinsic dimensionality algorithms. Pattern
classification. International Journal of Intelligent Data Recognition in Practice (pp. 415-425). North-Holland.
Analysis, 1(3), 131-156.
Yang, Y. (1995). Noise reduction in a statistical approach
Dash, M., & Liu, H. (2000). Feature Selection for to text categorization. Proc. of SIGIR’95 (pp. 256-263).
Clustering. Proc. of PAKDD-00 (pp. 110-121).
Yang, Y., & Pedersen, J. O. (1997). A comparative study
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). on feature selection in text categorization. Proc. of ICML-
Maximum likelihood from incomplete data via the EM 97 (pp. 412-420).
algorithm. Journal of the Royal Stat. Society,39, 1-38.
Zamir, O., Etzioni, O., Madani, O., & Karp, R. M. (1997).
Friedman, J.H. (1987). Exploratory projection pursuit. Fast and Intuitive Clustering of Web Documents. Proc. of
Journal of American Stat. Association, 82, 249-266. KDD-97 (pp. 287-290).
Galavotti, L., Sebastiani, F., & Simi, M. (2000). Feature
selection and negative evidence in automated text
categorization. Proc. of KDD-00.

View publication stats

You might also like