Professional Documents
Culture Documents
1 Introduction
The necessity to extract useful and relevant information from large datasets has
led to an important need to develop computationally efficient text mining algo-
rithms. Feature selection and dimensionality reduction techniques have become
a real prerequisite for data mining applications. In both the approaches, the goal
is to select a low dimensional subset of the feature space that covers most of the
information of the original data.
Nearest neighbor search in high dimensional space is an interesting and im-
portant, but difficult problem. Finding the closest matching object is important
for many applications. Based on that observation it has been shown in [1] that
in high dimensional space, all pairs of points are almost equidistant from one
another for a wide range of data distributions and distance functions. In such
cases, a nearest neighbor search is unstable. So we need to reduce the dimen-
sionality of original feature space to a reduced feature space using some feature
selection approach so that in that new feature space nearest neighbor search can
be meaningful.
SVM and decision tree classifiers do not use nearest neighbor search. We can
select some relevant features so that we get better computational performance
without significant information loss. In our work we focus on some context,
T. Huang et al. (Eds.): ICONIP 2012, Part III, LNCS 7665, pp. 382–389, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Feature Selection for Unsupervised Learning 383
such as the relationship between feature selection techniques and the resulting
classification accuracy and clustering purity.
2 Background
2.1 Feature Selection
In text categorization, feature selection (FS) [9] is typically performed by sorting
features according to some weighting measure and then setting up a threshold
on the weights or simply specifying a number or percentage of highly scored
features to be retained. Features with lower weights are discarded as having less
significance. We use Embedded Feature Selection using Support vector machine
(SVM) [2] and k-nearest-neighbor classifier [5]. We compare the effect of feature
selection based on an SVM [4] where class labels are known with feature selection
using NMF where class label information is not available.
2.2 Clustering
Non-Negative Matrix Factorization. Non-negative Matrix Factorization
(NMF) [6] is a matrix factorization algorithm that focuses on the analysis of
data matrices whose elements are nonnegative. The NMF consists of reduced
rank non-negative factors W ∈ Rt×k and H ∈ Rk×d , with k min{t, d}, that
approximates a given non-negative data matrix A ∈ Rt×d as A ≈ W H. The k
basis vectors {Wi }ki=1 can be thought of as the “building blocks” of the data,
and the (d-dimensional) coefficient vector Hi describes how strongly each build-
ing block is present in the measurement vector Ai . The non-linear optimization
problem underlying NMF can generally be stated as
1
min f (W, H) = A − W H2F (1)
W,H≥0 2
[3] is a document clustering model which adds local geometry feature with CF. Us-
ing local geometry the new objective function is
1
min f (W, V ) = X − XW V T 2 + λT r(V T LV ) (3)
W,V ≥0 2
where λ ≥ 0 is the regularization parameter, S is defined in Eq. 6, Dii = j Sij ,
L = D − S. For non-negative data the multiplicative updating rules minimizing
the above objective function are given as
(KV )ij
wij ← wij (4)
(KW V T V )ij
3 Our Approach
SVM Feature Selection. We can use feature selection based on SVM [4] when
class labels are known. So we need some feature selection method which gives
selected features without class information. So we use NMF to manually get class
label by clustering training samples into c clusters. Using these cluster labels ci
we can use SVM feature selection method.
SVM feature selection method selects features only from support vectors.
These features are discriminative in nature because support vectors only sep-
arates classes. It ignores features which are not present in support vectors. So
we need some method of feature selection which selects features from the entire
feature space. We use NMF feature selection approach.
NMF Feature Selection. Feature selection using NMF uses all features and
gives those features which are actually responsible for classification. NMF feature
selection is given in Algorithm 1.
Feature Selection for Unsupervised Learning 385
We can further filter features using both NMF feature selection and SVM
feature selection to get features having both discriminative and descriptive in
nature.
4 Experimental Results
4.1 Data Set
Four data sets are used in our experiments. We remove features having doc-
uments frequency less than two and stop words like ’the’, ’and’ etc. We use
the term-frequency vector to represent each document. Let W = {f1 , f2 , ..., ft }
be the complete vocabulary set of the document corpus. We use tf-idf weight-
ing method for global frequency of each term. The term-frequency vector Ai of
document di is defined as
where tji , idfj , n denote the term frequency of word fi ∈ W in document di , the
number of documents containing word fi and the total number of documents.
For classification we partition the datasets into 75% training samples and 25%
of testing samples.
x = argmax(vic ) (7)
c
Table 1. Datasets
4.2 Results
Feature Selection. From Fig. 1 we can see that with increase in dimensionality
performance of kNN classifier decreases. In Fig. 1 we compare three feature
selection schemes. Feature selection using NMF always gives better performance
as compared to feature selection using SVM. Because SVM feature selection
method selects features only from support vectors which are discriminative in
nature. It ignores features which are not present in support vectors. But NMF
feature selection method selects highest weighting features from entire feature
space which are descriptive in nature. By combining both SVM, NMF feature
selection method we will get features with both descriptive and discriminating
property.
Feature Selection for Unsupervised Learning 387
0.88
0.82
0.8
0.78
0.76
0.74
0.72
0.7
0.68
0 200 400 600 800 1000 1200
Number of features
Classification
– kNN classifier
From Fig. 2 and 3 we can observe that nearest neighbor search is meaningful in
the range of 100-200 dimensions for Classic4 dataset and 500-600 dimensions for
20Newsgroup dataset. Horizontal line in the Fig. 2 and 3 shows the accuracy of
classifier in full feature space.
95 50
KNN with selected features
KNN with full features
90
45
85
40 KNN with selected features
KNN with full features
80
Accuracy of Classification
Accuracy of Classification
35
75
30
70
25
65
20
60
55 15
50 10
0 100 200 300 400 500 600 700 800 900 0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Features Number of Features
– SVM classifier
From Fig. 4 and Fig. 5 we can see an increase in accuracy with increase in
number of features, but after a certain number of features it remains a constant.
Horizontal lines in the Fig. 4 and 5 shows the accuracy of classifier in full feature
space. We observe that after 10 − 20% features, accuracy remains constant. So
we need to select those useful features using feature selection approaches.
388 J.R. Adhikary and M. Narasimha Murty
96 90
94
80
92
90 70
Accuracy of Classification
Accuracy of Classification
88 SVM with selected features
60
SVM with full features
86
50
84
82
40
80
30
78
SVM with selected features
SVM with full features
76 20
0 100 200 300 400 500 600 700 800 900 0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Features Number of Features
space. We observe that for less than four clusters GNMF performs well, but with
increase in number of clusters our redLCCF approach gives better performance.
Fig. 6 and 7 shows the purity and normalized mutual information versus the
number of clusters for different algorithms on the TDT2 dataset. As can be seen,
our proposed redLCCF algorithm consistently performs better than all the other
algorithms.
1 1
0.9 0.9
0.8 0.8
Purity of Clustering
NMI of Clustering
0.7 0.7
0.6 0.6
0.5 0.5
LCCF LCCF
redLCCF redLCCF
CF CF
GNMF GNMF
0.4 0.4
5 10 15 20 25 30 5 10 15 20 25 30
Number of clusters Number of clusters
5 Conclusion
By effective and efficient feature selection methods we can get almost same qual-
ity of performance with lesser computational time. From our experimental results
we observe that only 10 − 20% of the original features actually are responsible
for the quality of performance. We use redLCCF, which is applicable to both
positive and negative data values unlike NMF which works with positive data
value only. Due to unconstrained nature of input data we can use kernel meth-
ods. Many tasks that use k-nearest neighbor search can be combined with our
efficient feature selection approach to get better performance.
References
1. Beyer, K., Goldstein, J., Ramakrishnan, R.: Shaft, Uri.: When is ”Nearest Neighbor”
Meaningful? In: Int. Conf. on Database Theory (1999)
2. Christopher, J.C.B.: A Tutorial on Support Vector Machines for Pattern Recogni-
tion. Data Mining and Knowledge Discovery 2, 121–167 (1998)
3. Cai, D., He, X.F., Han, J.W.: Locally Consistent Concept Factorization for Docu-
ment Clustering. IEEE Trans. on Knowl. and Data Eng. 23, 902–913 (2011)
4. Chapelle, O., Keerthi, S.: Multi-class Feature Selection with Support Vector Ma-
chines. In: Proceedings of the American Statistical Association (2008)
5. Cunningham, P., Delany, S.J.: K-nearest Neighbour Classifiers. Technical Report
(2007)
6. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Ad-
vances in Neural Information Processing Systems, vol. 13, MIT Press (2001)
7. Xu, W., Gong, Y.H.: Document Clustering by Concept Factorization. In: Proceed-
ings of the 27th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 2004. ACM (2004)
8. Xu, W., Liu, X., Gong, Y.H.: Document Clustering Based on Non-negative Matrix
Factorization. In: Proceedings of the 26th Annual International ACM SIGIR Con-
ference on Research and Development in Informaion Retrieval, SIGIR 2003. ACM
(2003)
9. Yang, Y.M., Pedersen, J.O.: A Comparative Study on Feature Selection in Text
Categorization. Morgan Kaufmann Publishers (1997)