You are on page 1of 8

Feature Selection for Unsupervised Learning

Jyoti Ranjan Adhikary and M. Narasimha Murty

Computer Science and Automation,


Indian Institute of Science, Bangalore, India
{jyotiranjanadhikary,mnm}@csa.iisc.ernet.in

Abstract. In this paper, we present a methodology for identifying best


features from a large feature space. In high dimensional feature space
nearest neighbor search is meaningless. In this feature space we see qual-
ity and performance issue with nearest neighbor search. Many data min-
ing algorithms use nearest neighbor search. So instead of doing nearest
neighbor search using all the features we need to select relevant fea-
tures. We propose feature selection using Non-negative Matrix Factor-
ization(NMF) and its application to nearest neighbor search.
Recent clustering algorithm based on Locally Consistent Con-
cept Factorization(LCCF) shows better quality of document clustering
by using local geometrical and discriminating structure of the data. By
using our feature selection method we have shown further improvement
of performance in the clustering.

Keywords: Feature selection, Non-negative matrix factorization(NMF),


Locally consistent concept factorization(LCCF).

1 Introduction
The necessity to extract useful and relevant information from large datasets has
led to an important need to develop computationally efficient text mining algo-
rithms. Feature selection and dimensionality reduction techniques have become
a real prerequisite for data mining applications. In both the approaches, the goal
is to select a low dimensional subset of the feature space that covers most of the
information of the original data.
Nearest neighbor search in high dimensional space is an interesting and im-
portant, but difficult problem. Finding the closest matching object is important
for many applications. Based on that observation it has been shown in [1] that
in high dimensional space, all pairs of points are almost equidistant from one
another for a wide range of data distributions and distance functions. In such
cases, a nearest neighbor search is unstable. So we need to reduce the dimen-
sionality of original feature space to a reduced feature space using some feature
selection approach so that in that new feature space nearest neighbor search can
be meaningful.
SVM and decision tree classifiers do not use nearest neighbor search. We can
select some relevant features so that we get better computational performance
without significant information loss. In our work we focus on some context,

T. Huang et al. (Eds.): ICONIP 2012, Part III, LNCS 7665, pp. 382–389, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Feature Selection for Unsupervised Learning 383

such as the relationship between feature selection techniques and the resulting
classification accuracy and clustering purity.

2 Background
2.1 Feature Selection
In text categorization, feature selection (FS) [9] is typically performed by sorting
features according to some weighting measure and then setting up a threshold
on the weights or simply specifying a number or percentage of highly scored
features to be retained. Features with lower weights are discarded as having less
significance. We use Embedded Feature Selection using Support vector machine
(SVM) [2] and k-nearest-neighbor classifier [5]. We compare the effect of feature
selection based on an SVM [4] where class labels are known with feature selection
using NMF where class label information is not available.

2.2 Clustering
Non-Negative Matrix Factorization. Non-negative Matrix Factorization
(NMF) [6] is a matrix factorization algorithm that focuses on the analysis of
data matrices whose elements are nonnegative. The NMF consists of reduced
rank non-negative factors W ∈ Rt×k and H ∈ Rk×d , with k  min{t, d}, that
approximates a given non-negative data matrix A ∈ Rt×d as A ≈ W H. The k
basis vectors {Wi }ki=1 can be thought of as the “building blocks” of the data,
and the (d-dimensional) coefficient vector Hi describes how strongly each build-
ing block is present in the measurement vector Ai . The non-linear optimization
problem underlying NMF can generally be stated as
1
min f (W, H) = A − W H2F (1)
W,H≥0 2

Concept Factorization. The optimization problem underlying Concept Fac-


torization(CF) [7] can generally be stated as
1
min f (W, V ) = X − XW V T 2 (2)
W,V ≥0 2
Each column of W corresponds to one concept represented by a weighted com-
bination of the documents in the dataset. Each row of V is a new representation
of the original document vector whose each element corresponds to one clus-
ter/concept defined by W.

Locally Consistent Concept Factorization. NMF and CF perform the factor-


ization in the euclidean space. Locally Consistent Concept Factorization(LCCF)
384 J.R. Adhikary and M. Narasimha Murty

[3] is a document clustering model which adds local geometry feature with CF. Us-
ing local geometry the new objective function is
1
min f (W, V ) = X − XW V T 2 + λT r(V T LV ) (3)
W,V ≥0 2

where λ ≥ 0 is the regularization parameter, S is defined in Eq. 6, Dii = j Sij ,
L = D − S. For non-negative data the multiplicative updating rules minimizing
the above objective function are given as

(KV )ij
wij ← wij (4)
(KW V T V )ij

(KW + λSV )ij


vij ← vij (5)
(V W T KW + λDV )ij
where K is the kernel matrix. Standard kernel matrix is X T X. In addition to
standard kernel matrix, the normalized-cut weighted form (NCW) suggested by
[8] can be used. The NCW weighting can automatically re-weight the samples
which automatically balance the effect of clustering algorithms to achieve better
result when dealing with unbalanced data.

3 Our Approach

3.1 Feature Selection Methods


Nearest neighbor search in high dimensional spaces affects performance and qual-
ity of text categorization. In such cases, a nearest neighbor is said to be unstable.
So we need some method to reduce the dimensionality. For that purpose we use
feature selection as dimensionality reduction. We use Non-negative matrix fac-
torization for feature selection and compare it with feature selection using SVM.

SVM Feature Selection. We can use feature selection based on SVM [4] when
class labels are known. So we need some feature selection method which gives
selected features without class information. So we use NMF to manually get class
label by clustering training samples into c clusters. Using these cluster labels ci
we can use SVM feature selection method.
SVM feature selection method selects features only from support vectors.
These features are discriminative in nature because support vectors only sep-
arates classes. It ignores features which are not present in support vectors. So
we need some method of feature selection which selects features from the entire
feature space. We use NMF feature selection approach.

NMF Feature Selection. Feature selection using NMF uses all features and
gives those features which are actually responsible for classification. NMF feature
selection is given in Algorithm 1.
Feature Selection for Unsupervised Learning 385

Algorithm 1. Feature Selection using NMF


Require: Given Training set A ∈ Rt×d , target number of features p
Ensure: Selected features
1: Apply NMF on A to get W,H
2: k= number of t-dimensional basis vectors W
3: Select p/k features from each wi having highest weight

We can further filter features using both NMF feature selection and SVM
feature selection to get features having both discriminative and descriptive in
nature.

3.2 Locally Consistent Concept Factorization with Selected


Features
LCCF creates a weight matrix by using local geometric structure effectively
modeled through a nearest neighbor graph on a scatter of data points. The edge
weight matrix S is defined as follows:
⎧ T
⎨ xi xj if xi ∈ Np (xj )
Sij = xi xj  (6)

0 otherwise.

where Np (xj ) denotes the set of p-nearest neighbors of xj . Constructing a graph


with N vertices where each vertex corresponds to a document using k-nearest
neighbors in the t-dimensional corpus may not give actual weights. So by select-
ing features from t-dimensional feature space and then Defining the edge weight
matrix S using k-nearest neighbors in selected features will give required weight
matrix. In summary, our data clustering algorithm is described in Algorithm 2.

4 Experimental Results
4.1 Data Set
Four data sets are used in our experiments. We remove features having doc-
uments frequency less than two and stop words like ’the’, ’and’ etc. We use
the term-frequency vector to represent each document. Let W = {f1 , f2 , ..., ft }
be the complete vocabulary set of the document corpus. We use tf-idf weight-
ing method for global frequency of each term. The term-frequency vector Ai of
document di is defined as

Ai = [a1i , a2i , ..., ati ]T


 
n
aji = tfji .log
idfj
386 J.R. Adhikary and M. Narasimha Murty

where tji , idfj , n denote the term frequency of word fi ∈ W in document di , the
number of documents containing word fi and the total number of documents.
For classification we partition the datasets into 75% training samples and 25%
of testing samples.

Algorithm 2. Locally Consistent Concept Factorization with selected features


1: Given a data set, construct the data matrix X in which column i represents the
feature vector of data point i.
2: Using Algorithm 1 select relevant features and using these features construct S and
D as given in Eq.(6)
3: Fixing V , update matrix W to decrease the quadratic form of Eq.(3) using Eq.(4).
4: Fixing W , update matrix V to decrease the quadratic form of Eq.(3) using Eq.(5).
5: Normalize W .
6: Repeat Step 3, 4 and 5 until the result converges.
7: Use matrix V to determine the cluster label of each data point. More precisely,
examine each row i of matrix V . Assign data point i to cluster x if

x = argmax(vic ) (7)
c

Table 1. Datasets

Classic4 20NG TDT2 Reuters


No of documents 7094 18745 9394 8067
No of classes 4 20 30 30
No of features 5896 16333 12353 18832

We use the normalized mutual information(NMI) and purity(accuracy) of


clustering as our evaluation matrix [8]. The algorithms that we evaluated are
Gradient Descent with Constrained Least Squares(GNMF), Concept Factoriza-
tion(CF), Locally Consistent Concept Factorization(LCCF).

4.2 Results
Feature Selection. From Fig. 1 we can see that with increase in dimensionality
performance of kNN classifier decreases. In Fig. 1 we compare three feature
selection schemes. Feature selection using NMF always gives better performance
as compared to feature selection using SVM. Because SVM feature selection
method selects features only from support vectors which are discriminative in
nature. It ignores features which are not present in support vectors. But NMF
feature selection method selects highest weighting features from entire feature
space which are descriptive in nature. By combining both SVM, NMF feature
selection method we will get features with both descriptive and discriminating
property.
Feature Selection for Unsupervised Learning 387

0.88

Accuracy of kNN classifier using different feature selection


Feature selection using SVM
0.86 Feature selection using NMF
Feature selection using SVM+NMF
0.84

0.82

0.8

0.78

0.76

0.74

0.72

0.7

0.68
0 200 400 600 800 1000 1200
Number of features

Fig. 1. Comparison of Feature selection schemes on Classic4 dataset

Classification

– kNN classifier

From Fig. 2 and 3 we can observe that nearest neighbor search is meaningful in
the range of 100-200 dimensions for Classic4 dataset and 500-600 dimensions for
20Newsgroup dataset. Horizontal line in the Fig. 2 and 3 shows the accuracy of
classifier in full feature space.

95 50
KNN with selected features
KNN with full features
90
45

85
40 KNN with selected features
KNN with full features
80
Accuracy of Classification

Accuracy of Classification

35

75

30
70

25
65

20
60

55 15

50 10
0 100 200 300 400 500 600 700 800 900 0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Features Number of Features

Fig. 2. kNN Classifier on Classic4 Fig. 3. kNN Classifier on 20NG

– SVM classifier

From Fig. 4 and Fig. 5 we can see an increase in accuracy with increase in
number of features, but after a certain number of features it remains a constant.
Horizontal lines in the Fig. 4 and 5 shows the accuracy of classifier in full feature
space. We observe that after 10 − 20% features, accuracy remains constant. So
we need to select those useful features using feature selection approaches.
388 J.R. Adhikary and M. Narasimha Murty

96 90

94
80
92

90 70
Accuracy of Classification

Accuracy of Classification
88 SVM with selected features
60
SVM with full features
86

50
84

82
40

80

30
78
SVM with selected features
SVM with full features
76 20
0 100 200 300 400 500 600 700 800 900 0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Features Number of Features

Fig. 4. SVM Classifier on Classic4 Fig. 5. SVM Classifier on 20NG

Clustering. We show performance of our proposed method Locally Consistent


Concept Factorization with selected features(redLCCF) and compare it with
other clustering algorithms. Tables 2 and 3 shows purity and NMI for full feature

Table 2. Purity of clustering Table 3. NMI of clustering

Reuters TDT2 Reuters TDT2


LCCF 0.3463 0.8416 LCCF 0.3796 0.8667
redLCCF 0.3874 0.8610 redLCCF 0.4024 0.8951
CF 0.2944 0.7312 CF 0.3885 0.6763
GNMF 0.2173 0.6992 GNMF 0.3049 0.5971

space. We observe that for less than four clusters GNMF performs well, but with
increase in number of clusters our redLCCF approach gives better performance.
Fig. 6 and 7 shows the purity and normalized mutual information versus the
number of clusters for different algorithms on the TDT2 dataset. As can be seen,
our proposed redLCCF algorithm consistently performs better than all the other
algorithms.

1 1

0.9 0.9

0.8 0.8
Purity of Clustering

NMI of Clustering

0.7 0.7

0.6 0.6

0.5 0.5
LCCF LCCF
redLCCF redLCCF
CF CF
GNMF GNMF
0.4 0.4
5 10 15 20 25 30 5 10 15 20 25 30
Number of clusters Number of clusters

Fig. 6. Purity of clustering Fig. 7. NMI of clustering


Feature Selection for Unsupervised Learning 389

5 Conclusion

By effective and efficient feature selection methods we can get almost same qual-
ity of performance with lesser computational time. From our experimental results
we observe that only 10 − 20% of the original features actually are responsible
for the quality of performance. We use redLCCF, which is applicable to both
positive and negative data values unlike NMF which works with positive data
value only. Due to unconstrained nature of input data we can use kernel meth-
ods. Many tasks that use k-nearest neighbor search can be combined with our
efficient feature selection approach to get better performance.

References
1. Beyer, K., Goldstein, J., Ramakrishnan, R.: Shaft, Uri.: When is ”Nearest Neighbor”
Meaningful? In: Int. Conf. on Database Theory (1999)
2. Christopher, J.C.B.: A Tutorial on Support Vector Machines for Pattern Recogni-
tion. Data Mining and Knowledge Discovery 2, 121–167 (1998)
3. Cai, D., He, X.F., Han, J.W.: Locally Consistent Concept Factorization for Docu-
ment Clustering. IEEE Trans. on Knowl. and Data Eng. 23, 902–913 (2011)
4. Chapelle, O., Keerthi, S.: Multi-class Feature Selection with Support Vector Ma-
chines. In: Proceedings of the American Statistical Association (2008)
5. Cunningham, P., Delany, S.J.: K-nearest Neighbour Classifiers. Technical Report
(2007)
6. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Ad-
vances in Neural Information Processing Systems, vol. 13, MIT Press (2001)
7. Xu, W., Gong, Y.H.: Document Clustering by Concept Factorization. In: Proceed-
ings of the 27th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 2004. ACM (2004)
8. Xu, W., Liu, X., Gong, Y.H.: Document Clustering Based on Non-negative Matrix
Factorization. In: Proceedings of the 26th Annual International ACM SIGIR Con-
ference on Research and Development in Informaion Retrieval, SIGIR 2003. ACM
(2003)
9. Yang, Y.M., Pedersen, J.O.: A Comparative Study on Feature Selection in Text
Categorization. Morgan Kaufmann Publishers (1997)

You might also like