Professional Documents
Culture Documents
min kWT X − BET k2F + λkWk2,1 min kPT̃ − Qk2F s.t. T̃T̃T = I (8)
W,B,E,F T̃
(3)
s.t. BT B = I, ET E = I, F = E, F ≥ 0, has an analytic solution
min kWT X − BET k2F + λkWk2,1 + γkF − Ek2F The solution is obtained by Proposition 1 with P = E, T̃ =
W,B,E,F
BT , and Q = XT W as
s.t. BT B = I, ET E = I, F ≥ 0, (4)
B = VB Im,c UB T , (10)
where γ > 0 is another parameter to control the degree of
where UB and VB are the left and right eigenvectors of
equivalence between F and E.
ET XT W computed by SVD, respectively.
This formulation has two advantages. First, each vari-
E, F update: E and F are minimized while fixing B and
able in problem (4) has a single constraint and the objective
W. E and F are successively updated in another iteration
function is convex in each single variable when the other
fixing each other. The subproblem that only relates to E is
variables are fixed, so it allows a simple iterative optimiza-
tion algorithm as in the following subsection. Second, since min kBET − WT Xk2F + γkE − Fk2F
each single constraint on E and B will be directly used in E
(11)
the optimization algorithm, E as well as B has orthogonal- s.t. ET E = I.
ity during the optimization steps. Usually, the orthogonality
The subproblem (11) is rewritten as
of E is more important than the nonnegativity of E to assure
E as an encoding matrix, so this formulation gives another arg min tr(EBT BET − 2XT WBET + XT WWT X)
advantage for the performance of the clustering. E:ET E=I
+ γ tr(ET E − 2ET F + FT F)
4.2. An Iterative Algorithm
= arg min − tr((XT WB + γF)ET )
W update: W is minimized while fixing B, E, and F. E:ET E=I
The subproblem that only relates to W is = arg min kE − (XT WB + γF)k2F . (12)
E:ET E=I
min kWT X − BET k2F + λkWk2,1 . (5) Then the solution of this subproblem is obtained by Propo-
W
sition 1 with P = I, T̃ = ET , and Q = BT WT X + γFT
Similar to [20], setting the derivative of J(W, B, E, F) as
Algorithm 1: E, F update algorithm Algorithm 2: SOCFS
Input: Ft , Wt and Bt ; Parameter: γ Input: Data matrix X ∈ Rd×n ; Parameters: λ, γ
Initialization: s = 0 and F0s = Ft Initialization: t = 0, Dt = I and Bt , Et
1 repeat 1 repeat
2 Update E0s+1 = VE In,c UE T by (13) where 2 Update Et+1 and Ft+1 by Algorithm 1;
T
BT WtT Xt + γF0 s = UE ΣE VE T ; 3 Update Wt+1 = (XXT + λDt )−1 XEt+1 BTt by
E0s+1 +|E0s+1 | (6);
3 Update F0s+1 = 2 by (15);
4 s = s + 1; 4 Update Bt+1 = VB Im,c UB T by (10) where
(t) 0 0 ETt+1 XT Wt+1 = UB ΣB VB T ;
5 until k4JEF (Es , Fs )k ≤ or s ≥ S;
5 Update the i-th diagonal elements of the diagonal
Output: Et+1 = Es , Ft+1 = F0s
0
matrix Dt+1 with 2kwi1 k2 ;
t+1
6 t = t + 1;
E = VE In,c UE T , (13) 7 until k4J(Wt , Bt , Et , Ft )k ≤ or t ≥ T ;
Output: Features are selected corresponding to the
where UE and VE are the left and right eigenvectors of
largest values of kwti k, i = 1 . . . d, which are
BT WT X+γFT computed by SVD, respectively. The sub-
sorted by descending order.
problem that only relates to F is
min kF − Ek2F s.t. F ≥ 0. (14)
F
5.1. Experimental Setup
The solution of the subproblem is easily obtained as
The experiments were conducted on seven publicly
1 available datasets. These datasets include one cancer
F = (E + |E|). (15)
2 dataset (LUNG1 [2]), one object image dataset (COIL202
We summarize the E and F update rules of the proposed [19]), one spoken letter recognition dataset (Isolet12 [7]),
optimization algorithm in Algorithm 1. The overall pro- one handwritten digit dataset (USPS2 [14]), and three face
posed optimization algorithm is also presented in Algorithm image datasets (YaleB2 [8], UMIST3 [10], and AT&T4
2. Following update rules (6), (10), (13), and (15), the ob- [25]). Detailed information of the datasets is summarized
jective function monotonically decreases. in Table 1. On each dataset, SOCFS is compared to the fol-
lowing six unsupervised feature selection methods and all
4.3. Convergence Analysis features case:
We put the convergence analysis of the proposed opti- • Max Variance (MV) selects features corresponding to
mization algorithm of SOCFS with all update rules in the the largest variances.
supplementary material. It is found that SOCFS converges
• Laplacian Score (LS)5 [13] selects features corre-
within 100 iterations. We use a single convergence criterion
sponding to the largest laplacian scores that are com-
in terms of the number of iterations by setting a maximum
puted to reflect the locality preserving power.
iteration value for all experiments in this paper.
• Multi-Cluster Feature Selection (MCFS)5 [3] selects
Table 1. Dataset Description features by a two-step method of spectral regression
# of Classes # of Instances # of Features with a l1 regularizer.
LUNG 5 203 3312 • Unsupervised Discriminative Feature Selection
COIL20 20 1440 1024 (UDFS)6 [30] selects features from local discrimina-
Isolet1 26 1560 617 tive score that reflects local structure information with
USPS 10 9258 256 a l2,1 regularizer.
YaleB 38 2414 1024
• Nonnegative Discriminative Feature Selection
UMIST 20 575 644
(NDFS)6 [18] selects features by a joint framework
AT&T 40 400 644
of nonnegative spectral analysis and l2,1 regularized
regression.
1 https://sites.google.com/site/feipingnie/publications
5. Experiments 2 http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html
3 http://www.sheffield.ac.uk/eee/research/iel/research/face
In this section, we evaluate the performance of the pro- 4 http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
posed method SOCFS. We follow the same experimental 5 http://www.cad.zju.edu.cn/home/dengcai/Data/MCFS.html
setups of the previous works [13, 3, 30, 18, 23]. 6 http://www.cs.cmu.edu/ yiyang/Publications.html
• Robust Unsupervised Feature Selection (RUFS)7 datasets, which indicates SOCFS selects the most discrimi-
[23] selects features by a joint framework of l2,1 native features under multi-label condition.
norm-based nonnegative matrix factorization with lo- We have the following further observations from the fig-
cal learning and l2,1 regularized regression. ures and the tables. First, the methods MCFS, UDFS,
NDFS, RUFS, and SOCFS, which select features simulta-
According to the experimental setups in the previous works, neously achieve better results than the others that select fea-
clustering accuracy (ACC) and normalized mutual informa- tures one by one. Since the projection matrices of those
tion (NMI) [29] are measured to evaluate the clustering re- methods are determined at the same time during iterations,
sults of each feature set selected from various methods. For corresponding features are selected to prevent high correla-
LS, MCFS, UDFS, NDFS, and RUFS, we fix the number tion. Second, regularized regression-based methods MCFS,
of neighboring parameter k = 5, following the previous NDFS, RUFS, and SOCFS, show relatively better results.
works. Since SOCFS does not consider the local structure For MCFS, NDFS, and RUFS, nonlinear local structure
information of data points, we do not need to set any neigh- learning methods are used, which can help regression fit
boring parameters for SOCFS. We set the dimension of pro- data more accurately. Although SOCFS does not use lo-
jected space m as the number of latent clusters c. cal structure information explicitly, SOCFS estimates latent
To fairly compare several unsupervised feature selec- cluster centers by orthogonal basis clustering of the target
tion methods, we tune all parameters for each method matrix, so regression can show better clustering results.
by a grid-search strategy from {10−6 , 10−4 , . . . , 104 , 106 }.
The number of selected features are set as {50, 100, 5.3. Sensitiveness of Parameters
150, . . . , 300} for the first six datasets. For USPS dataset,
To study the sensitiveness of parameters, clustering re-
we set the number of features as {50, 80, 110, . . . , 200} due
sults under varying parameters and the number of selected
to the total number of features. Clustering experiments are
features are measured. Since we cannot tune the parame-
conducted on each selected feature set from several datasets
ters according to the measures ACC and NMI w.r.t ground-
and methods by K-means. To report the best clustering re-
truth labels on practical unlabeled data, the sensitiveness of
sults of each method, different parameters can be used for
parameters is a critical issue. Figures 4-5 tell us that the
several datasets and methods. Since the results of K-means
clustering results of SOCFS are 1) not highly sensitive to λ
depend on initialization of clustering seeds, we repeat the
and γ within wide ranges and 2) slightly more sensitive to
experiments 20 times with random initialization of seeds
λ than to γ. This is because λ controls the sparsity of W
over all experiments and the mean and standard deviation
and the proper value of λ is data dependent. Since γ is a
of the measures ACC and NMI are reported. We also repeat
relatively insensitive parameter, it takes less time to tune γ.
the experiments 20 times for random initialization of vari-
Therefore, SOCFS is not affected too much by varying
ables of the methods such as NDFS, RUFS and SOCFS. We
the values of the parameters and this means that most λ and
can clearly notice that the methods of better performance
γ values can show satisfactory performances. In particular,
will have larger mean and smaller standard deviation val-
we found that almost equally prominent performances are
ues of the measures ACC and NMI of the clustering results.
obtained by setting λ from 1 to 100. Furthermore, we can
The reliability of the performance of the methods in practi-
fix γ = λ to simplify the tuning process for efficiency in
cal unlabeled data can be also verified by this experimental
practice. This is empirically proved in the experiments and
setup. All the results in the figures and the tables are pro-
also shown in Figures 4-5.
duced by their published source codes.
The experimental results are shown in Figures 2-3, and We suggest discussions about the advantages of SOCFS
Tables 2-3. We notice that feature selection can effectively compared to the most recent works RUFS and NDFS.
improve the clustering results for all datasets while reducing SOCFS is different in the following respects. First, SOCFS
the redundant features. Thus, it is desirable to use selected has fewer parameters to tune than NDFS and RUFS have.
features for learning tasks. Furthermore, by setting γ = λ SOCFS has only one pa-
We evaluate the clustering results of selected features rameter λ yet yields better performance, so SOCFS is more
versus number of selected features in Figures 2-3. We no- appropriate in practical use. Second, SOCFS has a mixed-
tice that, even with a small number of features particularly sign target matrix, which is suitable for the practical mixed-
less than 100 features, SOCFS selects the most discrimina- sign data compared to nonnegative target matrices (pseudo-
tive features among the seven methods. It is also shown in label indicators) of NDFS and RUFS. With those methods,
Tables 2-3 that SOCFS has the best clustering results for all if mixed-sign data is provided, the projected data points
should be adjusted to a nonnegative region, so this reduces
7 https://sites.google.com/site/qianmingjie/home/publications the possibility of accurate projection and selected features
75 64 16 52
50
62 14
70
48
60 46
65 12
ACC mean (%)
56 40
MV MV MV MV
8
55 LS LS LS 38 LS
MCFS MCFS MCFS MCFS
54
UDFS UDFS UDFS 36 UDFS
NDFS NDFS 6 NDFS NDFS
50
RUFS RUFS RUFS 34 RUFS
52
SOCFS SOCFS SOCFS SOCFS
4 32
50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300
# of selected features # of selected features # of selected features # of selected features
80
66
20
50
78 64
NMI mean (%)
60
MV MV MV MV
74 10
LS LS LS LS
40 MCFS MCFS MCFS 58 MCFS
UDFS UDFS UDFS UDFS
NDFS 72 NDFS NDFS NDFS
5 56
RUFS RUFS RUFS RUFS
35 SOCFS SOCFS SOCFS SOCFS
70 54
50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300
# of selected features # of selected features # of selected features # of selected features
75 60
74
55
70
72
50
65
NMI mean (%)
70 45
60
68 40
MV 55 MV MV
35
LS LS LS
66
MCFS 50 MCFS MCFS
30
UDFS UDFS UDFS
64 NDFS NDFS NDFS
45 25
RUFS RUFS RUFS
SOCFS SOCFS SOCFS
62 40 20
50 100 150 200 250 300 50 100 150 200 250 300 50 80 110 140 170 200
# of selected features # of selected features # of selected features
are less discriminative. This indicates that the nonnega- guarantees the orthogonality of the encoding matrix E by
tive constraint of the target matrix is too strict to cover the constraint, so the target matrix T effectively determines
the mixed-sign data. But SOCFS has a mixed-sign tar- the latent cluster centers of the projected data points.
get matrix for the purpose of estimating latent cluster cen-
ters, so it has more flexibility. Third, since both NDFS and 6. Conclusion
RUFS also take advantage of nonnegative matrix factoriza-
tion (NMF) formulation, their approximation of the orthog- We have proposed a new unsupervised feature selection
onal constraint (kFT F − Ik2F ) cannot guarantee the full or- method with simultaneous feature selection and clustering
thogonality, so their target matrices cannot act as label indi- combined in a single term of the objective function. In the
cator matrices as we mentioned above. But SOCFS always objective function of this new formulation, a novel type of
Table 2. ACC (% ± std) of various unsupervised feature selection methods on several datasets. The best results are in boldface.
LUNG AT&T YaleB UMIST COIL20 Isolet1 USPS
All Features 70.0 ± 8.9 60.9 ± 3.4 9.6 ± 0.6 42.1 ± 2.3 59.4 ± 4.9 57.9 ± 3.6 65.7 ± 2.4
MV 66.4 ± 8.9 61.1 ± 3.4 9.1 ± 0 4 46.7 ± 2.8 56.7 ± 4.6 58.5 ± 3.3 65.3 ± 4.4
LS [13] 60.1 ± 9.5 61.3 ± 3.5 8.8 ± 0.4 45.1 ± 3.4 56.3 ± 4.8 55.6 ± 3.3 62.5 ± 4.4
MCFS [3] 64.3 ± 7.9 61.2 ± 3.7 9.8 ± 0.7 45.1 ± 3.2 58.7 ± 5.3 60.7 ± 4.0 65.2 ± 4.2
UDFS [30] 66.2 ± 7.8 61.7 ± 3.8 11.5 ± 0.7 44.9 ± 2.7 58.9 ± 5.1 57.9 ± 3.0 62.4 ± 3.1
NDFS [18] 66.9 ± 9.1 61.4 ± 3.5 11.8 ± 0.6 47.8 ± 3.1 59.2 ± 5.0 64.6 ± 4.4 64.9 ± 3.1
RUFS [23] 68.4 ± 8.3 61.6 ± 3.2 14.5 ± 0.9 46.4 ± 3.0 59.9 ± 4.9 62.8 ± 3.8 65.8 ± 3.1
SOCFS 74.0 ± 8.9 62.7 ± 3.1 15.3 ± 0.7 49.4 ± 3.2 60.4 ± 4.7 64.9 ± 4.4 66.1 ± 2.0
Table 3. NMI (% ± std) of various unsupervised feature selection methods on several datasets. The best results are in boldface.
LUNG AT&T YaleB UMIST COIL20 Isolet1 USPS
All Features 51.7 ± 5.4 80.5 ± 1.8 13.0 ± 0.8 63.7 ± 2.5 74.2 ± 2.4 74.2 ± 1.7 60.9 ± 0.8
MV 49.0 ± 5.3 80.2 ± 1.7 12.8 ± 0.5 64.4 ± 2.2 69.8 ± 2.1 74.2 ± 1.4 60.8 ± 1.8
LS [13] 42.9 ± 5.0 80.4 ± 1.8 13.7 ± 0.4 65.1 ± 1.9 70.1 ± 2.3 71.7 ± 1.4 59.5 ± 2.1
MCFS [3] 45.6 ± 4.5 80.2 ± 1.8 14.2 ± 1.0 67.3 ± 2.6 73.7 ± 2.5 74.4 ± 1.9 61.2 ± 1.8
UDFS [30] 49.6 ± 5.1 80.6 ± 1.8 17.9 ± 0.9 64.4 ± 1.4 73.2 ± 2.5 73.6 ± 1.6 56.8 ± 1.4
NDFS [18] 48.3 ± 5.2 80.3 ± 1.8 18.8 ± 0.7 65.1 ± 2.0 73.7 ± 2.1 77.7 ± 2.0 60.7 ± 1.3
RUFS [23] 49.1 ± 5.1 80.9 ± 1.7 23.1 ± 0.7 64.5 ± 2.2 73.8 ± 2.4 76.8 ± 1.9 61.5 ± 1.4
SOCFS 55.7 ± 6.2 81.1 ± 1.6 25.1 ± 0.8 67.9 ± 2.1 74.8 ± 2.3 78.3 ± 1.9 61.6 ± 0.8
target matrix has also been proposed to serve as a latent [4] C. Ding, X. He, and H. D. Simon. On the equivalence of nonnegative
cluster centers by performing orthogonal basis clustering on matrix factorization and spectral clustering. In SDM, volume 5, pages
606–610, 2005. 2
the projected data points. The orthogonal basis clustering
[5] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix
gives latent projected cluster information to yield more ac- t-factorizations for clustering. In KDD, pages 126–135, 2006. 3
curate selection of discriminative features. The formulation [6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. 2nd
has been turned out to be minimized by proposed simple op- edition, 2001. 1
timization without other complex optimization algorithms. [7] M. A. Fanty and R. Cole. Spoken letter recognition. In NIPS, page
220, 1990. 4
The effectiveness of the proposed method based on the for- [8] A. S. Georghiades, P. N. Belhumeur, and D. Kriegman. From few to
mulation has carefully proved by the extensive experiments many: Illumination cone models for face recognition under variable
on several real world datasets. lighting and pose. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 23(6):643–660, 2001. 4
[9] G. H. Golub and C. F. Van Loan. Matrix computations. 1996. 3
Acknowledgments: This work was supported in part by [10] D. B. Graham and N. M. Allinson. Characterising virtual eigensigna-
the Technology Innovation Program, 10045252, Develop- tures for general purpose face recognition. In Face Recognition,
ment of robot task intelligence technology, funded by the pages 446–456. 1998. 4
Ministry of Trade, Industry & Energy (MOTIE, Korea), [11] Q. Gu and J. Zhou. Local learning regularized nonnegative matrix
factorization. In IJCAI, 2009. 1
and was also supported by the National Research Founda- [12] R. He, T. Tan, L. Wang, and W.-S. Zheng. l2,1 regularized corren-
tion of Korea (NRF) grant funded by the Korea government tropy for robust feature selection. In CVPR, pages 2504–2511, 2012.
(MEST) (2014-003140) and (MSIP) (2010-0028680). 2
[13] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection.
In NIPS, pages 507–514, 2005. 1, 4, 7
[14] J. J. Hull. A database for handwritten text recognition research.
References Pattern Analysis and Machine Intelligence, IEEE Transactions on,
16(5):550–554, 1994. 4
[15] K. Kira and L. A. Rendell. A practical approach to feature selection.
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral tech-
In International Workshop on Machine Learning (ML), pages 249–
niques for embedding and clustering. In NIPS, volume 14, pages
256, 1992. 1
585–591, 2001. 1
[16] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix
[2] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, factorization. In NIPS, pages 556–562, 2001. 3
P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, et al. Classifica- [17] Z. Li, X. Wu, and H. Peng. Nonnegative matrix factorization on
tion of human lung carcinomas by mrna expression profiling reveals orthogonal subspace. Pattern Recognition Letters, 31(9):905–911,
distinct adenocarcinoma subclasses. National Academy of Sciences 2010. 3
(NAS), 98(24):13790–13795, 2001. 4 [18] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu. Unsupervised feature
[3] D. Cai, C. Zhang, and X. He. Unsupervised feature selection for selection using nonnegative spectral analysis. In AAAI, pages 1026–
multi-cluster data. In KDD, pages 333–342, 2010. 1, 2, 4, 7 1032, 2012. 1, 2, 4, 7
60 50 60 60
40
ACC mean %
ACC mean %
ACC mean %
ACC mean %
40 40 40
30
20
20 20 20
10
0 0 0 0
50 50 50 50
100 100 100 100
10^6 10^6 10^6 10^6
150 10^4 150 10^4 150 10^4 150 10^4
200 10^2 200 10^2 200 10^2 200 10^2
# of selected 250 1 # of selected 250 1 # of selected 250 1 # of selected 250 1
features 300 10^−2 γ features 300 10^−2 γ features 300 10^−2 γ features 300 10^−2 γ
10^−4 10^−4 10^−4 10^−4
10^−6 10^−6 10^−6 10^−6
80 80 80 80
60 60 60 60
NMI mean %
NMI mean %
NMI mean %
NMI mean %
40 40 40 40
20 20 20 20
0 0 0 0
50 50 50 50
100 100 100 100
10^6 10^6 10^6 10^6
150 10^4 150 10^4 150 10^4 150 10^4
200 10^2 200 10^2 200 10^2 200 10^2
# of selected 250 1 # of selected 250 1 # of selected 250 1 # of selected 250 1
features 300 10^−2 γ features 300 10^−2 γ features 300 10^−2 γ features 300 10^−2 γ
10^−4 10^−4 10^−4 10^−4
10^−6 10^−6 10^−6 10^−6
80 50 60 80
60 40 60
ACC mean %
ACC mean %
ACC mean %
ACC mean %
40
30
40 40
20 20
20 20
10
0 0 0 0
50 50 50 50
100 100 100 100
10^6 10^6 10^6 10^6
150 10^4 150 150 10^4 150 10^4
200 10^4 200 200
10^2 200 10^2 10^2 10^2
# of selected 250 1 # of selected 250 # of selected 250 1 # of selected 250 1
1
features 300 10^−2 λ 300 10^−2 λ features 300 10^−2 λ features 300 10^−2 λ
10^−4 features 10^−4 10^−4
10^−4
10^−6 10^−6 10^−6 10^−6
100 80 80 80
80
60 60 60
NMI mean %
NMI mean %
NMI mean %
NMI mean %
60
40 40 40
40
20 20 20
20
0 0 0 0
50 50 50 50
100 100 100 100
10^6 10^6 10^6 10^6
150 10^4 150 150 10^4 150 10^4
200 10^4 200 200
10^2 200 10^2 10^2 10^2
# of selected 250 1 # of selected 250 # of selected 250 1 # of selected 250 1
1
features 300 10^−2 λ 300 10^−2 λ features 300 10^−2 λ features 300 10^−2 λ
10^−4 features 10^−4 10^−4
10^−4
10^−6 10^−6 10^−6 10^−6
[19] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image [27] T. Viklands. Algorithms for the weighted orthogonal Procrustes
library (coil-20). Technical report, CUCS-005-96, 1996. 4 problem and other least squares problems. PhD thesis, Umeå Uni-
[20] F. Nie, H. Huang, X. Cai, and C. H. Ding. Efficient and robust feature versity, 2006. 3
selection via joint l2,1 -norms minimization. In NIPS, pages 1813– [28] D. Wang, F. Nie, and H. Huang. Unsupervised feature selection via
1821, 2010. 1, 2, 3 unified trace ratio formulation and k-means clustering (track). In
[21] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan. Trace ratio criterion ECML PKDD, pages 306–321. 2014. 1
for feature selection. In AAAI, pages 671–676, 2008. 1 [29] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-
[22] H. Park. A parallel algorithm for the unbalanced orthogonal pro- negative matrix factorization. In SIGIR, pages 267–273, 2003. 5
crustes problem. Parallel Computing, 17(8):913–923, 1991. 3 [30] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. l2,1 -norm regu-
[23] M. Qian and C. Zhai. Robust unsupervised feature selection. In larized discriminative feature selection for unsupervised learning. In
IJCAI, pages 1621–1627, 2013. 1, 2, 4, 5, 7 IJCAI, pages 1589–1594, 2011. 1, 4, 7
[24] L. E. Raileanu and K. Stoffel. Theoretical comparison between the [31] J. Yoo and S. Choi. Orthogonal nonnegative matrix factorization:
gini index and information gain criteria. Annals of Mathematics and Multiplicative updates on stiefel manifolds. In Intelligent Data En-
Artificial Intelligence, 41(1):77–93, 2004. 1 gineering and Automated Learning, pages 140–147. 2008. 3
[25] F. S. Samaria and A. C. Harter. Parameterisation of a stochastic [32] Z. Zhao and H. Liu. Spectral feature selection for supervised and
model for human face identification. In IEEE Workshop on Appli- unsupervised learning. In ICML, pages 1151–1157, 2007. 1
cations of Computer Vision, pages 138–142, 1994. 4
[26] P. H. Schönemann. A generalized solution of the orthogonal pro-
crustes problem. Psychometrika, 31(1):1–10, 1966. 3