Professional Documents
Culture Documents
AIHC2018 Graph Duallaplacianprincipalcomponentanalysis
AIHC2018 Graph Duallaplacianprincipalcomponentanalysis
https://doi.org/10.1007/s12652-018-1096-5
ORIGINAL RESEARCH
Abstract
Principal component analysis is the most widely used method for linear dimensionality reduction, due to its effectiveness
in exploring low-dimensional global geometric structures embedded in data. To preserve the intrinsic local geometrical
structures of data, graph-Laplacian PCA (gLPCA) incorporates Laplacian embedding into PCA framework for learning local
similarities between data points, which leads to significant performance improvement in clustering and classification. Some
recent works showed that not only the high dimensional data reside on a low-dimensional manifold in the data space, but
also the features lie on a manifold in feature space. However, both PCA and gLPCA overlook the local geometric information
contained in the feature space. By considering the duality between data manifold and feature manifold, graph-dual Lapla-
cian PCA (gDLPCA) is proposed, which incorporates data graph regularization and feature graph regularization into PCA
framework to exploit local geometric structures of data manifold and feature manifold simultaneously. The experimental
results on four benchmark data sets have confirmed its effectiveness and suggested that gDLPCA outperformed gLPCA on
classification and clustering tasks.
Keywords Principal component analysis · Graph-Laplacian PCA · Dual graph · Feature manifold · Graph-Dual Laplacian
PCA
1 Introduction
13
Vol.:(0123456789)
J. He et al.
Another limitation of PCA is the interpretation of the and proposed RPCA on graphs to improve the clustering
principal components may be difficult. Although the dimen- performance. They also find that the low rank representation
sionalities identified by PCA are uncorrelated variables con- is piecewise constant on the underlying graph, then they
structed as a linear combination of the original features, they introduced a graph total variation regularization to enforce
do not have meaningful physical interpretations. Many vari- the piecewise constant assumption (Shahid et al. 2016).
ants of PCA have been proposed to enhance the interpret- Recent studies have found that not only the observed data
ability of principal components extracted from the classical lie on a nonlinear low dimensional manifold, which is called
PCA. Non-negative matrix factorization (NMF) (Lee 1999) data manifold, but also the features lie on a low dimensional
was introduced to give meaningful approximation of a non- manifold, which is called feature manifold. In order to con-
negative data matrix by non-negative low-rank factorization. sider the geometrical information of both the data manifold
In order to enhance the interpretability of PCA, sparse PCA and feature manifold simultaneously, graph dual regulariza-
(SPCA) (Zou et al. 2006) extracts principal components of tion technique has attracted many attentions in dimensional-
the given data with sparse non-zero loadings. ity reduction. For example, through enforcing preservation
The third limitation of PCA is that it is sensitive to of geometric information in both data space and the feature
grossly corrupted entries of data matrix. Since the quad- space, Shang et al. proposed dual graph based non-negative
ratic term in the classical PCA formulation is sensitive to matrix factorization (DNMF) (2012), and Yin et al. proposed
outliers, many L1-norm based PCA (Brooks et al. 2013; Lin dual graph regularized low rank representation (DGLRR)
et al. 2014; Wang 2012) and Lp-norm based PCA (Kwak (2015). All these methods have achieved promising per-
2014; Liang et al. 2013; Wang 2016) are proposed. Recently, formances, which demonstrated that duality between data
many researchers claimed that one can recover the principal points and feature vectors can be used to improve the per-
components of a data matrix even though a positive fraction formances of dimensionality reduction methods.
of its entries are arbitrarily corrupted, by decomposing data Inspired by the idea of dual regularization learning (Gu
matrix into a low-rank component and a sparse component, and Zhou 2009; Sindhwani and Hu 2009), graph-Dual
which is called robust PCA (RPCA) (Candes et al. 2009). Laplacian PCA (gDLPCA) is proposed in this paper. By
Since then, many extensions of RPCA have been proposed, combining PCA and graph dual regularization method,
such as inductive RPCA (IRPCA) (Bao et al. 2012), RPCA gDLPCA simultaneously preserves geometric structures of
with capped norms (RPCA-Capped) (Sun et al. 2013), and the data manifold and the feature manifold by two graphs
so on. IRPCA learned the underlying projection matrix by derived from data space and feature space that constructed
solving a nuclear-norm regularized minimization model, by k-nearest neighbor method. In summary, the main contri-
which can be used to efficiently remove the gross corrup- butions of the paper are summarized as follows:
tions in the data matrix. Without recalculating over all the
data, IRPCA can handle the new samples directly. Since L1- 1. We propose a dual graph regularized PCA model
norm model is based on a strong assumption which may not (gDLPCA) by discovering the local geometric structures
hold in real-world applications, RPCA-Capped is based on contained in the data manifold and the feature mani-
difference of convex functions framework using the capped fold. gDLPCA can effectively discover local geometri-
trace norm and capped L1-norm. cal structures in both data space and the feature space.
Recent works indicate the critical importance of preserv- Unlike gLPCA, gDLPCA uses the dual graph embed-
ing local geometric structure of data in dimensionality reduc- ding as the regularization term, which can preserve the
tion. In order to discover the local geometrical structure and local geometrical structure of both feature manifold and
discriminant structure of data manifold, many researchers data manifold.
have proposed a series of graph based dimensionality reduc- 2. An optimization algorithm is proposed to solve gDLPCA
tion methods by using a geometrically induced regularizer, model. We construct a compact closed form solution,
such as graph Laplacian. Since local geometric structure can so it can be efficiently solved. Closed-form solution
be captured by a k-nearest neighbor graph on data samples, can provide an exact and efficient algorithm with less
it has been widely used to explore geometrical structures implementation effort.
( ) The computational complexity
of data (Belkin and Niyogi 2001). Jiang et al. proposed a of gDLPCA is O d3 , where d is the dimensionality of
graph Laplacian PCA (gLPCA) (2013), which imposed the data sample.
graph regularization on projected low-dimensional rep- 3. Comprehensive experiments on clustering and classifi-
resentations, and they also proposed a robust version of cation tasks are conducted to confirm the effectiveness
gLPCA using L2,1-norm on reconstruction term, then the of the proposed gDLPCA method, and the results on
augmented Lagrange multiplier method is used to optimize four high-dimensional data sets have demonstrated its
robust gLPCA model. Shahid et al. (2015) incorporated the advantages over traditional PCA and gLPCA methods.
spectral graph regularization into the robust PCA framework
13
Graph-dual Laplacian principal component analysis
The rest of the paper is organized as follows. In Sect. 2, where Nk (xi ) denotes the set of k nearest neighbors of xi.
we describe the preliminary notations and formulations Several methods have been proposed to construct the weight
used in the paper. Section 3 briefly reviews some related matrix W, such as heat kernel (Belkin and Niyogi 2001),
works, including traditional PCA and gLPCA. The proposed local linear reconstruction coefficient (Roweis and Saul
gDLPCA method is introduced in Sect. 4. Then the classifi- 2000) and correlation distance (Jin et al. 2015).
cation and clustering results on four data sets are reported in In dimensionality reduction, there is a fundamental
Sect. 5. Finally, we give our concluding remarks in Sect. 6. assumption that nearby sample points are likely to have the
same embeddings. Thus, graph embedding model aims to
preserve local information of manifold structure through the
2 Preliminaries following optimization problem:
n
∑ � �2
Some notations used throughout the paper are summarized min J(Y) = Wij �yi − yj � (2)
Y � �
in Table 1. Based on the notations, dimensionality reduc- i,j=1
tion or data representation aims to generate low-dimensional which can be further formulated in trace form as follows:
representations Y from data matrix X, while preserving some ( )
structures of data set. min tr V T XLX T V (3)
Suppose
( that we)are given a high-dimensional data matrix where S = XLX T is the scatter matrix.
X = x1 , x2 , … , xn ∈ Rd×n , whose column vectors are sam- From Eq. (2) one can see that minimizing (2) is actually
ples. Dimensionality reduction aims to project d dimensional enforcing Y to reproduce the similarity structure coded in L.
samples
( into r dimensional
) subspace, i.e., Y = V T X , where
Y = y1 , y2 , … , yn( ∈ R is low-dimensional
r×n
) embedding
matrix, and V = v1 , v2 , … , vr ∈ Rd×r is linear projection 3 Related works
matrix. There are many methods to find projection matrix
V and low-dimensional embedding Y. The most popular 3.1 Principal component analysis
of them is graph based dimensionality reduction, namely
graph embedding (Yan et al. 2007). Most dimensionality Due to its simplicity and efficiency, PCA is a widely used
reduction algorithms can be unified into a graph embedding method for data representation and feature extraction. PCA
framework. learns a set of orthonormal projection vectors so that the
Let G = {X, W} be an undirected weighted graph with variance of original data in low-dimensional feature space
vertex set X and similarity weight matrix W ∈ Rn×n . The is maximized, which is equivalent to minimize the following
graph Laplacian matrix of graph G is defined as L = D − W , reconstruction error in L2 norm:
∑
where D is a diagonal matrix and Dii = j≠i Wij . The weight
2
between xi and xj is defined as min ‖X − VV T X ‖
V ‖ ‖F
(4)
{
1 xi ∈ Nk (xj ) ∨ xj ∈ Nk (xi ) s.t. V T V = I
Wij = (1)
0 otherwise
13
J. He et al.
where X is a centered data matrix. Traditionally, V and Equivalently, we have the following optimization
Y = V T X are termed as principal directions and princi- problems:
pal components. The optimal projection matrix V can be ( ( ) )
obtained from the eigen-decomposition
[ ] of the covariance min tr Y −X T X + 𝜆L Y T
Y
matrix C = XX T , and V = v1 , … , vr , where vi is the eigen-
s.t. YY T = I
vector corresponding to the i-th largest eigenvalue of C.
Then the r-dimensional representations are given as: Then, the optimal low-dimensional
( )embedding
T
matrix can
T be represented as Y ∗ = u1 , u2 , … , ur , where u1 , u2 , … , ur
yi = V xi , i = 1, … n.
are eigenvectors corresponding to the first r smallest eigen-
Noted that yi is a descriptive and compact representation values of the matrix
of high-dimensional data sample xi, and the corresponding
low-dimensional data space is usually called feature space, G𝛼 = −X T X + 𝛼L (6)
which is the learning objective of PCA model (Guan et al. and the optimal projection matrix V = XY . Noted that
∗ ∗T
3.2 Graph‑Laplacian PCA
4 Graph‑dual Laplacian PCA
Graph-Laplacian PCA (gLPCA) assumes that the high-
dimensional representations lie on a smooth manifold, and In this section, we propose a dual graph regularized PCA
since the manifold structures can be encoded in weight model, called gDLPCA algorithm. Since recent studies
matrix W of the graph, gLPCA learns low-dimensional rep- have shown that representation learning from data manifold
resentations of high-dimensional data matrix X through the and feature manifold simultaneously can improve the per-
following optimization model: formance of data clustering, gDLPCA incorporates feature
� � manifold embedding into graph-Laplacian PCA model. By
min J = ‖X − VY‖2F + 𝜆tr YLY T this way, the proposed gDLPCA method is able to discover
V,Y
(5)
s.t. YY T = I the local geometric structure of data space and feature space
simultaneously.
where 𝜆 is the regularization parameter and L = D − W .
Although the optimization model (5) is not a convex 4.1 Dual graph
problem, it has the closed-form solution, which means that
it can be efficiently solved. A data set endowed with pairwise relationships can be
When fixing Y, and setting the first order derivative of naturally illustrated as a graph, in which the samples are
objective function in (5) as zero, i.e., represented as vertices, and the relationships between any
two vertices can be represented by an edge. If the pairwise
𝜕J relationships among samples are symmetric, the graph can
= −2XY T + 2V = 0
𝜕V be undirected. Otherwise, it can be directed.
Then we can obtain the optimal projection matrix The data matrix has two modes, namely column vectors
V ∗ = XY T . and row vectors, which corresponding to sample points set
Substituting V = XY T into objective function (5), we get and feature points set. To be clear, the column space is called
( ) data space, and the row space is called feature space. Origi-
2
min J = ‖
‖ X − XY T Y ‖
‖F
+ 𝜆tr YLY T nally, duality between data samples and features is consid-
V,Y
ered for co-clustering. In this work, it is used for dimension-
s.t. YY T = I ality reduction.
By some algebra, it can be rewritten as Given high-dimensional { data matrix
} X, we can construct
the feature graph G(f ) = X T , W (f ) from feature samples
( ) {( ) ( ) ( )T }
‖ ‖2 T T
set x1 , x2 , … , xd , where xi is the i-th row of data
J = ‖X − XY T Y ‖ + 𝜆tr YLY T
‖ ‖F
(( )( )) ( )
= tr X − XY T Y X T − Y T YX T + 𝜆tr YLY T matrix X, and W (f ) is the weight matrix of feature graph.
( ) ( ) ( ) Similar to Eq. (1), W (f ) can be defined as
= tr XX T − tr YX T XY T + 𝜆tr YLY T
( ) ( ( ) ) {
= tr XX T + tr Y −X T X + 𝜆L Y T 1 xi ∈ Nk (xj ) ∨ xj ∈ Nk (xi )
(f )
Wij = (7)
0 otherwise
13
Graph-dual Laplacian principal component analysis
13
J. He et al.
where 𝜆 and 𝜇 are the alternative model parameters instead Table 2 Data sets used in the experiments
of regularization parameters 𝛼and 𝛽 . Name #Samples #Dimensionality #Class
Substituting Eqs. (14) and (15) into Eq. (13), we have
USPS 1000 256 10
[ ] wn
tr V T (−XX T + 𝛼 ⋅ L(f ) + 𝛽 ⋅ XL(d) X T )V = ⋅ tr ISOLET1 1560 617 26
1−𝜆−𝜇
{ [ ( ) ] } COIL20 1440 784 20
XX T L(f ) XL(d) X T SHD 1000 256 10
V T (1 − 𝜆 − 𝜇) ⋅ I − +𝜆⋅ +𝜇⋅ V
wn 𝛼n 𝛽n
(16)
Therefore, the solution of V in (16) can be stably com- 5.1 Data sets description
puted, which are eigenvectors of G𝜆,𝜇 that be defined as
( ) In the experiments, the data sets used to evaluate the proposed
G𝜆,𝜇 = (1 − 𝜆 − 𝜇) ⋅ I −
XX T
+𝜆⋅
L(f )
+𝜇⋅
XL(d) X T algorithm are listed in Table 2. These four benchmark data
wn 𝛼n 𝛽n sets are widely used for data clustering and classification.
(17) USPS1 data set is obtained by scanning of handwritten dig-
Noted that G𝜆,𝜇 is semi-positive definite, and all ones vec- its from envelopes of the U.S. Postal Service. After size nor-
tor is an eigenvector of G𝜆,𝜇 , which is orthogonal to any other malization, all the images are resized into 16 × 16 grayscale
eigenvectors.
The procedure of gDLPCA is summarized in Algorithm 1.
Since the eigen-decomposition of gDLPCA is the most images. Sample images of USPS data set are shown in Fig. 1.
time consuming and the size of G𝜆,𝜇(is d × d,
) the computa- Considering computational costs, we just use a subset of origi-
tional complexity of gDLPCA is O d3 . Noted that PCA nal data set. In the experiments, 100 images for each class are
and truncated singular value decomposition (TSVD) almost selected randomly. Thus, there are 1000 images in our USPS
have the same computational complexity. Moreover, in fact, data set.
gDLPCA is a manifold regularized low-rank matrix factori- ISOLET1 data set2 is the audio sequence data generated
zation method (Zhang and Zhao 2013). by a group of 30 speakers through speaking the name of each
letter of the alphabet.
COIL20 data set3 is collected by Columbia University
5 Experimental results Image Library, and contains images of 20 objects in which
13
Graph-dual Laplacian principal component analysis
13
J. He et al.
Fig. 5 The relation between classification accuracy and parameters λ, Fig. 6 The relation between classification accuracy and parameters fk,
µ dk
13
Graph-dual Laplacian principal component analysis
Table 3 Optimal clustering results with ACCmetric (%) denotes the number of samples that are in the intersection
Name Original PCA LPP gLPCA gDLPCA between the true cluster Ci and predicted cluster Ti. Similar
to ACC, a larger NMI value indicates a better clustering
USPS 66.65 68.05 67.90 66.85 70.45 performance.
ISOLET1 60.80 60.90 67.70 56.90 59.90 The ACC metric is based on one-to-one match between
COIL20 59.40 62.80 66.30 68.60 69.20 cluster labels and true labels, while NMI is an external cri-
SHD 59.70 60.60 63.00 65.30 65.30 terion, which evaluates the degree of similarity between
cluster labels and true labels. Table 3 reports the average
ACC on four data sets. The best results are highlighted in
Table 4 Optimal clustering results with NMI metric (%) bold. Table 4 shows clustering results of these unsupervised
Name Original PCA LPP gLPCA gDLPCA
linear dimensionality reduction algorithms in terms of NMI
on these four data sets. We also highlighted the best results
USPS 61.08 61.56 62.66 59.56 62.76 in bold.
ISOLET1 76.20 76.00 81.30 72.40 75.70 From Tables 3 and 4, we have the following observations.
COIL20 75.00 76.00 79.20 80.10 81.00 gDLPCA is superior to all other methods and acquires the
SHD 54.70 55.00 59.10 58.50 59.10 best result in terms of clustering evaluation metrics, ACC
and NMI, on almost all the data set. However, LPP performs
the best on ISOLET1 data set, while gDLPCA and gLPCA
and 𝜏i denotes the cluster
( ) label of xi , and li denotes the true perform the worst. The reason is perhaps that the local simi-
class label, and map 𝜏i is the permutation mapping function larity is a typical characteristic of ISOLET1 data set, which
(Lovász and Plummer 2009) that maps the cluster label 𝜏i to is a sequential data set. Since the LPP can preserve the local
the equivalent label from the data set. Larger value of ACC similarity of sequential data set, it captures the cluster struc-
indicates a better clustering performance. ture of ISOLET1 data set. While PCA, gLPCA and gDLPCA
In addition, NMI is used to determine the quality of clus- take the total variance maximization as prior information,
ters, which is defined as follows: which is not suitable for speech sequential data.
∑c ∑c ni,j
The clustering performances with the different reduced
i=1 j=1 ni,j log ⌢ dimensionalities r are shown in Fig. 7, 8, 9, 10, 11 and 12.
ni n j
NMI = � Compared with gLPCA method, the main improvement is
�∑ ��∑ ⌢
nj
�
that gDLPCA utilizes the local geometrical information
c ni c ⌢
n log
i=1 i n j=1
nj log n of feature space by feature graph regularization in gLPCA
model. Therefore, we can draw a conclusion that the infor-
mation in feature space is of great importance for data
where ni and nj are the numbers of data samples in the true
⌢
clustering.
cluster Ci and predicted cluster Ti, respectively, and ni,j
13
J. He et al.
13
Graph-dual Laplacian principal component analysis
13
J. He et al.
13
Graph-dual Laplacian principal component analysis
13
J. He et al.
13