You are on page 1of 12

Knowledge-Based Systems 225 (2021) 107130

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Label propagation with structured graph learning for semi-supervised


dimension reduction

Fei Wang a , Lei Zhu a , , Liang Xie b , Zheng Zhang c , Mingyang Zhong d
a
School of Information Science and Engineering, Shandong Normal University, China
b
School of Sciences, Wuhan University of Technology, China
c
Bio-Computing Research Center, Harbin Institute of Technology, China
d
School of Artificial Intelligence, Southwest University, China

article info a b s t r a c t

Article history: Graph learning has been demonstrated as one of the most effective methods for semi-supervised
Received 8 January 2021 dimension reduction, as it can achieve label propagation between labeled and unlabeled samples to
Received in revised form 21 March 2021 improve the feature projection performance. However, most existing methods perform this important
Accepted 5 May 2021
label propagation process on the graph with sub-optimal structure, which will reduce the quality of
Available online 7 May 2021
the learned labels and thus affect the subsequent dimension reduction. To alleviate this problem, in
Keywords: this paper, we propose an effective Label Propagation with Structured Graph Learning (LPSGL) method
Semi-supervised structured graph learning for semi-supervised dimension reduction. In our model, label propagation, semi-supervised structured
Label propagation graph learning and dimension reduction are simultaneously performed in a unified learning framework.
Dimension reduction We propose a semi-supervised structured graph learning method to characterize the intrinsic semantic
relations of samples more accurately. Further, we assign different importance scores for the given
and learned labeled samples to differentiate their effects on learning the feature projection matrix. In
our method, the semantic information can be propagated more effectively from labeled samples to
the unlabeled samples on the learned structured graph. And a more discriminative feature projection
matrix can be learned to perform the dimension reduction. An iterative optimization with the proved
convergence is proposed to solve the formulated learning framework. Experiments demonstrate the
state-of-the-art performance of the proposed method. The source codes and testing datasets are
available at https://github.com/FWang-sdnu/LPSGL-code.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction minimizes the intra-class scatter. Principal Component Analysis


(PCA) [6], as a representative unsupervised approach, aims to
Although high-dimensional features can provide more com- pursue the projection directions with the maximum variance
prehensive descriptions for the objective world, they also bring for optimal reconstruction. The performance of supervised ap-
considerable computing burden and high storage cost. Dimension proaches are significantly better than the unsupervised ones.
reduction [1] can effectively solve these problems by projecting However, the labeling process for large-amount of samples is
the original features into a low-dimensional representation that time-consuming and laborious. Meanwhile, it is much easier to
preserves the original data information. Thus, it has aroused obtain large amount of unlabeled data, therefore, semi-supervised
widespread interest among researchers [2–4]. dimension reduction approaches which jointly employ unlabeled
According to the label usage, dimension reduction approaches data as well as labeled data have become more promising. Typical
can be coarsely divided into three categories: supervised, un- semi-supervised dimension reduction methods include Semi-
supervised, and semi-supervised. Linear Discriminant Analysis supervised Discriminant Analysis (SDA) [7], Local and Global
(LDA) [5] is a typical supervised dimension reduction method Consistency (LGC) [8], Gaussian Fields and Harmonic Functions
which simultaneously maximizes the inter-class distance and (GFHF) [9], and Flexible Manifold Embedding (FME) [10].
Graph-based semi-supervised dimension reduction [10–14] is
proposed to preserve the geometric structure of all data points
The code (and data) in this article has been certified as Reproducible by
by the form of affinity graph [15–17] and perform label propa-
Code Ocean: (https://codeocean.com/). More information on the Reproducibility
Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-
gation on the affinity graph to improve the subsequent feature
engineering/computer-science/journals. projection performance. It has been demonstrated as one of the
∗ Corresponding author. most effective approaches for semi-supervised dimension reduc-
E-mail address: leizhu0608@gmail.com (L. Zhu). tion. In this kind of methods, label propagation is performed to

https://doi.org/10.1016/j.knosys.2021.107130
0950-7051/© 2021 Elsevier B.V. All rights reserved.
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

transfer the discriminative semantics from the labeled samples to 2.1. Supervised dimension reduction
the unlabeled samples through the graph structure. Under such
circumstance, the performance of graph-based semi-supervised Supervised dimension reduction projects high-dimensional
dimension reduction approach heavily depends on the quality of data into a low-dimensional subspace with the supervision of
the graph structure. labeled samples, while preserving the sample similarity. Linear
In most existing graph-based dimension reduction methods Discriminant Analysis (LDA) [5], as a representative supervised
[8–10,18–20], graph construction and dimension reduction are approach, simultaneously maximizes the inter-class distance and
separated into two independent steps. Moreover, real-world data minimizes the intra-class scatter. It requires the data distribution
usually contain a lot of redundant and noise information, which of each class to approximate a Gaussian distribution. However,
will reduce the quality of the pre-constructed graph. Under such this requirement cannot always be satisfied in real-world prob-
circumstance, the label propagation on the low-quality graph lems. Marginal Fisher Analysis (MFA) [20] constructs the intrinsic
may generate unreliable labels for unlabeled samples. In order graph and penalty graph to characterize the intra-class compact-
to learn a better graph, several approaches such as Non-negative ness and inter-class separability, respectively. Nevertheless, in
Sparse Graph (NNSG) [21], Semi-supervised Manifold Regular- MFA, graph construction and projection learning are separated
ization with Adaptive Graph (AGMR) [22], Non-negative LRR into two independent steps, which may lead to sub-optimal
(NNLRR) [23] and Semi-supervised Projection with Graph Op- results. Recently, Simultaneously Learning Neighborship and Pro-
timization (SPGO) [14] integrate these two independent steps jection Matrix (SLNP) [25] is proposed to integrate the similarity
into a joint learning framework. Although these methods have and projection learning within a unified framework for learning
achieved certain success, the learned graph with sub-optimal more discriminative projection.
structure still has limited capability to support the label propaga-
tion process. Under such circumstance, unreliable labels will be 2.2. Unsupervised dimension reduction
generated and thus affect the feature projection learning process.
Motivated by the above analysis, we propose a new semi- Unsupervised dimension reduction approaches perform the
supervised learning framework that simultaneously performs la- projection process without the dependence on labels. Principal
bel propagation and semi-supervised structured graph learning Component Analysis (PCA) [6] aims to map high-dimensional
for dimension reduction. Via the joint optimization, the label data in a low-dimensional subspace with maximum variance
propagation can be performed more effectively on the structured preserving. Projective Unsupervised Flexible Embedding Models
graph to accurately characterize the semantic relations of samples with Optimal Graph (PUFE-OG) [26] jointly performs optimal
in an adaptive semi-supervised learning paradigm. This process graph learning and projection learning by integrating the mani-
further improves the quality of the learned pseudo labels and fold regularizer and regression residual into a unified framework.
finally the feature projection performance to differentiate the Unsupervised Projection with Graph Optimization (UPGO) [14] si-
samples. The main contributions of this paper are summarized multaneously learns the projection matrix and an ideal structured
graph with clustering structure.
as follows:
(1) We propose an effective label propagation method with
2.3. Semi-supervised dimension reduction
structured graph learning method for semi-supervised dimension
reduction. Our model simultaneously performs label propaga-
Semi-supervised dimension reduction can improve the fea-
tion, semi-supervised structured graph learning and dimension
ture projection performance by jointly exploiting unlabeled and
reduction in a unified framework. It could improve the feature
labeled data. Among the existing methods, the graph-based semi-
projection performance through effective label propagation on
supervised dimension reduction technology has attracted great
the structured graph. To the best of our knowledge, there is still
attention from researchers, and various methods have been de-
no similar work.
veloped. Local and Global Consistency (LGC) [8] and Gaussian
(2) Different from existing methods, the structured graph in
Fields and Harmonic Functions (GFHF) [9] force the predicted la-
our method is adaptively learned in a semi-supervised learning
bel matrix to be close to the given labels for the labeled nodes and
paradigm, which jointly exploits the labeled and unlabeled sam-
meanwhile keep the manifold smoothness on the whole graph.
ples to characterize the semantic relations of samples. Besides, we
Although promising performance has been reported, they cannot
assign different importance scores for the given and learned la-
be applied to the out-of-sample extension scenario. To solve the
beled samples to differentiate their effects on learning the feature
problem, the Linear Manifold Regularization (LapRLS) [18] is pro-
projection matrix.
posed to minimize the linear regression errors while preserving
(3) An effective iterative optimization strategy with conver-
the manifold smoothness, but it is unreasonable for the linear
gence is proposed to solve the formulated learning problem. regression function to force the prediction label matrix to lie
Experimental results on several public datasets demonstrate the with the space spanned by all the training samples. In Flexible
promising performance of the proposed model compared with Manifold Embedding (FME) [10], regression deviation is proposed
state-of-the-art approaches. to relax the strict constraint in LapRLS, which can better fit the
The rest of this paper is arranged as follows. We briefly review samples from a nonlinear manifold. In these methods, graph
the related work in Section 2. The details of the proposed method construction and the projection process are separated into inde-
are presented in Section 3. In Section 4, we give the theoretical pendent steps, which may lead to sub-optimal results. Besides,
analysis about LPSGL. Extensive experiments are conducted in real-world data usually contain lots of noises and redundant
Section 5. Finally, we conclude the paper in Section 6. information, which will affect the label propagation on the pre-
constructed graph and further reduce the subsequent feature pro-
2. Related work jection performance. The Non-negative Sparse Graph (NNSG) [21]
method improves the feature projection performance by perform-
In this section, we briefly review the related researches about ing the graph construction, label prediction and semi-supervised
supervised dimension reduction, unsupervised dimension reduc- dimension reduction within a unified framework. Nevertheless, it
tion and semi-supervised dimension reduction. For detailed sur- adopts the Local Coordinate Coding (LCC) [27] to construct graph
vey of dimension reduction methods, please refer to [24]. and directly learns the labels with mixed information, which will
2
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 1 (1) Semi-supervised Similarity Learning. To support the


Main notations used in this paper. structured graph learning, we first construct a nonnegative and
Notations Description normalized similarity matrix A in a semi-supervised learning
X = [x1 , . . . , xl , xl+1 , . . . , xn ] ∈ Rd×n Original data matrix paradigm, then we set the sum of each row of A to be one. For
F = [f1 , . . . , fl , fl+1 , . . . , fn ] ∈ Rn×c Predicted label matrix simplicity, we denote the labels of samples xi and xj as yi and yj ,
Y = [y1 , . . . , yl , yl+1 , . . . , yn ] ∈ Rn×c Given label matrix
respectively. We calculate the distance between two samples as
V ∈ Rn×n Importance score matrix
S ∈ Rn×n The learned structured graph follows
A ∈ Rn×n The learned similarity matrix
m if yi and yj is know n and yi ̸ = yj
{
xi ∈ Rd×1 The ith data
dxij =
d Dimension of original data ∥ xi − ∥ xj 22 other w ise
n The number of data
k The number of neighbors (1)
c The number of clusters
where m represents a relatively large constant. In the spectral
learning theory [28], a larger distance dxij between xi and xj corre-
sponds to a smaller similarity value aij . In addition, we simply set
bring considerable computation cost and thus impact the projec- aii = 0 to avoid that sample xi is only related to itself, which vi-
tion learning process. Recently, Semi-supervised Projection with olates our original intention of dividing all samples into c classes
Graph Optimization (SPGO) [14] is proposed to learn the graph by calculating the similarities between samples. Then we define ai
similarity matrix adaptively based on the relations characterized as the vector representing the ith column of the similarity matrix
by low-dimensional representations. However, it fails to take into A and it can be computed by solving the following problem
account the different importance of the given and learned labeled
samples on learning the feature projection matrix, which may n
∑ n

degrade the feature projection performance. min dxij aij + γ a2ij
ai (2)
Different from the above methods, in this paper, we pro- j=1 j=1
pose an effective LabelPropagation w ith Structured Graph learning
s.t . aTi 1 = 1, ai ≥ 0, aii = 0
(LPSGL) method to simultaneously perform label propagation,
semi-supervised structured graph learning and dimension reduc- where γ is the regularization parameter. The first term ensures
tion in a unified framework. Specifically, we learn the structured the sample distance to reflect the similarities between samples.
graph in a semi-supervised paradigm to improve the label prop- The second term is used to avoid the trivial results. The Eq. (2)
agation performance. Besides, we assign different importance can be written in the vector form as Eq. (3)
scores for the given and learned labeled samples to differentiate dxi
their effects on learning the feature projection matrix. To the best min ∥ai + ∥22
ai 2γ (3)
of our knowledge, there is still no similar work.
s.t . aTi 1 = 1, ai ≥ 0, aii = 0
3. The proposed method The Lagrangian function of Eq. (3) is
1 dxi
In this section, we present the details of the proposed method. L(ai , η, βi ) = ∥ai + ∥22 − η (aTi l − 1) − βiT ai (4)
2 2γ
First, we give the relevant notations and definitions used in this
paper. Then, we formulate the objective function. Finally we where η and βi ≥ 0 are the Lagrangian multipliers. According to
derive an efficient algorithm to optimize it. the KKT condition, the optimal ai is obtained by
dxij
3.1. Notations and definitions aij = (− + η)+ (5)

Throughout the paper, all the matrices are written in upper- where (z)+ = max (z , 0).
case while the vectors are written in lowercase. For a matrix M ∈ To learn a sparse similarity matrix A, we consider that each
Rd×n , the element in the ith row and jth column is represented sample has only k nearest neighbors. In other words, the optimal
as mij . The trace and the transpose of M are denoted as Tr (M) solution of ai to Eq. (2) has exactly k nonzero values. Without loss
∑n 1 of generality, suppose the elements in dxij are in an ascending sort
and M T , respectively. ∥v∥2 = ( i=1 vi ) 2 denotes the l2 -norm
order. Then we have
of vector v . The Frobenius norm of matrix M is denoted by
dxik

1
∥M ∥F = ( di=1 nj=1 m2ij ) 2 . 1 denotes a column vector with all +η >0
∑ ∑ ⎪ aik = −



the elements as 1 and the identity matrix is denoted by I. The



dxi,k+1

main notations used in this paper are listed in Table 1. ⎪
+η ≤0

ai,k+1 = −
⎪ 2γ
3.2. Objective formulation ⎪ k
dxij

⎪ ∑
⎪ aTi 1 = + η) = 1

(−



The formulated objective of the proposed method is comprised (6)

j=1
of three parts: semi-supervised structured graph learning, label ⎧ k
propagation and projection matrix learning. k x 1∑ x
γ

= d − dij

2 i,k+1

Semi-supervised Structured Graph Learning. A high-quality 2



j=1
structured graph that reveals the semantic relations of samples ⇒ k
will be more conductive for the label propagation. In this paper,
⎪ 1 1 ∑ x
η

= + dij


we propose to adaptively learn a graph with more desirable struc- k 2kγ


j=1
ture to support the label propagation process. Specifically, we
propose to learn a structured graph S based on a similarity matrix We get the value range of γ and set it to the maximum in
A which is obtained in a semi-supervised learning paradigm. the above derivations, thus the constraint that ai has exactly
3
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

k nonzero values can be satisfied. Consequently, the similarity and dimension reduction, we derive the overall formulation of
matrix A can be computed by LPSGL as
n
dxi,k+1 − dxij


min ∥S − A∥2F + 2λTr (F T Ls F ) + α vi ∥W T xi − fi ∥22

⎨ ∑k if j ≤ k
aij = kdxi,k+1 − x
(7) S ,F ,W (11)
j=1 dij i=1
if j > k

0 s.t . S1 = 1, S ≥ 0, F ∈ R n×c
, Fl = Yl

(2) Structured Graph Learning and Label Propagation. A where α is a balance parameter. In Eq. (11), the first two terms
graph that reveals the semantic relations of samples can effec- perform semi-supervised structured graph learning and label
tively support the label propagation. In this paper, we aim to learn propagation. The third term aims to learn the projection ma-
a structured graph which can achieve an ideal neighbor assign- trix for dimension reduction. In particular, we assign different
ment. That is, S has exactly c connected components, where each importance scores vi |ni=1 from the set of {1, 10, 102 , 103 , 104 }
connected component corresponds to one cluster. To obtain such for the unlabeled and labeled samples to differentiate their ef-
graph, we explicitly impose the constraint rank (Ls ) = n − c on the fects on learning the feature projection matrix. With LPSGL,
graph learning process according to the following theorem [29]. the label propagation can be performed more effectively on
the learned structured graph and thus improves the feature
Theorem 1. The multiplicity c of the eigenvalue 0 of the Laplacian projection performance.
matrix LS is equal to the number of connected components in the
graph associated with S. 3.3. Iterative optimization
Then the optimization function is formulated as follows
We set a diagonal matrix V to represent the importance scores
min ∥S − A∥2F for training samples on learning the feature projection matrix.
S (8) The first l elements and the last n − l elements correspond to
s.t . S1 = 1, S ≥ 0, rank (Ls ) = n − c the values of the given and learned labeled samples, respectively,
(S T +S) which we denote as Vl and Vu . Then we rewrite Eq. (11) to
where Ls = Ds − and Ds is a diagonal matrix whose
ij ji
2 ∑ s +s Eq. (12)
ith diagonal element is j 2
. In Eq. (8), we use F2-norm
constraint to measure the disparity between the structured graph min ∥S − A∥2F + 2λTr (F T Ls F ) + α Tr ((X T W − F )T V (X T W − F ))
S ,F ,W
S and the pre-learned similarity matrix A. Since the element sij
of S represents s.t . S1 = 1, S ≥ 0, F ∈ Rn×c , Fl = Yl
∑n the probability that xj is the neighbor of xi , then
we have j=1 sij = 1, which can be formulated as S1 = 1. It is (12)
natural that a larger distance should be assigned with a smaller
probability, and vice versa. According to the previous work [14], We propose an alternative optimization algorithm which opti-
we denote the number of labeled and unlabeled samples as l and mizes one variable while fixing the remaining variables. The key
u, respectively. Yl and Fl denote the given labels and predicted steps of the iterative optimization for Eq. (12) are as follows:
labels of the labeled samples. Without loss of generality, we use Update F : When W and S fixed, the Eq. (12) becomes
the method in [9] to rearrange all the data samples and let the
min 2λTr (F T Ls F ) + α Tr ((X T W − F )T V (X T W − F ))
front l data samples be labeled. Then [ we split L]s and F into blocks, F (13)
so they could be expressed as Ls =
Lll Llu
and F = [Fl ; Fu ], s.t . F ∈ Rn×c , Fl = Yl
Lul Luu
Fl = Yl . Given a large λ, Eq. (8) can be transformed to the The optimal solution to Eq. (13) can be calculated as
following equivalent problem Fu = (2λLuu + α Vu )−1 (α Vu XuT W − 2λLul Yl )
(14)
min ∥S − A∥ + 2λTr (F Ls F )
2
F
T
F = [Yl ; Fu ]
S ,F
(9)
s.t . S1 = 1, S ≥ 0, F ∈ Rn×c , Fl = Yl To avoid unreliable labels, we impose the following constraint
on the predicted label matrix to force the value of fij to be in the
Projection Matrix Learning. With the learned pseudo la- range of [0, 1].
bels, we learn the projection matrix for dimension reduction.
⎨0 if fij ≤ 0

Theoretically, the predicted labels for the labeled samples with
the mandatory equality constraint with the given labels will be fij = fij if 0 ≤ fij ≤ 1 (15)
more reliable than the labels for the unlabeled samples obtained

1 if fij ≥ 1
from label propagation. Therefore, we assign different importance
scores for the labeled and unlabeled samples to differentiate their Update W : When F and S fixed, the Eq. (12) is reduced to
effects on learning the projection matrix. To this end, we derive min Tr ((X T W − F )T V (X T W − F )) (16)
the following formula W

n By calculating the derivation of the objective function with W



min vi ∥W xi − ∥
T
fi 22 and setting it to 0, we can obtain the updating rule for W as
W (10)
i=1
W = (XVX T )−1 XVF (17)
s. t . F ∈ R n×c
, Fl = Yl
Update S: When W and F fixed, the optimization for S can be
where W ∈ Rd×c is the projection matrix, vi denotes the im- derived as
portance score of the training sample xi and fi is the vector ∑ ∑
representing ith row of F . Besides, fij = 1 if fi is labeled as j and min (sij − aij )2 + λ ∥fi − fj ∥22 sij
S
i,j i,j
fij = 0 otherwise. ∑ (18)
Overall Learning Framework. After jointly considering the s.t . sij = 1, sij ≥ 0
semi-supervised structured graph learning, label propagation, j

4
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Algorithm 1 Optimization Algorithm for LPSGL When fixing other variables to update S, the optimization of
d×n d×l Eq. (20) becomes a typical quadratic programming problem. The
Input: Original data matrix X ∈ R , where Xl ∈ R and
Hessian matrix of the Lagrangian function of Eq. (20) is also
Xu ∈ Rd×u are labeled and unlabeled data matrix, respectively;
positive semi-definite. Thus, we arrive at
The learned similarity matrix A ∈ Rn×n ; Known label matrix
Yl ∈ Rl×c ; Parameters α , v , λ. Ω (W , F , S t ) ≥ Ω (W , F , S t +1 ) (23)
Output: The projection matrix W ∈ Rd×c .
n×n From the above analysis, the convergence of Algorithm 1 can
1: Compute importance score matrix V ∈ R .
be achieved after a number of iterations.
2: Initialize S = A, W = 0d×c
3: repeat
(S T +S)
4.2. Computation complexity analysis
4: Compute the graph Laplacian matrix LS = DS − 2
.
5: Update F with Eq. (14). We analyze the computational complexity of LPSGL described
6: Adjust F according Eq. (15). in Algorithm 1. The entire process can be divided into t itera-
7: Update W by solving Eq. (17). tions. In each iteration, the computational complexity required
8: Update each row si of S by optimizing Eq. (20). to update F is O (u3 ), the complexity for updating W is O (nd2 )
9: until convergence and it costs O (n2 ) to update S. In real-world applications, n ≫ d.
Therefore, the total computational complexity of the proposed
method is O (max {n2 , u3 } t).
We can obtain the optimal solution of S by solving the follow-
4.3. Discussion
ing problem separately for each i
∑ ∑
min (sij − aij )2 + λ ∥fi − fj ∥22 sij In this subsection, we discuss the advantages of the joint learn-
S ing framework in Eq. (11) compared with previous approaches.
j j
∑ (19) The projection matrix for dimension reduction learned from
s.t . sij = 1, sij ≥ 0 linear regression is used to differentiate the samples. Its quality is
j largely dependent on the labels used for semantic supervision. As
Denoting ei as a vector with the jth element as eij = ∥fi − fj ∥22 , the labels are mainly generated from the label propagation pro-
the Eq. (19) can be written in the following form as cess, the quality of the projection matrix will be affected by the
label propagation under the semi-supervised learning paradigm.
λ In this paper, we jointly perform label propagation and learn an
min ∥si − (ai − ei )∥22
si 2 (20) adaptive structured graph that better reveals the semantic struc-
s.t . sTi 1 = 1, si ≥ 0 ture of samples, which thereby improves the feature projection
performance.
This problem can be solved by an efficient iterative algorithm
Specifically, we propose to first learn a similarity matrix A in
[30]. The basic optimization steps to solve Eq. (11) are summa-
a semi-supervised learning paradigm. Then a structured graph
rized in Algorithm 1. In this algorithm, F , W and S are alternately
is adaptively constructed based on the pre-learned A during the
updated until convergence.
iterative optimization process. Based on the graph with better
structure, more discriminative semantics are propagated from the
4. Theoretical analysis
given labeled samples to the unlabeled samples and thus can
guide the projection matrix learning process. Besides, in order
In this section, we first provide the convergence analysis and
to take advantage of the given labels on learning the feature
computational complexity of Algorithm 1, and then discuss some
projection matrix, we empirically assign larger importance scores
details about our proposed method.
for the given labeled samples.
Via the joint optimization, our method can learn a more dis-
4.1. Convergence analysis
criminative feature projection matrix to differentiate samples.
We will theoretically prove the convergence of our proposed Our model is a new attempt to improve the feature projection
Algorithm 1 according to the following Theorem 2. performance by performing more effective label propagation.

Theorem 2. The iterative optimization in Algorithm 1 can mono- 5. Experiments


tonically decrease the objective function of Eq. (11) until conver-
gence. 5.1. Experimental settings

Proof. The optimization problem in Eq. (11) is divided into In this subsection, we first give the descriptions of six bench-
three iterative optimization steps. In each iteration, when fixing mark datasets. Then, we introduce comparison approaches. Fi-
the other variables and optimizing only one variable, the value nally, we present the implementation details in our experiment.
of the objective function will decrease or remain unchanged.
Specifically, let t be the number of iterations. 5.1.1. Datasets descriptions
When fixing other variables, updating W will monotonically Our experiments are conducted on six datasets which are
decrease the objective function in Eq. (11) widely adopted for semi-supervised dimension reduction [10,14,
21]. The basic descriptions of these datasets are summarized in
Ω (W t , F , S) ≥ Ω (W t +1 , F , S) (21) Table 2.
(1) Face Datasets: The ORL [31] dataset contains 400 face
By fixing other variables and updating F , the Hessian matrix of
images of 40 different people, each image is collected under
the Lagrangian function of Eq. (13) is positive semi-definite, thus
different times, lighting, facial expressions (open or close eyes,
Eq. (13) becomes a convex optimization problem. Then, we have
smiling or not smiling) and facial details (glasses or no glasses).
Ω (W , F t , S) ≥ Ω (W , F t +1 , S) (22) All images are cropped and then resized to 32 × 32 pixels. For
5
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 2 The discriminative capability of the learned feature projection


Descriptions of six datasets. matrix from the compared approaches are measured by per-
Type Datasets Samples Dimension Class forming the nearest neighbor (NN) classifier on the projected
ORL 400 1024 40 low-dimensional representation of original data. For all the com-
Face
MSRA25 1799 256 12
parison approaches, the reduced dimension is set to c (the num-
Coil20 1440 1024 20 ber of clusters). All approaches except MFA are semi-supervised
Object
USPS20 1854 256 10
dimension reduction. We construct KNN graphs to obtain the pre-
Text CANE-9 1080 856 9 defined graph Laplacian matrix L required by SDA and FME, where
Handwritten Digit Dig1–10 1797 64 10 k is set to 10 and a Gaussian kernel is used to calculate edge
weights. The parameters of these algorithms are ranged from
{10−9 , 10−6 , . . . , 106 , 109 }. The α , β , γ of NNSG are selected from
MSRA25 [31] dataset, 1799 images from 12 peoples are used in {10−4 , 10−3 , 10−2 , 10−1 , 1}. For SPGO and LPSGL, the parameters
our experiment. α and λ are chosen from {10−6 , 10−5 , . . . , 105 , 106 }, and the
(2) Object Dataset: The Coil20 [32] dataset consists of 1440 importance score v for given labeled samples takes the value from
images from 20 objects with each object providing 72 images. {1, 10, 102 , 103 , 104 } while the v for learned labeled samples is
The images in each object are captured from varying angles at set to 1. Empirically, we set m to 10 in the semi-supervised
intervals of five degrees. We resize each image to 32 × 32 pixels. similarity learning process.
(3) Text Dataset: The CNAE-91 dataset has 9 categories and First, PCA is used as a preprocessing step to preserve 95%
totally 1080 free text business descriptions documents, which are energy of data. Then we randomly select 50% of the dataset as
from Brazilian companies. the training samples, and the others as the testing samples. In
(4) Handwritten Digit Datasets: The USPS [33] dataset has the training samples, p samples per class are randomly selected
7291 training and 2007 testing images. The size of each image as labeled samples, and the rest are as unlabeled samples. We
is 16 × 16 pixels. In our experiment, we randomly select 20% set p to 1, 2, 3. For all approaches except MFA, we use the entire
images from each category to construct a dataset which contains training samples to learn the feature projection matrix. The unla-
1854 images, the subset is represented by USPS20. The Dig1–101 beled samples are used to test the semi-supervised classification
dataset contains 1797 images of digits from 1 to 10. performance after dimension reduction, while the testing samples
are used to test the classification performance of the learned
feature projection matrix on the new samples. For the supervised
5.1.2. Comparison approaches
method MFA, we only adopt the labeled samples to learn the
To demonstrate the performance of our method, we com-
feature projection matrix.
pare our model with 6 state-of-the-art semi-supervised dimen-
sion reduction methods. The compared approaches are briefly
introduced as follows: 5.2. Experimental results
(1) Baseline. It employs the original data without performing
dimension reduction in the experiment. In this subsection, we first show the comparison results and
(2) Marginal Fisher Analysis (MFA) [20]. MFA is a supervised running time comparison, then we conduct ablation experiments
dimension reduction method. It constructs two graphs to describe to verify the effects of the semi-supervised structured graph
the intra-class compactness and the inter-class separability, re- learning, pseudo label learning and importance scores for semi-
spectively. supervised dimension reduction. Finally, we give parameter sen-
(3) Semi-supervised Discriminant Analysis (SDA) [7]. SDA is sitivity analysis and convergence results.
proposed to use unlabeled data for maximizing the locality pre-
serving power and labeled data for maximizing the discriminating 5.2.1. Comparison results
power. Following the previous works [10,14,21], we report the mean
(4) Flexible Manifold Embedding (FME) [10]. FME can ef- classification results and standard deviation over 20 random
fectively project unlabeled data and process new samples by splits on the unlabeled training samples and the testing samples,
modeling the regression residue. In addition, it considers the label known as ‘‘ Unlabel ’’ and ‘‘ Test ’’, respectively. Since MFA does
information as well as the manifold smoothness from labeled and not contain enough training samples when p is set to 1, we ignore
unlabeled data during the training process. the results for MFA at that setting. As shown in Tables 3 to 8, the
(5) Non-negative Sparse Graph (NNSG) [21]. It integrates best result in each column is marked in bold.
linear regression and graph learning into a unified framework We can draw the following observations from these tables:
to achieve an overall optimum. NNSG is proposed to enable the (1) In terms of mean classification results, our LPSGL method
label information be accurately propagated on the non-negative is better than the comparison methods in all cases. The advantage
sparse graph structure, and thus improve the feature projection becomes more obvious especially when there are less labeled
performance. data. For example, the semi-supervised classification result ob-
(6) Semi-supervised Projection with Graph Optimization tained by LPSGL on the CNAE-9 dataset is about 18.2% higher
(SPGO) [14]. SPGO simultaneously learns a feature projection ma- than the second best result when p is set to 1. It demonstrates
trix and an ideal structured graph. It further has 4 variant meth- that the labels can be propagated more effectively via the learned
ods SPGO_KNN, SPGO_SSC, SPGO_LRR, SPGO_ADP with different structured graph and further improves the feature projection
graph construction methods. performance. Another obvious phenomenon is that the mean
classification results obtained from the ‘‘Test’’ are lower than
those from ‘‘Unlabel’’ in most cases.
5.1.3. Implementation details
(2) For all semi-supervised dimension reduction approaches,
In our experiments, we follow the implementations similar to
the mean classification results after dimension reduction increase
the previous works [10,14,21].
as the number of given labeled samples increases. This phe-
nomenon indicates that labeled samples are beneficial for pro-
1 http://www.escience.cn/people/chenxiaojun/index.html. jection matrix learning.
6
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 3
Mean classification results for Unlabel and Test on the ORL dataset.
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 50.6±4.4 50.1±4.8 67.1±3.5 66.3±4.4 76.4±3.5 76.3±3.7
MFA [20] – – 65.0±4.6 64.5±4.8 71.9±6.6 71.3±6.5
SDA [7] 43.0±5.2 47.7±3.5 66.5±4.8 69.5±4.1 79.6±4.1 80.9±3.4
FME [10] 47.8±4.1 47.9±2.8 66.2±3.5 67.4±3.1 75.3±5.3 76.9±3.3
NNSG [21] 58.7±3.9 56.5±3.5 76.1±3.8 75.7±2.3 82.4±5.5 83.6±2.9
SPGO_KNN [14] 52.4±4.5 50.7±4.7 63.7±4.4 62.2±4.9 73.9±4.3 74.3±3.3
SPGO_SSC [14] 59.0±4.1 56.0±3.9 73.1±3.4 67.7±3.1 77.3±4.1 75.9±3.5
SPGO_LRR [14] 59.2±3.8 59.6±2.8 73.4±4.3 72.0±3.4 79.9±3.6 79.7±2.8
SPGO_ADP [14] 64.0±2.4 61.4±3.9 73.0±4.1 74.3±3.5 78.1±3.8 79.0±2.5
LPSGL 69.8±4.3 64.8±2.2 80.5±3.5 75.9±1.9 86.9±3.2 83.7±2.2

Table 4
Mean classification results for Unlabel and Test on the MSRA25 dataset.
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 47.7±3.9 46.0±3.7 65.7±4.1 64.9±4.1 76.3±4.9 75.3±5.2
MFA [20] – – 73.4±5.2 73.3±5.7 82.5±3.7 83.3±3.4
SDA [7] 56.5±5.1 54.9±4.5 78.0±5.6 77.4±5.0 88.7±4.2 88.1±5.1
FME [10] 56.5±4.8 55.7±4.3 77.6±7.0 76.1±6.4 88.1±6.2 87.8±6.8
NNSG [21] 59.8±4.4 56.2±4.9 80.9±4.5 80.3±4.2 89.6±3.9 89.0±4.2
SPGO_KNN [14] 69.0±5.9 66.8±5.7 82.1±4.5 81.8±4.5 90.0±5.1 89.5±4.7
SPGO_SSC [14] 72.4±4.7 72.0±3.3 85.7±5.3 85.4±5.3 89.5±3.8 89.0±3.8
SPGO_LRR [14] 72.2±5.8 70.8±6.5 85.5±4.7 85.7±5.0 91.6±4.2 91.2±4.7
SPGO_ADP [14] 69.2±7.0 67.2±7.8 83.1±5.5 83.2±5.3 89.9±4.1 89.5±4.1
LPSGL 81.4±4.5 79.7±2.9 91.8±4.5 90.8±3.7 95.9±2.5 96.0±2.2

Table 5
Mean classification results for Unlabel and Test on the Coil20 dataset.
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 62.2±2.9 61.3±2.6 72.5±3.0 72.3±3.2 77.9±2.1 77.3±2.1
MFA [20] – – 70.9±2.9 70.8±3.0 75.3±2.9 75.0±3.0
SDA [7] 65.2±2.6 64.0±2.7 74.4±2.5 73.6±2.6 79.2±1.7 77.9±1.6
FME [10] 67.4±3.0 66.1±2.9 77.0±2.5 76.4±2.5 80.8±2.2 80.0±1.9
NNSG [21] 76.2±2.5 75.0±2.1 80.0±2.6 79.2±2.0 83.6±1.7 80.4±2.2
SPGO_KNN [14] 77.0±4.8 75.9±4.8 79.1±2.9 78.8±2.7 84.0±2.8 81.1±2.6
SPGO_SSC [14] 71.8±4.1 71.1±3.6 75.2±2.0 75.2±2.8 80.0±3.5 79.3±2.7
SPGO_LRR [14] 65.9±3.6 65.1±3.6 70.9±2.9 69.8±2.8 79.3±3.6 78.6±3.4
SPGO_ADP [14] 79.7±3.0 76.5±3.3 81.3±2.6 80.4±2.2 85.6±2.2 82.0±1.3
LPSGL 81.4±2.3 80.7±1.7 84.4±1.7 84.2±1.9 87.3±1.7 86.2±1.5

Table 6
Mean classification results for Unlabel and Test on the USPS20 dataset.
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 54.6±4.7 53.5±6.3 66.5±3.3 64.3±4.7 68.9±3.8 70.2±3.5
MFA [20] – – 69.2±4.3 69.0±4.8 71.3±2.9 69.4±5.3
SDA [7] 61.4±4.1 60.6±3.6 71.5±3.9 69.3±5.0 76.2±3.1 76.6±3.3
FME [10] 63.3±4.6 65.4±5.8 71.4±3.9 72.6±4.1 76.1±3.7 75.4±3.9
NNSG [21] 57.8±6.8 56.2±5.6 65.4±3.9 65.7±3.9 72.6±3.0 70.9±4.3
SPGO_KNN [14] 33.7±4.2 34.1±3.3 41.9±5.1 41.2±4.1 45.5±3.7 44.3±3.9
SPGO_SSC [14] 39.0±4.9 36.5±4.7 40.0±3.8 40.6±5.1 44.7±4.1 45.0±2.7
SPGO_LRR [14] 56.7±5.6 57.7±6.5 67.7±5.1 66.7±3.7 71.5±3.7 69.9±4.2
SPGO_ADP [14] 37.3±4.6 36.3±4.1 42.5±4.7 43.2±5.0 45.9±2.9 45.9±4.1
LPSGL 69.6±6.0 70.9±4.7 78.7±3.3 77.5±3.7 83.0±2.1 82.9±2.1

(3) The performance of most semi-supervised dimension re- 5.2.2. Running time and computation complexity comparison
duction algorithms is better than MFA, which indicates that un- The time comparison results and computation complexity be-
tween LPSGL and the comparison approaches are shown in Ta-
labeled data are useful to improve the feature projection perfor-
ble 9, in which ι denotes the number of iterative steps required
mance. to calculate S. All the results are implemented on MATLAB R2016b
(4) From the MSRA25, Coil20 and ORL datasets, we can clearly with Inter (R) Core (TM) i7-6700 CPU @ 3.40 GHZ. For SPGO,
we only report the lowest running time among the 4 variant
observe that the dimension reduction approaches are better than
methods. From Table 9, we can draw the following conclusions:
Baseline, which demonstrates the importance of dimension re- (1) MFA, SDA, and FME, which separate graph learning and di-
duction on eliminating data noise and redundant information. mension reduction into two independent steps, are significantly
7
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 7
Mean classification results for Unlabel and Test on the CNAE-9 dataset..
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 48.1±1.7 48.4±4.8 61.8±4.5 62.1±3.7 65.6±4.3 65.7±3.7
MFA [20] – – 57.1±8.3 58.8±8.2 60.7±9.2 60.9±9.8
SDA [7] 50.5±8.1 51.1±6.2 59.2±8.7 58.2±7.0 67.1±7.0 66.3±4.2
FME [10] 51.8±5.9 53.5±8.2 68.0±5.5 64.5±4.5 72.4±3.9 71.6±4.3
NNSG [21] 30.7±7.5 28.3±4.2 46.3±8.1 43.2±5.9 56.9±7.1 54.2±3.6
SPGO_KNN [14] 38.0±4.8 34.1±5.3 47.0±6.6 39.9±4.1 51.6±4.2 44.7±3.7
SPGO_SSC [14] 31.1±5.9 28.9±5.7 41.4±6.6 38.3±4.0 45.9±5.3 38.5±5.6
SPGO_LRR [14] 45.0±9.2 41.3±6.5 54.0±3.4 56.9±5.4 56.8±3.9 54.1±6.3
SPGO_ADP [14] 36.7±4.7 33.6±5.4 46.9±5.2 41.8±3.8 52.7±4.6 44.3±4.3
LPSGL 70.0±3.8 60.8±4.9 77.0±4.0 69.5±4.4 80.0±3.6 73.6±2.9

Table 8
Mean classification results for Unlabel and Test on the Dig1–10 dataset.
p=1 p=2 p=3
Methods
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
Baseline 62.5±5.7 61.4±6.6 74.2±4.3 75.5±3.7 81.6±2.6 81.4±2.6
MFA [20] – – 71.7±3.1 71.9±4.7 74.6±3.2 73.9±2.7
SDA [7] 68.4±5.2 67.0±6.2 78.8±4.1 77.1±4.2 82.3±2.4 82.2±2.8
FME [10] 74.8±4.9 74.3±4.9 79.2±4.2 77.8±3.2 82.2±3.0 82.3±2.0
NNSG [21] 64.5±4.7 64.5±4.7 74.0±4.4 74.2±4.4 79.4±3.3 79.2±4.2
SPGO_KNN [14] 44.1±3.6 44.4±4.9 51.8±4.7 52.3±4.9 58.3±4.3 58.9±3.9
SPGO_SSC [14] 47.9±5.0 45.1±4.4 53.7±4.6 54.0±4.9 59.3±4.9 59.2±4.5
SPGO_LRR [14] 67.1±5.4 68.0±4.6 75.2±4.9 74.9±4.6 79.6±2.3 79.7±2.9
SPGO_ADP [14] 45.6±4.3 44.9±5.7 55.8±3.7 54.0±4.2 59.0±4.1 58.9±4.0
LPSGL 79.9±4.9 79.7±4.4 86.5±2.2 85.7±2.5 88.8±1.5 87.6±1.9

faster than other methods. The reason is that they do not need Eq. (11). Thus, Eq. (11) is reduced to
to update the graph during the iterative optimization. (2) NNSG n

consumes the longest running time among all approaches. This is min 2λTr(F T Ls F ) + α vi ∥W T xi − fi ∥22
F ,W (25)
because it employs Local Coordinate Coding (LCC) when learning i=1
the graph. (3) Compared with other methods, our proposed LPSGL s.t . F ∈ R n×c
, Fl = Yl
can achieve competitive performance with shorter running time
Specifically, we adopt KNN, ϵ neighbor, LRR and our semi-
and lower computation complexity.
supervised similarity matrix A to construct different graph Lapla-
cian matrix Ls . The updating rules of F and W in Eq. (25) are
5.2.3. Effects of semi-supervised structured graph learning similar to Eq. (11). This experiment is known as ‘‘Test_S’’ in
(1) Effects of Semi-supervised Similarity Learning. In this Table 10.
subsection, we verify the effects of our proposed semi-supervised From Table 10, we can observe that: (1) In the ‘‘Test_A’’, the
similarity learning part. Specifically, we compare our approach LPSGL achieves superior classification performance than all the
baselines, which illustrates that the semi-supervised similarity
with three variant methods which adopt different graph con-
learning and structured graph learning can improve the label
struction method: KNN, ϵ neighbor and LRR [34] to construct the
propagation process. Thus, more discriminative pseudo labels
similarity matrix A. For KNN graph, two samples xi and xj are
can be generated and the subsequent feature projection perfor-
connected if the distance between xi and xj is in the kth smallest mance can be improved accordingly. (2) The mean classification
distance from xi to other samples. The difference between the results obtained from the ‘‘Test_A’’ are higher than those from the
ϵ neighbor graph and KNN graph is that xi selects all neighbors ‘‘Test_S’’ in most cases. For LPSGL, ‘‘Test_A’’ gains 3.9%, 4.3%, 3.4%
whose distance from it is less than a value. LRR graph refers to improvement over ‘‘Test_S’’ on the Dig1–10 dataset, respectively.
the graph in the low-rank representation. In this experiment, the The superior performance is attributed to the reason that the
semi-supervised projection learning process is achieved by the structured graph learned from LPSGL can better characterize the
following equation semantic relations of samples, and it can support the label prop-
n
agation more effectively. (3) The experimental results in ‘‘Test_A’’
∑ obtained by LPSGL are better than that obtained by all baselines
min ∥S − Ac ∥2F + 2λTr (F T Ls F ) + α vi ∥W T xi − fi ∥22
S ,F ,W (24) in ‘‘Test_A’’. These results demonstrate that our proposed semi-
i=1
supervised similarity learning is effective on enhancing the graph
s.t . S1 = 1, S ≥ 0, F ∈ R n×c
, Fl = Yl structure and improving the feature projection performance.
where Ac denotes the similarity matrix, which can be calculated
5.2.4. Effects of pseudo label learning and importance scores
by the above variant methods. The updating rules of S, F and W in
To avoid unreliable labels, we explicitly impose a value range
Eq. (24) are similar to Eq. (11). According to the implementation
constraint on the predicted labels of unlabeled samples during
details in Section 5.1.3, we only report the mean classification the iterative optimization process. In this subsection, we design
results and standard deviation over 20 random splits on the a new variant method LPSGL-N for comparison. The objective
unlabeled training samples. This experiment is known as ‘‘Test_A’’ function of the variant method and the updating rules of S and W
in Table 10. are the same as those of LPSGL. For the update of F , we remove
(2) Effects of Structured Graph Learning. To verify the effects the constraint expressed by Eq. (15) and only use Eq. (14) to
of the structured graph learning, we remove the first term in predict the label matrix F .
8
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 9
Running time (recorded in seconds) and computation complexity comparison on six datasets.
Datasets
Methods Computation complexity
ORL MSRA25 Coil20 USPS20 CNAE-9 Dig1–10
MFA [20] 0.09 0.10 0.38 0.09 0.13 0.11 O (n3 )
SDA [7] 0.14 0.23 0.18 0.18 0.19 0.17 O (d3 )
FME [10] 0.20 0.41 0.35 0.44 0.31 0.42 O (2cn2 )
NNSG [21] 0.66 6.21 4.41 5.87 5.47 5.57 O ((n3 + max {n3 , n2 c } + ιn3 ) t +
(ndc) + (2d2 n + d3 + n2 d + n3 ))
SPGO [14] 0.40 1.57 0.91 1.27 0.99 1.39 O (max {n3 , d3 , clu} t)
LPSGL 0.36 1.42 1.09 1.19 0.89 0.82 O (max {n2 , u3 } t)

Table 10
Effects of the semi-supervised structured graph learning on three datasets.
KNN ϵ neighbor LRR [34] LPSGL
Datasets (p)
Test_S (%) Test_A (%) Test_S (%) Test_A (%) Test_S (%) Test_A (%) Test_S (%) Test_A (%)
ORL (1) 54.4±3.9 68.1±3.6 59.6±4.0 62.4±3.0 47.3±3.3 55.3±4.1 65.0±3.2 69.8±4.3
ORL (2) 68.9±2.8 79.4±3.2 72.9±4.6 73.8±3.1 66.0±4.4 69.2±3.3 78.9±3.4 80.5±3.5
ORL (3) 79.9±4.0 84.3±3.6 82.4±3.6 82.6±4.5 78.3±4.7 77.6±4.5 85.4±3.8 86.9±3.2
Coil20 (1) 71.8±2.5 78.6±2.8 69.6±2.5 51.1±2.4 57.3±2.3 60.1±2.7 78.2±1.7 81.4±2.3
Coil20 (2) 76.8±2.2 82.9±2.6 77.0±2.5 80.1±2.0 63.2±1.9 65.6±2.3 82.6±2.2 84.4±1.7
Coil20 (3) 79.9±2.6 85.7±1.4 81.9±2.1 82.6±1.4 68.7±2.8 72.2±3.5 85.6±1.5 87.3±1.7
Dig1–10 (1) 72.9±4.7 77.5±4.1 73.3±4.5 76.2±4.3 50.3±3.2 55.2±4.5 76.0±3.5 79.9±4.9
Dig1–10 (2) 81.4±3.4 84.8±2.9 82.9±2.5 82.2±3.1 61.3±2.7 66.6±4.2 82.2±2.8 86.5±2.2
Dig1–10 (3) 83.9±1.9 87.5±1.7 85.1±2.5 86.1±1.8 68.3±1.9 74.8±4.1 85.4±1.9 88.8±1.5

Besides, we design another variant method LPSGL-v to demon- samples. Then we record the variations of classification accuracy
strate the effects of importance scores. In the process of learning with different parameters for LPSGL in Fig. 2. The sub-figures
feature projection matrix, the variant method does not consider of (a), (b), (d), (e), (g) and (h) are the classification accuracy
the different importance of given and learned labeled samples. variations with different α and λ on unlabeled samples and test
Thus, the objective function of LPSGL-v is as follows samples when v is fixed, respectively. The last column is the
n sensitivity of v on different datasets when α and λ are fixed.
According to Fig. 2, we can observe that the classification

min ∥S − A∥2F + 2λTr (F T Ls F ) + α ∥W T xi − fi ∥22
S ,F ,W (26) accuracy variations are different on various datasets. The perfor-
i=1
mance for LPSGL on the ORL dataset is relatively stable to a wide
s.t . S1 = 1, S ≥ 0, F ∈ R n×c
, Fl = Yl
range of parameter variations (α , λ). However, the best results
It can be achieved by fixing the importance score matrix V in for unlabeled samples and test samples on the Coil20 and Dig1–
LPSGL as the identity matrix I ∈ Rn×n . The updating rules of S, 10 datasets are achieved with small α and λ. For the importance
W and F are similar to LPSGL. score v , both semi-supervised classification and out-of-sample
We conduct the experiments on three datasets. The mean classification can achieve superior performance when v is set to
classification results are shown in Table 11. From this table, we 10. These results show that the terms in Eq. (11) corresponding to
can find that: (1) LPSGL is superior than LSPGL-N in all cases. The α and λ are of equal importance on learning the feature projection
reason for the superior performance is that the constraint can matrix. In addition, since the classification performance after
avoid the unreliable labels and further improve the feature pro- dimension reduction can indicate the discriminative capability of
jection performance. (2) LPSGL significantly outperforms LPSGL-v, W learned from Eq. (11), it is necessary to appropriately increase
which demonstrates that the importance scores can differentiate the importance scores for the given labeled samples.
the effects of given and learned labels on learning the projection
matrix. 5.2.7. Convergence results
We have theoretically analyzed the convergence of Algorithm
5.2.5. Performance variations with the number of labeled samples 1 in Section 4.1, and here we further investigate its conver-
In this subsection, we present the performance variations with gence in practice. The convergence curves that record the vari-
the number of labeled samples. Specifically, we gradually in- ations of the objective function value of Eq. (11) with respect
crease the value of p to reflect the importance of labeled samples to the number of iterations are presented in Fig. 3. We can find
for projection matrix learning. Then we record the mean clas- that the objective function value drops sharply at first and does
sification accuracy and deviation over 20 random splits on the not change significantly after several iterations (less than 6).
unlabeled training samples and the testing samples in Fig. 1. From These experimental results demonstrate that our updating rule
this figure, we can find that the mean classification accuracy is effective.
after dimension reduction increases as the number of labeled
samples increases for both ‘‘ Unlabel ’’ and ‘‘ Test ’’ case. These 6. Conclusion
results demonstrate that more number of the labeled samples are
beneficial to improve the semi-supervised dimension reduction In this paper, we present an effective Label Propagation w ith
performance. Structured GraphLearning (LPSGL) method for semi-supervised
dimension reduction. Our model is developed to simultaneously
5.2.6. Parameter sensitivity perform label propagation, semi-supervised structured graph
We discuss the sensitivity of the three parameters α , λ and learning and dimension reduction in a unified framework. The
v in experiment on three datasets in this subsection. For each structure graph is adaptively learned in a semi-supervised learn-
dataset, we randomly select three samples per class as the labeled ing paradigm to better reveal the semantic relations of samples,
9
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Table 11
Effects of pseudo label learning and importance scores for semi-supervised dimension reduction on three datasets.
LPSGL-N LPSGL-v LPSGL
Datasets (p)
Unlabel (%) Test (%) Unlabel (%) Test (%) Unlabel (%) Test (%)
ORL (1) 67.7±4.0 63.5±2.7 64.4±3.3 62.2±3.4 69.8±4.3 64.8±2.2
ORL (2) 79.7±3.7 74.6±4.1 77.8±3.8 74.8±2.8 80.5±3.5 75.9±1.9
ORL (3) 86.0±3.7 81.6±2.6 85.5±3.0 81.9±3.0 86.9±3.2 83.7±2.2
Coil20 (1) 80.2±2.1 79.5±1.9 80.0±2.6 79.4±2.5 81.4±2.3 80.7±1.7
Coil20 (2) 84.0±1.8 83.4±1.3 83.4±1.8 82.4±1.5 84.4±1.7 83.7±1.1
Coil20 (3) 86.5±1.8 85.0±1.5 86.2±2.1 85.1±1.1 87.3±1.7 85.8±2.0
Dig1–10 (1) 78.8±5.0 78.7±4.5 75.9±4.1 74.9±6.5 79.2±5.1 79.7±4.4
Dig1–10 (2) 86.0±2.5 83.8±2.7 83.0±3.5 82.4±3.0 86.5±2.2 85.7±2.5
Dig1–10 (3) 87.5±2.5 86.3±1.8 88.1±1.7 85.8±2.4 88.8±1.5 87.4±1.7

Fig. 1. Performance variations with the number of labeled samples on three datasets.

Fig. 2. Classification accuracy variation with different values of parameters α , λ and v on three datasets.

10
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

Fig. 3. Objective function value variations with the number of iterations on three datasets.

so as to enhance the label propagation process. Besides, different [7] D. Cai, X. He, J. Han, Semi-supervised Discriminant Analysis, in: Proceed-
importance scores are assigned to the given and learned labeled ings of the International Conference on Computer Vision (ICCV), 2007, pp.
samples to differentiate their effects on learning the feature 1–7.
[8] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with
projection matrix. The joint optimization framework can gener-
local and global consistency, in: Proceedings of the Advances in Neural
ate more accurate labels and finally improve the discriminative Information Processing Systems (NIPS), 2004, 321–328.
capability of the feature projection matrix. Experimental results [9] X. Zhu, Z. Ghahramani, J.D. Lafferty, Semi-supervised learning using gaus-
on several widely tested datasets demonstrate the superiority of sian fields and harmonic functions, in: Proceedings of the International
the proposed method from various aspects. Conference on Machine Learning (ICML), 2003, pp. 912–919.
[10] F. Nie, D. Xu, I.W. Tsang, C. Zhang, Flexible manifold embedding: A
framework for semi-supervised and unsupervised dimension reduction,
CRediT authorship contribution statement IEEE Trans. Image Process. 19 (7) (2010) 1921–1932.
[11] J. Ma, B. Xiao, C. Deng, Graph based semi-supervised classification with
Fei Wang: Data curation, Writing - original draft, Software, probabilistic nearest neighbors, Pattern Recognit. Lett. 133 (2020) 94–101.
Validation. Lei Zhu: Data curation, Writing - original draft, [12] Z. Liu, Z. Lai, W. Ou, K. Zhang, R. Zheng, Structured optimal graph based
Software, Validation. Liang Xie: Investigation, Conceptualiza- sparse feature extraction for semi-supervised learning, Signal Process. 170
(2020) 107456.
tion, Methodology, Review & editing, Supervision, Funding
[13] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric
acquisition. Zheng Zhang: Reviewing and editing. Mingyang framework for learning from labeled and unlabeled examples, J. Mach.
Zhong: Reviewing and editing. Learn. Res. 7 (11) (2006) 2399–2434.
[14] F. Nie, X. Dong, X. Li, Unsupervised and semisupervised projection with
Declaration of competing interest graph optimization, IEEE Trans. Neural Netw. Learn. Syst. (2020) 1–13.,
http://dx.doi.org/10.1109/TNNLS.2020.2984958.
[15] Y. Han, L. Zhu, Z. Cheng, J. Li, X. Liu, Discrete optimal graph clustering,
The authors declare that they have no known competing IEEE Trans. Cybern. 50 (4) (2018) 1697–1710.
financial interests or personal relationships that could have [16] X. Dong, L. Zhu, X. Song, J. Li, Z. Cheng, Adaptive Collaborative Similarity
appeared to influence the work reported in this paper. Learning for Unsupervised Multi-view Feature Selection, in: Proceedings of
the International Joint Conference on Artificial Intelligence (IJCAI), 2018,
Acknowledgments pp. 2064–2070.
[17] X. Bai, L. Zhu, C. Liang, J. Li, X. Nie, X. Chang, Multi-view feature selection
via nonnegative structured graph learning, Neurocomputing 387 (2020)
The authors would like to thank the anonymous reviewers 110–122.
for their constructive and helpful suggestions. This work was [18] V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi, Linear manifold regular-
supported in part by the National Natural Science Foundation ization for large scale semi-supervised learning, in: Proceedings of the
of China under Grant 61802236 and Grant U1836216, in part International Conference on Machine Learning (ICML), 2005.
by the Natural Science Foundation of Shandong, China, under [19] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Graph embedding: A general frame-
work for dimensionality reduction, in: Proceedings of the IEEE Conference
Grant ZR2020YQ47 and Grant ZR2019QF002, in part by the Major
Computer Vision and Pattern Recognition (CVPR), 2005, pp. 830–837.
Fundamental Research Project of Shandong, China, under Grant [20] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin, Graph embedding and
ZR2019ZD03, in part by the Youth Innovation Project of Shan- extensions: A general framework for dimensionality reduction, IEEE Trans.
dong Universities, China, under Grant 2019KJN040, and in part Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51.
by the Taishan Scholar Project of Shandong, China, under Grant [21] X. Fang, Y. Xu, X. Li, Z. Lai, W.K. Wong, Learning a nonnegative sparse graph
ts20190924. for linear regression, IEEE Trans. Image Process. 24 (9) (2015) 2760–2771.
[22] Y. Wang, Y. Meng, Y. Li, S. Chen, Z. Fu, H. Xue, Semi-supervised manifold
regularization with adaptive graph construction, Pattern Recognit. Lett. 98
References (15) (2017) 90–95.
[23] X. Fang, Y. Xu, X. Li, Z. Lai, W.K. Wong, Robust semi-supervised subspace
[1] D. Shi, L. Zhu, Z. Cheng, Z. Li, H. Zhang, Unsupervised multi-view feature clustering via non-negative low-rank representation, IEEE Trans. Cybern.
extraction with dynamic graph learning, J. Vis. Commun. Image Represent. 46 (8) (2016) 1828–1838.
56 (2018) 256–264. [24] I.K. Fodor, A survey of dimension reduction techniques, Technical. Report.,
[2] H. Guo, H. Zou, J. Tan, Semi-supervised dimensionality reduction via sparse Lawrence Livermore National Lab, CA (US), 2002.
locality preserving projection, Appl. Intell. 50 (4) (2020) 1222–1232. [25] Y. Pang, B. Zhou, F. Nie, Simultaneously learning neighborship and pro-
[3] Y. Liu, R. Zhang, F. Nie, X. Li, C. Ding, Supervised dimensionality reduction jection matrix for supervised dimensionality reduction, IEEE Trans. Neural
methods via recursive regression, IEEE Trans. Neural Netw. Learn. Syst. 31 Netw. Learn. Syst. 30 (9) (2019) 2779–2793.
(9) (2020) 3269–3279. [26] W. Wang, Y. Yan, F. Nie, S. Yan, N. Sebe, Flexible manifold learning with
[4] X. Zhao, J. Guo, F. Nie, L. Chen, Z. Li, H. Zhang, Joint principal component optimal graph for image and video representation, IEEE Trans. Image
and discriminant analysis for dimensionality reduction, IEEE Trans. Neural Process. 27 (6) (2018) 2664–2675.
Netw. Learn. Syst. 31 (2) (2019) 433–444. [27] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding,
[5] Z. Fan, Y. Xu, D. Zhang, Local linear discriminant analysis framework using in: Proceedings of the Advances in Neural Information Processing Systems
sample neighbors, IEEE Trans. Neural Netw. 22 (7) (2011) 1119–1132. (NIPS), 2009, pp. 2223–2231.
[6] I.T. Jolliffe, Principal components in regression analysis, Princ. Compon. [28] A.Y. Ng, M.I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and
Anal. 87 (100) (1986) 129–155. an algorithm, Adv. Neural Inf. Process. Syst. 2 (2002) 849–856.

11
F. Wang, L. Zhu, L. Xie et al. Knowledge-Based Systems 225 (2021) 107130

[29] B. Mohar, Y. Alavi, G. Chartrand, O. Oellermann, The Laplacian spectrum [32] S.A. Nane, S.K. Nayar, H. Murase, Columbia object image library (coil-20),
of graphs, Graph Theory Comb. Appl. 2 (12) (1991) 871–898. Technical. Report. CUCS-005-96, Columbia University, 1996).
[30] J. Huang, F. Nie, H. Huang, A New Simplex Sparse Learning Model to [33] J.J. Hull, A database for handwritten text recognition research, IEEE Trans.
Measure Data Similarity for Clustering, in: Proceedings of the International Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554.
Joint Conference on Artificial Intelligence (IJCAI), 2015, pp. 3569–3575. [34] G. Liu, Z. Lin, Y. Yu, Robust Subspace Segmentation by Low-Rank Rep-
[31] M.J. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facial resentation, in: Proceedings of the International Conference on Machine
images, IEEE Trans. Pattern Anal. Mach. Intell. 21 (12) (1999) 1357–1362. Learning (ICML), 2010, pp. 663–670.

12

You might also like