Professional Documents
Culture Documents
Textbook Learning Representation For Multi View Data Analysis Models and Applications Zhengming Ding Ebook All Chapter PDF
Textbook Learning Representation For Multi View Data Analysis Models and Applications Zhengming Ding Ebook All Chapter PDF
https://textbookfull.com/product/linking-and-mining-
heterogeneous-and-multi-view-data-deepak-p/
https://textbookfull.com/product/data-analysis-in-the-cloud-
models-techniques-and-applications-1st-edition-marozzo/
https://textbookfull.com/product/advanced-r-statistical-
programming-and-data-models-analysis-machine-learning-and-
visualization-1st-edition-matt-wiley/
https://textbookfull.com/product/machine-learning-and-big-data-
analytics-paradigms-analysis-applications-and-challenges-aboul-
ella-hassanien/
Time Series Analysis Methods and Applications for
Flight Data Zhang
https://textbookfull.com/product/time-series-analysis-methods-
and-applications-for-flight-data-zhang/
https://textbookfull.com/product/statistical-modeling-for-
degradation-data-1st-edition-ding-geng-din-chen/
https://textbookfull.com/product/practical-machine-learning-for-
streaming-data-with-python-design-develop-and-validate-online-
learning-models-1st-edition-sayan-putatunda/
https://textbookfull.com/product/electrolyzed-water-in-food-
fundamentals-and-applications-tian-ding/
https://textbookfull.com/product/practical-machine-learning-for-
data-analysis-using-python-1st-edition-abdulhamit-subasi/
Advanced Information and Knowledge Processing
Zhengming Ding
Handong Zhao
Yun Fu
Learning
Representation for
Multi-View Data
Analysis
Models and Applications
Advanced Information and Knowledge
Processing
Series editors
Lakhmi C. Jain
Bournemouth University, Poole, UK, and
University of South Australia, Adelaide, Australia
Xindong Wu
University of Vermont
Information systems and intelligent knowledge processing are playing an increasing
role in business, science and technology. Recently, advanced information systems
have evolved to facilitate the co-evolution of human and information networks
within communities. These advanced information systems use various paradigms
including artificial intelligence, knowledge management, and neural science as well
as conventional information processing paradigms. The aim of this series is to
publish books on new designs and applications of advanced information and
knowledge processing paradigms in areas including but not limited to aviation,
business, security, education, engineering, health, management, and science. Books
in the series should have a strong focus on information processing—preferably
combined with, or extended by, new results from adjacent sciences. Proposals for
research monographs, reference books, coherently integrated multi-author edited
books, and handbooks will be considered for the series and each proposal will be
reviewed by the Series Editors, with additional reviews from the editorial board and
independent reviewers where appropriate. Titles published within the Advanced
Information and Knowledge Processing series are included in Thomson Reuters’
Book Citation Index.
Yun Fu
Learning Representation
for Multi-View Data Analysis
Models and Applications
123
Zhengming Ding Yun Fu
Indiana University-Purdue Northeastern University
University Indianapolis Boston, MA, USA
Indianapolis, IN, USA
Handong Zhao
Adobe Research
San Jose, CA, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
transfer learning problem when all the sources are incomplete. Chapter 9 proposes
three deep domain adaptation models to address the challenge where target data has
limited or no label. Following this, Chap. 10 provides a deep domain generalization
model aiming to deal with the target domain that is not available in the training
stage while only with multiple related sources at hand.
In particular, this book can be used by these audiences in the background of
computer science, information systems, data science, statistics, and mathematics.
Other potential audiences can be attracted from broad fields of science and engi-
neering since this topic has potential applications in many disciplines.
We would like to thank our collaborators Ming Shao, Hongfu Liu, and Shuyang
Wang. We would also like to thank editor Helen Desmond from Springer for the
help and support.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Are Multi-view Data and Problem? . . . . . . . . . . . . . . . . . 1
1.2 A Unified Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
vii
viii Contents
Multi-view data generated from various view-points or multiple sensors are com-
monly seen in real-world applications. For example, the popular commercial depth
sensor Kinect uses both visible light and near infrared sensors for depth estimation;
autopilot uses both visual and radar sensors to produce real-time 3D information
on the road; face analysis algorithms prefer face images from different views for
high-fidelity reconstruction and recognition. However, such data with large view
divergence would lead to an enormous challenge: data across various views have a
large divergence preventing them from a fair comparison. Generally, different views
tend to be treated as different domains from different distributions. Thus, there is an
urgent need to mitigate the view divergence when facing specific problems by either
fusing the knowledge across multiple views or adapting knowledge from some views
to others. Since there are different terms regarding “multi-view” data analysis and
its aliasing, we first give a formal definition and narrow down our research focus to
differentiate it from other related works but in different lines.
First, multi-view learning aims to to merge the knowledge from different views
to either uncover common knowledge, or employ the complementary knowledge in
Fig. 1.1 Different scenarios of multi-view data analytics. a Different types of features from single
image; b different sources to represent information; c–e images from different viewpoints
specific views to assist learning tasks. For example, in vision, multiple features
extracted from the same object by various visual descriptors, e.g., LBP, SIFT and
HOG are very discriminant in recognition tasks. Another example is multi-modal data
captured, represented, and stored in varied formats, e.g., near-infrared and visible
face, and image and text. For multi-view learning, the goal is to fuse the knowledge
from multiple views to facilitate common learning tasks, e.g., clustering and classi-
fication. The key challenge is exploring data correspondence across multiple views.
The mappings among different views are able to couple view-specific knowledge
while additional labels would help formulate supervised regularizers. The general
setting of multi-view clustering is to group n data samples in v different views (e.g.,
v types of features, sensors, or modalities) by fusing the knowledge across different
views to seek a consistent clustering result. The general setting of multi-view clas-
sification is that it needs to build a model with given v views of training data. In
the test stage, we would have two different scenarios. First, one view will be used
to recognize other views with the learned model. In this case, the label information
across training and test data is different; Second, specifically for multi-features based
learning, is that v-view training data is used to seek a model by fusing the cross-view
knowledge, which is also used as gallery data to recognize v-view probe data.
Second, domain adaptation attempts to transfer knowledge from labeled source
domains to facilitate the learning burden in the target domains with sparsely or
no labeled samples. For example, in surveillance, faces are captured by long wave
infrared sensor in night-time, but recognition model is trained on regular face images
collected under visible light. Conventional domain adaptation methods consider seek-
ing domain-invariant representation for the data or modifying classifiers to fight off
the marginal or conditional distribution mismatch across source and target domains.
The goal of domain adaptation is to transfer knowledge from well-labeled sources
to unlabeled targets, which accounts for the more general settings that some source
views are labeled while target views are unlabeled. The general setting of domain
adaptation is that we build a model on both labeled source data and unlabeled target
data. Then we use the model to predict the unlabeled target data, either the same
1.1 What Are Multi-view Data and Problem? 3
data in the training stage or different data. Thus, we have corresponding transductive
domain adaptation and inductive domain adaption.
There are different strategies to deal with multi-view data, e.g., translation, fusion,
alignment, co-learning and representation learning. This book will focus on represen-
tation learning and fusion. The following chapters would discuss multi-view data ana-
lytic algorithms along with our proposed unified model Sect. 1.2 from three aspects.
Furthermore, we will discuss the challenging situation where test data are sampled
from unknown categories, e.g., zero-shot learning, and more challenging tasks with
incomplete data, e.g., missing modality transfer learning, incomplete multi-source
adaptation and domain generalization.
v
v
min A ( f i (X i ), f j (X j )) + λ R( f k (X k )),
f 1 (·),..., f v (·)
i=1,i< j k=1
where f i (·) is a feature learning function for view i, either linear, non-linear mapping,
or deep network.
The first common term A (·) is a pairwise symmetric alignment function across
multiple views to either fuse the knowledge among multiple views or transfer knowl-
edge across different views. Due to different problem settings, multi-view learning
and domain adaptation would explore various strategies to define the loss functions.
While multi-view learning employs data correspondence (i.e., sample-wise relation-
ship w/ or w/o labels) to seek common representation, domain adaptation employs
domain- or class-wise relationship during the model learning for discriminant domain
invariant feature.
The second common term R(·) is the feature learning regularizer by incorporat-
ing either the labeled information or the intrinsic structure of the data, or both during
the mapping learning. To name a few, logistic regression, Softmax regression, graph
regularizers are usually incorporated to carry the label and manifold information.
When we turn to deep learning, this term is mostly Softmax regression. For a part of
multi-view learning algorithms, they would merge feature learning regularizer into
the alignment term. Generally, the formulation of the second term is very similar
4 1 Introduction
between multi-view learning and domain adaptation within our research concentra-
tion.
Along the unified model, we will cover both shallow structure learning and deep
learning approaches for multi-view data analysis, e.g., subspace learning, matrix
factorization, low-rank modeling, deep auto-encoder, deep neural networks, deep
convolutional neural networks. For example, multi-view clustering models will be
explored including multi-view matrix factorization, multi-view subspace learning,
multi-view deep structure learning in unsupervised setting.
The rest of this book is organized as follows. The first two parts are for multi-view
data analysis with sample-wise correspondence; and the third part is for multi-view
data analysis with class-wise correspondence.
Part I focuses on developing unsupervised multi-view clustering (MVC) models.
It consists of the following three chapters. Chapter 2 explores complementary infor-
mation across views to benefit the clustering problem and presents a deep matrix
factorization framework for MVC, where semi-nonnegative matrix factorization is
adopted to learn the hierarchical semantics of multi-view data in a layer-wise fashion.
To maximize the mutual information from each view, we enforce the non-negative
representation of each view in the final layer to be the same. Furthermore, to respect
the intrinsic geometric structure in each view data, graph regularizers are introduced
to couple the output representation of deep structures.
Chapter 3 considers an underlying problem hidden behind the emerging multi-
view techniques: What if one/more view data fail? Thus, we propose an unsuper-
vised method which well handles the incomplete multi-view data by transforming
the original and incomplete data to a new and complete representation in a latent
space. Different from the existing efforts that simply project data from each view
into a common subspace, a novel graph Laplacian term with a good probabilistic
interpretation is proposed to couple the incomplete multi-view samples. In such a
way, a compact global structure over the entire heterogeneous data is well preserved,
leading to a strong grouping discriminability.
Chapter 4 presents a multi-view outlier detection algorithm based on clustering
techniques to identify two different types of data outliers with abnormal behaviors.
We first give the definition of both types of outliers in multi-view setting. Then we
propose a multi-view outlier detection method with a novel consensus regularizer
on the latent representations. Specifically, we explicitly characterize each kind of
outliers by the intrinsic cluster assignment labels and sample-specific errors. We
experimentally show that this practice generalizes well when the number of views are
greater than two. Last but the least, we make a thorough discussion on the connection
and difference between the proposed consensus-regularization and the state-of-the-
art pairwise-regularization.
1.3 Organization of the Book 5
Abstract Multi-view Clustering (MVC) has garnered more attention recently since
many real-world data are comprised of different representations or views. The key
is to explore complementary information to benefit the clustering problem. In this
chapter, we consider the conventional complete-view scenario. Specifically, in the
first section, we present a deep matrix factorization framework for MVC, where
semi-nonnegative matrix factorization is adopted to learn the hierarchical semantics
of multi-view data in a layer-wise fashion. In the second section, we make an exten-
sion and consider the different sampled feature sets as multi-view data. We propose
a novel graph-based method, Ensemble Subspace Segmentation under Block-wise
constraints (ESSB), which is jointly formulated in the ensemble learning framework.
2.1.1 Overview
1 Thischapter is reprinted with permission from AAAI. “Multi-view Clustering via Deep Matrix
Factorization”. 31st AAAI Conference on Artificial Intelligence, pp. 2921–2927, 2017.
© Springer Nature Switzerland AG 2019 9
Z. Ding et al., Learning Representation for Multi-View Data Analysis,
Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-030-00734-8_2
10 2 Multi-view Clustering with Complete Information
in developing effective MVC methods (Cai et al. 2013a; Gao et al. 2015; Xu et al.
2016; Zhao et al. 2016). Along this line, Kumar et al. developed co-regularized Multi-
view spectral clustering to do clustering on different views simultaneously with a
co-regularization constraint (Kumar et al. 2011). Gao et al. proposed to perform
clustering on the subspace representation of each view simultaneously guided by a
common cluster structure for the consistence across different views (Gao et al. 2015).
A good survey can be found in Xu et al. (2013).
Recently, lots of research activities on MVC have achieved promising performance
based on Non-negative Matrix Factorization (NMF) and its variants, because the non-
negativity constraints allow for better interpretability (Guan et al. 2012; Trigeorgis
et al. 2014). The general idea is to seek a common latent factor through non-negative
matrix factorization among Multi-view data (Liu et al. 2013; Zhang et al. 2014, 2015).
Semi Non-negative Matrix Factorization (Semi-NMF), as one of the most popular
variants of NMF, was proposed to extend NMF by relaxing the factorized basis matrix
to be real values. This practice allows Semi-NMF to have a wider application in the
real world than NMF. Apart from exploring Semi-NMF in MVC application for the
first time, our method has another distinction from the existing NMF-based MVC
methods: we adopt a deep structure to conduct Semi-NMF hierarchically as shown in
Fig. 2.1. As illustrated, through the deep Semi-NMF structure, we push data samples
from the same class closer layer by layer. We borrow the idea from deep learning
(Bengio 2009), thus this practice has such a flavor. Note that the proposed method
is different from the existing deep auto-encoder based MVC approaches (Andrew
et al. 2013; Wang et al. 2015), though all of us are of deep structure. One major
difference is that Andrew et al. (2013), Wang et al. (2015) are based on Canonical
Correlation Analysis (CCA), which is limited to 2-view case, while our method has
no such limitation.
Fig. 2.1 Framework of our proposed method. Same shape denotes the same class. For demonstra-
tion purposes, we only show the two-view case, where two deep matrix factorization structures
are proposed to capture rich information behind each view in a layer-wise fashion. With the deep
structure, samples from the same class but different views gather close to each other to generate
more discriminative representation
2.1 Deep Multi-view Clustering 11
To sum up, in this section we propose a deep MVC algorithm through graph reg-
ularized semi-nonnegative matrix factorization. The key is to build a deep structure
through semi-nonnegative matrix factorization to seek a common feature represen-
tation with more consistent knowledge to facilitate clustering. To the best of our
knowledge, this is the first attempt applying semi-nonnegative matrix factorization
to MVC in a deep structure. We summarize our major contributions as follows:
• Deep Semi-NMF structure is built to capture the hidden information by leveraging
benefits of strong interpretability from Semi-NMF and effective feature learning
from deep structure. Through this deep matrix factorization structure, we dis-
semble unimportant factors layer by layer and generate an effective consensus
representation in the final layer for MVC.
• To respect the intrinsic geometric relationship among data samples, we introduce
graph regularizers to guide the shared representation learning in each view. This
practice makes the consensus representation in the final layer preserve most shared
structures across multiple graphs. It can be considered as a fusion scheme to boost
the final MVC performance.
where X ∈ Rd×n denotes the input data with n samples, each sample is of d dimen-
sional feature. In the discussion on equivalence of semi-NMF and K-means clustering
(Ding et al. 2010), Z ∈ Rd×K can be considered as the cluster centroid matrix,2 and
H ∈ R K ×n , H ≥ 0 is the “soft” cluster assignment matrix in latent space.3 Similar
to the traditional NMF, the compact representation H uncovers the hidden semantics
by simulating the part-based representation in human brain, i.e., psychological and
physiological interpretation.
While in reality, natural data may contain different modalities (or factors), e.g.,
expression, illumination, pose in face datasets (Samaria and Harter 1994; Georghi-
ades et al. 2001). Single NMF is not strong enough to eliminate the effect of
2 For a neat presentation, we do not follow the notation style in Ding et al. (2010), and remove the
mix-sign notation “±” on X and Z , which does not affect the rigorousness.
3 In some literatures (Ding et al. 2010; Zhao et al. 2015), Semi-NMF is also called the soft version
of K-means clustering.
12 2 Multi-view Clustering with Complete Information
those undesirable factors and extract the intrinsic class information. To solve this,
Trigeorgis et al. (2014) showed that a deep model based on Semi-NMF has a promis-
ing result in data representation. The multi-layer decomposition process can be
expressed as
X ≈ Z 1 H1+
X ≈ Z 1 Z 2 H2+
.. (2.2)
.
X ≈ Z 1 . . . Z m Hm+
where Z i denotes the ith layer basis matrix, Hi+ is the ith layer representation matrix.
Trigeorgis et al. (2014) proved that each hidden representations layer is able to
identify the different attributes. Inspired by this work, we propose a MVC method
based on deep matrix factorization technique.
In the MVC setting, let us denote X = {X (1) , . . . , X (v) , . . . , X (V ) } as the data
sample set. V represents the number of views. X (v) ∈ Rdv ×n , where dv denotes the
dimensionality of the v-view data and n is the number of data samples. Then we
formulate our model as:
V
min (α (v) )γ X (v) −Z 1(v) Z 2(v) . . . Z m(v) Hm 2F + βtr(Hm L (v) HmT )
Z i(v) , Hi(v) v=1
Hm , α (v) (2.3)
V
s.t. Hi(v) ≥ 0, Hm ≥ 0, α (v) = 1, α (v) ≥ 0,
v=1
where X (v) is the given data for vth view. Z i(v) , i ∈ {1, 2, . . . , m} is the ith layer map-
ping for view v. m is the number of layers. Hm is the consensus latent representation
for all views. α (v) is the weighting coefficient for the vth view. γ is the parameter to
control the weights distribution. L (v) is the graph Laplacian of the graph for view v,
where each graph is constructed in k-nearest neighbor (k-NN) fashion. Theweight
matrix of the graph for view v is A(v) and L (v) = A(v) − D (v) , where Dii(v) = j Ai(v) j
(He and Niyogi 2003; Ding and Fu 2016).
Remark 1 Due to the homology of Multi-view data, the final layer representation
Hm(v) for vth view data should be close to each other. Here, we use the consensus
Hm as a constraint to enforce Multi-view data to share the same representation after
multi-layer factorization.
2.1.2.2 Optimization
To expedite the approximation of the variables in the proposed model, each of the
layers is pre-trained to have an initial approximation of variables Z i(v) and Hi(v)
for the ith layer in vth view. The effectiveness of pre-training has been proven
before Hinton and Salakhutdinov (2006) on deep autoencoder networks. Similar
to Trigeorgis et al. (2014), we decompose the input data matrix X (v) ≈ Z 1(v) H1(v)
to perform the pre-training, where Z 1(v) ∈ Rdv × p1 and H1(v) ∈ R p1 ×n . Then the vth
view feature matrix H1(v) is decomposed as H1(v) ≈ Z 2(v) H2(v) , where Z 2(v) ∈ R p1 × p2
and H2(v) ∈ R p2 ×n . p1 and p2 are the dimensionalities for layer 1 and layer 2,
respectively.4 Continue to do so until we have pre-trained all layers. Follow-
ing this, the weights of each layer is fine-tuned by alternating minimizations of
the proposed objective function Eq. (2.3). First, we denote the cost function as
V
C = (α (v) )γ X (v) − Z 1(v) Z 2(v) . . . Z m(v) Hm 2F + βtr(Hm L (v) HmT ) .
v=1
Update rule for weight matrix Z(v) i . We minimize the objective value with
(v)
respect to Z i by fixing the rest of variables in vth view for the ith layer. By setting
∂C /∂ Z i(v) = 0, we give the solutions as
where [M]pos denotes a matrix that all the negative elements are replaced by 0.
Similarly, [M]neg denotes one that has all the positive elements replaced by 0. That is,
Update rule for weight matrix Hm (i.e., Hi(v) (i = m)). Since Hm involves the
graph term, the updating rule and convergence property have never been investigated
4 For the ease of presentation, we denote the dimensionalities (layer size) from layer 1 to layer m
as [ p1 . . . pm ] in the experiments.
14 2 Multi-view Clustering with Complete Information
before. We give the updating rule first, followed by the proof of its convergence
property.
[Φ T X (v) ]pos +[Φ T Φ Hm ]neg +Gu (Hm , A)
Hm =Hm (2.7)
[Φ T X (v) ]neg +[Φ T Φ Hm ]pos +Gd (Hm , A)
where Gu (Hm , A) = β([Hm A(v) ]pos + [Hm D (v) ]neg ) and Gd (Hm , A) = β([Hm
A(v) ]neg + [Hm D (v) ]pos ).
Theorem 2.1 The limited solution of the update rule in Eq. (2.7) satisfies the KKT
condition.
V
L (Hm ) = (α (v) )γ X (v) − Z 1(v) Z 2(v) . . . Z m(v) Hm 2F
v=1
(2.8)
(v)
+ βtr(Hm L HmT ) − ηHm ,
This is a fixed point equation that the solution must satisfy at convergence.
The limiting solution of Eq. (2.7) satisfies the fixed point equation. At conver-
gence, Hm(∞) = Hm(t+1) = Hm(t) = Hm , i.e.,
Equation (2.11) is identical to Eq. (2.9). Both equations require that at least one of
the two factors is equal to zero. The first factors in both equations are identical. For
the second factor (Hm )kl or (Hm2 )kl , if (Hm )kl = 0 then (Hm2 )kl = 0, and vice versa.
Therefore if Eq. (2.9) holds, Eq. (2.11) also holds and vice versa.
2.1 Deep Multi-view Clustering 15
Update rule for weight α (v) . Similar to (Cai et al. 2013b), for the ease of rep-
resentation, let us denote R (v) = X (v) − Z 1(v) Z 2(v) . . . Z m(v) Hm 2F + βtr(Hm L (v) HmT ).
The objective in Eq. (2.3) with respect to α (v) is written as
V
V
min (α (v) )γ R (v) , s.t. α (v) = 1, α (v) ≥ 0. (2.12)
α (v)
v=1 v=1
V V
min (α (v) )γ R (v) − λ( α (v) − 1), (2.13)
α (v)
v=1 v=1
where λ is the Lagrange multiplier. By taking the derivative of Eq. (2.13) with respect
to α(v), and setting it to zero, we obtain
1
λ γ −1
α (v) = . (2.14)
γ R (v)
V
Then we replace α (v) in Eq. (2.14) into α (v) = 1, and obtain
v=1
1
(v) γ R (v) 1−γ
α = V .
(v)
1−γ
1 (2.15)
γR
v=1
It is interesting to see that with only one parameter γ , we could control the different
weights for different views. When γ approaches ∞, we get equal weights. When γ
is close to 1, the weight of the view whose R (v) value is the smallest is assigned to
1, and the others are assigned to 0.
Until now, we have all the update rules done. We repeat the updates iteratively
until convergence. The entire algorithm is outlined in Algorithm 2.1. After obtaining
the optimized Hm , standard spectral clustering (Ng et al. 2001) is performed on the
graph built on Hm via k-NN algorithm.
Our deep matrix factorization model is composed of two stages, i.e., pre-training and
fine-tuning, so we analyze them separately. To simplify the analysis, we assume the
dimensions in all the layers (i.e., layer size) are the same, denoting p. The original
feature dimensions for all the views are the same, denoting d. V is the number of
views. m is the number of layers.
16 2 Multi-view Clustering with Complete Information
In pre-training stage, the Semi-NMF process and graph construction are the time
consuming
parts. The complexity is of order O V mt p (dnp + np 2 + pd 2 + pn 2 +
dn ) , where t p is the number of iterations to achieve convergence in Semi-NMF
2
optimization
process. Normally,
p < d, thus the computational cost is T pr e. =
O V mt p (dnp + pd 2 + dn 2 ) for the pre-training stage. Similarly, in the fine-tuning
stage, the time complexity is of order T f ine. = O V mt f (dnp + pd 2 + pn 2 ) , where
t f is the number of iterations in this fine-tuning stage. To sum up, the overall com-
putational cost is Ttotal = T pr e. + T f ine. .
For these datasets, we follow the preprocessing strategy (Cao et al. 2015). Firstly
all the images are resized into 48 × 48 and then three kinds of features are extracted,
i.e., intensity, LBP (Ahonen et al. 2006) and Gabor (Feichtinger and Strohmer 1998).
Specifically, LBP is a 59-dimension histogram over 9 × 10 pixel patches generated
from cropped images. The scale parameter λ in Gabor wavelets is fixed as 4 at four
orientations θ = {0◦ , 45◦ , 90◦ , 135◦ } with a cropped image of size 25 × 30 pixels.
For the comparison baselines, we have the following. (1) BestSV performs stan-
dard spectral clustering (Ng et al. 2001) on the features in each view. We report the best
performance. (2) ConcatFea concatenates all the features, and then performs stan-
dard spectral clustering. (3) ConcatPCA concatenates all the features, then projects
the original features into a low-dimensional subspace via PCA. Spectral clustering
is applied on the projected feature representation. (4) Co-Reg (SPC) (Kumar et al.
2011) co-regularizes the clustering hypotheses to enforce the memberships from
different views admit with each other. (5) Co-Training (SPC) (Kumar and Daume
III 2011) borrows the idea of co-training strategy to alternatively modify the graph
structure of each view using other views’ information. (6) Min-D(isagreement) (de
Sa 2005) builds a bipartite graph which derives from the “minimizing-disagreement”
idea. (7) MultiNMF (Liu et al. 2013) applies NMF to project each view data to the
common latent subspace. This method can be roughly considered as one-layer ver-
sion of our proposed method. (8) NaMSC (Cao et al. 2015) firstly applies (Hu et
al. 2014) to each view data, then combines the learned representations and feeds to
the spectral clustering. (9) DiMSC (Cao et al. 2015) investigates the complementary
information of representations of Multi-view data by introducing a diversity term.
This work is also one of the most recent approaches in MVC. We do not make the
comparison with deep auto-encoder based methods (Andrew et al. 2013, Wang et
al. 2015), because these CCA-based methods cannot fully utilize more than 2 view
data, leading to an unfair comparison.
To make a comprehensive evaluation, we use six different evaluation metrics
including normalized mutual information (NMI), accuracy (ACC), adjusted
rand index (AR), F-score, Precision and Recall. For details about the metrics,
readers could refer to Kumar and Daume III (2011), Cao et al. (2015). For all the
metrics, higher value denotes better performance. Different measurements favor dif-
ferent properties, thus a comprehensive view can be acquired from the diverse results.
For each experiment, we repeat 10 times and report the mean values along with stan-
dard deviations.
2.1.3.1 Result
Tables 2.1 and 2.2 tabulate the results on datasets Yale and Extended YaleB. Our
method outperforms all the other competitors. For the dataset Yale, we raise the
performance bar by around 7.57% in NMI, 5.08% in ACC, 8.22% in AR, 6.56% in
F-score, 10.13% in Precision and 4.61% in Recall. On average, we improve the state-
of-the-art DiMSC by more than 7%. The possible reason why our method improves
a lot is that both image data in Yale and Extended YaleB contain multiple factors, i.e.,
18 2 Multi-view Clustering with Complete Information
Table 2.1 Results of 6 different metrics (mean ± standard deviation) on dataset Yale
Method NMI ACC AR F-score Precision Recall
BestSV 0.654 ± 0.616 ± 0.440 ± 0.475 ± 0.457 ± 0.495 ±
0.009 0.030 0.011 0.011 0.011 0.010
ConcatFea 0.641 ± 0.544 ± 0.392 ± 0.431 ± 0.415 ± 0.448 ±
0.006 0.038 0.009 0.008 0.007 0.008
ConcatPCA 0.665 ± 0.578 ± 0.396 ± 0.434 ± 0.419 ± 0.450 ±
0.037 0.038 0.011 0.011 0.012 0.009
Co-Reg 0.648 ± 0.564 ± 0.436 ± 0.466 ± 0.455 ± 0.491 ±
0.002 0.000 0.002 0.000 0.004 0.003
Co-Train 0.672 ± 0.630 ± 0.452 ± 0.487 ± 0.470 ± 0.505 ±
0.006 0.001 0.010 0.009 0.010 0.007
Min-D 0.645 ± 0.615 ± 0.433 ± 0.470 ± 0.446 ± 0.496 ±
0.005 0.043 0.006 0.006 0.005 0.006
MultiNMF 0.690 ± 0.673 ± 0.495 ± 0.527 ± 0.512 ± 0.543 ±
0.001 0.001 0.001 0.000 0.000 0.000
NaMSC 0.671 ± 0.636 ± 0.475 ± 0.508 ± 0.492 ± 0.524 ±
0.011 0.000 0.004 0.007 0.003 0.004
DiMSC 0.727 ± 0.709 ± 0.535 ± 0.564 ± 0.543 ± 0.586 ±
0.010 0.003 0.001 0.002 0.001 0.003
Ours 0.782 ± 0.745 ± 0.579 ± 0.601 ± 0.598 ± 0.613 ±
0.010 0.011 0.002 0.002 0.001 0.002
pose, expression, illumination, etc. The existing MVC methods only involve one layer
of representation, e.g., one layer factor decomposition in MultiNMF or the practice of
self-representation (i.e., coefficient matrix Z in NaMSC and DiMSC Cao et al. 2015).
However, our proposed approach can extract the meaningful representation layer by
layer. Through the deep representation, we eliminate the influence of undesirable
factors, and keep the core information (i.e., class/id information) in the final layer.
Table 2.3 lists the performance on video data Notting-Hill. This dataset is more
challenging than the previous two image datasets, since the illumination conditions
vary dramatically and the source of lighting is arbitrary. Moreover, there is no fixed
expression pattern in the Notting-Hill movie, on the contrary to datasets Yale and
Extended YaleB. We observe from the tables that our method reports the superior
results in five metrics. The only outlier is NMI, but our performance is slightly
worse than DiMSC by only 0.25%. Therefore, we safely draw the conclusion that our
proposed method generally achieves better clustering performance in the challenging
video dataset Notting-Hill.
2.1.3.2 Analysis
In this subsection, the robustness and stability of the proposed model is evaluated.
The convergence property is firstly studied in terms of objective value and NMI
2.1 Deep Multi-view Clustering 19
Table 2.2 Results of 6 different metrics (mean ± standard deviation) on dataset Extended YaleB
Method NMI ACC AR F-score Precision Recall
BestSV 0.360 ± 0.366 ± 0.225 ± 0.303 ± 0.296 ± 0.310 ±
0.016 0.059 0.018 0.011 0.010 0.012
ConcatFea 0.147 ± 0.224 ± 0.064 ± 0.159 ± 0.155 ± 0.162 ±
0.005 0.012 0.003 0.002 0.002 0.002
ConcatPCA 0.152 ± 0.232 ± 0.069 ± 0.161 ± 0.158 ± 0.164 ±
0.003 0.005 0.002 0.002 0.001 0.002
Co-Reg 0.151 ± 0.224 ± 0.066 ± 0.160 ± 0.157 ± 0.162 ±
0.001 0.000 0.001 0.000 0.001 0.000
Co-Train 0.302 ± 0.186 ± 0.043 ± 0.140 ± 0.137 ± 0.143 ±
0.007 0.001 0.001 0.001 0.001 0.002
Min-D 0.186 ± 0.242 ± 0.088 ± 0.181 ± 0.174 ± 0.189 ±
0.003 0.018 0.001 0.001 0.001 0.002
MultiNMF 0.377 ± 0.428 ± 0.231 ± 0.329 ± 0.298 ± 0.372 ±
0.006 0.002 0.001 0.001 0.001 0.002
NaMSC 0.594 ± 0.581 ± 0.380 ± 0.446 ± 0.411 ± 0.486 ±
0.004 0.013 0.002 0.004 0.002 0.001
DiMSC 0.635 ± 0.615 ± 0.453 ± 0.504 ± 0.481 ± 0.534 ±
0.002 0.003 0.000 0.006 0.002 0.001
Ours 0.649 ± 0.763 ± 0.512 ± 0.564 ± 0.525 ± 0.610 ±
0.002 0.001 0.002 0.001 0.001 0.001