You are on page 1of 89

Hongxing Wang Chaoqun Weng

Junsong Yuan

Visual Pattern Discovery


and Recognition

123
Hongxing Wang Junsong Yuan
Chongqing University Nanyang Technological University
Chongqing Singapore
China Singapore

Chaoqun Weng
Nanyang Technological University
Singapore
Singapore

ISSN 2191-5768 ISSN 2191-5776 (electronic)


SpringerBriefs in Computer Science
ISBN 978-981-10-4839-5 ISBN 978-981-10-4840-1 (eBook)
DOI 10.1007/978-981-10-4840-1

Library of Congress Control Number: 2017942976

This book was advertised with a copyright holder in the name of the publisher in error, whereas the
author(s) holds the copyright.

The Author(s) 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microlms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specic statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional afliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface

Patterns are ubiquitous in almost any kind of data. Finding patterns is of great
importance and interest to data analytics. This book presents visual pattern dis-
covery for visual data analytics. It provides a systematic study of visual pattern
discovery and recognition, from unsupervised to semi-supervised manner approa-
ches, and from dealing with single feature to multiple types of features. We start
with a brief overview of visual pattern discovery, then move on to specic
approaches. Chapters 2 and 3 focus on discovering spatial context-aware visual
co-occurrence patterns incorporating single or multiple types of features. Chapter 4
studies the visual pattern discovery problem given a small amount of labeled data to
enable visual categorization and recognition through label propagation based on
similar feature co-occurrence patterns. Chapter 5 introduces a multi-feature pattern
embedding method for visual data clustering using only the multiple feature evi-
dences. Chapter 6 nally concludes this book, discusses potential visual search and
recognition applications of discovering visual patterns, and suggests worthy
directions for further research.
This is a reference book for advanced undergraduates or postgraduate students
who are interested in visual data analytics. Readers of this book will be able to
quickly access the research front and acquire a systematic methodology rather than
a few isolated techniques to analyze visual data with large variations. It may be also
inspiring for researchers working in computer vision and pattern recognition elds.
Basic knowledge of linear algebra, computer vision, and pattern recognition would
be helpful to read this book.

Chongqing, China Hongxing Wang


Singapore, Singapore Chaoqun Weng
Singapore, Singapore Junsong Yuan
April 2017

v
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Discovering Spatial Co-occurrence Patterns . . . . . . . . . . . . . . . . . . . 3
1.3 Discovering Feature Co-occurrence Patterns . . . . . . . . . . . . . . . . . . 5
1.4 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Context-Aware Discovery of Visual Co-occurrence Patterns . ....... 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... 15
2.2 Multi-context-aware Clustering . . . . . . . . . . . . . . . . . . . . . ....... 16
2.2.1 Regularized k-means Formulation with Multiple
Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Self-learning Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Spatial Visual Pattern Discovery . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Image Region Clustering Using Multiple Contexts . . . . . . . . 23
2.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Hierarchical Sparse Coding for Visual Co-occurrence
Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Spatial Context-Aware Multi-feature Sparse Coding . . . . . . . . . . . . 30
3.2.1 Learning Spatial Context-Aware Visual Phrases . . . . . . . . . . 30
3.2.2 Learning Multi-feature Fused Visual Phrases . . . . . . . . . . . . 35
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Spatial Visual Pattern Discovery . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Scene Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Scene Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ix
x Contents

4 Feature Co-occurrence for Visual Labeling . . . . . . . . . . . . . . . . . . . . . 45


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Multi-feature Collaboration for Transductive Learning . . . . . . . . . . . 47
4.2.1 Spectral Embedding of Multi-feature Data . . . . . . . . . . . . . . 48
4.2.2 Embedding Co-occurrence for Data Representation . . . . . . . 49
4.2.3 Transductive Learning with Feature Co-occurrence
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50
4.2.4 Collaboration Between Pattern Discovery and Label
Propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Experimental Setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Label Propagation on Synthetic Data . . . . . . . . . . . . . . . . . . 54
4.3.3 Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.4 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.5 Body Motion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.6 Scene Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Visual Clustering with Minimax Feature Fusion . . . . . . . . . . . . . . . .. 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67
5.2 Minimax Optimization for Multi-feature Spectral Clustering . . . . .. 69
5.2.1 Spectral Embedding for Regularized Data-Cluster
Similarity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Minimax Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.3 Minimax Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 Datasets and Experimental Setting . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.5 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.6 Sensitivity of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 1
Introduction

Abstract As re-occurring compositions of visual data, visual patterns exist in com-


plex spatial structures and diverse feature views of image and video data. Discovering
visual patterns is of great interest to visual data analysis and recognition. Many meth-
ods have been proposed to address the problem of visual pattern discovery during the
dozen years. In this chapter, we start with an overview of the visual pattern discovery
problem and then discuss the major progress of spatial and feature co-occurrence
pattern discovery.

Keywords Spatial co-occurrence pattern discovery Feature co-occurrence pat-


tern discovery Bottom-up methods Top-down methods Subpace learning
Co-training Multiple kernel learning

1.1 Overview

Similar to frequent patterns in transaction data, visual patterns are compositions


of visual primitives that appear frequently in image and video data [74, 93]. The
visual primitives that construct visual patterns can be very diverse, e.g., local image
patches (or even pixels), semantic visual parts, and visual objects. As we show in
Fig. 1.1, the visual pattern in image or video data can be a texton that captures the
repetitiveness of image texture [106], e.g., the double-G pattern in a Gucci bag;
an abstract object model that describes its composition of visual parts [20], e.g., a
face pattern composing of two eyes, a nose, and a mouth; a scene layout pattern that
captures the key objects which compose the scene [42], e.g., a bedroom composing
of a bed, a lamp etc.; or a human action that describes postures and motions of human
body, e.g., a bent-leg layover spin action showing by upturning the torso and bending
the free leg. Besides the above spatial co-occurrence patterns, there is also another
type of visual patterns in multiple feature spaces, i.e., feature co-occurrence patterns.
Taking Fig. 1.2 as an example, the baboons face shows a co-occurrence pattern of
blue color and visible texture features.
Ubiquitous visual patterns show protean images. Just like the perception of
repeated structures is well-nigh fundamental to the understanding of the world around

The Author(s) 2017 1


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_1
2 1 Introduction

(a) bag (b) face (c) bedroom (d) bent-leg layover spin

Fig. 1.1 Examples of spatial co-occurrence patterns. (a) The repetitive double-G textures gen-
erate the texton patterns in a Gucci bag; (b) two eyes, a nose, and a mouth sketch a face pattern.
Images are from Caltech 101 dataset [17]; (c) a bed, a lamp, etc. usually make up a bedroom. Images
are from MIT Indoor dataset [59]; (d) upturning of the torso and bending of the free leg together
show the bent-leg layover spin action [101]. Copyright (2014) Wiley. Used with permission from
Ref. [83]

Fig. 1.2 An example of feature co-occurrence patterns. The image patch in the left baboon picture
is composed of color and texture features

us [72], the recognition of visual patterns is essential to the understanding of image


data. In practice, visual patterns can be used to model images, which have extensive
applications in visual data analysis, such as image search, object categorization, and
scene recognition [83]. It therefore offers an interesting, practical, but challenging
issue for us to mine visual patterns from visual data. We will in this book focus on
discovering spatial and feature co-occurrence patterns for visual data analytics.
It is generally known that frequent pattern mining has been well studied in data
mining community [26]. However, the existing frequent pattern mining methods
cannot be applied to image data directly. This is because the complex spatial structures
and heterogeneous feature descriptions among visual data make the problem of visual
pattern discovery more challenging.
Similar to many computer vision problems, one important prerequirement of
visual pattern discovery is to extract stable visual primitives from image or video
data. To obtain visual primitives larger than pixels, many local feature detectors
have been proposed [44, 75], e.g., Difference of Gaussian (DoG) [50] and Hessian-
Laplacian [52]. In addition, segmentation methods, e.g., normalized cuts [65], can be
used to collect primitive regions, and object detection methods, e.g., deformable part
models [18] and object proposals [16, 107], can provide object primitives appearing
1.1 Overview 3

Fig. 1.3 Preprocessing of image and video data. Copyright (2014) Wiley. Used with permission
from Ref. [83]

in visual data. Once we have visual primitives, we can encode their appearance using
feature descriptors [23]. For example, Scale-Invariant Feature Transform (SIFT) [50]
and Histograms of Oriented Gradients (HOG) [11] are the widely used gradient
features. The efficient binary features [27] are such as Binary Robust Independent
Elementary Features (BRIEF) [7], Oriented FAST and Rotated BRIEF (ORB) [60].
To guarantee performance, more advanced features including Fisher Vector [55, 56,
63], Vector of Locally Aggregated Descriptors (VLAD) [31], and Convolutional
Neural Networks (CNN)-based features [32, 37, 64] are suggested to be exploited.
Instead of describing them using raw features, we can also utilize some clustering
method, e.g., k-means algorithm, to further quantize feature descriptors into discrete
visual words. After that, each visual primitive can be identified by the corresponding
visual word. Meanwhile, each image can be represented as a global histogram feature
using the bag of visual words (BoW) model.
We summarize the aforementioned preprocessing of image or video data in
Fig. 1.3. Based on visual data preprocessing, there have been increasing efforts to
address the visual pattern discovery problem in the literature [83]. In the follow-
ing sections, we will give detailed discussions on spatial and feature co-occurrence
pattern discovery.

1.2 Discovering Spatial Co-occurrence Patterns

Many approaches have been proposed to discover frequent spatial patterns of visual
primitives. These methods can be generally divided into bottom-up visual pattern
mining and top-down generative visual pattern modeling. The bottom-up pattern
discovery methods usually start with visual primitives and then find visual patterns
relying on the compositions of visual primitives. The basic idea is shown in Fig. 1.4.
Each image consists of a number of visual primitives that has been depicted as
visual words (colored in blue). By investigating frequent visual word configurations
in image spatial space, two types of word co-occurrence compositions, i.e., visual
patterns {cross, star} and {parallelogram, diamond, trapezoid}, are found.
Finally, we locate all instances of both types of visual patterns. Classic frequent item-
set mining (FIM) methods [26] provide off-the-shelf bottom-up techniques for pattern
discovery from transaction data and inspire early research on visual pattern discov-
ery, including Apriori algorithm [29, 58], frequent pattern growth algorithm [96],
clustering-based methods [68, 81, 95], frequent item bag mining [34], and frequent
4 1 Introduction

Fig. 1.4 Bottom-up spatial


visual pattern discovery.
Copyright (2014) Wiley.
Used with permission from
Ref. [83]

local histogram mining [19]. However, the performance of FIM-based methods heav-
ily depends on the quality of transaction data. Thus, more general strategies have
been proposed to avoid the generation of transactions for image/video data mining,
e.g., voting in offset space [42, 100], spatial random partition [87, 94], ensem-
ble matching [2], multilayer match-growing [9], multilayer candidate pruning [97],
hierarchical part composition learning [20], clustering by composition [15], greedy
randomized adaptive search [47], and sparse dictionary selection [10, 13, 51, 78].
Owing to modeling sophisticated spatial structures among visual primitives, some
graph-based pattern mining methods have also been proposed [21, 41, 46, 103].
Recent studies show that drawing deep learning architectures into visual pattern
mining techniques can bring impressive advances [12, 43, 54, 87].
In addition to the above bottom-up visual pattern mining, there are also consid-
erable methods in modeling visual patterns from top down, which start with the
modeling of visual patterns and then infer the pattern discovery result. Figure 1.5
illustrates the top-down method by using the latent Dirichlet allocation (LDA) to
model images and visual patterns [4]. The basic idea is that images are represented
as mixtures over visual patterns, where each pattern is characterized by a distribution
over visual words. This is similar to describing a document by mixtures of topics,
where each topic has its own word distribution. The pattern discovery is achieved
by inferring the posterior distribution of visual pattern mixture variable given an
image. Most of top-down methods extend classic generative topic models for visual
pattern modeling [61, 66, 67, 77]. In particular, much work incorporates the spatial
and temporal cues into topic models [28, 45, 57, 73, 85, 102]. Besides using the
statistical viewpoint to mine visual patterns, some subspace projection methods are
also proposed to approximate the semantic structure of visual patterns [70, 71].
1.2 Discovering Spatial Co-occurrence Patterns 5

Fig. 1.5 Top-down spatial visual pattern discovery. Copyright (2014) Wiley. Used with permis-
sion from Ref. [83]

It is application-dependent to choose between bottom-up and top-down approaches.


Generally, when we observe a number of specific spatial compositions of visual prim-
itives and expect from them to discover common visual patterns, bottom-up methods
will be appropriate. In contrast, when we are required to model pattern mixture
and reason posterior distribution of visual pattern mixture over visual primitives,
top-down methods should be preferable.

1.3 Discovering Feature Co-occurrence Patterns

Feature co-occurrence patterns come into being due to multiple feature represen-
tations of image and video data. The techniques of finding feature co-occurrence
patterns that can represent different attributes of visual data are also known as multi-
feature fusion or multi-view learning. By feature fusion for pattern discovery, we can
combine multiple complementary feature modalities to improve the result of cluster-
ing [14, 30, 76], classification [80, 89, 90], image search [98, 99, 105], etc. Such a
multi-feature fusion, however, is challenging due to the possible incompatibility of
heterogeneous features. As a result, a simple concatenation of them does not guar-
antee good performance [5, 89]. To deal with diverse features, various approaches
have been proposed [88, 104].
As shown in Fig. 1.6, much work aims to seek a latent subspace shared by different
features. To obtain such a common subspace, one can use canonical correlation analy-
sis (CCA) [3, 6, 8], general sparse coding [91], convex subspace learning [25], Pareto
6 1 Introduction

Fig. 1.6 Each type of features are in its own feature space. Subspace learning aims to find a latent
space shared by different feature types

Fig. 1.7 In co-training, the features and models in different views exchange information among
each other to obtain the fusion result

embedding [86], structured feature selection [79], pattern projection method [49],
common nonnegative matrix factorization (NMF) [1], multi-view deep representa-
tion learning [53, 69, 84], etc.
Whether using subspace learning or other methods, multi-feature fusion implic-
itly or explicitly applies the co-training strategy. Figure 1.7 shows the co-training
idea, where the features and models in different views exchange information among
each other to obtain a fusion result. To co-train among different feature types, mutual
regularization is often adopted and performed by disagreement minimization, which
is widely investigated with k-means regularization [81, 82, 95], NMF regulariza-
tion [48], topic model regularization [33], spectral embedding regularization [5,
38, 39], kernel multiplication [62], and low-rank affinity regularization [24]. Among
these methods, pairwise regularization is representative, which generally outputs dif-
ferent solutions from multiple feature types such that a late fusion step is required.
To avoid such a late fusion, some of them apply regularization between each feature
modality and a centroid modality and finally output the centroid result [5, 24, 39,
62, 95].
1.3 Discovering Feature Co-occurrence Patterns 7

Fig. 1.8 Multiple kernel learning aims to integrate various kernels into a unified one to represent
the similarity between any pair of the input data

To better deal with nonlinearity existing in the data, kernel methods can be inte-
grated into multi-feature fusion. For example, CCA can be extended to a nonlinear
version using kernel methods, which is the so-called kernel CCA (KCCA). In fact,
kernel methods map the raw data features into pairwise similarities using kernel
functions such as the popular radial basis function. Multiple kernel learning meth-
ods perform feature fusion based on kernel representations of different features. As
shown in Fig. 1.8, the kernel matrices from multiple features are expected to be
combined into a unified kernel [35, 36, 40, 92]. The combination can be linear or
nonlinear [22].
It is worth noting that the ways to feature fusion can be compatible with each
other. For example, the methods presented in Chaps. 4 and 5 can be seen as the cases
of subspace learning, as they both learn a common embedding representation for
each data based on multiple features. But from the optimization perspective, they
belong to the co-training paradigm. Moreover, they also both utilize graph (diffusion)
kernels for multiple kernel learning.

1.4 Outline of the Book

This book presents the visual data analytics techniques based on visual pattern dis-
covery to handle the large variations in spatial and feature domains. The proposed
visual data analytics approaches include both unsupervised and semi-supervised
fashions, therefore suiting different needs in real applications. Chapter 1 gives an
8 1 Introduction

Table 1.1 The used information of each chapter


Unlabeled data Labeled data Multiple features Spatial context

Chapter 2

Chapter 3

Chapter 4

Chapter 5

overview on the recent developments in visual pattern discovery. In Chaps. 25, we


introduce four promising visual data analytics approaches by incorporating visual
co-occurrence information and multi-feature evidences.
Chapter 2 introduces a multi-context-aware clustering method with spatial and
feature contexts for visual co-occurrence pattern discovery. A self-learning opti-
mization is developed for visual disambiguity, which can leverage the discovered
co-occurrence patterns to guide visual primitive clustering.
Chapter 3 presents a hierarchical sparse coding method for mid-level visual phrase
learning. Following Chap. 2, it still exploits spatial contexts and multi-feature infor-
mation, but utilizes sparse coding rather than k-means hard quantization. Further-
more, the category information of visual data is leveraged to make the learned
visual phrase sparse codes representative and discriminative. A back-propagation
algorithm is developed to optimize the visual phrase learning objective.
Chapter 4 presents a feature co-occurrence pattern discovery method by spectral
embedding and transductive learning instead of k-means regularization used in
Chap. 2. The proposed algorithm can iteratively refine the results of feature co-
occurrence pattern discovery and label propagation. It eventually allows visual
data of similar feature co-occurrence patterns sharing the same label.
Chapter 5 introduces a visual clustering method based on spectral embedding learn-
ing and fusion of multiple features. Different from Chaps. 24, it uses neither extra
spatial context nor data label information. A universal feature embedding is finally
learned for a consensus clustering of multiple features by optimizing a minimax
loss function.
Table 1.1 summarizes the used information for the proposed approach in each
chapter.

References

1. Akata, Z., Thurau, C., Bauckhage, C., et al.: Non-negative matrix factorization in multimodal-
ity data for segmentation and label prediction. In: Proceedings of Computer Vision Winter
Workshop (2011)
2. Bagon, S., Brostovski, O., Galun, M., Irani, M.: Detecting and sketching the common. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3340
(2010)
References 9

3. Blaschko, M., Lampert, C.: Correlational spectral clustering. In: Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition (2008)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993
1022 (2003)
5. Cai, X., Nie, F., Huang, H., Kamangar, F.: Heterogeneous image feature integration via multi-
modal spectral clustering. In: Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, pp. 19771984 (2011)
6. Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)
7. Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary
features. In: Proceedings of European Conference on Computer Vision, pp. 778792 (2010)
8. Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-view clustering via canonical
correlation analysis. In: Proceedings of International Conference on Machine Learning, pp.
129136 (2009)
9. Cho, M., Shin, Y.M., Lee, K.M.: Unsupervised detection and segmentation of identical objects.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1617
1624 (2010)
10. Cong, Y., Yuan, J., Luo, J.: Towards scalable summarization of consumer videos via sparse
dictionary selection. IEEE Trans. Multimed. 14(1), 6675 (2012)
11. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition, pp. 886893 (2005)
12. Diba, A., Pazandeh, A.M., Pirsiavash, H., Gool, L.V.: Deepcamp: deep convolutional action
and attribute mid-level patterns. In: Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 35573565 (2016)
13. Elhamifar, E., Sapiro, G., Vidal, R.: See all by looking at a few: sparse modeling for finding
representative objects. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 16001607 (2012)
14. Eynard, D., Kovnatsky, A., Bronstein, M.M., Glashoff, K., Bronstein, A.M.: Multimodal
manifold analysis by simultaneous diagonalization of laplacians. IEEE Trans. Pattern Anal.
Mach. Intell. 37(12), 25052517 (2015)
15. Faktor, A., Irani, M.: Clustering by composition-unsupervised discovery of image cate-
gories. In: Proceedings of European Conference on Computer Vision, pp. 474487 (2012)
16. Fang, Z., Cao, Z., Xiao, Y., Zhu, L., Yuan, J.: Adobe boxes: locating object proposals using
object adobes. IEEE Trans. Image Process. 25(9), 41164128 (2016)
17. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training exam-
ples: an incremental bayesian approach tested on 101 object categories. In: CVPR Workshop
on Generative-Model Based Vision, pp. 178178 (2004)
18. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrimi-
natively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 16271645
(2010)
19. Fernando, B., Fromont, E., Tuytelaars, T.: Mining mid-level features for image classification.
Int. J. Comput. Vis. 108(3), 186203 (2014)
20. Fidler, S., Leonardis, A.: Towards scalable representations of object categories: learning a
hierarchy of parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 18 (2007)
21. Gao, J., Hu, Y., Liu, J., Yang, R.: Unsupervised learning of high-order structural semantics
from images. In: Proceedings of IEEE International Conference on Computer Vision, pp.
21222129 (2009)
22. Gnen, M., Alpaydn, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211
2268 (2011)
23. Grauman, K., Leibe, B.: Visual Object Recognition (Synthesis Lectures on Artificial Intelli-
gence and Machine Learning). Morgan & Claypool Publishers, San Rafael, CA (2011)
24. Guo, X., Liu, D., Jou, B., Zhu, M., Cai, A., Chang, S.F.: Robust object co-detection. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)
10 1 Introduction

25. Guo, Y.: Convex subspace representation learning from multi-view data. In: Proceedings of
AAAI Conference on Artificial Intelligence (2013)
26. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future
directions. Data Min. Knowl. Discov. 15(1), 5586 (2007)
27. Heinly, J., Dunn, E., Frahm, J.M.: Comparative evaluation of binary features. In: Proceedings
of European Conference on Computer Vision, pp. 759773 (2012)
28. Hong, P., Huang, T.: Spatial pattern discovery by learning a probabilistic parametric model
from multiple attributed relational graphs. Discret. Appl. Math. 139(1), 113135 (2004)
29. Hsu, W., Dai, J., Lee, M.: Mining viewpoint patterns in image databases. In: ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 553558 (2003)
30. Huang, H.C., Chuang, Y.Y., Chen, C.S.: Affinity aggregation for spectral clustering. In: Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 773780
(2012)
31. Jgou, H., Douze, M., Schmid, C., Prez, P.: Aggregating local descriptors into a compact
image representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 33043311 (2010)
32. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S.,
Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093
(2014)
33. Jiang, Y., Liu, J., Li, Z., Li, P., Lu, H.: Co-regularized plsa for multi-view clustering. In:
Proceedings of Asian Conference on Computer Vision, pp. 202213 (2012)
34. Kim, S., Jin, X., Han, J.: Disiclass: discriminative frequent pattern-based image classification.
In: KDD Workshop on Multimedia Data Mining, pp. 7:17:10 (2010)
35. Kobayashi, T.: Low-rank bilinear classification: efficient convex optimization and extensions.
Int. J. Comput. Vis. 110(3), 308327 (2014)
36. Kong, Y., Fu, Y.: Bilinear heterogeneous information machine for rgb-d action recognition.
In: CVPR, pp. 10541062 (2015)
37. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp.
10971105 (2012)
38. Kumar, A., III, H.D.: A co-training approach for multi-view spectral clustering. In: Proceed-
ings of International Conference on Machine Learning, pp. 393400 (2011)
39. Kumar, A., Rai, P., III, H.D.: Co-regularized multi-view spectral clustering. In: Proceedings
of Advances in Neural Information Processing Systems, pp. 14131421 (2011)
40. Lange, T., Buhmann, J.M.: Fusion of similarity data in clustering. In: Proceedings of Advances
in Neural Information Processing Systems (2005)
41. Leordeanu, M., Hebert, M.: A spectral technique for correspondence problems using pairwise
constraints. In: Proceedings of IEEE International Conference on Computer Vision, vol. 2,
pp. 14821489 (2005)
42. Li, C., Parikh, D., Chen, T.: Automatic discovery of groups of objects for scene understanding.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2012)
43. Li, Y., Liu, L., Shen, C., van den Hengel, A.: Mining mid-level visual patterns with deep cnn
activations. Int. J. Comput. Vis. 121(3), 344364 (2017)
44. Li, Y., Wang, S., Tian, Q., Ding, X.: A survey of recent advances in visual feature detection.
Neurocomputing 149, 736751 (2015)
45. Liu, D., Chen, T.: A topic-motion model for unsupervised video object discovery. In: Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis,
Minnesota, USA (2007)
46. Liu, H., Yan, S.: Common visual pattern discovery via spatially coherent correspondences.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1609
1616 (2010)
47. Liu, J., Liu, Y.: Grasp recurring patterns from a single view. In: Proceedings of IEEE Confer-
ence on Computer Vision and Pattern Recognition (2013)
References 11

48. Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factor-
ization. In: Proceedings of SIAM International Conference on Data Mining (2013)
49. Long, B., Philip, S.Y., Zhang, Z.M.: A general model for multiple view unsupervised learning.
In: Proceedings of SIAM International Conference on Data Mining, pp. 822833 (2008)
50. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
60(2), 91110 (2004)
51. Meng, J., Wang, H., Yuan, J., Tan, Y.P.: From keyframes to key objects: video summarization
by representative object proposal selection. In: Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 10391048 (2016)
52. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir,
T., Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vis. 65(12), 4372
(2005)
53. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In:
Proceedings of International Conference on Machine Learning, pp. 689696 (2011)
54. Oramas, J.M., Tuytelaars, T.: Modeling visual compatibility through hierarchical mid-level
elements. arXiv:1604.00036 (2016)
55. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 18 (2007)
56. Perronnin, F., Snchez, J., Mensink, T.: Improving the fisher kernel for large-scale image
classification. In: Proceedings of European Conference on Computer Vision, pp. 143156
(2010)
57. Philbin, J., Sivic, J., Zisserman, A.: Geometric latent dirichlet allocation on a matching graph
for large-scale image datasets. Int. J. Comput. Vis. 95(2), 138153 (2011)
58. Quack, T., Ferrari, V., Leibe, B., Van Gool, L.: Efficient mining of frequent and distinctive
feature configurations. In: Proceedings of IEEE International Conference on Computer Vision
(2007)
59. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (2009)
60. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or
SURF. In: Proceedings of IEEE International Conference on Computer Vision, pp. 25642571
(2011)
61. Russell, B., Freeman, W., Efros, A., Sivic, J., Zisserman, A.: Using multiple segmentations
to discover objects and their extent in image collections. In: Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, pp. 16051614 (2006)
62. de Sa, V.R., Gallagher, P.W., Lewis, J.M., Malave, V.L.: Multi-view kernel construction. Mach.
Learn. 79(12), 4771 (2010)
63. Snchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector:
theory and practice. Int. J. Comput. Vis. 105(3), 222245 (2013)
64. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated
recognition, localization and detection using convolutional networks. arXiv:1312.6229 (2013)
65. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888905 (2000)
66. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering objects and their
location in images. In: Proceedings of IEEE International Conference on Computer Vision,
pp. 370377 (2005)
67. Sivic, J., Russell, B., Zisserman, A., Freeman, W., Efros, A.: Unsupervised discovery of visual
object class hierarchies. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 18 (2008)
68. Sivic, J., Zisserman, A.: Video data mining using configurations of viewpoint invariant regions.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 488
495 (2004)
69. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J.
Mach. Learn. Res. 15(1), 29492980 (2014)
12 1 Introduction

70. Sun, M., hamme, H.V.: Image pattern discovery by using the spatial closeness of visual code
words. In: Proceddings of IEEE International Conference on Image Processing, Brussels,
Belgium, pp. 205208 (2011)
71. Tang, J., Lewis, P.H.: Non-negative matrix factorisation for object class discovery and image
auto-annotation. In: Proceedings of the International Conference on Content-based Image and
Video Retrieval, Niagara Falls, Canada, pp. 105112 (2008)
72. Thompson, D.W.: On Growth and Form. Cambridge University Press, Cambridge, UK (1961)
73. Todorovic, S., Ahuja, N.: Unsupervised category modeling, recognition, and segmentation in
images. IEEE Trans. Pattern Anal. Mach. Intell. 30(12), 21582174 (2008)
74. Tuytelaars, T., Lampert, C., Blaschko, M., Buntine, W.: Unsupervised object discovery: a
comparison. Int. J. Comput. Vis. 88(2), 284302 (2010)
75. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Foundations and
Trends in Computer Graphics and Vision 3(3), 177280 (2008)
76. Wang, B., Jiang, J., Wang, W., Zhou, Z.H., Tu, Z.: Unsupervised metric fusion by cross
diffusion. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 29973004 (2012)
77. Wang, G., Zhang, Y., Fei-Fei, L.: Using dependent regions for object categorization in a
generative framework. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, vol. 2, pp. 15971604 (2006)
78. Wang, H., Kawahara, Y., Weng, C., Yuan, J.: Representative selection with structured sparsity.
Pattern Recognit. 63, 268278 (2017)
79. Wang, H., Nie, F., Huang, H.: Multi-view clustering and feature learning via structured spar-
sity. In: Proceedings of International Conference on Machine Learning (2013)
80. Wang, H., Nie, F., Huang, H., Ding, C.: Heterogeneous visual features fusion via sparse
multimodal machine. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (2013)
81. Wang, H., Yuan, J., Tan, Y.: Combining feature context and spatial context for image pattern
discovery. In: Proceedings of IEEE International Conference on Data Mining, pp. 764773
(2011)
82. Wang, H., Yuan, J., Wu, Y.: Context-aware discovery of visual co-occurrence patterns. IEEE
Trans. Image Process. 23(4), 18051819 (2014)
83. Wang, H., Zhao, G., Yuan, J.: Visual pattern discovery in image and video data: a brief survey.
Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(1), 2437 (2014)
84. Wang, W., Arora, R., Livescu, K., Bilmes, J.A.: On deep multi-view representation learning:
objectives and optimization. arXiv: 1602.01024 (2016)
85. Wang, X., Grimson, E.: Spatial latent dirichlet allocation. In: Proceedings of Advances in
Neural Information Processing Systems (2008)
86. Wang, X., Qian, B., Ye, J., Davidson, I.: Multi-objective multi-view spectral clustering via
pareto optimization. In: Proceedings of SIAM International Conference on Data Mining
(2013)
87. Weng, C., Wang, H., Yuan, J., Jiang, X.: Discovering class-specific spatial layouts for scene
recognition. IEEE Sig. Process. Lett. (2016)
88. Xu, C., Tao, D., Xu, C.: A survey on multi-view learning. arXiv:1304.5634 (2013)
89. Xu, C., Tao, D., Xu, C.: Large-margin multi-view information bottleneck. IEEE Trans. Pattern
Anal. Mach. Intell. 36(8), 15591572 (2014)
90. Xu, C., Tao, D., Xu, C.: Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach.
Intell. 37, 25312544 (2015)
91. Yang, J., Wang, Z., Lin, Z., Shu, X., Huang, T.: Bilevel sparse coding for coupled feature
spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 23602367 (2012)
92. Yu, S., Tranchevent, L.C., Liu, X., Glanzel, W., Suykens, J.A., De Moor, B., Moreau, Y.:
Optimized data fusion for kernel k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell.
34(5), 10311039 (2012)
References 13

93. Yuan, J.: Discovering visual patterns in image and video data: concepts, algorithms, experi-
ments. VDM Verlag Dr. Mller, Saarbrcken, Germany (2011)
94. Yuan, J., Wu, Y.: Spatial random partition for common visual pattern discovery. In: Proceed-
ings of IEEE International Conference on Computer Vision, pp. 18 (2007)
95. Yuan, J., Wu, Y.: Context-aware clustering. In: Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 18 (2008)
96. Yuan, J., Wu, Y.: Mining visual collocation patterns via self-supervised subspace learning.
IEEE Trans. Syst. Man Cybern. Part B Cybern. 42(2), 113 (2012)
97. Yuan, J., Zhao, G., Fu, Y., Li, Z., Katsaggelos, A., Wu, Y.: Discovering thematic objects in
image collections and videos. IEEE Trans. Image Process. 21, 22072219 (2012)
98. Zhang, S., Yang, M., Cour, T., Yu, K., Metaxas, D.N.: Query specific fusion for image retrieval.
In: Proceedings of European Conference on Computer Vision, pp. 660673 (2012)
99. Zhang, S., Yang, M., Wang, X., Lin, Y., Tian, Q.: Semantic-aware co-indexing for image
retrieval. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1673
1680 (2013)
100. Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual phrases. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 809816
(2011)
101. Zhao, G., Yuan, J.: Discovering thematic patterns in videos via cohesive sub-graph mining.
In: Proceedings of IEEE International Conference on Data Mining, pp. 12601265 (2011)
102. Zhao, G., Yuan, J., Hua, G.: Topical video object discovery from key frames by modeling
word co-occurrence prior. IEEE Trans. Image Process. (2015)
103. Zhao, G., Yuan, J., Xu, J., Wu, Y.: Discovery of the thematic object in commercial videos.
IEEE Multimed. Mag. 18(3), 5665 (2011)
104. Zhao, J., Xie, X., Xu, X., Sun, S.: Multi-view learning overview: recent progress and new
challenges. Inf. Fusion 38, 4354 (2017)
105. Zheng, L., Wang, S., Liu, Z., Tian, Q.: Packing and padding: coupled multi-index for accu-
rate image retrieval. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 19471954 (2014)
106. Zhu, S., Guo, C., Wang, Y., Xu, Z.: What are textons? Int. J. Comput. Vis. 62(1), 121143
(2005)
107. Zitnick, C.L., Dollr, P.: Edge boxes: locating object proposals from edges. In: Proceedings
of European Conference on Computer Vision, pp. 391405 (2014)
Chapter 2
Context-Aware Discovery of Visual
Co-occurrence Patterns

Abstract Once images are decomposed into a number of visual primitives, it is of


great interests to cluster these primitives into mid-level visual patterns. However,
conventional clustering of visual primitives, e.g., bag-of-words, usually ignores the
spatial context and multi-feature information among the visual primitives and thus
cannot discover mid-level visual patterns of complex structure. To overcome this
problem, we propose to consider both spatial and feature contexts among visual
primitives for visual pattern discovery in this chapter. We formulate the pattern dis-
covery task as a multi-context-aware clustering problem and propose a self-learning
procedure to iteratively refine the result until it converges. By discovering both spatial
co-occurrence patterns among visual primitives and feature co-occurrence patterns
among different types of features, the proposed method can better address the ambi-
guities of visual primitives.

Keywords Co-occurrence pattern discovery Visual disambiguity Multi-context-


aware clustering k-means regularization Self-learning optimization

2.1 Introduction

It has been a common practice to build a visual vocabulary for image analysis
by visual primitive clustering. However, most existing clustering methods ignore
the spatial structure among the visual primitives [7], thus bringing unsatisfac-
tory results. For example, the popular k-means clustering of visual primitives can
lead to synonymous visual words that overrepresent visual primitives, as well
as polysemous visual words that bring large uncertainties and ambiguities in the
representation [5, 6].
Since visual primitives are not independent to each other, to better address the
visual polysemous and synonymous phenomena, the ambiguities and uncertainties
of visual primitives can be partially resolved through analyzing their spatial contexts
[12, 13], i.e., other primitives in the spatial neighborhood. Two visual primitives,
although exhibit dissimilar visual features, may belong to the same pattern if they
have the same spatial contexts. Even though they share similar features, they may not

The Author(s) 2017 15


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_2
16 2 Context-Aware Discovery of Visual Co-occurrence Patterns

belong to the same visual pattern if their spatial contexts are completely different.
Besides the spatial dependencies among visual primitives, a visual pattern can exhibit
certain feature dependencies among multiple types of features or attributes as well.
Therefore, it is equally interesting in discovering spatial and feature co-occurrence
patterns in image data so that we can leverage visual patterns to improve the clustering
of visual primitives.
To address the above problem, we propose to consider spatial and feature contexts
among visual primitives for pattern discovery. By discovering spatial co-occurrence
patterns among visual primitives and feature co-occurrence patterns among different
types of features, our method can effectively reduce the ambiguities of visual primi-
tive clustering. We formulate the pattern discovery problem as multi-context-aware
clustering, where spatial and feature contexts are served as constraints of k-means
clustering to improve the pattern discovery results. A novel self-learning procedure
is proposed to integrate visual pattern discovery into the process of visual primi-
tive clustering. The proposed self-learning procedure is guaranteed to converge, and
experiments on real images validate the effectiveness of our method.

2.2 Multi-context-aware Clustering

In multi-context-aware clustering, each visual primitive xn X is characterized


by V types of features: {fn(v) }v=1 V
, where fn(v) Rdv . These features of xn correspond
(f)
to a feature context group Gn . Meanwhile, collocating with a visual primitive in
a local spatial neighborhood, the inclusive visual primitives constitute the spatial
contexts of the central one. For each visual primitive xn X , we denote by Gn(s) =
{xn , xn 1 , xn 2 , . . . , xn K } its spatial context group, which can be built by K -nearest
neighbors (K -NN) or -nearest neighbors (-NN).

2.2.1 Regularized k-means Formulation with Multiple


Contexts

Each type of features {fn(v) }n=1


N
can produce a feature word lexicon v (|v | = Mv )
by k-means clustering with the objective function (2.1) minimized.


Mv 
N
(v)
Qv = rmn dv (um(v) , fn(v) ) = tr(Rv Dv ), (2.1)
m=1 n=1

where
{um(v) }m=1
Mv
denote Mv quantized feature words after clustering, and they together
form a feature word matrix Uv Rdv Mv ;
2.2 Multi-context-aware Clustering 17

Rv R Mv N is a binary label indicator matrix, and the entry rmn (v)


= 1 only if fn(v)
(v)
is labeled with the mth discovered feature word um via clustering;
Dv R Mv N denotes a distortion matrix, and the entry of its mth row and nth
column is given by dv (um(v) , fn(v) ), i.e., the distortion between um(v) and fn(v) .
To consider multiple typesV
of features, we let each xn X generate a feature
context transaction tn(f) R v=1 Mv to represent Gn(f) .
Definition 2.1 (Feature context transaction) The feature context transaction of the
visual primitive xn , denoted by tn(f) , refers to the co-occurrences of multiple types of
feature words in the feature context group of xn .
Using label indicator matrices {Rv }v=1
V
obtained from k-means clustering on the V
types of features, we can represent the feature context transaction database as a binary
matrix
R1
R2

Tf = . . (2.2)
..
RV
V
Therefore, Tf R v=1 Mv N , and tn(f) is in the nth column of Tf . Similar to single
feature clustering, we propose to minimize the objective function (2.3) to obtain a
mid-level feature pattern lexicon f (|f | = Mf ), which actually provide a partition
to the given data in X using multiple features.


Mf 
N
(f)
Qf = rmn df (um(f) , tn(f) ) = tr(Rf Df ), (2.3)
m=1 n=1

where
{um(f) }m=1
Mf
denote Mf quantized feature patterns after clustering, and they form a
V
feature pattern matrix Uf R j=1 M j Mf ;
Rf R Mf N is a binary label indicator matrix, and the entry rmn (f)
= 1 only if vn is
(f)
included the mth discovered feature pattern um via clustering;
Df R Mf N denotes a distortion matrix, and the entry of its mth row and nth
column is given by df (um(f) , tn(f) ), i.e., the distortion between um(f) and tn(f) .
Besides multi-feature information, we further explore the spatial dependencies
among visual primitives and represent Gn(s) as a spatial context transaction.

Definition 2.2 (Spatial context transaction) The spatial context transaction of the
visual primitive xn , denoted by tn(s) , refers to the co-occurrences of different categories
of visual primitives appearing in the spatial context group of xn .

The spatial context transaction database can be represented as a sparse integer matrix
Ts R Mf N , where each column is a spatial context transaction tn(s) Z Mf . The entry
18 2 Context-Aware Discovery of Visual Co-occurrence Patterns

Fig. 2.1 Pattern discovery along the solid arrows and visual disambiguity along the dashed arrows

(s)
tmn = c indicates that the nth transaction contains c visual primitives belonging to
the mth category. Similarly, we can find a higher level spatial pattern lexicon s
(|s | = Ms ) by clustering on spatial context transactions. The minimization objective
function is given by


Ms 
N
(s)
Qs = rmn ds (um(s) , tn(s) ) = tr(Rs Ds ), (2.4)
m=1 n=1

where
{um(s) }m=1
Ms
denote Ms quantized spatial patterns after clustering, and they form a
spatial pattern matrix Us R Mf Ms ;
Rs R Ms N is a binary label indicator matrix, and the entry rmn (s)
= 1 only if vn is
(s)
included the mth discovered spatial pattern um via clustering;
Ds R Ms N denotes a distortion matrix, and the entry of its mth row and nth
column is given by ds (um(s) , tn(s) ), i.e., the distortion between um(s) and tn(s) .
After having spatial patterns, we aim to refine visual primitive clustering of uncer-
tainty. Such a refinement should enable spatial patterns to help improve feature pat-
tern constructions. Afterward, each type of feature words will also be adjusted due
to the tuned feature patterns. Then, the multiple types of updated feature words can
learn more accurate feature patterns and spatial patterns from bottom up again. We
show the idea in Fig. 2.1. To achieve this objective, we propose to minimize (2.1)
regularized by (2.3) and (2.4). The objective function thus becomes


V
Q= tr(RvT Dv ) + f tr(RfT Df ) + s tr(RsT Ds )
v=1

= tr(RT D) + f tr(RfT Df ) + s tr(RsT Ds ), (2.5)






Q Q Q

where
f > 0 and s > 0 are constants for regularization;
Q , Q , and Q are the total quantization distortions of multiple types of features,
the quantization distortion of feature context transactions, and the quantization
distortion of spatial context transactions, respectively.
R and D are block diagonal matrices from {Ri }v=1 V
and {Dv }v=1
V
.
2.2 Multi-context-aware Clustering 19

As Q , Q , and Q are correlated among each other, it is intractable to minimize


Q by minimizing the three terms separately, which makes the objective function of
(2.5) a challenge. We will in Sect. 2.2.2 show how to decouple the dependencies
among them and propose our algorithm to solve this optimization function.

2.2.2 Self-learning Optimization

We initialize feature words, feature patterns, and spatial patterns gradually by k-


means clustering by minimizing (2.1), (2.3), and (2.4). During k-means clustering,
we use squared Euclidean distance to measure dv (, ) in each feature space. Since
feature context transactions are binary, we use Hamming distance to measure df (, ),
which leads to

Df = 2UfT Tf + 1Tf Tf + UfT 1Uf


= 2UfT RZf + 1Tf RZf + UfT 1Uf , (2.6)
V V
where 1Tf is an M i=1 Mv all 1 matrix; 1Uf is a i=1 Mv N all 1 matrix; and
Zf RV N N is the concatenation of V identity matrices of N N . Following (2.6),
we can have a similar distortion matrix to spatial context transactions

Ds = 2UsT Ts + 1Ts Ts + UsT 1Us


= 2UsT Rf Zs + 1Ts Rf Zs + UsT 1Us , (2.7)

where 1Ts is an Ms M all 1 matrix; 1Us is an M N all 1 matrix; and Zs is an


N N matrix, whose entry qi j = 1 only if xi and x j are local spatial neighbors. It
is worth noting that the matrix (2.7) no longer indicates pairwise distances but only
distortion penalties, unless spatial context transactions are all binary.
To decouple the dependencies among the terms of (2.5), we take each of Rf , R,
and Rs as the common factor for extraction and derive (2.5) as:

Q(R, Rf , Rs , D, Df , Ds )
= tr (RfT Hf ) + tr (RT D) + s tr (RsT UsT 1Us ) (2.8)
= tr (R H) +
T
s tr (RsT Ds )
+ f tr (RfT UfT 1Uf ) (2.9)
= tr (RsT Hs ) + tr (R D)
T
+ f tr (RfT Df ), (2.10)

in which

Hf = f Df s (2UsT 1Ts )T Rs ZTs , (2.11)


H = D f (2UfT 1Tf )T Rf ZTf , (2.12)
Hs = s Ds , (2.13)
20 2 Context-Aware Discovery of Visual Co-occurrence Patterns

Algorithm 1: Visual Pattern Discovery with Multi-Context-Aware Clustering


(MCAC)
Input: X = {xn }n=1 N ; Z ; Z ; parameters: {M }V , M , M , ,
f s v i=1 f s f s
Output: feature word lexicons: {v }v=1 V ({Uv }v=1
V ); feature pattern lexicon: (U ); spatial
f f
pattern lexicon: s (Us ); clustering results {Rv }v=1
V ,R ,R
f s
/ / Initialization
1: perform k-means clustering from bottom up to obtain {Ui }i=1 V
, Uf , Us
/ / Main loop
2: repeat
3: repeat
4: R-step: fix {Ui }i=1
V , U , U , successively top-down/bottom-up update {R }V , R , R
f s i i=1 f s
5: until Q is not decreasing
6: D-step: fix {Rv }v=1 V , R , R , update {U }V , U , U
f s i i=1 f s
7: until Q is converged
/ / Solution
8: return {Ui }i=1 V , U , U , {R }V , R , R
f s i i=1 f s

V
where the size of Hf , H, and Hs are M N , v=1 Mv V N and Ms N , and H
contains V diagonal blocks {Hv }v=1 .
V

We then successively update the three label indicator matrices Rf , R, and Rs


when fixing the cluster centroid matrices Uf , {Uv }v=1
V
, and Us . To minimize (2.5), the
following label indicator matrices update criteria will be adopted, n = 1, 2, . . . , N ,

(f) 1 m = arg mink h (f)
rmn = kn , (2.14)
0 other wise

(v) 1 m = arg mink h (v)
rmn = kn , (2.15)
0 other wise

(s) 1 m = arg mink h (s)
rmn = kn , (2.16)
0 other wise

where h (f) (f) (v) (v) (s) (s)


kn , r mn , h kn , r mn , h kn , and r mn are the entries of Hf , Rf , Hv , Rv , Hs and Rs ,
respectively. As long as the objective function of (2.5) is decreasing, Rv and R can
be continually refined, followed by the bottom-up updates of Rf and Rs .
Furthermore, provided the label indicator matrices Rf , R, and Rs , the correspond-
ing centroid matrices Uf , {Uv }v=1 V
, and Us can be updated, and so as the corresponding
distortion matrices Df , {Dv }v=1 , and Ds , which will also make the objective function
V

of (2.5) decrease.
Eventually, we propose a visual pattern discovery method with multi-context-
aware clustering (MCAC) in Algorithm 1. This algorithm is convergent since the
solution spaces of R, Rf , and Rs are discrete and finite, and the objective function
(2.5) is monotonically decreasing at each step. Clearly, the proposed MCAC will
be degenerated to the visual pattern discovery method with spatial context-aware
2.2 Multi-context-aware Clustering 21

clustering (SCAC) [11] if there is only one type of features and we set f = 0 in
(2.5) to remove the Q term. The complexity of the proposed algorithm is similar to
k-means clustering, since our method only needs a finite run of k-means clustering.

2.3 Experiments

In the experiments, we set Mv = Mf , i = 1, 2, . . . , V for the proposed MCAC.


Besides, to help parameter tuning, we let f = f |Q 0 /Q 0 | and s = s |Q 0 /Q 0 |,
where Q 0X (X = 1, 2, , , ) is the initial value of Q X defined by (2.5), and the
nonnegative constants f and s are the auxiliary parameters to balance the influences
from feature co-occurrences and spatial co-occurrences, respectively.

2.3.1 Spatial Visual Pattern Discovery

Given a single image, we detect visual primitives X = {xn }n=1 N


and use one or
more (e.g., V types of) features to depict each of them. Next, we apply spatial K -
NN groups to build spatial context group database {Gn(s) }n=1N
. After that, we conduct
spatial pattern discovery using SCAC and the proposed MCAC. The results are shown
in Figs. 2.2 and 2.3.
As shown in Fig. 2.2, the test image is a mono-colored LV monogram fabric image.
Because of cloth warping, the monogram patterns are deformed, which makes pattern
discovery more challenging. We detect 2604 image patches as visual primitives and
use SIFT features to describe them [3]. To build spatial context groups, K -NN with
K = 8 is applied. Other parameters are set as M1 = 20, Ms = 4, s = 1 for SCAC.
In Fig. 2.2, we use different colors to indicate different (4 in total) discovered spatial
patterns. It is interesting to notice that SCAC can locate the monogram patterns of
different spatial structures. In comparison, without considering spatial contexts of
visual primitives, k-means clustering cannot obtain satisfactory results.
A comparison between SCAC and MCAC is shown in Fig. 2.3, where 422 image
patches [3] are extracted. In SCAC, SIFT features [3] are used to describe these
patches. While in MCAC, the patches are represented by SIFT features [3] and color
histograms (CHs) [2]. Both methods construct spatial context groups by K -NN with
K = 12 and aim to detect three categories of spatial patterns: human faces, text
logos, and background edges. We highlight the instances of each discovered spatial
pattern. The 1st column shows the results of SCAC, and the used parameters are
as follows: M1 = 10, Ms = 3, s = 0.8. The results of the 2nd column is based on
MCAC, and the used parameters are as follows: Mv = 10, i = 1, 2, Mf = 10,
Ms = 3, f = 1.5, s = 0.8. The results show that the discovered patterns are more
accurate when using MCAC. Particularly, there are more confusions between face
patterns and edge patterns using SCAC than those using MCAC.
22 2 Context-Aware Discovery of Visual Co-occurrence Patterns

Fig. 2.2 Pattern discovery from a mono-colored LV monogram picture. [2014] IEEE. Reprinted,
with permission, from Ref. [8]
2.3 Experiments 23

Fig. 2.3 Pattern discovery from a colored group photograph. [2014] IEEE. Reprinted, with
permission, from Ref. [8]

2.3.2 Image Region Clustering Using Multiple Contexts

To evaluate how much feature contexts and spatial contexts can improve the clustering
performance, we perform image region clustering on MSRC-V2 dataset [10]. The
ground-truth labeling of MSRC-V2 is provided by [4]. As shown in Fig. 2.4, we
collect five region compositions for experiments. To distinguish different region
segmentations, multiple features have to be fused. Taking Fig. 2.5 as an example,
while color feature can distinguish sheep and cow, it cannot distinguish aeroplane,
boat, or bicycle. Therefore, we describe each region segmentation with the following
three features: color histogram (CH), texton histogram (TH) [2], and pyramid of
HOG (pHOG) [1]. The feature dimensions of CH, TH, and pHOG are 69, 400, and
680, respectively. Given an image region, all other regions in the same image are
considered as in its spatial context group. Each scene category has its own region
24 2 Context-Aware Discovery of Visual Co-occurrence Patterns

Fig. 2.4 Sample images of five region compositions: sheep+grass, cow+grass, aero-
plane+grass+sky, boat+water, and bicycle+road

Fig. 2.5 Illustration of different features used to distinguish different region segmentations.
[2013] IEEE. Reprinted, with permission, from Ref. [9]

Fig. 2.6 Class disambiguation by using spatial contexts. [2014] IEEE. Reprinted, with permis-
sion, from Ref. [8]

compositions and our goal is to cluster image regions by leveraging the spatial co-
occurrence patterns. For example, visual features may suffer from the confusion
between sheep class and road class as shown in Fig. 2.6, where the sheep
regions are mislabeled as the road class. However, by exploring spatial contexts
of image regions, the proposed MCAC are expected to better distinguish the two
classes. Specifically, grass regions are in favor of labeling their co-occurring image
regions as the sheep class, and similarly, the bicycle regions with correct labels
can support the co-occurring road regions.
2.3 Experiments 25

Fig. 2.7 Confusion matrices of clustering on four categories of regions. [2014] IEEE. Reprinted,
with permission, from Ref. [8]

For evaluation, we first experiment on a subset of images with two region pairs
that often appear together: sheep+grass and bicycle+road. Sample images are
shown in the leftmost and rightmost columns of Fig. 2.4. Each region pair has 27
image instances. There are in total 31 sheep regions, 32 grass regions, 27 bicycle
regions, and 32 road regions. Because the spatial contexts of a region are the regions
occurring in the same image, the spatial contextual relations only appear between
regions of sheep and grass or regions of bicycle and road. We show the
confusion matrices of k-means clustering and our multi-context-aware clustering in
Fig. 2.7. The parameters used are as follows: k = 4 for k-means clustering; and Mv =
4, v = 1, 2, 3, Mf = 4, Ms = 2, f = 3.5, s = 1 for MCAC. We observe that k-
means clustering easily mislabeled bicycle as sheep when using TH features.
This is because these TH features encode the texture of regions, and sheep regions
have similar texture to bicycle regions. When using CH features, it is easy to
mislabel sheep regions as road regions because of their similar colors. Also, with
similar shape features, quite a lot of sheep regions are mislabeled as bicycle class
when using pHOG features. Besides the limited description ability of a single type of
feature, as k-means does not consider the spatial dependencies among regions, it also
causes confusions among different classes. By considering the feature co-occurrences
of CH, TH and pHOG, and the spatial co-occurrences of sheep and grass regions,
as well as bicycle and road regions, the proposed MCAC can well improve the
26 2 Context-Aware Discovery of Visual Co-occurrence Patterns

Table 2.1 Results of image region clustering on the MSRC-V2 subset, sample images of which
are shown in Fig. 2.4. Based on Ref. [8]
Method Error(%)
k-means clustering using TH 44.31
k-means clustering using CH 55.21
k-means clustering using pHOG 47.63
k-means clustering using TH+CH+pHOG 38.39
MCAC using all features 29.86

clustering results on individual features and finally reduce the confusion among the
region classes. Specifically, our method can leverage the grass regions to correct
the confused sheep regions and vice versa. A similar improvement can be observed
for bicycle and road.
In the above experiment, we show the advantage of the proposed MCAC in deal-
ing with image regions of clear spatial contexts. However, Fig. 2.4 shows image
regions may have ambiguous spatial contexts, which will also be used to evaluate
the proposed method. Specifically, we collect 30 sheep+grass, 29 cow+grass, 30
aeroplane+grass+sky, 31 boat+water, and 30 bicycle+road. The numbers of
sheep, grass, cow, sky, aeroplane, boat, water, bicycle, and road
are 34, 104, 34, 53, 30, 47, 39, 30, and 51, respectively. Notice that in this chal-
lenging dataset, different image regions may share the same spatial context. For
example, grass occurs in three different scenes: sheep+grass, cow+grass, and
aeroplane+grass+sky.
The results of k-means clustering and MCAC are shown in Table 2.1, where the
same 10% seeds per category from ground truth are randomly chosen for initializa-
tion. The clustering error rate of the proposed MCAC is 29.86%. It brings a consid-
erable improvement than the best one (i.e., 33.65%) obtained by k-means clustering
on the individual features or the concatenated multiple features. We can obtain sim-
ilar observation in terms of average of precision and average of recall. In k-means
clustering, we set k = 9 as there are 9 different types of image regions. Parameters
used in MCAC are Mv = 9, i = 1, 2, 3, Mf = 9, Ms = 5, f = 3.5, s = 1.
Some representative clustering results of the proposed MCAC are shown in
Fig. 2.8. Despite large intra-class variations, our method can still obtain a satisfactory
clustering result by using both spatial and feature contexts. For example, the cow
regions are with different colors and perspectives. We also note that there may con-
tain water regions in some sheep+grass and cow+grass region compositions.
These small amounts of water regions are mislabeled as grass class because of
its preference of cow/sheep contexts. Moreover, because the feature appearance
and spatial contexts are similar, there still exist confusions between a few regions
of sheep and cow, bicycle and sheep, boat and aeroplane, water and
sky, boat and bicycle, and water and road. Nevertheless, the mislabeling
results are only among the minority.
2.4 Summary of this Chapter 27

Fig. 2.8 Exemplar clustering results of MCAC. [2014] IEEE. Reprinted, with permission, from
Ref. [8]

2.4 Summary of this Chapter

Because of the structure and content variations of complex visual patterns, they
greatly challenge most existing methods to discover meaningful visual patterns in
images. We propose a novel pattern discovery method to construct low-level visual
primitives, e.g., local image patches or regions, into high-level visual patterns of spa-
tial structures. Instead of ignoring the spatial dependencies among visual primitives
and simply performing k-means clustering to obtain the visual vocabulary, we explore
spatial contexts and discover the co-occurrence patterns to resolve the ambiguities
among visual primitives. To solve the regularized k-means clustering, an iterative
top-down/bottom-up procedure is developed. Our proposed self-learning procedure
can iteratively refine the pattern discovery results and guarantee to converge. Fur-
thermore, we explore feature contexts and utilize the co-occurrence patterns among
multiple types of features to handle the content variations of visual patterns. By
28 2 Context-Aware Discovery of Visual Co-occurrence Patterns

doing so, our method can leverage multiple types of features to further improve the
performance of clustering and pattern discovery. The experiments on spatial visual
pattern discovery and image region clustering validate the advantages of the proposed
method.

References

1. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In:
Proceedings of the International Conference on Image and Video Retrieval, pp. 401408 (2007)
2. Lee, Y., Grauman, K.: Object-graphs for context-aware visual category discovery. IEEE Trans.
Pattern Anal. Mach. Intell. 34(2), 346358 (2012)
3. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2),
91110 (2004)
4. Malisiewicz, T., Efros, A.: Improving apatial support for objects via multiple segmentations.
In: Proceedings of British Machine Vision Conference, vol. 2 (2007)
5. Russell, B., Freeman, W., Efros, A., Sivic, J., Zisserman, A.: Using multiple segmentations to
discover objects and their extent in image collections. In: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, pp. 16051614 (2006)
6. Su, Y., Jurie, F.: Visual word disambiguation by semantic contexts. In: Proceedings of IEEE
International Conference on Computer Vision, pp. 311318 (2011)
7. Tuytelaars, T., Lampert, C., Blaschko, M., Buntine, W.: Unsupervised object discovery: a
comparison. Int. J. Comput. Vis. 88(2), 284302 (2010)
8. Wang, H., Yuan, J., Wu, Y.: Context-aware discovery of visual co-occurrence patterns. IEEE
Trans. Image Process. 23(4), 18051819 (2014)
9. Weng, C., Wang, H., Yuan, J.: Hierarchical sparse coding based on spatial pooling and multi-
feature fusion. In: Proceedings of the IEEE International Conference on Multimedia Expo, pp.
16 (2013)
10. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary.
In: Proceedings of IEEE International Conference on Computer Vision (2005)
11. Yuan, J., Wu, Y.: Context-aware clustering. In: Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 18 (2008)
12. Yuan, J., Wu, Y.: Mining visual collocation patterns via self-supervised subspace learning.
IEEE Trans. Syst. Man Cybern. Part B Cybern. 42(2), 113 (2012)
13. Yuan, J., Wu, Y., Yang, M.: From frequent itemsets to semantically meaningful visual patterns.
In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
864873 (2007)
Chapter 3
Hierarchical Sparse Coding for Visual
Co-occurrence Discovery

Abstract In this chapter, we investigate soft assignments instead of hard assign-


ments used in Chap. 2 and propose a hierarchical sparse coding method to learn
representative mid-level visual phrases. Given multiple types of low-level visual
primitive features, we first learn their sparse codes, respectively. Then, we cast these
sparse codes into mid-level visual phrases by spatial pooling in spatial space. Besides
that, we also concatenate the sparse codes of multiple feature types to discover fea-
ture phrases in feature space. After that, we further learn the sparse codes for the
formed visual phrases in spatial and feature spaces, which can be more representative
compared with the low-level sparse codes of visual primitive features. The superior
results on various tasks of visual categorization and pattern discovery validate the
effectiveness of the proposed approach.

Keywords Visual phrase learning Hierarchical sparse coding Spatial pooling


Multi-feature fusion Back-propagation

3.1 Introduction

The bag-of-words (BoW) model [21] is one of the most popular image representation
methods for solving visual recognition problems. It utilizes k-means clustering to
quantize local features of visual primitives into visual words so that local features
of an image can be pooled into a histogram to form a global image representation.
We have also improved the BoW model for visual pattern discovery by integrating
spatial context information and multi-feature evidences of visual primitives into the
process of k-means clustering in Chap. 2. Compared with the BoW model, however,
a more biological plausible image representation method is the sparse coding algo-
rithm, which is inspired by the V1 cortex in human brain [18]. The sparse coding
representation methods have gained much popularity due to the state-of-the-art per-
formance for many multimedia and computer vision applications [1, 3, 19, 20, 26,
28]. Despite previous successes, due to the semantic gap between low-level features
and high-level concepts [15], it is difficult for traditional sparse coding of low-level
features to learn representative and discriminative visual patterns.

The Author(s) 2017 29


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_3
30 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

To address the above issue, we follow Chap. 2 to learn mid-level visual phrases
by incorporating spatial context information and multi-feature evidences of visual
primitives. The main difference lies in that Chap. 2 only learns the hard quantization
codes for visual patterns from the unsupervised k-means algorithm, while we aim
in this chapter to learn more representative sparse codes for visual phrases from
image data. Given multiple types of low-level visual primitive features, we first learn
their sparse codes respectively by conventional sparse coding algorithms. Then, we
cast these visual primitive sparse codes of a local spatial neighborhood into mid-
level visual phrases by spatial pooling. Along with the spatial pooling in the spatial
space, we also fuse the multiple types of visual primitive sparse codes by feature
concatenation in the feature space. After that, we further learn the sparse codes for
the visual phrases which can be more representative compared with the low-level
visual primitive sparse codes. To improve the discriminativeness of the visual phrase
representations, we can train the aforementioned hierarchical sparse coding from
each category of image data.
We optimize the codebooks for both visual primitive and visual phrase features
in one unified framework combined with the back-propagation method. The experi-
ments on image pattern discovery, image scene clustering, and scene categorization
justify the advantages of the proposed algorithm.

3.2 Spatial Context-Aware Multi-feature Sparse Coding

Since the proposed hierarchical sparse coding is spatial context-aware and enables
multi-feature fusion, we refer it to spatial context-aware multi-feature sparse coding
and present it in detail in this section.

3.2.1 Learning Spatial Context-Aware Visual Phrases

We begin to introduce the mid-level visual phrase learning method by considering


the spatial neighborhood of low-level visual primitives. As shown in Fig. 3.1, we
construct the mid-level visual phrases, e.g., image patterns, from low-level visual
primitives, e.g., local image patches or regions, of a local spatial neighborhood.
We follow the traditional descriptorcodingpooling pipeline in the first layer to
encode local visual primitives. Then, in the second layer, we try to discover the
spatial context-aware and also multi-feature fused visual phrases. We use sparse
coding method to encode visual phrases such that we can then create global image
representations based on these sparse coded visual phrases. Compared with coding of
individual visual primitives, visual phrase has a more complex structure and convey
richer information thus can be more representative.
In the following, we will discuss our spatial context-aware visual phrase learning
algorithm step by step as shown in Fig. 3.1.
3.2 Spatial Context-Aware Multi-feature Sparse Coding 31

Fig. 3.1 The proposed discriminative visual phrase learning algorithm via spatial context-aware
multi-feature fusion sparse coding, in which the main variables to be optimized are the classifier
weights matrix W, the visual phrase codebook U and sparse codes V, and the visual primitive
codebook B and sparse codes C. Based on Ref. [23]
32 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

Visual Primitive Sparse Coding. We start from the visual primitive sparse
coding for images. An image is represented by a set of local descriptors D =
[d1 , d2 , . . . , d M ] R PM , where each column vector di represents a visual primi-
tive. Given a codebook B R PK 1 where K 1 is the dictionary size of the codebook,
the sparse coding representation C R K 1 M of the descriptor set D can be calculated
as follows:

C = arg min Q 1 = arg min D BC22 + 1 C1 , (3.1)


C C

where  2 is the Frobenius norm of a matrix and  1 is the 1 norm.
Spatial Pooling. In order to incorporate the spatial context information of the
low-level visual primitives, we pool the sparse codes of the visual primitives in a
local spatial neighborhood by k-NN or -NN. The spatial pooling process is illus-
trated in Fig. 3.2. We consider two commonly used spatial pooling methods, average
pooling [21] and max pooling [26]. Assume that z j is the jth spatial pooled visual
phrase and ci is the sparse codes of the local descriptor di , then average pooling is
shown in (3.2).
 1
zj = ci , (3.2)
iS( j)
|S( j)|

where S( j) denotes the set of local descriptors contained in the jth visual phrase.
The max pooling is shown in (3.3).

z j = max (ci ), (3.3)


iS( j)

where the max operation is the element-wise max operation. As discussed in [5],
max pooling method tends to produce more discriminative representations when
soft coding methods are used, while average pooling method on the contrary works
better when hard quantization method is applied.
Visual Phrase Sparse Coding. After the spatial pooling, we have obtained the
visual phrase descriptor set Z = [z1 , z2 , . . . , z M ] R K 1 M where each column z j is
a feature vector to describe the jth visual phrase. It is worth noting that the spatial
context information of the low-level visual primitive features has been incorporated
in Z after the spatial pooling. Similar to the sparse coding for visual primitive features,
for visual phrase features we can also calculate the sparse codes V R K 2 M of the
descriptor set Z by (3.4), i.e.,

V = arg min Q 2 = arg min Z UV22 + 2 V1 , (3.4)


V V

where U R K 1 K 2 is the given visual phrase codebook and K 2 is the dictionary size.
Visual Phrase Codebook Learning. As our target is to learn representative visual
phrase sparse codes which require high-quality codebooks, we now describe how to
3.2 Spatial Context-Aware Multi-feature Sparse Coding 33

optimize the two codebooks B and U for visual primitive and visual phrase features,
respectively. To optimize the visual phrase codebook U, we fix all other variables in
(3.4) except for U and then solve (3.5) as discussed in [9].

U = arg min Z UV22 , (3.5)


U
s.t. ui 22 1, i = 1, . . . , K 2 .

Back-Propagation. In order to optimize the visual primitive codebook B, how-


ever, we have to compute the gradient of Q 2 w.r.t. B using the chain rule as follows:

Q2  Q 2  z j ci
= . (3.6)
B j
z j i
ci B

Q2
From (3.4), we can easily compute z j
as shown in (3.7).

Q2
= 2(z j Uv j ). (3.7)
z j

According to the different spatial pooling methods applied by (3.2) and (3.3), we
have different back-propagation results. When the average pooling method in (3.2)
is used, (3.6) becomes

Q2  Q2  1 ci
= . (3.8)
B j
z j iS( j) |S( j)| B

When the max pooling method in (3.3) is applied, (3.6) becomes

Q2  Q2 cmax
= sign(yimax )  i , (3.9)
B j
z j B

where  is the element-wise product symbol and cimax is obtained as follows:

cimax = max (ci ). (3.10)


iS( j)

According to (3.6) and (3.9), in order to obtain BQ2


, we need to calculate c
B
i
.
Visual Primitive Codebook Learning. Since ci is not directly linked to B accord-
ing to (3.1), we have to compute c
B
i
by the implicit differentiation method. First, we
calculate the gradient with respect to ci at its minimum ci for (3.1), as used in [6]:

2(BT Bci BT di )|ci =ci = 1 sign(ci )|ci =ci . (3.11)


34 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

It is worth noting that (3.11) is only correct when ci = ci . For convenience, in the
following, we will admit the condition that ci = ci without explicitly showing it in
the equations. Then, we calculate the gradient with respect to B on both sides of
(3.11) and obtain

{2(BT Bci BT di )} {1 sign(ci )}


= , (3.12)
bmn bmn

where bmn is the mth row and nth column element of the codebook B. Note that
the right-hand side of (3.12) is not well-defined at zero due to the non-continuous
property of sign(ci ); therefore, we choose the nonzero coefficients from ci to form
ci and select the corresponding codebook bases B by ci , and have the result:

{2(BT Bci BT di )}
= 0. (3.13)
bmn

By expanding (3.13), we can further obtain

ci BT B BT di
BT B + ci = 0, (3.14)
bmn bmn bmn
ci
which leads to the final result of B
, and specifically,

ci BT di BT B
= (BT B)1 ( ci ). (3.15)
bmn bmn bmn

In practice, due to the sparse solution of ci , the selected B has much less number of
bases than the descriptor dimension. Therefore, (BT B)1 can be well-conditioned.
The VPL-SC algorithm. To summarize our spatial context-aware visual phrase
learning algorithm, we combine the previously discussed sparse coding and codebook
learning steps for both the visual primitive features and the visual phrase features
and show the proposed VPL-SC algorithm in Algorithm 2. It is worth noting that
our target is to learn representative visual phrase sparse codes V; therefore, we need
to update codebooks B and U via back-propagation. Once the codebooks B and U
are updated, the corresponding sparse codes C and V can be computed according
to (3.1) and (3.4). In the experiments, when given codebooks we use the SPAMS
toolbox1 to compute the sparse codes. To update the codebooks in our algorithms,
we use stochastic gradient descent method until the objective functions converge.

1 SPAMS toolbox. http://spams-devel.gforge.inria.fr/.


3.2 Spatial Context-Aware Multi-feature Sparse Coding 35

Algorithm 2: Visual Phrase Learning via Spatial Context-aware sparse coding (VPL-SC)
input : visual primitive descriptors set D, spatial neighborhood structure S for spatial
pooling ;
output: visual primitive sparse codes C, visual phrase sparse codes V
init B by k-means on D ;
init C by (3.1) ;
init Z by (3.2) or (3.3) ;
init U by k-means on Z ;
init V by (3.4) ;
while Q 2 is decreasing do
update U by (3.5) ;
update B by (3.6) ;
update C by (3.1) ;
update Z by (3.2) or (3.3) ;
update V by (3.4) ;
return C, V

3.2.2 Learning Multi-feature Fused Visual Phrases

In Sect. 3.2.1, we have discussed the mid-level visual phrase learning algorithm,
VPL-SC, with a single type of low-level visual primitive features. Now let us consider
fusing different types of visual primitive features together to obtain more descriptive
visual phrase features.
Assume that we have T types of different visual primitive descriptor sets D =
{D(1) , D(2) , . . . D(T ) }. For each descriptor set D(t) , we can obtain the corresponding
codebook B(t) , the sparse codes C(t) by (3.1), and the spatial pooled representations
Z(t) by average pooling in (3.2) or max pooling in (3.3). After that, we can concatenate
all the Z(t) as follows:
T 
 
zi = zi(t) , (3.16)
t=1


where [] is the vector concatenation operator.
After the concatenation, the new descriptor set Z for the visual phrases contains
both spatial context information and multi-feature evidences, which can be more
descriptive than VPL-SC that only uses a single type of visual primitive features.
In order to update each B(t) and Y(t) , the back-propagation (3.6) becomes

Q  Q  z(t) j ci(t)
= (t)

(t) B(t)
, (3.17)
B(t) j z j i ci

Q Q
where each is a component of .
z(t)
j
z j
36 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

Fig. 3.2 Illustration of spatial pooling and multi-feature fusion. [2013] IEEE. Reprinted, with
permission, from Ref. [23]

Although multi-feature fusion can be exploited as early fusion by concatenating


visual primitive descriptors, we use late fusion by concatenating visual primitive
sparse codes. This is due to the consideration that early fusion method forces differ-
ent visual primitive descriptor sets to share one common codebook, which can be
problematic if different visual primitives have different codebook sizes. For example,
the codebook size of color histogram features can differ from the codebook size of
shape features. On the contrary, late fusion method allows training different code-
books for different visual primitive descriptor sets and then fusing different sparse
codes for the visual phrases to share multi-feature evidences. The benefits of late
fusion will be further explored in the experiments.
The above discussed multi-feature fusion process is illustrated in Fig. 3.2. In sum-
marization, we show the proposed spatial context-aware and multi-feature fused
visual phrase learning algorithm VPL-SC-MF in Algorithm 3.

3.3 Experiments

3.3.1 Spatial Visual Pattern Discovery

In the first experiment, to illustrate the effectiveness of the proposed method in terms
of encoding spatial context information for discovering visual phrases, we evaluate
the proposed VPL-SC on an LV pattern image shown in Fig. 3.3. From the image,
3.3 Experiments 37

(a) (b) (c)

Fig. 3.3 Illustration of an LV pattern image: (a) original image; (b) visual patterns contained in
original image; (c) colors used to show the visual primitives (e.g., SIFT points detected by [14])
located at different visual patterns. [2013] IEEE. Reprinted, with permission, from Ref. [23]

we first extract in total 2985 SIFT points [14] as the visual primitives, upon which
we compute the visual primitive sparse codes by (3.1) and the visual phrase sparse
codes by the proposed VPL-SC. In the experiment, we construct the visual phrase
descriptor using average pooling method of 8 nearest points around each SIFT point.
After learning the sparse codes, we perform k-means algorithm to cluster all
the visual primitive/visual phrases into 4 image patterns. The results are shown in
Table 3.1, where we use different colors shown in Fig. 3.3 to plot the SIFT points
located at different image patterns.
From the results in Table 3.1, we can see that, on the one hand, visual primitive
sparse codes can hardly distinguish the SIFT points stemming from different visual
patterns in the LV pattern image. As shown in column (a), SIFT points that represent
the same visual patterns may be separated into different clusters (e.g., the 1st row
and the 3rd row), while a certain cluster may contain SIFT points that belong to
different visual patterns (e.g., the 3rd row). On the other hand, using visual phrase
sparse codes we can discover exactly the 4 visual patterns in the LV bag image, as
shown in column (b). This experiment demonstrates that the proposed VPL-SC can
utilize the spatial context information to discover higher-level visual patterns in the
image. It is interesting to note that similar to the results obtained from Chap. 2, the
proposed spatial context-aware approach shows the advantages in discovering image
patterns again.
38 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

Algorithm 3: Visual Phrase Learning via Spatial Context-aware Multi-feature


Fusion sparse coding (VPL-SC-MF)
input : visual primitive descriptors sets {D(t) }, spatial neighborhood structure S for spatial
pooling ;
output: visual primitive sparse codes {C(t) }, visual phrase sparse codes V ;
for t=1:T do
init B(t) by k-means on D(t) ;
init C(t) by (3.1) ;
init Z(t) by (3.2) or (3.3) ;
init Z by (3.16) ;
init U by k-means on Z ;
init V by (3.4) ;
while Q 2 is decreasing do
update U by (3.5) ;
for t=1:T do
update B(t) by (3.17) ;
update C(t) by (3.1) ;
update Z(t) by (3.2) or (3.3) ;
update Z by (3.16) ;
update V by (3.4) ;
return {C(t) }, V

3.3.2 Scene Clustering

In the second experiment, to demonstrate the effectiveness of the proposed method


in multi-feature fusion, we perform image scene clustering on the MSRC-V2
dataset [24]. Following Chap. 2, we select a collection of 150 images from 5 scene
categories: sheep, cow, aeroplane, boat, bicycle. Each image contains several region
segmentations of the following 9 ones: grass, cow, sheep, sky, aeroplane, water, bicy-
cle, road, boat. Sample images are shown in Fig. 2.4. The ground-truth labeling of
each region segmentation is provided by [16]. Each region segmentation is described
with three features: color histogram (CH), texton histogram (TH) [10], and pyramid
of HOG (pHOG) [4].
In the experiment, we consider region segmentations as visual primitives and
the whole images as visual phrases in our algorithms. We use region segmentations
in the same image as spatial neighbors for max pooling. After learning the visual
phrase sparse codes, we perform k-means (k = 5) algorithm and evaluate the clustering
performance by clustering accuracy.
Comparing Multi-feature Fusion Results. Table 3.2 shows the final clustering
accuracy results. From the table, we can see that the proposed VPL-SC achieves
slightly better performance 70% on the concatenated feature TH+CH+pHOG than
68% on the best individual feature TH. However, when the proposed VPL-SC-MF is
applied, it can significantly improve the performance over the best individual feature
3.3 Experiments 39

Table 3.1 (a) Clustering results on sparse codes obtained by (3.1); (b) Clustering results on sparse
codes obtained by VPL-SC. We plot each SIFT point by the color defined in Fig. 3.3. [2013]
IEEE. Reprinted, with permission, from Ref. [23]
(a) (b)
40 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

Table 3.2 Clustering accuracy results on the MSRC-V2 dataset


Feature VPL-SC (1 ) k-means (Chap. 2) VPL-SC (2,1 )
CH 59.7% 44.79% 55.1%
TH 68.0% 55.69% 61.8%
pHOG 60.0% 52.37% 58.2%
TH+CH+pHOG 70.0% 61.61% 64.5%
VPL-SC-MF (1 ) MCAC (Chap. 2) VPL-SC-MF (2,1 )
Multi-feature fusion 78.7% 70.14% 72.2%

TH from 68 to 78.7%. That is, the proposed multi-feature late fusion algorithm VPL-
SC-MF can be more effective than VPL-SC that uses a single-feature type or adopts
multi-feature early fusion to learn the visual phrase sparse codes.
We also list the results from Chap. 2, which combines spatial context and multi-
feature information into a regularized k-means method (i.e., MCAC) to discover
mid-level visual phrases. From the results, we can see that VPL-SC successfully
outperforms the k-means algorithm used in Chap. 2 on all the individual features
TH, CH, and pHOG and the concatenated feature TH+CH+pHOG. Moreover, VPL-
SC-MF also significantly outperforms MCAC in Chap. 2 from 70.14 to 78.7%.
Comparing Visual Phrase Learning Results. To evaluate the performance of
using sparse coding method for the visual phrase layer, we also list the comparison
results in the first and second columns in Table 3.2, where in the first column we
use sparse coding method for the visual phrase layer and in the second column we
only use the concatenation of raw features from visual primitive layer. As shown in
the table, VPL-SC that uses the second-layer sparse coding outperforms the algo-
rithm that uses only the concatenation of raw features from visual primitive layer in
Chap. 2. For example, VPL-SC using the second-layer sparse coding improves the
performance from 44.79 to 59.7% on CH feature, 55.69 to 68.0% on TH feature,
52.37 to 60.0% on pHOG feature, and 61.61 to 70.0% on CH+TH+pHOG feature.
The experiment results have shown the advantages of using the second-layer sparse
coding for learning visual phrases.
We also compare the performances of VPL-SC and VPL-SC-MF using 2,1 regu-
larized term instead of using 1 regularized term in (3.4). As can be seen, the results
of using 1 regularized term outperform those of using 2,1 regularized term, which
indicates that 2,1 term might be suited for feature selection [27], but not necessary
an optimal choice for producing sparse codes.

3.3.3 Scene Categorization

In the third experiment, to illustrate the discriminativeness of the proposed algo-


rithms, we perform scene categorization on the 15-scene dataset.
3.3 Experiments 41

Table 3.3 Sample images of the 15-scene dataset

bedroom living room suburb industrial

kitchen coast forest highway

inside city mountain open country street

tall building office store

The 15-scene dataset is gradually collected by [8, 17]. It contains totally 4485
images from 15 categories of outdoor and indoor scenes, including bedroom, living
room, suburb, industrial, kitchen, coast, forest, highway, inside city, mountain, open
country, street, tall building, office, and store. For each category, there are about 216
to 410 images of size about 300 250 pixels. In Table 3.3, we show some example
images from the 15-scene dataset.
In the experiments, we use 10 random splits of the dataset, and for each split,
we use 100 images from each category for training and the rest for testing. For the
visual primitive layer, we extract dense SIFT [26] and dense edge-SIFT [25] as local
descriptor sets on 16 16 pixel patches computed over a grid with spacing of 8 pixels.
A codebook of size 1024 is trained upon the visual primitive layer. For the visual
42 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

Table 3.4 Classification accuracy results on the 15-scene dataset


Algorithm Classification accuracy (%)
Kernel SPM [8] 81.40
Kernel codebook [22] 76.67
Localized soft assignment [13] 82.70
LCSR [13] 82.7
Object bank [11] 80.9
Geometric phrase pooling [25] 85.13
 p -norm pooling [7] 83.20
Kernel Descriptors [2] 86.7
ScSPM [26] 80.28
Macrofeatures [5] 84.9
Max-margin dictionary [12] 82.7
VPL-SC w/o codebook update 81.9
VPL-SC w/ codebook update 83.2
VPL-SC-MF w/o codebook update 83.8
VPL-SC-MF w/ codebook update 85.5

phrase layer, we apply k-NN at 4 scales, i.e., k = 4, 8, 12, 16, around each visual
primitive descriptor for max pooling to construct the visual phrase layer descriptors.
The codebook size of the visual phrase layer is also 1024. For the classifier layer,
we use max pooling upon the spatial pyramid of 4 4, 2 2, 1 1 subregions to
obtain the global image features for training, as used in [8, 26].
We run VPL-SC on the dense SIFT features, VPL-SC-MF on both dense SIFT
and dense edge-SIFT features to learn visual primitive and visual phrase sparse
codes, respectively. After that, we further apply max pooling and spatial pyramid
matching method upon the visual primitive sparse codes of different k-NN scales
(k = 1, 4, 8, 12, 16) to form global image representations for classification. Since
different sparse codes have different discriminative power, we use multiple kernel
learning algorithm with RBF kernels to train the final classifier.
Table 3.4 shows the final accuracy results. We find that the performances of our
methods are very competitive to the best performance obtained by kernel descrip-
tors [2], which used kernel approximation of the visual primitive descriptors. From
the table, we can also see that the proposed VPL-SC outperforms previous sparse
coding work [26] on the visual primitive layer by about 3%, which justifies the dis-
criminative power of our learned visual phrase sparse codes. When compared with
the Macrofeature work [5] that learned mid-level features on multiple visual primitive
descriptors of local spatial neighborhood to encode the spatial context information,
the proposed VPL-SC-MF also shows superior recognition performance by about
0.6%. Thanks to the proposed visual phrase sparse codes learning and also multi-
feature fusion, VPL-SC-MF achieves superior accuracy 85.5%, compared with the
max-margin dictionary learning method on the visual primitive features in [12]. The
3.3 Experiments 43

experiments on the 15-scene dataset validate the advantages of the proposed visual
phrase learning method that combines both spatial pooling and multi-feature fusion
techniques.

3.4 Summary of this Chapter

We propose to learn discriminative mid-level visual phrase features via spatial


context-aware multi-feature sparse coding, upon low-level visual primitive features.
With the help of labeled image dataset, we optimize the two-layer sparse codes, as
well as the two-layer codebooks via back-propagation. Since we have utilized the
spatial context information, multi-feature information, and also the image category
information, representative and discriminative sparse codes of visual phrases can be
obtained. Experiments on image pattern discovery and scene recognition justify the
effectiveness of the proposed algorithms.

References

1. Bengio, S., Pereira, F., Singer, Y., Strelow, D.: Group sparse coding. In: Proceedings of
Advances in Neural Information Processing Systems (2009)
2. Bo, L., Ren, X., Fox, D.: Kernel descriptors for visual recognition. In: Proceedings of Advances
in Neural Information Processing Systems (2010)
3. Bo, L., Ren, X., Fox, D.: Multipath sparse coding using hierarchical matching pursuit. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)
4. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In:
Proceedings of ACM International Conference on Image and Video Retrieval (2007)
5. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2010)
6. Bradley, D.M., Bagnell, J.: Differentiable sparse coding. In: Proceedings of Advances in Neural
Information Processing Systems (2008)
7. Feng, J., Ni, B., Tian, Q., Yan, S.: Geometric lp-norm feature pooling for image classification.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2011)
8. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for
recognizing natural scene categories. In: Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition (2006)
9. Lee, H., Battle, A., Raina, R., Ng, A.: Efficient sparse coding algorithms. In: Proceedings of
Advances in Neural Information Processing Systems (2006)
10. Lee, Y.J., Grauman, K.: Object-graphs for context-aware visual category discovery. IEEE Trans.
Pattern Anal. Mach. Intell. 34(2), 346358 (2012)
11. Li, L.J., Su, H., Xing, E.P., Li, F.F.: Object bank: a high-level image representation for scene
classification and semantic feature sparsification. In: Proceedings of Advances in Neural Infor-
mation Processing Systems (2010)
12. Lian, X.C., Li, Z., Lu, B.L., Zhang, L.: Max-margin dictionary learning for multiclass image
categorization. In: Proceedings of European Conference on Computer Vision (2010)
13. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: Proceedings of IEEE
International Conference on Computer Vision (2011)
44 3 Hierarchical Sparse Coding for Visual Co-occurrence Discovery

14. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2),
91110 (2003)
15. Lu, Y., Zhang, L., Tian, Q., Ma, W.: What are the high-level concepts with small semantic gaps?
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2008)
16. Malisiewicz, T., Efros, A.: Improving spatial support for objects via multiple segmentations.
In: Proceedings of British Machine Vision Conference (2007)
17. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial
envelope. Int. J. Comput. Vis. (2001)
18. Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a
sparse code for natural images. Nature 381, 607609 (1996)
19. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from
unlabeled data. In: Proceedings of International Conference on Machine Learning (2007)
20. Ranzato, M., Boureau, Y., LeCun, Y.L.: Sparse feature learning for deep belief networks. In:
Proceedings of Advances in Neural Information Processing Systems (2007)
21. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos.
In: Proceedings of IEEE International Conference on Computer Vision (2003)
22. van Gemert, J., Geusebroek, J., Veenman, C., Smeulders, A.: Kernel codebooks for scene
categorization. In: Proceedings of European Conference on Computer Vision (2008)
23. Weng, C., Wang, H., Yuan, J.: Hierarchical sparse coding based on spatial pooling and multi-
feature fusion. In: Proceddings of IEEE International Conference on Multimedia and Expo
(2013)
24. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary.
In: Proceedings of IEEE International Conference on Computer Vision (2005)
25. Xie, L., Tian, Q., Wang, M., Zhang, B.: Spatial pooling of heterogeneous features for image
classification. IEEE Trans. Image Process. 23(5), 19942008 (2014)
26. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding
for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (2009)
27. Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: L2,1-norm regularized discriminative
feature selection for unsupervised learning. In: Proceedings of International Joint Conference
on Artificial Intelligence (2011)
28. Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierar-
chical sparse coding. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (2011)
Chapter 4
Feature Co-occurrence for Visual Labeling

Abstract Due to the difficulties in obtaining labeled visual data, there has been an
increasing interest to label a limited amount of data and then propagate the initial
labels to a large amount of unlabeled data. In this chapter, we propose a transductive
label propagation algorithm by leveraging the advantages of feature co-occurrence
patterns in visual disambiguity. We formulate the label propagation problem by intro-
ducing a smooth regularization that ensures similar feature co-occurrence patterns
share the same label. To optimize our objective function, we propose an alternating
method to decouple feature co-occurrence pattern discovery and transductive label
propagation. The effectiveness of the proposed method is validated by both synthetic
and real image data.

Keywords Feature co-occurrence pattern discovery Visual labeling Semi-


supervision Transductive spectral learning Propagation

4.1 Introduction

In Chaps. 2 and 3, we have studied how to capture visual patterns across multi-
ple feature modalities for a better unsupervised clustering. This chapter will further
present a novel semi-supervised visual labeling method based on multi-feature learn-
ing. Specifically, this multi-feature learning problem is targeted at leveraging a small
amount of labeled data to transfer the initial labels to a vast amount of unlabeled data.
Most existing multi-feature learning approaches rely on the agreement among dif-
ferent feature types to improve the performance, i.e., the decision of a data sample is
preferred to be consistent across different feature types. However, as different feature
types may have different data characteristics and distributions, a forced agreement
among different feature types may not bring a satisfying result.
To handle the different data characteristics among multiple feature types, we pro-
pose to respect the data distribution and allow different feature types to have its
own clustering results. This can faithfully reflect the data characteristics in different

The Author(s) 2017 45


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_4
46 4 Feature Co-occurrence for Visual Labeling

feature types, e.g., color feature space can be categorized into a number of typical
colors, while texture feature space categorized into a different number of texture
patterns. To integrate the clustering results from different feature types, we follow
Chap. 2 to quantize each data sample into a co-occurrence of feature patterns, e.g., a
composition of typical color and texture patterns. Such a treatment has two advan-
tages. First, instead of forcing different feature types to agree with each other, we
compose multiple feature types to reveal the compositional pattern across different
feature types, and thus it can naturally combine multiple features. Comparing with a
direct concatenation of multiple types of features, the feature co-occurrence patterns
encode the latent compositional structure among multiple feature types, thus having
a better representation power. Moreover, as it allows different clustering results in
different feature types, the feature co-occurrence patterns can be more flexible. Sec-
ond, the discovered feature co-occurrence patterns enable us to propagate labels in
the same group of feature co-occurrence patterns.
To enable label propagation through feature co-occurrence patterns, we propose
a transductive learning formulation with three objectives, namely the good quality
of clustering in individual feature types, the label smoothness of data samples in
terms of feature co-occurrence patterns, and the fitness to the labels provided by
the training data. The optimization of this objective function is complicated as the
clustering results in different feature types and the formed co-occurrence patterns
influence each other under the transductive learning formulation. We thus propose
an iterative optimization approach that can decouple these factors. During iterations,
the discovery of feature co-occurrence patterns and the labeling smoothness of data
samples will help each other, leading to a better transductive learning. To evaluate
our method, we conduct experiments on a synthetic dataset, as well as object and
action recognition datasets. The comparison with related methods such as [3, 14,
19] shows promising results that our proposed method can well handle the different
data characteristics of multiple feature types.
We explain the proposed method in Fig. 4.1. There are four data classes repre-
sented by two feature modalities, i.e., texture and color. The texture modality forms
two texture patterns, chessboard and brick; while the color modality forms two color
patterns, green and blue. All data samples belong to one of the four compositional pat-
terns: green brick (hexagon), blue chessboard (triangle), green chessboard (square),
and blue brick (circle). Clearly, the four data classes cannot be distinguished in either
the texture or the color feature space alone. For example, the two classes square and
triangle share the same texture attribute, but different in color, while the hexagon and
square classes share the same color but different in texture. However, each class can
be easily distinguished by a co-occurrence of the texture and color pattern, e.g., the
hexagon class composes brick texture and green color. As a result, the unlabeled
data samples of the same co-occurrence feature pattern can be labeled as the same
class as the labeled data sample.
4.2 Multi-feature Collaboration for Transductive Learning 47

Fig. 4.1 Label propagation of unlabeled data by the discovery of the co-occurrence patterns among
different types of clusters. [2015] IEEE. Reprinted, with permission, from Ref. [13]

4.2 Multi-feature Collaboration for Transductive Learning

We study the collaborative multi-feature fusion in a transductive learning frame-


work, where the labeled data can transfer the labels to the unlabeled data. Con-
sider a collection of partially labeled multi-class dataset X = (Xl , Xu ). The
labeled inputs Xl = {xi }li = 1 are associated with known labels Yl = {yi }li = 1 , where
yi L = {1, 2, . . . , M}. The unlabeled data Xu = {xi }iN= l+1 are with missing
labels Yu = {yi }i=l+1
N
, where yi L and the task is to infer Yu . A binary matrix
N M
Y {1, 0} encodes the label information of X , where Yi j = 1 if xi has a label
48 4 Feature Co-occurrence for Visual Labeling

yi = j and Yi j = 0 otherwise. We set Yi j = 0 initially for unlabeled data yi Yu .


We assume each xi X is represented as V types/modalities of features as {fi(v) }vV= 1 ,
where fi(v) Rdv .

4.2.1 Spectral Embedding of Multi-feature Data

As spectral embedding can effectively capture the data clustering structure [12], we
leverage it to study the data distribution in each feature type.
At first, each feature type {F (v) } = {fi(v) }i=1
N
of X defines an undirected graph
G v = (X , E , Wv ) in which the set of vertices is X and the set of edges connecting
pairs of vertices is E = {ei j }. Each edge ei j is assigned a weight wi(k)
j = (x i , x j ) to
(v)
represent the similarity between xi and x j . The matrix Wv = (wi j ) R N N denotes
the similarity or kernel matrix of X in this feature type. Following spectral clustering,
we use the following function to compute the graph similarities:
 
dist2 fi(v) , f (v)
j

wi j = exp , (4.1)
2 2

 
where dist fi(v) , f (v)
j denotes the distance between a pair of features; is the band-
width parameter to control how fast the similarity decreases. By summing the
weights
of edges being connected to xi , we can obtain the degree of this vertex
di(v) = Nj=1 wi(v)j . Let Dv R
N N
be the vertex degree matrix by placing {di(v) }i=1
N

on the diagonal. Then we can write the normalized graph Laplacian Lv R N N as

Lv = I N D1/2
v Wv D1/2
v , (4.2)

where I N is an identify matrix of order N .


After the above preprocessing to each feature type, we perform spectral clustering
to group the feature points of both labeled and unlabeled data into clusters. Assume
there are Mv clusters in the vth feature type. The spectral clustering on this feature
type is to minimize the spectral embedding cost [7]:

Q type (Uv ) = tr(UvT Lv Uv ), (4.3)

subject to UvT Uv = I Mv , where tr() denotes the matrix trace; Uv R N Mv is the real-
valued cluster indicators of the Mv clusters [12]; I Mv is an identify matrix of order
Mv . By using the RayleighRitz theorem [6], we can obtain the solution of Uv , which
consists of the first Mv eigenvectors corresponding to the Mv smallest eigenvalues
of Lv , i.e., ri(v) , i = 1, 2, . . . , Mv , denoting as:
4.2 Multi-feature Collaboration for Transductive Learning 49


Uv = [r1(v) , r2(v) , . . . , r(v)
Mv ] = eig(Lv , Mv ). (4.4)

By using (4.4), we can independently perform spectral embedding in different feature


types. In other words, we do not have to force the clustering in different feature spaces
to agree with each other.

4.2.2 Embedding Co-occurrence for Data Representation

We have obtained V label indicator matrices {Uv }vV= 1 derived from the V types
of features by (4.4) in the above section. To integrate

V
them, we follow Chap. 2 to
MV N
concatenate all label indicator matrices into Tf R v = 1 and

Tf = [U1 , U2 , . . . , UV ]T . (4.5)

The nth column of Tf actually represents embedding co-occurrence of xn , which


conveys the clustering structure information across multiple types of features with-
out forcing clustering agreement among different feature types. Comparing to hard
clustering indicators used in Chap. 2, the soft embedding relaxation is more flexible
to capture the clustering structures of individual feature types and tolerate noisy fea-
tures [12]. With the embedding co-occurrence representations Tf for the samples in
X , we introduce the multi-feature similarity graph G f = (X , E , Wf ) based on Tf .
By Laplacian embedding, the resulting soft cluster indicators {Uv }vV= 1 can be consid-
ered to obey linear similarities [4], so are the concatenation of them, Tf . Therefore,
we define the similarity matrix Wf R N N as a linear kernel:


V
Wf = TTf Tf = Uv UvT . (4.6)
v=1

From (4.6), we can see that Wf is a sum of the linear kernels of the soft cluster indi-
cators of multiple feature types. Therefore, it will be less sensitive to poor individual
feature types. What needs to be noted is that although the entries of the matrix Wf
are not necessary all nonnegative, Wf is semi-positive. One can also add Wf with a
rank-1 matrix whose entries are the same to make sure each entry of Wf nonnegative.
We will omit this manipulation in the following as it does not affect the solution to
the problem being proposed in (4.9).
According to Wf , we can obtain the degree matrix Df R N N by

Df = diag(Wf 1), (4.7)


50 4 Feature Co-occurrence for Visual Labeling

where 1 RN is an all-one vector. We then have the normalized Laplacian as:


1/2 1/2
Lf = I N Df Wf Df . (4.8)

With Lf , we encode the smoothness of the multi-feature similarity graph. It will help
us to assign the same label to data samples of similar feature co-occurrence patterns.

4.2.3 Transductive Learning with Feature Co-occurrence


Patterns

After building the similarity graph G f with multiple features, it is still a non-trivial
task to build a smooth connection between the feature clustering structures of multiple
feature types and the label predictions of unlabeled data. In order to address the
problem, we introduce a soft label matrix Uf R N M for feature co-occurrence
patterns to assist the transition. Different from the hard class labels Y {0, 1} N M ,
Uf is a relaxed real matrix. All taken into account, we propose to minimize the
spectral embedding costs of individual feature types, the labeling smoothness term
for finding co-occurrence patterns, and the fitting penalty of hard class labels Y and
soft pattern labels Uf together in the following objective function:

Q({Ui }v=1
V
, Uf , Y)

V
= Q type (Uv ) + Q smooth (Uf , {U j } Kj=1 ) + Q fit (Uf , Y)
v=1 (4.9)

V
= tr(UvT Lv Uv ) + tr(UfT Lf Uf ) + tr{(Uf SY) (Uf SY)},
T
v=1

subject to UvT Uv = I Mv , v = 1, 2, . . . , V ; Uf R N M ; Y {1, 0} N M and Mj=1


Yi j = 1 with balance parameters and . In (4.9), UvT Uv = I Mv is the requirement

of unique embedding; M j=1 Yi j = 1 is to make a unique label assignment for each


N N
vertex; and S R is a normalized term to weaken the influence of noisy labels
and balance class biases. Similar to [14], the diagonal elements of S are filled by

Y. j Df 1
the class-normalized node degrees: s = M j=1 YT D 1 , where  denotes Hadamard
.j f

product; Y. j denotes the jth column of Y; 1 RN is an all-one vector.


More specifically, as
discussed in Sect. 4.2.1, the spectral embedding objective of
multiple feature types vV= 1 Q type (Uv ) is to reveal the data distributions in multiple
feature types without forcing clustering agreement. In addition, to allow the soft
pattern labels Uf for X to be consistent on closely connected vertices in the multi-
feature similarity graph G f , we regularize our objective with the following smoothing
function:
Q smooth (Uf , {Uv }vV= 1 ) = tr(UfT Lf Uf ), (4.10)
4.2 Multi-feature Collaboration for Transductive Learning 51

Algorithm 4: Feature Co-occurrence Pattern Discovery for Transductive Spec-


tral Learning (FCPD-TSL)
Input: labeled data {Xl , Yl }; unlabeled data Xu ; V types of features {F (k) }v=1
V ; cluster

numbers of individual feature types {Mv }v=1


V ; class number M; parameters and

Output: labels on unlabeled data Yu


1: Initialization: initial label matrix Y; normalized graph Laplacians of individual feature
types Lk Lk , k = 1, 2 . . . K
2: repeat
/ / Spectral embedding
3: Uv eig(Lv , Mv ), v = 1, 2 . . . V (4.4)
/ / Generate feature co-occurrence patterns
4: Tf = [U1 , U2 , . . . , UV ]T (4.5)
/ / Build multi-feature similarity graph Laplacian
5: Wf TTf Tf (4.6)
1/2 1/2
6: Lf I N Df Wf Df (4.8)
/ / Compute gradient w.r.t. class-normalized labels
7: (SY) Q 2[PLf P + (P I N )2 ]SY (4.15)
/ / Reset unlabeled data
8: Xu Xu
/ / Gradient search for unlabeled data labeling
9: repeat
10: (i, j) arg min (SY) Q;
(i, j): xi Xu , j{1.2,...,M}
11: Yi, j 1;
12: yi j;
13: until Xu Xu \xi =
/ / Update soft class labels of unlabeled data
14: Uf PSY
(4.13)
/ / Regularize graph Laplacians for each feature types

V 1 1
15: Lv Lv v=1 Df 2 Uf UfT Df 2 , v = 1, 2 . . . V (4.18)
16: until Q is not decreasing

where Lf is defined by (4.8) which is related to {U j } Kj=1 . Furthermore, to prevent


overfitting, it should allow occasional disagreement between the soft class labels Uf
and the hard class labels Y on the dataset X . Thus, we minimize the fitting penalty:

Q fit (Uf , Y) = tr{(Uf SY)T (Uf SY)}. (4.11)

Regarding (4.9), it is worth noting that the three terms of this function are corre-
lated among each other. We thus cannot minimize Q by minimizing the three terms
separately. Moreover, the binary integer constraint on Y also challenges the optimiza-
tion. We will in Sect. 4.2.4 show how to decouple the dependencies among them and
propose our algorithm to solve this optimization function.
52 4 Feature Co-occurrence for Visual Labeling

4.2.4 Collaboration Between Pattern Discovery


and Label Propagation

In this section, we decouple the dependencies among the terms of (4.9) to solve
the objective function. More specifically, we fix the soft feature clustering results
{Uv }v=1
V
in individual feature types to optimize Q over the class labeling results with
soft pattern labels Uf and hard class labels Y together. And similarly, we fix the class
labeling results with soft pattern labels Uf and hard class labels Y simultaneously
to optimize Q over the soft feature clustering results {Uv }v=1
V
in individual feature
types. In the class labeling update step, we solve Uf by an analytical form and then
optimize Q over Y using a gradient-based greedy search approach. In the feature
clustering update step, we optimize Q over Uv , v = 1, 2, . . . , V separately.
The closed form of Uf . Since Q is quadratic w.r.t. Uf , similar to [14], we are
allowed to zero the partial derivative to obtain the analytical solution of Uf w.r.t. Y
and {Uv }v=1
V
. We then have:

Q
= Lf Uf + (Uf SY) = 0, (4.12)
Uf

which implies

Uf = ( Lf + I N )1 SY = PSY, (4.13)

where P = ( Lf + I N )1 , which is related to {Uv }v=1


V
according to (4.8).
The soft pattern labels Uf make the transition smooth from feature clustering
results of multiple feature types {Uv }v=1
V
to the prediction of hard class labels Y for
the dataset X . Then we can substitute the analytical solution of Uf in (4.13) to (4.9),
and optimize Q over Y.
Optimize Q over Y. Given {Uv }v=1 V
, we use the gradient-based greedy search
approach [14] to optimize the binary integer optimization. It is worth noting that
searching along the gradient of hard class labels Y and class-normalized labels SY
is in fact equivalent. Therefore,

Yupdate ({Uv }v=1


V
) = arg min Y Q = arg min (SY) Q, (4.14)
Y Y

where the gradient of Q over SY is:

(SY) Q = 2[PLf P + (P I N )2 ]SY. (4.15)

(4.14) shows how to leverage the feature clustering structures in multiple types of
features {Rk }v=1
V
and the labeled data to predict the labels of unlabeled data.
4.2 Multi-feature Collaboration for Transductive Learning 53

Optimize Q over Uv , v = 1, 2, . . . , V . We propose to update data clustering


results by data class labeling results, which have not been studied before to the best
of our knowledge. To this end, we fix {Ui }i =v , Uf and Y and obtain an equivalent
minimization function J to minimize Q (4.9), where


V
1 1
J (Uv , Uf , Y, {Ui }i =v ) = tr{UvT (Lv Df 2 Uf UfT Df 2 )Uv }, (4.16)
v=1

subject to UvT Uv = I Mv . However, the partial derivative of Df w.r.t. Rv is intractable


since there is a diagonalization operation in (4.7). We therefore use the values of
{Uv }v=1
V
at the current iteration to initialize Df . Then given Df , the optimization turns
out to minimize the following objective:

1 1
type (Uv , Y, {Ui }i =v ) = tr{Uv (Lv Df Uf Uf Df )Uv },
Q new T 2 T 2
(4.17)

subject to UvT Uv = I Mv . It becomes a spectral clustering with a regularized graph


Laplacian:
1 1
Lvnew = Lv Df 2 Uf UfT Df 2 . (4.18)

By using the RayleighRitz theorem [6], we can update Uv as the first Mk eigenvectors
corresponding to the Mk smallest eigenvalues of Lknew :

Uvupdate (Uv , Y, {Ui }i =v ) = eig(Lvnew , Mv ). (4.19)

Equation (4.19) shows how to tune the feature clustering result of each feature
type Uv , v = 1, 2, . . . , V by learning from the known data class labels and the
feature clustering results of the other feature types. It is worth noting that, at the
beginning, our method does not require the clustering agreement among different
feature types. However, by further optimizing the objective, individual feature types
will be regularized by known data class labels, and each individual feature type
will be influenced by other feature types. In fact, the regularized graph Laplacian
(4.18) in each feature type has become a multi-feature Laplacian representation. Such
multi-feature Laplacian representations should gradually agree with each other.
We show our complete solution in Algorithm 4, where we refer the proposed
method as Feature Co-occurrence Pattern Discovery for Transductive Spectral Learn-
ing (FCPD-TSL). The complexity within each iteration of our method mainly relies
on M times of eigen-decomposition (O(N 3 ) [2]) and one update of soft pattern
label matrix Uf and hard class matrix Y (O(N 2 ) [16]. We can further consider to
apply supernodes method [5] or bilateral random projections [20] to speed up the
computation of eigen-decomposition.
54 4 Feature Co-occurrence for Visual Labeling

4.3 Experiments

4.3.1 Experimental Setting

In the experiments, the regularized parameters are both set to 1. Specifically, in the
proposed FCPD-TSL, we set = 1, and = 1 as we observe they are not very
sensitive. For a fair comparison, we set C = 1 in RWMV [19], and set = 1 in
GTAM [14]. As suggested in [19], the graph combination parameters in RWMV
are set equally, i.e., i = 1/.M, i = 1, 2, . . . , V . For synthetic data, UCI digits [1],
UC merced land uses [18] and body motions [10], we use Gaussian kernel (4.1)
with euclidean distances to compute the pairwise image similarities. For Oxford
flowers [8], although the original features are not available, the 2 distance matrices
of individual visual features are provided by [8, 9]. Therefore, instead of using
euclidean distance, we choose 2 distance to build Gaussian kernel as the similarity
measure of Oxford flower images. Besides, we randomly pick labeled samples and
run 10 rounds for performance evaluation on each real dataset.

4.3.2 Label Propagation on Synthetic Data

We synthesize a toy dataset with two types of features in Fig. 4.2. Each type of features
is described by a 2-dimensional feature space. The dataset has four classes labeled
by 1, 2, 3, and 4, respectively. The labeled data are highlighted using different
colors. Each class has 200 samples. Feature type #1 has two clusters: Above moon
and Below moon. Feature type #2 also has two clusters: Left moon and Right moon.
It is worth noting that the feature clusters are mixed across different classes. In
feature type #1, both classes 1 and 2 share cluster A; and both classes 3 and 4 share
cluster B. In feature type #2, both classes 2 and 4 share cluster L; and both classes 1 and
3 share cluster R. Therefore, it is infeasible to classify the data by using a single feature
type. In addition, a direct concatenation of features from multiple feature types will
diminish the differences among samples, thus cannot distinguish all samples from
different classes. For example, by using the GTAM [14], the concatenated features
obtain 92.81% accuracy, but cannot disambiguate among several samples. In terms
of general multi-feature fusion approaches, e.g., RWMV [19], the requirement that
the data categorization results in individual feature types should agree with each
other does not hold, e.g., the toy data. Hence the accuracy of RWMV just reaches
48.11%.
In contrast, by utilizing the feature co-occurrence patterns among multiple feature
types, the proposed FCPD-TSL can learn a favorable clustering, and the accuracy is
100%. Specifically, class 1 exhibits the co-occurrence of cluster A in feature type #1
and cluster R in feature type #2; class 2 exhibits the co-occurrence of cluster A in
feature type #1 and cluster L in feature type #2; class 3 exhibits the co-occurrence
4.3 Experiments 55

Fig. 4.2 Classification on synthetic toy data with two feature types. Different markers, i.e., 1,
2, 3, and 4, indicate four different classes. Shading markers highlight the labeled data. The
first column shows the synthetic toy data. The last three columns show the classifying results of
RWMV [19], GTAM [14] and the proposed FCPD-TSL. [2015] IEEE. Reprinted, with permission,
from Ref. [13]
56 4 Feature Co-occurrence for Visual Labeling

of cluster B in feature type #1 and cluster R in feature type #2; and class 4 exhibits
the co-occurrence of cluster B in feature type #1 and cluster L in feature type #2.

4.3.3 Digit Recognition

To evaluate how multiple feature types influence handwritten digit recognition, we


test the multi-feature digit dataset [11] from the UCI machine learning repository [1].
It consists of features of handwritten numerals (09) extracted from a collection
of Dutch utility maps. There are 200 samples in each class. So the dataset has a
total of 2,000 samples. These digits are represented by six types of features: (1) 76-
dimensional Fourier coefficients of the character shapes (fou); (2) 64-dimensional
KarhunenLoeve coefficients (kar); (3) 240-dimensional pixel averages in 2 3
windows (pix); (4) 216-dimensional profile correlations (fac); (5) 47-dimensional
Zernike moments (zer); and (6) 6-dimensional morphological features (mor). All
features are concatenated to generate the 649-dimensional features. As the source
image dataset is not available [1], we have shown the sampled images by the 240-
dimensional pixel features in Fig. 4.3.
In this experiment, the first 50 samples from each digit class are labeled for
transductive learning. The classification results on the remaining 1500 unlabeled
samples are used for evaluation. For each class, we randomly pick labeled data from
the 50 labeled candidates and vary the size from 2 to 20. The accuracy comparison
results are shown in Fig. 4.4, including the proposed FCPD-TSL, GTAM [14] (on
the best single feature type, the worst single feature type and the concatenations of
all feature types), and RWMV [19] (on all feature types).
The various performances of individual feature types show there is a substantial
disagreement among feature types in this dataset. The concatenation of all the six
feature types performs better than the worst single feature but worse than the best
single feature when using GTAM. This also shows that feature concatenation can be
easily affected by the bad feature types, thus not the best choice for multi-feature
transductive learning. By a linear combination of similarity matrices of the six feature
types [19], the performance of RWMV can be close to that of GTAM on the best
single feature type, but is still affected by the poor feature types. The best performance
is achieved by the proposed FCPD-TSL, which benefits from learning the feature
co-occurrence patterns. In Fig. 4.4, we show the results of FCPD-TSL with 100
clusters per feature type. Because we do not force individual feature types to have
the same clustering structure, the feature co-occurrence patterns faithfully reflect the

Fig. 4.3 Sample images of UCI handwritten digit dataset


4.3 Experiments 57

0.9

0.8

GTAM (best single view)


GTAM (worst single view)
0.7
FCPD-TSL (100 clusters per view)
Accuracy

RWMV
0.6 GTAM (feature concatenation)

0.5

0.4

0.3

0.2
2 5 8 11 14 17 20

Number of labeled samples per class

Fig. 4.4 Performance comparison on UCI handwritten digits. [2015] IEEE. Reprinted, with
permission, from Ref. [13]

Table 4.1 Performance of the proposed FCPD-TSL on UCI handwritten digits under different
cluster numbers per feature type. The size of labeled data is 20. [2015] IEEE. Reprinted, with
permission, from Ref. [13]
# Cluster per feature Accuracy # Cluster per feature Accuracy
type type
5 0.870 0.012 50 0.970 0.001
10 0.925 0.002 100 0.966 0.013
20 0.958 0.001 200 0.936 0.035

data distribution characteristics. Moreover, as discussed in Sect. 4.2.3, the feature co-
occurrence patterns are less sensitive to poor feature types when performing graph
transduction. Therefore, the proposed FCPD-TSL achieves a noticeable performance
improvement by combining all the individual feature types, despite some poor feature
types and the disagreement among different feature types.
We also study the impact of the cluster number in each feature type. The perfor-
mance comparison is shown in Table 4.1, in which the number of clusters per feature
type varies from 5 to 200, with the size of labeled samples per class equal to 20.
With the increase of cluster number per feature type, the accuracy increases first then
decreases. This is because either under-clustering or over-clustering will discourage
the investigation of data distributions in multiple feature types. Despite that, there
still exists a large number of effective over-clustering which can produce informative
feature clusters, boosting the performance of graph transduction. For example, when
58 4 Feature Co-occurrence for Visual Labeling

the cluster number per feature type is between 10 and 200, the labeling accuracies
of unlabeled data all reach more than 90%.

4.3.4 Object Recognition

The proposed approach can also combine different visual features for object recog-
nition. The Oxford flower dataset is used for experiment, which is composed of 17
flower categories, including Buttercup, Coltsfoot, Daffodil, Daisy, Dandelion, Frit-
illary, Iris, Pansy, Sunflower, Windflower, Snowdrop, LilyValley, Bluebell, Crocus,
Tigerlily, Tulip, and Cowslip. Each category is with 80 images. We have shown one
flower image for each class in Fig. 4.5. In the experiment, we use seven pairwise
distance matrices provided by the dataset. These matrices are precomputed respec-
tively from seven types of image appearance features [8, 9]. Using these pairwise
distances, we compute the similarities between pairs of features according to (4.1).
We label the first 30 samples per class and use them for transductive learning.
The classification performance on the remaining 850 unlabeled samples is used for
evaluation. We compare the proposed FCPD-TSL with GTAM [14] (on the best
single feature type, the worst single feature type) and RWMV [19] (on all feature
types) w.r.t. mean value and standard deviation of classification accuracies in Fig. 4.6.
For each class, we randomly pick labeled data from the 30 labeled candidates and
vary the size from 2 to 20. In Fig. 4.7, we show the confusion matrices of compared
methods when there are 20 labeled data samples for each class. Because we do not
have the original features, we do not compare the results of feature concatenation.
As shown in Fig. 4.6, the individual types of features all show poor performances.
Moreover, the best and worst single feature types confuse in different flower classes
(Fig. 4.7a, b), resulting in a large performance gap. Therefore, there are serious dis-
agreements among different feature types. In this case, the effectiveness of the linear
combination of similarity matrices is limited to reduce the classification confusion
caused by different feature types. By comparing Fig. 4.7ac, we can see that the
confusion matrix generated by RWMV is only a slight smooth over different fea-
ture types. Hence RWMV only brings a little gain compared with the best single
feature type (Fig. 4.6). In contrast, the confusion matrices in Fig. 4.7d, e show that
FCPD-TSL can adequately alleviate classification confusion either using 17 clusters

Fig. 4.5 Sample images of Oxford 17-category flower dataset


4.3 Experiments 59

Fig. 4.6 Performance comparison on Oxford 17-category flowers. [2015] IEEE. Reprinted, with
permission, from Ref. [13]

or 100 clusters per feature type. The performances consequently show significant
improvements over GTAM on individual types of features and RWMV on all feature
types. As mentioned in Sect. 4.3.3, because of better exploring the feature clustering
structures of individual feature types, the proposed FCPD-TSL using 100 clusters
per feature type performs better than that of using 17 clusters per feature type.

4.3.5 Body Motion Recognition

In video data, appearance and motion features complement each other for body
motion description and recognition. Therefore, in this section, we combine such
two feature types for video recognition. We experiment on the recent body motion
dataset, which is included in UCF101 [10] and contains 1910 videos in total, with 16
categories of human body motion actions: Baby Crawling, Blowing Candles, Body
Weight Squats, Handstand Pushups, Handstand Walking, Jumping Jack, Lunges, Pull
Ups, Push Ups, Rock Climbing Indoor, Rope Climbing, Swing, Tai Chi, Trampoline
Jumping, Walking with a Dog, and Wall Pushups. For each category, one sample
action has been shown in Fig. 4.8. Each video is represented as dense appearance
trajectories based on Histogram of Oriented Gradients (HOG) and dense motion
trajectories based on Motion Boundary Histograms (MBH) [15].
We label the first 50 samples per class for transductive learning. For each class,
we randomly pick the labeled data from the 50 candidates and vary the size from 2
60 4 Feature Co-occurrence for Visual Labeling

Predicted class Predicted class


1 1
Daffodil Daffodil
Snowdrop 0.9 Snowdrop 0.9
Lily Valley Lily Valley
Bluebell 0.8 Bluebell 0.8
Crocus Crocus
0.7 0.7
Iris Iris
Actual class

Actual class
Tigerlily 0.6 Tigerlily 0.6
Daffodil Daffodil
Fritillary 0.5 Fritillary 0.5
Sunflower Sunflower
Daisy 0.4 0.4
Daisy
Colts'Foot Colts'Foot
0.3 0.3
Dandelion Dandelion
Cowslip 0.2 Cowslip 0.2
Buttercup Buttercup
Windflower 0.1 0.1
Windflower
Pansy Pansy
0 0
Tigerlily

Sunflower

Buttercup
Snowdrop
Lily Valley

Fritillary

Daisy

Dandelion
Cowslip
Iris

Windflower
Bluebell
Crocus

Colts'Foot

Pansy
Daffodil

Daffodil

Tigerlily

Sunflower

Buttercup
Snowdrop
Lily Valley

Dandelion
Fritillary

Daisy

Cowslip
Iris

Windflower
Bluebell
Crocus

Colts'Foot

Pansy
Daffodil

Daffodil
(a) GTAM (best single type of features) (b) GTAM (worst single type of features)
Predicted class
1
Daffodil
Snowdrop 0.9
Lily Valley
Bluebell 0.8
Crocus
0.7
Iris
Actual class

Tigerlily 0.6
Daffodil
Fritillary 0.5
Sunflower
Daisy 0.4

Colts'Foot
0.3
Dandelion
Cowslip 0.2
Buttercup
Windflower 0.1
Pansy
0
Tigerlily

Sunflower

Buttercup
Snowdrop
Lily Valley

Dandelion
Fritillary

Daisy

Cowslip
Iris

Windflower
Bluebell
Crocus

Colts'Foot

Pansy
Daffodil

Daffodil

(c) RWMV (all features)


Predicted class Predicted class
1 1
Daffodil Daffodil
Snowdrop 0.9 Snowdrop 0.9
Lily Valley Lily Valley
Bluebell 0.8 Bluebell 0.8
Crocus Crocus
0.7 0.7
Iris Iris
Actual class

Actual class

Tigerlily 0.6 Tigerlily 0.6


Daffodil Daffodil
Fritillary 0.5 Fritillary 0.5
Sunflower Sunflower
Daisy 0.4 0.4
Daisy
Colts'Foot Colts'Foot
0.3 0.3
Dandelion Dandelion
Cowslip 0.2 Cowslip 0.2
Buttercup Buttercup
Windflower 0.1 0.1
Windflower
Pansy Pansy
0 0
Tigerlily

Sunflower

Buttercup
Snowdrop
Lily Valley

Dandelion
Fritillary

Daisy

Cowslip
Iris

Windflower
Bluebell
Crocus

Colts'Foot

Pansy
Daffodil

Daffodil

Sunflower
Tigerlily

Buttercup
Snowdrop
Lily Valley

Fritillary

Daisy

Dandelion
Cowslip
Iris

Windflower
Bluebell
Crocus

Colts'Foot

Pansy
Daffodil

Daffodil

(d) FCPD-TSL (all features): 17 clusters per feature type (e) FCPD-TSL (all features): 100 clusters per feature type

Fig. 4.7 Confusion matrix comparison on Oxford 17-category flowers. [2015] IEEE. Reprinted,
with permission, from Ref. [13]

Fig. 4.8 Sample videos of human body motion dataset


4.3 Experiments

Table 4.2 Performance comparison on human body motion videos. [2015] IEEE. Reprinted, with permission, from Ref. [13]
# labeled per class GTAM with HOG GTAM with GTAM with RWMV with all FCPD-TSL 16 FCPD-TSL 50 FCPD-TSL 100
MBH feature concat feature types clusters per clusters per clusters per
feature type feature type feature type
20 0.088 0.004 0.140 0.007 0.104 0.007 0.078 0.007 0.340 0.042 0.464 0.040 0.511 0.026
17 0.087 0.003 0.135 0.008 0.101 0.008 0.080 0.011 0.332 0.040 0.465 0.032 0.509 0.022
14 0.088 0.004 0.133 0.013 0.103 0.009 0.082 0.012 0.320 0.029 0.439 0.046 0.488 0.031
11 0.090 0.004 0.135 0.013 0.107 0.007 0.097 0.024 0.301 0.039 0.416 0.050 0.474 0.025
8 0.089 0.008 0.132 0.014 0.102 0.012 0.101 0.030 0.261 0.036 0.381 0.039 0.424 0.028
5 0.081 0.012 0.118 0.019 0.099 0.019 0.089 0.037 0.234 0.034 0.353 0.026 0.395 0.037
2 0.081 0.012 0.132 0.029 0.103 0.023 0.075 0.019 0.197 0.047 0.302 0.038 0.317 0.034
61
62 4 Feature Co-occurrence for Visual Labeling

Fig. 4.9 Sample images of UC merced 21-category land use dataset

to 20. The classification performance on the remaining 1110 unlabeled samples is


used for evaluation. Again, we compare the proposed approach with GTAM [14] (on
individual feature types and feature concatenation) and RWMV [19] (on all feature
types) in Table 4.2.
Comparing the first two columns of Table 4.2, we can see that motion features
perform better than appearance features in human body motion classification. The
3rd and 4th columns show that the approaches of GTAM on feature concatena-
tion and RWMV that uses all feature types usually perform better than GTAM on
the poorer feature type, but still cannot compete against GTAM on the better fea-
ture type. Therefore, they are not suitable to handle appearance and motion feature
fusion. In contrast, the proposed FCPD-TSL using 16 clusters per feature type (as
shown in the 5th column) improves GTAM on the best single feature type. To further
investigate clustering structures of individual feature types sufficiently, we over-
cluster individual types of features and obtain 50 or 100 clusters per feature type.
The results are shown in the last two columns of Table 4.2. This process brings a
significantly improved performance in all labeled data sizes, which further verifies
the effectiveness of FCPD-TSL in fusing appearance and motion features.

4.3.6 Scene Recognition

To further evaluate the proposed FCPD-TSL, we conduct scene recognition experi-


ment on UC merced land use dataset [18] and compare one more recent method [3]
except for GTAM and RWMV. This dataset contains 21 classes of aerial orthoim-
agery: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense
residential, forest, freeway, golf course, harbor, intersection, medium density res-
idential, mobile home park, overpass, parking lot, river, runway, sparse residen-
tial, storage tanks, and tennis courts. Each class has 100 images with resolution
256 256. We have shown one sample image for each class in Fig. 4.9. For each
image, we extract SIFT features over the 16 16 patches with spacing of 6 pixels.
By applying the locality-constrained linear coding (LLC) [17] on all SIFT features
extracted from this dataset, and running spatial pyramid max pooling on images with
1 1, 2 2, and 4 4 sub-regions, we generate 3 scales of image representations
with dimensionalities of 1 1024, 2 2 1024, and 4 4 1024 as three feature
4.3 Experiments

Table 4.3 Performance comparison on UC merced land use images. [2015] IEEE. Reprinted, with permission, from Ref. [13]
# labeled per class GTAM [14] GTG [3] RWMV [19] FCPD-TSL 21 FCPD-TSL 50 FCPD-TSL 100
clusters per feature clusters per feature clusters per feature
type type type
20 0.334 0.018 0.379 0.012 0.304 0.010 0.357 0.020 0.485 0.023 0.554 0.023
17 0.331 0.019 0.373 0.018 0.298 0.016 0.337 0.028 0.484 0.020 0.527 0.023
14 0.340 0.017 0.380 0.018 0.293 0.017 0.325 0.029 0.458 0.028 0.511 0.035
11 0.334 0.026 0.371 0.020 0.290 0.017 0.315 0.028 0.452 0.018 0.488 0.025
8 0.333 0.031 0.368 0.022 0.291 0.026 0.293 0.026 0.409 0.039 0.463 0.037
5 0.320 0.022 0.350 0.018 0.276 0.021 0.274 0.027 0.372 0.044 0.400 0.036
2 0.310 0.038 0.314 0.021 0.243 0.034 0.270 0.043 0.314 0.031 0.343 0.067
63
64 4 Feature Co-occurrence for Visual Labeling

types. The image representations with different scales result in different types of
features.
We select the first 40 samples per class as the labeled data pool and vary the number
(from 2 to 20) of labeled samples from the pool. The classification performance on the
remaining 1260 unlabeled samples is reported for evaluation. Besides GTAM [14]
and RWMV [19], we also compare with graph transduction game (GTG) [3] in
Table 4.3. For GTAM or GTG, we separately perform it on each single feature type
or feature concatenation and report the best performance it obtains. For RWMV and
the proposed FCPD-TSL, we report the results of muti-feature fusion. As can be seen
from the 1st to the 4th columns, GTG generally outperforms GTAM and RWMV
and performs better than our method with 21 clusters per feature type. However, by
appropriately increasing the number of clusters per feature type, the classification
performance of FCPD-TSL can be considerably enhanced as shown in the last two
columns of Table 4.3. The results further justify the benefit of the proposed FCPD-
TSL and especially the effectiveness of collaboration between clustering and classi-
fication. Overall, the performance gain depends on the spectral clustering results of
using individual features, as well as the complementary among the multiple features.

4.4 Summary of this Chapter

The different data characteristics and distributions among multiple feature types chal-
lenge many existing multi-feature learning methods. Instead of iteratively updating
individual feature type and forcing different feature types to agree with each other,
we allow each feature type to perform data clustering by its own and then quantize
each data sample into a co-occurrence of feature patterns across different feature
types. Relying on feature co-occurrence pattern discovery, we propose a transduc-
tive spectral learning approach, so that data labels can be transferred based on similar
feature co-occurrence patterns. To transfer the labels from the labeled data to unla-
beled data under our transductive learning formulation, we develop an algorithm
that can iteratively refine the spectral clustering results of individual feature types
and the labeling results of unlabeled data. The experiments on both synthetic and
real-world image/video datasets highlight the advantages of the proposed method to
handle multi-feature fusion in transductive learning.

References

1. Bache, K., Lichman, M.: UCI Mach. Learn. Repository (2013). http://archive.ics.uci.edu/ml
2. Demmel, J.W., Marques, O.A., Parlett, B.N., Vmel, C.: Performance and accuracy of lapacks
symmetric tridiagonal eigensolvers. SIAM J. Sci. Comput. 30(3), 15081526 (2008)
3. Erdem, A., Pelillo, M.: Graph transduction as a noncooperative game. Neural Comput. 24(3),
700723 (2012)
References 65

4. Kumar, A., Rai, P., III, H.D.: Co-regularized multi-view spectral clustering. In: Proceedings of
Advances in Neural Information Processing Systems, pp. 14131421 (2011)
5. Liu, J., Wang, C., Danilevsky, M., Han, J.: Large-scale spectral clustering on graphs. In: Pro-
ceedings of International Joint Conference on Artificial Intelligence (2013)
6. Ltkepohl, H.: Handbook of Matrices. Wiley, New Jersey (1996)
7. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Proc. Adv.
Neural Inf. Process. Syst. 2, 849856 (2001)
8. Nilsback, M., Zisserman, A.: A visual vocabulary for flower classification. Proc. IEEE Conf.
Comput. Vis. Pattern Recogni. 2, 14471454 (2006)
9. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes.
In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing
(2008)
10. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from
videos in the wild (2012). arXiv:1212.0402
11. van Breukelen, M.P.W., Tax, D.M.J., den Hartog, J.E.: Handwritten digit recognition by com-
bined classifiers. Kybernetika 34, 381386 (1998)
12. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395416 (2007)
13. Wang, H., Yuan, J.: Collaborative multifeature fusion for transductive spectral learning. IEEE
Trans. Cybern. 45(3), 466475 (2015)
14. Wang, J., Jebara, T., Chang, S.: Graph transduction via alternating minimization. In: Proceed-
ings of International Conference on Machine Learning, pp. 11441151 (2008)
15. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169
3176 (2011)
16. Wang, J., Jebara, T., Chang, S.F.: Semi-supervised learning using greedy max-cut. J. Mach.
Learn. Res. 14(1), 771800 (2013)
17. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for
image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (2010)
18. Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. In: Proceedings
of IEEE International Conference on Computer Vision (2011)
19. Zhou, D., Burges, C.: Spectral clustering and transductive learning with multiple views. In:
Proceedings of International Conference on Machine Learning, pp. 11591166 (2007)
20. Zhou, T., Tao, D.: Bilateral random projections. In: IEEE International Symposium on Infor-
mation Theory, pp. 12861290. IEEE (2012)
Chapter 5
Visual Clustering with Minimax
Feature Fusion

Abstract To leverage multiple feature types for visual data analytics, various meth-
ods have been presented in Chaps. 24. However, all of them require the extra infor-
mation, e.g., the spatial context information and the data label information. It is
often difficult to obtain such information in practice. Thus, pure multi-feature fusion
becomes critical, where we are given nothing but the multi-view features of data. In
this chapter, we study multi-feature clustering and propose a minimax formulation
to reach a consensus clustering. Using the proposed method, we can find a universal
feature embedding, which not only fits each feature view well, but also unifies differ-
ent views by minimizing the pairwise disagreement between any two of them. The
experiments with real image and video data show the advantages of the proposed
multi-feature clustering method when compared with existing methods.

Keywords Multi-feature clustering Universal feature embedding Regularized


data-cluster similarity Hyper parameter Minimax optimization

5.1 Introduction

Although, in Chaps. 24, the proposed methods can handle multiple features of visual
data, they either leverage spatial context information or have a few of data labels.
So an obvious question arises: What if we have no extra information but only the
multiple features for visual clustering? To address this problem, we will in this chapter
introduce a minimax feature fusion method.
Like mentioned in Chap. 4, spectral clustering has been shown remarkable to
handle challenging data distribution [7, 10], we thus work on feature embedding
and fusion for spectral clustering. Relying on this, we aim to find a universal feature
embedding, which not only fits each feature modality well, but also unifies different
modalities by minimizing the pairwise disagreement between any two of them. As a
result, two types of losses need to be minimized: the unary embedding cost terms for
each feature modality, and the pairwise disagreement cost terms for each pair of the
feature modalities. All the costs can constitute a triangular matrix as shown in Fig. 5.1.
For each feature modality, we measure the unary cost by Laplacian embedding. While

The Author(s) 2017 67


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_5
68 5 Visual Clustering with Minimax Feature Fusion

Fig. 5.1 The main idea of minimax feature fusion is based on that each modality has a unary cost
(Qii , i = 1, 2, . . . , M) for data clustering, while each pair of feature modalities has a pairwise cost
(Qij , i < j) for clustering consensus. The problem then becomes (1) how to measure each cost and
(2) how to balance different costs

to measure pairwise disagreement costs, instead of using the consistency of data


distribution of different feature types, we project the Laplacian embedding from each
feature type to a regularized data-cluster similarity matrix using the latent universal
feature embedding and compute the pairwise Frobenius distance through pairs of
regularized data-cluster similarity matrices. In this way, we are able to measure
modality disagreements in clustering level. To reconcile different feature modalities
and reduce their disagreements, we propose to minimize the maximum loss with a
novel minimax formulation, which has the following advantages:
It has only one hyper parameter, while all fusing weights can be automatically
determined with minimax optimization.
It reaches a harmonic consensus by weighting the cost terms differently during
minimax optimization, such that the disagreements among different feature modal-
ities can be effectively reconciled.
5.1 Introduction 69

Following Chap. 4, we still use the same four datasets for evaluation. The superior
performance of clustering on image and video data compared with the state of the arts
validates that the proposed method can well fuse heterogeneous feature modalities
for multi-view clustering.

5.2 Minimax Optimization for Multi-feature Spectral


Clustering

5.2.1 Spectral Embedding for Regularized Data-Cluster


Similarity Matrix

Given N data samples X = {xn }Nn=1 and the corresponding feature descriptors F =
{fn }Nn= 1 of a specific feature type in d dimensional space, i.e., fn Rd for n =
1, 2, , N, one can follow Sect. 4.2.1 to compute the feature embedding, U. We
then follow [5] to obtain the Data-Data Similarity Matrix by inner product:

Z (U) = UUT . (5.1)

Let V RNK be the final cluster indicator matrix agreed among multiple feature
types. We define the Regularized Data-Cluster Similarity Matrix as the projection
of Z onto V:
PV (U) = Z (U) V = UUT V. (5.2)

Compared to the original data-cluster similarity matrix U, the regularized data-


cluster similarity matrix PV (U) measures the data-cluster similarity of each data
sample with the final clustering solution V. In the following, we will relax the final
clustering solution V to be a real-valued universal feature embedding with orthonor-
mal constraints: VT V = I. As a result, the projection (5.2) enables self projection to
be invariant:
PV (V) = VVT V = V. (5.3)

5.2.2 Minimax Fusion

Suppose we have M different types of features in total. Our motivation is to encourage


the regularized data-cluster similarity matrices to be similar between any two feature
types, e.g., type i and type j. Therefore, we propose to minimize the following
disagreement measure:
    2
DV Ui , Uj = PV (Ui ) PV Uj F . (5.4)
70 5 Visual Clustering with Minimax Feature Fusion

Instead of forcing pairwise data-data similarity matrices to agree between two feature
types in [5], we relax the constraint to data-cluster similarity matrices for noise
suppression. Besides that, we propose an additional requirement that the two feature
embeddings Ui and Uj in (5.4) shouldaccommodate the universal feature embedding
V. Thus, DV (Ui , V) and DV Uj , V should also be minimized. We further extend
(5.4) into (5.5) to measure the disagreement among Ui , Uj and V:

1    
Qij = DV Ui , Uj + DV (Ui , V) + DV Uj , V
2     (5.5)
= tr VT I sym Ui UiT Uj UjT V ,
 
where sym (A) = A + AT 2 for any square matrix A. To derive (5.5), we use
the trace expansion of the Frobenius norm, as well as the linearity and cyclicity
properties of matrix trace. Now Let
 
Lij = I sym Ui UiT Uj UjT , (5.6)

then the disagreement (5.5) becomes


 
Qij = tr VT Lij V . (5.7)

In addition, we also need to minimize the unary cost of spectral embedding in each
feature type for 1 i M (see 4.3):
 
Qii = tr UiT Li Ui , (5.8)

where Li denotes the normalized Laplacian matrix of a specific feature type; Ui


corresponds to Laplacian embedding.
Therefore, 1 i j M, we need to minimize both the pairwise disagree-
ment cost

defined

M by (5.7), as well as the unary spectral embedding cost defined by
(5.8): M j=i i = 1 Qij . However, as the pairwise costs {Qij }i<j and the unary costs
{Qii } have different properties, they cannot be simply fused using the same weight.
Moreover, even for the same type of costs, assigning equal weights may not be the
optimal choice either, as a poor feature modality or two opposing feature modali-
ties may introduce a larger cost of embedding or disagreement. Instead of assigning
equal weights, we consider a weighted combination of {Qij }ij . However a direct
minimization of the minimal weighted sum of costs may not be preferred, because
it will assign small weight to large disagreement while large weight to small dis-
agreement. As a result, only those feature types of smaller disagreements and less
complementary information will be highlighted with larger weights. We are then pro-
hibited from exploring and fusing complementary information among feature types
of larger disagreements. Thus, we instead prefer to assign a larger penalty weight to
Qij of higher cost, which enables us to concentrate more on minimizing Qij of higher
cost, so that not only the overall cost can be reduced, but also the consensus can be
5.2 Minimax Optimization for Multi-feature Spectral Clustering 71

reached by suppressing high values of individual cost Qij . To achieve the goals, we
propose the following optimization problem:

M
M

min max ij Qij
{Ui }M
i = 1 ,V {ij }M
ji j=i i=1


M
M
subject to ij R+ , ij = 1, (5.9)
j=i i=1

Ui RNK , UiT Ui = I,
V RNK , VT V = I,

where [0, 1) is a parameter to control the distribution of weights ij . When
= 0, it is a special case with equal weights.
This optimization problem (5.9) aims to achieve multi-view fusion via minimizing
the maximum weighted disagreement costs. On the one hand, maximizing the overall
cost w.r.t. weight variables will highlight Qij of high costs, i.e., large disagreement or
high embedding cost. On the other hand, minimizing the overall weighted cost w.r.t.
embeddings can further reduce the highlighted costs. Moreover, it is worth noting
that the proposed objective function has only one parameter . Instead of manually
selecting weights ij for Qij , the proposed objective function can optimize the fusing
weights too.

5.2.3 Minimax Optimization

In theory, it is possible to apply root-finding methods, e.g., Newtons method, to


solve the saddle point of the constrained minimax problem (5.9) by a Lagrangian
transformation [9]. As a result, the numberof unknown variables together with
introduced multipliers will be as many as O N 2 . It is untractable to successively
update and solve such a large scale of linear system as required by Newtons method.
Instead of using Newtons method, we propose an alternative solution using
minimax

optimization.

M This is motivated by the fact that the objective function

Q = M j=i
i = 1 ij Q ij is concave w.r.t. each of {ij }M
ji and is convex w.r.t. each
of {Ui }i = 1 and V. Since Q is differentiable, it is easy to reach local optimum of the
M

objective w.r.t. one variable when fixing others. Therefore, we propose to alterna-
tively update V, {Ui }M i = 1 and {ij }ji . Such an update is generally able to boost the
M

embedding performance as been studied in Sect. 5.3.


Initialization. We initialize each Ui using Laplacian embedding of the corre-
sponding feature type and assign equal weights to different costs.
Minimization: optimizing V. To minimize the objective function Q, it can be
transformed to (5.10) using the linearity of matrix trace:
72 5 Visual Clustering with Minimax Feature Fusion

Algorithm 5: Minimax Optimization for Multi-feature Spectral Clustering


(MOMSC)
Input: number of feature types M; Laplacian matrix for the ith feature type Li ; number of clusters
K; parameter
Output: data clustering assignment indexes Y RN

/ / Initialization

M
M
1: ij 1 1, 1 i, j M
q=p p=1
2: for i [1, M] do
3: Lreg,i Li
  (Sect. 4.2.1)
4: Ui arg minU tr UT Lreg,i U s.t. UT U = I
 T 
5: Qii tr Ui Li Ui
6: end for/ / Main loop
7: repeat
8: for i [1, M] and j [i + 1, M]  do
9: Lij I sym Ui UiT Uj UjT (5.6)
10: end for

M
M

11: LV ij Lij (5.11)
j=i+1 i=1
 
12: V arg minU tr UT LV U s.t. UT U = I
 T 
13: Qij tr V Lij V
14: for i [1, M] and  j [i + 1, M] do


M
M 1
15: ij Qij 1 Qpq 1
q=p p=1
(5.15)
16: end for
17: for i [1, M] do 


18: Lreg,i = ii Li ij sym Uj UjT VVT
j=i
  (5.13)
19: Ui arg minU tr UT Lreg,i U s.t. UT U = I
 T 
20: Qii tr Ui Li Ui
21: end for
22: until Q (5.9) is converged or maximum number of iterations is reached
/ / Discrete solution
23: return Y k-means clustering on rows of V

  M

Q = tr VT LV V + ii Qii , (5.10)
i=1

where

M
M

LV = ij Lij , (5.11)
j = i+1 i = 1

and only the first term is related to V. Under the orthonormal constraints of V (5.9),
V can be updated by performing spectral embedding of LV , i.e., by seeking the
5.2 Minimax Optimization for Multi-feature Spectral Clustering 73

first K smallest eigenvectors of LV . This optimization actually minimizes weighted


pairwise disagreements to generate a universal embedding of consensus.
Minimization: optimizing Ui . To minimize the objective function Q, it can be
transformed to (5.12) using the linearity and cyclicity of matrix trace, where we let
ij = ji , j < i:  
Q = tr UiT Lreg,i Ui + Ci , (5.12)

where

 
Lreg,i = ii Li ij sym Uj UjT VVT , (5.13)
j=i

and

Ci = hj Qhj + K ij . (5.14)
jh,j=i h=i j=i

Since Ci is not related to Ui , under the orthonormal constraints of Ui (5.9), Ui can


be updated by performing spectral embedding of Lreg,i , i.e., by seeking the first K
smallest eigenvectors. From (5.13), we can see that when optimizing each unary fea-
ture embedding, it considers embeddings of other features and the universal feature
embedding by Laplacian regularization. We are thus able to gradually reconcile the
disagreements among different features in this way.
Maximization: optimizing ij . It becomes a maximization problem w.r.t. ij .

Applying the Lagrange multiplier method, we can obtain the closed-form of ij as

Qij 1
ij =  . (5.15)

M 1
Qpq 1

q=p p=1

Because [0, 1), the above formula (5.15) shows that larger costs will be
assigned with larger weights. As a result, larger disagreements will be suppressed
across heterogeneous features in the process of total cost minimization. Further
analyzing (5.15), we can see that, when 0, different weights will come close
to each other; when 1, the weight of the largest cost will tend to be 1, while
other weights will approach 0; 0 < < 1 achieves a trade-off weighting. We will
also discuss the influence of parameter in the experiments.
We refer to the proposed method as Minimax Optimization for Multi-feature Spec-
tral Clustering (MOMSC) and show the complete solution in Algorithm 5. As can
be seen, the computational complexity within each iteration of this algorithm mainly
relies on M + 1 times of eigen decomposition (Lines 12 and 19), which can be solved
in time O(N 3 ) [2], or efficiently approximated by bilateral random projections [15].
74 5 Visual Clustering with Minimax Feature Fusion

Due to the complicated minmax iteration, we do not have a complete theoretical


analysis to justify the convergency of Algorithm 5. We will, however, discuss its
convergence in Sect. 5.3.5 empirically.

5.3 Experiments

5.3.1 Datasets and Experimental Setting

To evaluate the proposed MOMSC method, we follow Chap. 4 to conduct experi-


ments on three image datasets: UCI Digits [6], Oxford flowers [8], UC merced land
uses [14], and one video dataset: body motions [11]. We summarize the feature
descriptors in Table 5.1 for the used datasets, including feature type IDs and feature
dimensions.

5.3.2 Baseline Algorithms

We compare the proposed MOMSC method with the below baselines.


Spectral Clustering with Single Feature Type (SC(#)): running spectral clus-
tering [7] with graph Laplacian derived from a single feature type.
Feature Concatenation for Spectral Clustering (FCSC): running spectral clus-
tering [7] with graph Laplacian derived from concatenated features of all feature
types.
Kernel Averaging Spectral Clustering (KASC): averaging normalized kernel
matrix derived from individual feature types, followed by applying spectral clus-
tering [7] with corresponding Laplacian. The kernel normalization is obtained by
1
(ker) dim , where ker denotes a feature kernel matrix, and dim denotes the feature
dimension.
Centroid Co-regularized Spectral Clustering (CRSC): pushing all spectral
embeddings of different feature types close to a centroid embedding using data-
data similarity matrices (5.1) [5], followed by k-means clustering with the centroid
embedding. We set the parameters in this algorithm to be 0.01 as suggested.
Pairwise Co-regularized Spectral Clustering (PRSC): pushing pairwise spec-
tral embeddings of different feature types close to each other using data-data similar-
ity matrices (5.1) [5], followed by k-means clustering with embedding concatenation.
We set the parameter in this algorithm to be 0.01 as suggested.
Multi-Modal Spectral Clustering (MMSC): learning a shared graph Laplacian
from different feature types for spectral embedding [1], followed by multi-view
clustering with NMF, in which the initialization is from the embedding result [3].
We report the best results by tuning the parameter of this algorithm in the range from
102 to 102 with incremental step 100.2 as suggested.
5.3 Experiments

Table 5.1 The image/video datasets used in the experiments and the used feature descriptors. [2014] IEEE. Reprinted, with permission, from Ref. [13]
Feature type Body motions Oxford flowers UC merced land uses UCI digits
Feature Dimension Feature Dimension Feature Dimension Feature Dimension
1 HOG 4000 Color 500 LLC 1 1 1024 FOU 76
2 MBH 4000 Shape 1000 LLC 2 2 4096 FAC 216
3 Texture 700 LLC 4 4 16384 KAR 64
4 pHOG 680 PIX 240
5 GIST 512 ZER 47
6 Color 784 MOR 6
histogram
75
76 5 Visual Clustering with Minimax Feature Fusion

Affinity Aggregation Spectral Clustering (AASC): aggregating affinities of dif-


ferent feature types with optimized weights [4], followed by applying normalized
cut [10] with corresponding Laplacian.

5.3.3 Evaluation Metrics

Following Chaps. 2 and 3, we still use clustering accuracy (ACC) as a metric to eval-
uate clustering performance. In addition, another used metric is normalized mutual
information (NMI) [12], which allows us to consider the prior distribution of the data
classes for evaluation. Denoting the set of ground-truth classes by C, and the set of
clustering groups by C, the NMI will be defined by

I(C, C)
NMI =  , (5.16)
H(C)H(C)

where
p(ci , ci )
I(C, C) = p(ci , ci ) log2 (5.17)
p(ci ), p(ci )
ci C,ci C

is the mutual information of C and C, and



H(C) = p(ci ) log2 p(ci ), H(C) = p(ci ) log2 p(ci ) (5.18)
ci C ci C

define the entropies of C and C, respectively. Similar to ACC, one can see from (5.16)
that NMI ranges from 0 to 1, and a higher NMI value indicates a better clustering
result.
When evaluating a clustering algorithm using ACC and NMI, we run the algorithm
10 times with randomly initialized clustering and report the mean performance
standard deviation.

5.3.4 Experimental Results

To compare the proposed MOMSC with baseline methods, we fix the parameter
of MOMSC to 0.33 for Algorithm 5. The results are evaluated by ACC and NMI,
which are shown in Tables 5.2 and 5.3. We can see that MOMSC can generally outper-
form the compared methods. This is because MOMSC can benefit from effectively
unveiling and fusing complementary information of heterogeneous features by the
minimax optimization of (5.9).
5.3 Experiments 77

Table 5.2 ACC comparisons of various baselines with the proposed MOMSC. Based on Ref. [13]
Method Body motions Oxford flowers UC merced land UCI digits
uses
SC (1) 0.273 0.009 0.343 0.013 0.381 0.011 0.679 0.044
SC (2) 0.312 0.010 0.404 0.016 0.364 0.020 0.631 0.048
SC (3) 0.257 0.010 0.387 0.010 0.692 0.087
SC (4) 0.093 0.004 0.710 0.053
SC (5) 0.336 0.007 0.569 0.021
SC (6) 0.234 0.006 0.420 0.028
FCSC 0.289 0.010 0.352 0.020 0.638 0.016
KASC 0.301 0.013 0.370 0.012 0.294 0.010 0.709 0.042
PRSC [5] 0.275 0.007 0.419 0.013 0.368 0.022 0.769 0.049
CRSC [5] 0.317 0.020 0.449 0.019 0.395 0.011 0.770 0.036
MMSC [1] 0.266 0.008 0.416 0.015 0.123 0.012 0.731 0.011
AASC [4] 0.250 0.011 0.410 0.028 0.226 0.009 0.683 0.047
MOMSC 0.322 0.015 0.493 0.039 0.404 0.021 0.800 0.102

Table 5.3 NMI comparisons of various baselines with the proposed MOMSC. Based on Ref. [13]
Method Body motions Oxford flowers UC merced land UCI digits
uses
SC (1) 0.263 0.005 0.371 0.009 0.459 0.012 0.649 0.015
SC (2) 0.342 0.004 0.425 0.007 0.440 0.017 0.622 0.016
SC (3) 0.239 0.009 0.449 0.006 0.652 0.042
SC (4) 0.242 0.003 0.660 0.027
SC (5) 0.397 0.005 0.500 0.009
SC (6) 0.269 0.006 0.469 0.008
FCSC 0.302 0.008 0.410 0.011 0.650 0.006
KASC 0.328 0.009 0.403 0.009 0.349 0.009 0.668 0.016
PRSC [5] 0.263 0.005 0.435 0.008 0.447 0.010 0.728 0.020
CRSC [5] 0.335 0.012 0.461 0.009 0.468 0.006 0.713 0.011
MMSC [1] 0.305 0.010 0.427 0.012 0.265 0.007 0.675 0.008
AASC [4] 0.292 0.008 0.422 0.014 0.291 0.007 0.649 0.018
MOMSC 0.352 0.010 0.484 0.022 0.482 0.015 0.785 0.049

To further demonstrate the advantage of the proposed MOMSC and provide deeper
insights from the results, we study how different regularization methods influence
the performance of each feature embedding, as well as the performance of the fusion
result. As shown in Fig. 5.2, we compare the related work PRSC and CRSC with
MOMSC on the body motion and Oxford flower datasets.
78 5 Visual Clustering with Minimax Feature Fusion

10 10 10
9 9 9 0.34
8 8 8
7 7
Iteration
7

Iteration

Iteration
0.32
6 6 6
5 5 5
0.3
4 4 4
3 3 3
2 2 2 0.28
1 1 1
HOG MBH Fusion HOG MBH Fusion HOG MBH Fusion

(a) PRSC (b) CRSC (c) MOMSC


  
Body motions

10 10 10
9 9 9 0.45
8 8 8
7 7 7 0.4
Iteration

Iteration

Iteration
6 6 6
5 5 5 0.35
4 4 4
3 3 3 0.3
2 2 2
1 1 1 0.25
Color Shape Texture Fusion Color Shape Texture Fusion Color Shape Texture Fusion

(d) PRSC (e) CRSC (f) MOMSC


  
Oxford flowers

Fig. 5.2 Iteration comparisons of PRSC, CRSC, and the proposed MOMSC w.r.t. NMI performance
of each feature embedding and the fusion result on the body motion and Oxford flower datasets.
[2014] IEEE. Reprinted, with permission, from Ref. [13]

PRSC adopts pairwise regularization among different modality-specific Lapla-


cian embeddings using data-data similarity matrices. Form Fig. 5.2a and d, we can
see that PRSC is sensitive to the poorer feature types, e.g., HOG features in body
motion dataset and color/texture features in Oxford flower dataset. In the initializa-
tion, i.e., the first iteration, the fusion result approaches/exceeds the performance
of the best feature type. However, with more iterations, the regularization may lead
to a worse result. In such a case, it is not a good choice for multi-view clustering.
Similarly, CRSC also leverages data-data similarity matrices to perform regular-
ization. But it aims to force each modality-specific Laplacian embedding toward a
consensus embedding. From Fig. 5.2b and e, we can see that CRSC is not very sen-
sitive to poor feature types. However, it cannot effectively enhance the poor feature
types either. As shown in Fig. 5.2e, successive regularization does not improve the
bad performances of color and texture features. In such a case, CRSC is unable to
bring different modality-specific Laplacian embeddings close enough, which finally
influences the consensus result. On the contrary, we do not directly use data-data sim-
ilarity matrices for regularization. Instead, we relax each data-data similarity matrix
to a regularized data-cluster similarity matrix (using (5.2)) for the proposed regular-
ization framework. As a result, we can see from Fig. 5.2c and f, although MOMSC
may not perform better than the best feature modality initially (Fig. 5.2c), the pro-
5.3 Experiments 79

posed regularization can gradually reduce disagreements among different feature


embeddings to refine the performance, and finally enhance the fusion result.

5.3.5 Convergence Analysis

To evaluate the convergency of the proposed MOMSC, Fig. 5.3 shows the value
change of NMI performance and objective function over iterations, with = 0.33.
For each subfigure, the top row shows the NMI values, while the bottom row shows
the objective values. As can be seen, the objective function first moves down then
upwards and flattens with a small number of iterations. After less than 20 iterations,
the algorithm will converge to a saddle, which meets minimax optimization of the
objective function (5.9). Besides, it is interesting to notice that, although the objective
function value has almost no change after several iterations, the performance can still
benefit from the minmax iteration, e.g., the results shown in Fig. 5.3b, c. This further
verifies the effectiveness of the proposed algorithm.
To further understand the optimization of MOMSC, we leverage the dataset of

UCI handwritten digits to plot the change of the weights ({ij }M ji ) and the costs
({Qij }ji ) over iterations in Fig. 5.4. As can be seen in this subfigure (a), the unary
M

costs and pairwise costs are equally weighted in the beginning. After one iteration,
the cost distribution is plotted in Fig. 5.4b. According to the heterogeneous pairwise
costs, we can see different feature types highly disagree with each other. According to
Eq. (5.15), larger weights will be associated with higher costs. As a result, the weight
distribution is positively correlated with the cost distribution as shown by the lines
with triangle markers in Fig. 5.4a, b. Along with the iterations, each of the unary costs
cannot be smaller than that obtained by independent Laplacian embedding primi-
tively. Nevertheless, the proposed minimax optimization can leverage large weights
to suppress the high values of unary costs, which makes unary costs increase little.
In the meantime, the pairwise cost is reduced due to the overall cost minimization in
each minmax iteration. Finally, all the pairwise costs are reduced to small values,
which leads to a consensus of different feature types.

5.3.6 Sensitivity of Parameters

In the objective function (5.9), we have only one parameter to control the fusing
weights of different feature types. Figure 5.5 shows that the best choice of is around
0.3 for the experimental datasets. As another factor, different feature orders give rise
to different initialization of the proposed algorithm, which may result in different
local optima.
To study how the input order of feature types influences clustering performance in
the proposed MOMSC, we run Algorithm 5 using different feature input orders on the
80 5 Visual Clustering with Minimax Feature Fusion

Fig. 5.3 Convergence study 0.4

0.38
for the proposed MOMSC. 0.36

NMI
For each subfigure, the top 0.34

0.32

row is for NMI performance, 0.3


0 1 2 3 4 5 6 7 8 9 10 11
Iteration
and the bottom row is for 27

Objective value
objective value 26

25

24

23

22
0 1 2 3 4 5 6 7 8 9 10 11
Iteration

(a) Body Motions


0.55

0.5

NMI
0.45

0.4

0.35
0 2 4 6 8 10 12
Iteration
50
Objective value

45

40

35

30

25
0 2 4 6 8 10 12
Iteration

(b) Oxford Flowers


0.5

0.48

0.46
NMI

0.44

0.42

0.4
0 2 4 6 8 10 12
Iteration
120
Objective value

100

80

60

40
0 2 4 6 8 10 12
Iteration

(c) UC Merced Land Uses


1

0.9
NMI

0.8

0.7

0.6
0 2 4 6 8 10 12 14 16 18 20
Iteration
50
Objective value

40

30

20
0 2 4 6 8 10 12 14 16 18 20
Iteration

(d) UCI Digits

Oxford flower dataset. We enumerate all six possible permutations of three feature
types, as shown in the first column of Table 5.4. The corresponding feature IDs are
given in Table 5.1. As shown in Table 5.4, we obtain the lowest performance when
the 2nd feature type is input before the other two feature types. It is worth noting that,
however, the results in Tables 5.2 and 5.3 show that the 2nd feature type achieves the
5.3 Experiments 81


Fig. 5.4 The change of the weights {ij }M
ji and the costs {Qij }ji with minimax optimization of
M

the proposed MOMSC on the dataset of UCI handwritten digits

best performance in spectral clustering compared with other feature types. In such
a case, if we begin with the best feature type followed by poorer ones, it may not
achieve the best performance of multi-feature fusion. Therefore, it is not suggested
to let strong feature type be the first input, as it may suppress the performance gain
of weaker features.

5.4 Summary of this Chapter

Multi-view clustering is a challenging problem as it is difficult to find a clustering


result agreeable to all feature modalities. To find the consensus, we explore a loss
function consisting of both the unary term based on the cost of the Laplacian embed-
82 5 Visual Clustering with Minimax Feature Fusion

Fig. 5.5 NMI performance of the proposed MOMSC with different values of parameter . [2014]
IEEE. Reprinted, with permission, from Ref. [13]

Table 5.4 Performance of flower clustering with different feature input orders using the proposed
MOMSC
Feature Order ACC NMI
123 0.493 0.039 0.484 0.022
321 0.481 0.034 0.482 0.017
312 0.498 0.034 0.498 0.023
231 0.458 0.026 0.468 0.013
213 0.458 0.017 0.471 0.009
132 0.506 0.028 0.497 0.019

ding of each individual feature modality and the pairwise disagreement term between
any pair of feature modalities. To optimize the objective function, we propose a min-
imax formulation by minimizing the maximum loss, which has only one parameter.
Our multi-view clustering results on four image and video datasets show superior
performance when compared with the state-of-the-art methods.

References

1. Cai, X., Nie, F., Huang, H., Kamangar, F.: Heterogeneous image feature integration via multi-
modal spectral clustering. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 19771984 (2011)
2. Demmel, J.W., Marques, O.A., Parlett, B.N., Vmel, C.: Performance and accuracy of lapacks
symmetric tridiagonal eigensolvers. SIAM J. Scient. Comput. 30(3), 15081526 (2008)
3. Ding, C., Li, T., Jordan, M.I.: Nonnegative matrix factorization for combinatorial optimization:
Spectral clustering, graph matching, and clique finding. In: Proceedings of IEEE International
Conference on Data Mining, pp. 183192 (2008)
References 83

4. Huang, H.C., Chuang, Y.Y., Chen, C.S.: Affinity aggregation for spectral clustering. In: Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 773780
(2012)
5. Kumar, A., Rai, P., III, H.D.: Co-regularized multi-view spectral clustering. In: Proceedings of
Advances in Neural Information Processing Systems, pp. 14131421 (2011)
6. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
7. Ng, A.Y., Jordan, M.I., Weiss, Y., et al.: On spectral clustering: analysis and an algorithm. Proc.
Adv. Neural Inf. Process. Syst. 2, 849856 (2001)
8. Nilsback, M., Zisserman, A.: A visual vocabulary for flower classification. Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. 2, 14471454 (2006)
9. Qi, L., Sun, W.: An iterative method for the minimax problem. In: Minimax and Applications,
Nonconvex Optimization and Its Applications, pp. 5567. Springer, Heidelberg (1995)
10. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888905 (2000)
11. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from
videos in the wild (2012). arXiv preprint arXiv:1212.0402
12. Strehl, A., Ghosh, J.: Cluster ensemblesa knowledge reuse framework for combining multiple
partitions. J. Mach. Learn. Res. 3(3), 583617 (2002)
13. Wang, H., Weng, C., Yuan, J.: Multi-feature spectral clustering with minimax optimization. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 41064113
(2014)
14. Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. In: Proceedings
of IEEE International Conference on Computer Vision, pp. 14651472 (2011)
15. Zhou, T., Tao, D.: Bilateral random projections. In: IEEE International Symposium on Infor-
mation Theory, pp. 12861290 (2012)
Chapter 6
Conclusion

Abstract Over the past dozen years, visual pattern discovery has received increasing
attention, especially in the communities of computer vision and data mining. This
book provides a systematic study on the visual pattern discovery problems from
unsupervised to semi-supervised manner approaches. This chapter concludes this
book and suggests worthy directions for further research.

Keywords Spatial co-occurrence pattern discovery Feature co-occurrence pattern


discovery Context-aware clustering Hierarchical sparse coding Transductive
label propagation Minimax fusion

To discover spatial and feature co-occurrence patterns for visual data analytics, we
proposed four effective approaches. In Chap. 2, we leveraged multiple features and
spatial contexts of visual data to discover feature co-occurrence patterns and spa-
tial co-occurrence patterns from local visual primitives. In Chap. 3, we proposed an
improved hierarchical sparse coding method for visual co-occurrence pattern discov-
ery. It uses soft assignments instead of hard assignments used in Chap. 2, thus can
achieve superior performances. In Chap. 4, we further studied feature co-occurrence
patterns in a transductive learning framework to propagate visual labels based on
similar patterns. In Chap. 5, without the spatial context and data label information,
we proposed a spectral clustering method purely based on multi-feature embedding.
Regarding these proposed approaches, the use of the state-of-the-art features such
as Fisher Vector [1113], Vector of Locally Aggregated Descriptors (VLAD) [5],
and Convolutional Neural Networks (CNN)-based features [6, 8, 14] is possible to
enhance the discovery of visual patterns. What matters is, with these features, we
are able to obtain a favorable result of k-means clustering (Chap. 2), sparse coding
(Chap. 3), or spectral embedding (Chaps. 4 and 5). Meanwhile, when performing
multi-feature transduction (Chap. 4), enough labeled data will be helpful.
There are also a few potential applications to study in the future work: (i) In
many computer vision applications, it is important to appropriately fuse different
features to perform better recognition or detection task, such as using appearance
and motion features for action recognition [1, 19], RGB and depth features for
object recognition [4]. Therefore, we can investigate how the proposed methods can
benefit these tasks. (ii) It is possible to explore the applications of image search by

The Author(s) 2017 85


H. Wang et al., Visual Pattern Discovery and Recognition,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4840-1_6
86 6 Conclusion

semi-supervised visual pattern classification and social relationship mining [18] by


visual co-occurrence discovery. (iii) Besides the image domain and feature domain,
our proposed approaches also have the potential to mine recurring patterns among
any other multiple domains. Especially, we are able to mine the hidden but massive
links (.e., patterns) in heterogeneous information networks such as the Conference-
Author Network [17] and Image-Rich Information Networks formed by social image
websites [7].
Several open issues in visual pattern discovery that need to be addressed for further
study, including:
Interpreting visual patterns and effectively measure their quality. The interpre-
tation and quality measure is crucial to visual pattern discovery. Despite a few
successes in explaining visual patterns [2, 9, 21], we still need deeper investiga-
tion of spatial co-occurrences, geometric associations, and visual appearance of
individual primitives, in order to better understand and utilize visual patterns.
Selecting representative and discriminative patterns. Mining representative and
discriminative patterns is a non-trivial problem as sometimes the two goals contra-
dict to each other. However, depending on applications, it is interesting to develop
methods that can find such visual patterns. Some efforts have been made in this
line recently, e.g., local frequent histograms [3], discriminative doublets [15], and
mid-level deep patterns [10].
Effectively combining the bottom-up and top-down approaches for visual pattern
discovery. Bottom-up methods can assemble visual primitives with a similar spatial
layout into some specific visual pattern, while top-down methods can model pattern
mixture over visual primitives [16, 20]. How to combine the strengths of bottom-up
methods and top-down methods for visual pattern discovery will be an interesting
research topic.

References

1. Cai, X., Nie, F., Huang, H., Kamangar, F.: Heterogeneous image feature integration via multi-
modal spectral clustering. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 19771984 (2011)
2. Faktor, A., Irani, M.: Clustering by compositionunsupervised discovery of image categories.
In: Proceedings of European Conference on Computer Vision, pp. 474487 (2012)
3. Fernando, B., Fromont, E., Tuytelaars, T.: Mining mid-level features for image classification.
Int. J. Comput. Vis. 108(3), 186203 (2014)
4. Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor:
a review. IEEE Trans. Cybern. 43(5), 13181334 (2013)
5. Jgou, H., Douze, M., Schmid, C., Prez, P.: Aggregating local descriptors into a compact
image representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 33043311 (2010)
6. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell,
T.: Caffe: convolutional architecture for fast feature embedding (2014). arXiv:1408.5093
7. Jin, X., Luo, J., Yu, J., Wang, G., Joshi, D., Han, J.: iRIN: image retrieval in image-rich
information networks. In: Proceedings of International World Wide Web Conference, pp. 1261
1264 (2010)
References 87

8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp.
10971105 (2012)
9. Li, C., Parikh, D., Chen, T.: Automatic discovery of groups of objects for scene understanding.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2012)
10. Li, Y., Liu, L., Shen, C., Hengel, A.V.D.: Mining mid-level visual patterns with deep cnn
activations. Int. J. Comput. Vis. 121(3), 344364 (2017)
11. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 18 (2007)
12. Perronnin, F., Snchez, J., Mensink, T.: Improving the fisher kernel for large-scale image
classification. In: Proceedings of European Conference on Computer Vision, pp. 143156
(2010)
13. Snchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector:
theory and practice. Int. J. Comput. Vis. 105(3), 222245 (2013)
14. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated
recognition, localization and detection using convolutional networks (2013). arXiv:1312.6229
15. Singh, S., Gupta, A., Efros, A.: Unsupervised discovery of mid-level discriminative patches.
In: Proceedings of European Conference on Computer Vision (2012)
16. Sun, M., Hamme, H.V.: Image pattern discovery by using the spatial closeness of visual code
words. In: Proceddings of IEEE International Conference on Image Processing, pp. 205208.
Brussels, Belgium (2011)
17. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: RankClus: integrating clustering with
ranking for heterogeneous information network analysis. In: Proceedings of the International
Conference in Extending Database Technology, pp. 565576 (2009)
18. Wang, G., Gallagher, A., Luo, J., Forsyth, D.: Seeing people in social context: Recognizing
people and social relationships. In: Proceedings of European Conference on Computer Vision,
pp. 169182 (2010)
19. Xu, C., Tao, D., Xu, C.: Large-margin multi-view information bottleneck. IEEE Trans. Pattern
Anal. Mach. Intell. 36(8), 15591572 (2014)
20. Zhao, G., Yuan, J., Hua, G.: Topical video object discovery from key frames by modeling word
co-occurrence prior. IEEE Trans. Image Process. (2015)
21. Zhu, S., Guo, C., Wang, Y., Xu, Z.: What are textons? Int. J. Comput. Vis. 62(1), 121143
(2005)