You are on page 1of 232

Studies in Computational Intelligence 1082

Avik Hati
Rajbabu Velmurugan
Sayan Banerjee
Subhasis Chaudhuri

Image
Co-segmentation
Studies in Computational Intelligence

Volume 1082

Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design methods
of computational intelligence, as embedded in the fields of engineering, computer
science, physics and life sciences, as well as the methodologies behind them. The
series contains monographs, lecture notes and edited volumes in computational
intelligence spanning the areas of neural networks, connectionist systems, genetic
algorithms, evolutionary computation, artificial intelligence, cellular automata, self-
organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems.
Of particular value to both the contributors and the readership are the short publica-
tion timeframe and the world-wide distribution, which enable both wide and rapid
dissemination of research output.
Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
Avik Hati · Rajbabu Velmurugan · Sayan Banerjee ·
Subhasis Chaudhuri

Image Co-segmentation
Avik Hati Rajbabu Velmurugan
Department of Electronics Department of Electrical Engineering
and Communication Engineering Indian Institute of Technology Bombay
National Institute of Technology Mumbai, Maharashtra, India
Tiruchirappalli
Tiruchirappalli, Tamilnadu, India Subhasis Chaudhuri
Department of Electrical Engineering
Sayan Banerjee Indian Institute of Technology Bombay
Department of Electrical Engineering Mumbai, Maharashtra, India
Indian Institute of Technology Bombay
Mumbai, Maharashtra, India

ISSN 1860-949X ISSN 1860-9503 (electronic)


Studies in Computational Intelligence
ISBN 978-981-19-8569-0 ISBN 978-981-19-8570-6 (eBook)
https://doi.org/10.1007/978-981-19-8570-6

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

Image segmentation is a classical and well-known problem in image processing,


where an image is partitioned into non-overlapping regions. Such regions may be
objects or meaningful parts of a scene. It is usually a challenging task to perform
image segmentation and automatically extract objects without high-level knowledge
of the object category. Instead, if we have two or more images containing a common
object of interest, jointly trying to segment the images to obtain the common object
will help in automating the segmentation process. This is referred to as the problem
of image co-segmentation. This monograph explores several approaches to perform
robust co-segmentation of images.
The problem of co-segmentation is not as well researched as segmentation. For
us, the motivation for understanding image co-segmentation arose from considering
the problem of identifying videos with similar content and also retrieving images
by searching for a similar image, even before deep learning became popular. We
realized that earlier approaches had considered saliency of an object in an image as
one of the cues for co-segmentation. However, realizing various restrictive issues
with this approach, we started exploring other methods that can perform robust
co-segmentation. We believe that a good representation for the foreground and back-
ground in an image is essential, and hence use a graph representation for the images,
which helped in both unsupervised and supervised approaches. This way we could use
and extend graph matching algorithms that can be made more robust. This could also
be done in the deep neural network framework, extending the strength of the model to
supervised approaches. Given that graph-based approaches for co-segmentation have
not sufficiently been explored in literature, we decided to bring out this monograph
on co-segmentation.
In this monograph, we present several methods for co-segmentation that were
developed over a period of seven years. Most of these methods use the power of
superpixels to represent images and graphs to represent connectedness among them.
Such representations could exploit efficient graph matching algorithms that could
lead to co-segmentation. However, there were several challenges in developing such
algorithms which are brought out in the chapters of this monograph. The challenges
both in formulating and implementing such algorithms are illustrated with analytical

v
vi Preface

and experimental results. In the unsupervised framework, one of the analytical chal-
lenges relates to the statistical mode detection in a multidimensional feature space.
While a solution is discussed in the monograph, this is one of the problems still
considered to be a challenge in machine learning algorithms.
After presenting unsupervised approaches, we present supervised approaches to
solve the problem of co-segmentation. These methods lead to better performance
with sufficiently labeled large datasets of images. However, with fewer images, these
methods could not do well. Hence, in the monograph, we present some recent tech-
niques such as few-shot learning to solve the problem of having access to fewer
samples during training for the co-segmentation problem. In most of the methods
presented, the problem of co-segmenting a single object across multiple images is
presented. However, the problem of co-segmenting multiple objects across multiple
images is still a challenging problem. We believe the approaches presented in this
monograph will help researchers to address such co-segmentation problems and in
a less constrained setting.
Most of the methods presented are good references to practicing researchers. In
addition, the primary target group for this monograph is graduate students in electrical
engineering, computer science, or mathematics who have interest in image processing
and machine learning. Since co-segmentation is a natural extension of segmentation,
the monograph briefly describes topics that would be required for a smooth transition
from segmentation problems. The later chapters in the monograph will be useful
for students in the area of machine learning, including a data-deprived method.
Overall, the chapters can help practitioners to consider the use of co-segmentation
in developing efficient image or video retrieval algorithms.
We strongly believe that the monograph will be useful for several readers and
welcome any suggestions or comments.

Mumbai, India Avik Hati


July 2022 Rajbabu Velmurugan
Sayan Banerjee
Subhasis Chaudhuri

Acknowledgements The authors would like to acknowledge partial support provided by National
Centre for Excellence in Internal Security (NCETIS), IIT Bombay. Funding support in the form of
JC Ghosh Fellowship to the last author is also gratefully acknowledged. We also acknowledge the
contributions of Dr. Feroz Ali and Divakar Bhat in developing some of the methods discussed in
this monograph. The authors thank the publisher for accommodating our requests and supporting
the development of this monograph. We also thank our families for their support throughout this
endeavor.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Image Saliency and Co-saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Basic Components of Co-segmentation . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Organization of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Co-segmentation of an Image Pair . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Robust Co-segmentation of Multiple Images . . . . . . . . . . . 16
1.4.3 Co-segmentation by Superpixel Classification . . . . . . . . . . 16
1.4.4 Co-segmentation by Graph Convolutional Neural
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.5 Conditional Siamese Convolutional Network . . . . . . . . . . . 18
1.4.6 Co-segmentation in Few-Shot Setting . . . . . . . . . . . . . . . . . 18
2 Survey of Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Unsupervised Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Markov Random Field Model-Based Methods . . . . . . . . . 21
2.1.2 Saliency-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Other Co-segmentation Methods . . . . . . . . . . . . . . . . . . . . . 23
2.2 Supervised Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Semi-supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Deep Learning-Based Methods . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Co-segmentation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Two-class Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Multiclass Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Nonlinear Activation Functions . . . . . . . . . . . . . . . . . . . . . . 45

vii
viii Contents

3.4.2 Pooling in CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


3.4.3 Regularization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.5 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Graph Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Maximum Common Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Co-segmentation for Two Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Image as Attributed Region Adjacency Graph . . . . . . . . . . 60
4.2.2 Maximum Common Subgraph Computation . . . . . . . . . . . 62
4.2.3 Region Co-growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Common Background Elimination . . . . . . . . . . . . . . . . . . . 71
4.3 Multiscale Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Extension to Co-segmentation of Multiple Images . . . . . . . . . . . . 81
5 Maximally Occurring Common Subgraph Matching . . . . . . . . . . . . . 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.1 Mathematical Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Multi-image Co-segmentation Problem . . . . . . . . . . . . . . . 91
5.2.3 Overview of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Superpixel Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.1 Feature Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.2 Coarse-level Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Hole Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Common Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.1 Latent Class Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.2 Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.1 Quantitative and Qualitative Analysis . . . . . . . . . . . . . . . . . 110
5.5.2 Multiple Class Co-segmentation . . . . . . . . . . . . . . . . . . . . . 115
5.5.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6 Co-segmentation Using a Classification Framework . . . . . . . . . . . . . . 123
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Co-segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.1 Mode Estimation in a Multidimensional
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.2 Discriminative Space for Co-segmentation . . . . . . . . . . . . . 130
6.2.3 Spatially Constrained Label Propagation . . . . . . . . . . . . . . 136
Contents ix

6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142


6.3.1 Quantitative and Qualitative Analyses . . . . . . . . . . . . . . . . . 142
6.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.3 Analysis of Discriminative Space . . . . . . . . . . . . . . . . . . . . 146
6.3.4 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7 Co-segmentation Using Graph Convolutional Network . . . . . . . . . . . 151
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.1 Global Graph Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 Graph Convolution-Based Feature Computation . . . . . . . . . . . . . . 154
7.3.1 Graph Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.2 Analysis of Filter Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4.1 Network Training and Testing Strategy . . . . . . . . . . . . . . . . 160
7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5.1 Internet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5.2 PASCAL-VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8 Conditional Siamese Convolutional Network . . . . . . . . . . . . . . . . . . . . . 167
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2.1 Conditional Siamese Encoder-Decoder Network . . . . . . . . 171
8.2.2 Siamese Metric Learning Network . . . . . . . . . . . . . . . . . . . 173
8.2.3 Decision Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2.5 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.3.1 PASCAL-VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.3.2 Internet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.3.3 MSRC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9 Few-shot Learning for Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2.1 Class Agnostic Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . 187
9.2.2 Directed Variational Inference Cross-Encoder . . . . . . . . . . 192
9.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.3.1 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.3.2 Channel Attention Module (ChAM) . . . . . . . . . . . . . . . . . . 194
9.3.3 Spatial Attention Module (SpAM) . . . . . . . . . . . . . . . . . . . . 194
9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.4.1 PMF Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
x Contents

10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
About the Authors

Avik Hati is currently an Assistant Professor at National Institute of Technology


Tiruchirappalli, Tamilnadu. He received his B.Tech. Degree in Electronics and
Communication Engineering from Kalyani Government Engineering College, West
Bengal in 2010 and M.Tech. Degree in Electronics and Electrical Engineering from
the Indian Institute of Technology Guwahati in 2012. He received his Ph.D. degree
in Electrical Engineering from the Indian Institute of Technology Bombay in 2018.
He was a Postdoctoral Researcher at the Pattern Analysis and Computer Vision
Department of Istituto Italiano di Tecnologia, Genova, Italy. He was an Assistant
Professor at Dhirubhai Ambani Institute of Information and Communication Tech-
nology (DA-IICT), Gandhinagar from 2020 to 2022. He joined National Institute
of Technology Tiruchirappalli in 2022. His research interests include image and
video co-segmentation, subgraph matching, saliency detection, scene analysis, robust
computer vision, adversarial machine learning.

Rajbabu Velmurugan is a Professor in the Department of Electrical Engineering,


Indian Institute of Technology Bombay. He received his Ph.D. in Electrical and
Computer Engineering from Georgia Institute of Technology, USA, in 2007. He was
in L&T, India, from 1995 to 1996 and in the MathWorks, USA, from 1998 to 2001. He
joined IIT Bombay in 2007. His research interests are broadly in signal processing,
inverse problems with application in image and audio processing such as blind
deconvolution and source separation, low-level image processing and video anal-
ysis, speech enhancement using multi-microphone arrays, and developing efficient
hardware systems for signal processing applications.

Sayan Banerjee received his B.Tech. degree in Electrical Engineering from the
West Bengal University of Technology, India, in 2012 and M.E. degree in Electrical
Engineering from Jadavpur University, Kolkata, in 2015. Currently, he is completing
doctoral studies at the Indian Institute of Technology Bombay. His research areas
include image processing, computer vision, and machine learning.

xi
xii About the Authors

Prof. Subhasis Chaudhuri received his B.Tech. degree in Electronics and Electrical
Communication Engineering from the Indian Institute of Technology Kharagpur in
1985. He received his M.Sc. and Ph.D. degrees, both in Electrical Engineering, from
the University of Calgary, Canada, and the University of California, San Diego,
respectively. He joined the Department of Electrical Engineering at the Indian Insti-
tute of Technology Bombay, Mumbai, in 1990 as an Assistant Professor and is
currently serving as K. N. Bajaj Chair Professor and Director of the Institute. He
has also served as Head of the Department, Dean (International Relations), and
Deputy Director. He has also served as a Visiting Professor at the University of
Erlangen-Nuremberg, Technical University of Munich, University of Paris XI, Hong
Kong Baptist University, and National University of Singapore. He is a Fellow of
IEEE and the science and engineering academies in India. He is a Recipient of the
Dr. Vikram Sarabhai Research Award (2001), the Swarnajayanti Fellowship (2003),
the S. S. Bhatnagar Prize in engineering sciences (2004), GD Birla Award (2010),
and the ACCS Research Award (2021). He is Co-author of the books Depth from
Defocus: A Real Aperture Imaging Approach, Motion-Free Super-Resolution, Blind
Image Deconvolution: Methods and Convergence, and Kinesthetic Perception: A
Machine Learning Approach, all published by Springer, New York (NY). He is an
Associate Editor for the International Journal of Computer Vision. His primary areas
of research include image processing and computational haptics.
Acronyms

CNN Convolutional neural network


GCN Graph convolutional network
LCG Latent class graph
LDA Linear discriminant analysis
MCS Maximum common subgraph
MOCS Maximally occurring common subgraph
RAG Region adjacency graph
RCG Region co-growing
x Vector
X Matrix
I Image
n1 × n2 Image dimension
N Number of images
C Cluster
K Number of clusters
σ Standard deviation or sigmoid function (depending on context)
λ Eigenvalue
H Histogram
P Probability or positive set
p(·) Probability density function
χ Regularizer in cost function
t Threshold
s, r Superpixels or regions
f, x Feature vectors
d(·) Feature distance
S(·) Feature similarity function
S Feature similarity matrix
G = (V, E) Graph
V Set of nodes in a graph
u, v Nodes in a graph
E Set of edges in a graph

xiii
xiv Acronyms

e Edge in a graph
H Subgraph
VH Set of nodes in a subgraph
Ḡ Set of graphs
W Product graph
UW Set of nodes in a product graph
N (·) Neighborhood
F Foreground object
O(·) Order of computation
Q Compactness
Q Scatter matrix
Y, L Label matrix
L Label
L Loss
ω Weights in a linear combination
Chapter 1
Introduction

Image segmentation is the problem of partitioning an image into several non-


overlapping regions where every region represents a meaningful object or a part of the
scene captured by the image. The example image in Fig. 1.1a can be divided into three
coarse segments: one object (‘cow’) and two regions (‘field and water body’) shown
in Fig. 1.1b–d, respectively. This problem is well researched in the area of computer
vision and forms the backbone of several applications. Given low-level features, e.g.,
color and texture, it is very difficult to segment an image with complex structure into
meaningful objects or regions without any knowledge of high-level features, e.g.,
shape, size or category of objects present in the image. Unlike the example image in
Fig. 1.1a, the foreground and background regions in the example image of Fig. 1.1e
cannot be easily segmented from low-level features alone because the foreground
and background contain regions of different textures. Although human visual sys-
tem can easily do this segmentation, lack of high-level information (e.g., presence
of house) makes the image segmentation problem in computer vision very challeng-
ing. Existing image segmentation techniques can be classified into three categories,
which may be applied depending on the difficulty level of the task.
• Unsupervised segmentation: It is not aided by any additional information or prop-
erty of the image regions.
• Semi-supervised segmentation: Users input some information regarding different
image segments in the form of (i) foreground/background scribbles or (ii) provide
additional images containing regions with similar features.
• Fully supervised segmentation: A segmentation model is learned from the ground-
truth available with the training data.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_1
2 1 Introduction

(a) (b)

(c) (d)

(e)

Fig. 1.1 Image segmentation. a Input image and b–d the corresponding segmentation results. e
An image that is difficult to segment using only low-level features. Image Courtesy: Source images
from the MSRC dataset [105]

1.1 Image Co-segmentation

In this monograph, we discuss the image co-segmentation problem [78, 103] which is
the process of finding objects with similar features from a set of two or more images.
This is illustrated using an image pair in Fig. 1.2. Recently with many media sharing
websites, there is a large volume of image data available to researchers over internet.
For example, often several people capture images of the same or a similar scene from
different viewpoints or at different time instants and upload. In such a scenario, co-
segmentation can be used to find the object of common interest. Figure 1.3 shows a set
of five images captured at different locations by different people containing a common
object ‘tiger’, and it is quite apparent that the photographers were interested in the
‘tiger’. Given two images, an image similarity measure from just global statistics may
lead to wrong conclusion if the image pair contains similar background of large area
1.1 Image Co-segmentation 3

(a) (b)

(c) (d)

Fig. 1.2 Co-segmentation of an image pair. a, b Input image pair and c, d the common object in
them. Image Courtesy: Source images from the image pair dataset [69]

but unrelated objects of interest of small size. Detection of co-segmented objects gives
a better measure of image similarity. Co-segmentation can also be used to discard
images that do not contain co-occurring objects from a group of images for database
pruning. Hence, co-segmentation has several applications, and it has attracted much
attention. This problem can be solved using either completely unsupervised or fully
supervised techniques. Methods from both categories work with the assumption that
it is known that every image in the image set contains at least one object of a common
class. In this context, it is worth mentioning that we may require to co-segment an
image set where only a majority of the images contain the common object and the
rest need not. This is a challenging problem, and this monograph discusses several
approaches to solve this problem. It may be noted that in co-segmentation, we extract
the common object without the knowledge of its class information, and it does not
do pattern recognition. For the image set in Fig. 1.3, co-segmentation yields a set
of binary masks that can be used to extract the common object (‘tiger’) from the
respective images. But, it does not necessarily have to recognize the common object
as a ‘tiger’. Thus, co-segmentation can be used as a preprocessing step for object
recognition.
The co-segmentation problem is not limited to finding only a single common
object. It is quite common for the image set to be co-segmented to have multiple
common objects. Figure 1.4 shows an example of an image set containing two com-
4 1 Introduction

Fig. 1.3 Co-segmentation of more than two images. Column 1: Images retrieved by a child from
the internet, when asked to provide pictures of a tiger. Column 2: The common object quite apparent
from the given set of images. Image Courtesy: Source images from the internet

mon objects ‘kite’ and ‘bear’ present in different images. Thus not having to identify
the class of object(s) provides a greater flexibility in designing a general-purpose co-
segmentation scheme. Moreover, the different common objects need not be present
in all the images. Further, the image subsets containing different common objects
1.1 Image Co-segmentation 5

Fig. 1.4 Multiple class co-segmentation. Column 1 shows a set of six images that includes images
from two classes: ‘kite’ and ‘bear’. Column 2 shows the common object in ‘kite’ class. Column 3
shows the common object in ‘bear’ class. Image Courtesy: Source images from the iCoseg dataset
[8]

can be overlapping. The example in Fig. 1.5 shows a set of four images containing
three common objects ‘butterfly’ (present in all images), ‘red flower’ (present in two
images) and ‘pink flower’ (present in two images). The co-segmentation results can
be used in image classification, object recognition, image annotation etc., justifying
the importance of image co-segmentation in computer vision.
6 1 Introduction

Fig. 1.5 Co-segmentation where multiple common objects are present in overlapping image sub-
sets. Row 1 shows a set of four images. Row 2 shows common object ‘butterfly’ present in all
images. Rows 3,4 show ‘red flower’ and ‘pink flower’ present in Images-1, 2 and Images-3, 4,
respectively. Image Courtesy: Source images from the FlickrMFC dataset [59]

Given an image, humans are more attentive to certain objects present in the image.
Hence in most applications of co-segmentation, we are interested in finding the com-
mon object rather than common background regions in different images. Figure 1.6
shows a set of four images that contain a common foreground (‘cow’) as well as
common background regions (‘field’ and ‘water body’). In this monograph, co-
segmentation is restricted only to common foreground objects (‘cow’) while ‘field’
and ‘water body’ are ignored. It may be noted that detection of common background
regions also has some, although limited, applications including semantic segmenta-
tion [3], scene understanding [44].
It is worth mentioning that a related problem is image co-saliency [16], which
measures the saliency of co-occurring objects in an image set. There one can make
use of computer vision algorithms that are able to detect attentive objects in images.
We briefly describe image saliency detection [14] next.
1.2 Image Saliency and Co-saliency 7

Fig. 1.6 Example of an image set containing common foreground (‘cow’) as well as common
background regions (‘field’ and ‘water body’). Image Courtesy: Source images from the MSRC
dataset [105]

(a) (b) (c)

Fig. 1.7 Salient object detection. a Input image and b the corresponding salient object shown in
the image and c the cropped salient object through a tightly fitting bounding box. Image Courtesy:
Source image from the MSRA dataset [26]

1.2 Image Saliency and Co-saliency

Human visual system can easily locate objects of interest and identify important
information in a large volume of real world data of complex scenes or cluttered
background by interpreting them in real time. The mechanism behind this remarkable
ability of human visual system is researched in neuroscience and psychology, and
saliency detection is used to study this. Saliency is a measure of importance of
objects or regions in an image or important event in a video scene that capture
our attention. The salient regions in an image are different from the rest of the
image in certain features (e.g., color, spatial or frequency). Saliency of a patch in an
8 1 Introduction

Fig. 1.8 Difference between


salient object and foreground
object. a Input image [105],
b foreground objects and c
the salient object. Image
Courtesy: Source image from
the MSRC dataset [105]

(a)

(b) (c)

image depends on its uniqueness with respect to other patches in the image i.e., rare
patches (in terms of features) are likely to be more salient than frequently occurring
patches. In the image shown in Fig. 1.7a, the ‘cow’ is distinct among the surrounding
background (‘field’) in terms of color and shape features, hence it is the salient
object (Fig. 1.7b). It is important to distinguish between salient object extraction and
foreground segmentation from an image. In Fig. 1.8a, both the ‘ducks’ are foreground
objects (Fig. 1.8b), but only the ‘white duck’ is the salient object (Fig. 1.8c) since the
color of the ‘black duck’ is quite similar to ‘water’ unlike the ‘white duck’. Moreover,
the salient object need not always be distinct only from the background. In Fig. 1.9,
‘red apple’ is the salient object which stands out from the rest of the ‘green apples’.
Hence, a salient object, in principle, can also be detected from a set of objects of the
same class. It is also possible that an image contains more than one salient object.
For example, in the image shown in Fig. 1.10, there are four salient objects (‘bowling
pins’). Here, the four objects may not be equally salient and some may be more salient
than others. Hence, we need to formulate a mathematical definition of saliency [97,
120, 151] and assign a saliency value to every object or pixel in an image to find its
relative attention. This, however, is beyond the scope of this book.
With an analogy to the enormous amount of incoming information from the retina
that are required to be processed by the human visual system, a large number of
images are publicly available to be used in many real-time computer vision algo-
rithms that also have high computational complexity. Saliency detection methods
offer efficient solutions by identifying important regions in an image so that oper-
ations can be performed only on those regions. This reduces complexity in many
image and vision applications that work with large image databases and long video
sequences. For example, saliency detection can aid in video summarization [52,
79], image segmentation and co-segmentation [19, 43, 45, 90], content-based image
1.2 Image Saliency and Co-saliency 9

(a) (b)

Fig. 1.9 Salient object detection among a set of objects of the same class. a Input image and b the
corresponding salient object. Image Courtesy: Source image from the HFT dataset [70]

(a) (b)

Fig. 1.10 Multiple salient object detection. a Input image and b the corresponding salient objects.
Image Courtesy: Source image from the MSRA dataset [26]

retrieval [23, 76], image classification [58, 112], object detection and recognition [93,
109, 113, 133], photo collage [42, 137], dominant color detection [140], person re-
identification [149], advertisement design [83, 101] and image editing for content-
aware image compression, retargeting, resizing, image cropping/thumbnailing (see
Fig. 1.7c), and adaptive image display on mobile devices [2, 37, 47, 82, 94, 96, 122].
Since saliency detection can be used to extract highly attentive objects from an
image, we may make use of it to find the common object(s) from multiple images.
This is evident from the example image set in Fig. 1.11 where the common object
‘balloon’ is salient in all the images and it can be extracted through saliency value
matching across images. This extraction of common and salient objects is known as
image co-saliency. But this scenario is limited in practice. For example in Fig. 1.12,
although ‘dog’ is the common object, it is not salient in Image 3. Here, saliency alone
will fail to detect the common object (‘dog’) in all the images. Hence, saliency is not
always suitable for robust co-segmentation. Hence, we have restricted the discussions
in this monograph to image co-segmentation methods that do not use image saliency.
In Chap. 10, we will discuss this with more examples. We also show that the image
10 1 Introduction

sets in Figs. 1.11 and 1.12 can be co-segmented through feature matching without
using saliency in Chaps. 6 and 4, respectively.

1.3 Basic Components of Co-segmentation

In this monograph, we will describe co-segmentation methods in both unsupervised


and supervised settings. While all the unsupervised methods discussed in the mono-
graph utilize image superpixels, the supervised methods may use either superpixels
or pixels. Further, graph-based representation of images forms the basis of some of
the supervised and unsupervised frameworks discussed. Some methods are based on
classification of image pixels or superpixels. Next, we briefly describe these repre-
sentations for a better understanding of the problem being discussed.
Superpixels: Since the images to be co-segmented may have a large size, there
is a high computational complexity associated with pixel-based operations. Hence
to reduce the complexity, it is efficient to decompose every image into groups of
pixels, and perform operations on them. A common basic image primitive used
by researchers is the rectangular patch. But, pixels within a rectangular patch may
belong to multiple objects and may not be similar in features, e.g., color. So, the
natural patch, called superpixel, is a good choice to represent image primitives.
Unlike rectangular patches, use of superpixels helps to retain the shape of an object
boundary. The simple linear iterative clustering (SLIC) [1] is an accurate method to
oversegment the input images into non-overlapping superpixels (see Fig. 1.13a, b).
Each superpixel contains pixels from a single object or region and is homogeneous
in color. Typically, superpixels are attributed with appropriate features for further
processing (Chaps. 4, 5, 6 and 7).
Region adjacency graphs (RAG): A global representation of an image can be
obtained from the information contained in the local image structure using some
local conditional density function. Since the rectangular patches are row and column
ordered, one could have defined a random field, e.g., Markov random field. But here,
we choose superpixels as image primitives. As no natural ordering can be specified
for superpixels, a random field model cannot be defined on the image lattice. Graph
representations allow us to handle this kind of superpixel-neighborhood relation-
ship (see Fig. 1.13c). Hence, in this monograph, we use graph based approaches,
among others, for co-segmentation. In a graph representation of an image, each node
corresponds to an image superpixel. A node pair is connected using an edge if the
corresponding superpixels are spatially adjacent to each other in the image spaces.
Hence, this is called a region adjacency graph (G ). Detailed explanation of graph
formulation is provided later. Chapters 4, 5 and 7 of this monograph describe graph-
based methods, both unsupervised and supervised, where an object is represented as
a subgraph (H) of the graph representing the image (see Fig. 1.13d, e).
Convolutional neural networks (CNN): Over the past decade, several deep
learning-based methods have been deployed for a range of computer vision tasks
including visual recognition and image segmentation [108]. The basic unit in such
1.3 Basic Components of Co-segmentation 11

Fig. 1.11 Saliency detection on an image set (shown in Column 1) with the common foreground
(‘balloon’, shown in Column 2) being salient in all the images. Image Courtesy: Source images
from the iCoseg dataset [8]
12 1 Introduction

Fig. 1.12 Saliency detection on an image set (shown in top row) where the common foreground
(‘dog’, shown in bottom row) is not salient in all the images. Image Courtesy: Source images from
the MSRC dataset [105]

deep learning architectures is the CNN which learns to obtain semantic features from
images. Given a sufficiently large dataset of labeled images for training, these learned
features have been shown to outperform hand-crafted features used in the unsuper-
vised methods. We will demonstrate this in Chaps. 7, 8 where co-segmentation is
performed using a model learned utilizing labeled object masks. Further, among
the deep learning-based co-segmentation methods described in this monograph, the
method in Chap. 9 focuses on utilizing lesser amount of training data to mimic several
practical situations where sufficient amount of labeled data is not available.

1.3.1 The Problem

The primary objectives of this monograph are to


• design computationally efficient image co-segmentation algorithms so that they
can be used on image sets of large cardinality, without compromising on accuracy,
and
• ensure robustness of the co-segmentation algorithm in the presence of outlier
images (in the image set) which do not contain the common object present in the
majority of the images.
1.3 Basic Components of Co-segmentation 13

(a)

(b) (c)

(d) (e)

Fig. 1.13 Graph representation of an image. a Input image, b its superpixel segmentation and c the
corresponding region adjacency graph whose nodes are drawn at the centroids of the superpixels. d
The subgraph representing e the segmented object (‘flower’). Image Courtesy: Source image from
the HFT dataset [70]

Since the common object present in multiple images is constituted by a set of pix-
els (or superpixels), there must be high feature similarities among these pixels (or
superpixels) from all the images, and hence, we need to find matches. Since we are
working with natural images, this poses three challenges apart from high computa-
tions associated with finding correspondences.
• The common object in the images may have different pose,
• they may have different sizes (see Fig. 1.14) and
14 1 Introduction

• they may have been captured by different cameras under different illumination
conditions.
The co-segmentation methods described in this monograph aim to overcome these
challenges.

1.4 Organization of the Monograph

In this monograph, we describe three unsupervised and three supervised methods


for image co-segmentation. The monograph is organized as follows. In Chaps. 2
and 3, we describe existing works on co-segmentation and provide mathematical
background on the tools that will be used in co-segmentation methods described
in subsequent chapters. Then in Chap. 4, we describe an image co-segmentation
algorithm for image pairs using maximum common subgraph matching and region
co-growing. In Chap. 5, we explain a computationally efficient unsupervised image
co-segmentation method for multiple images using a concept called latent class
graph. In Chap. 6, we demonstrate a solution of the image co-segmentation problem
in a classification setup, although in an unsupervised manner, using discriminant
feature-based label propagation. We describe a graph convolutional network (GCN)-
based co-segmentation in Chap. 7, as first of the supervised methods. In Chap. 8,
we describe a siamese network to do co-segmentation of a pair of images. More
recently, supervised methods are trying to use fewer data during training, and few-
shot learning is one such approach. A co-segmentation method under the few-shot
setting is discussed in Chap. 9. Finally in Chap. 10, we conclude with discussions
and possible future directions in image co-segmentation. We next provide a brief
overview of the above six co-segmentation methods.

1.4.1 Co-segmentation of an Image Pair

To co-segment an image pair using graph-based approaches, we need to find pairwise


correspondences among nodes of the corresponding graphs, i.e., superpixels across
the image pair. Since the common object present in the image pair may have different
pose and different size, we may have one-to-one, one-to-many or many-to-many
correspondences. In Chap. 4, we describe a method where
• the maximum common subgraph (MCS) of the graph pair (G1 and G2 ) is first
computed, which provides an approximate result that detects the common objects
partially.
• A region co-growing method can simultaneously grow the seed superpixels (i.e.,
nodes in the MCS) in both images to obtain the common objects completely.
• A progressive method by combining the two stages, MCS computation and region
co-growing, significantly improves the computation time.
1.4 Organization of the Monograph 15

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Fig. 1.14 Co-segmentation of image pairs with the common objects having different size and pose.
a, b and e, f two input image pairs, and c, d and g, h the corresponding common objects, respectively.
Image Courtesy: Source images from the image pair dataset [69]
16 1 Introduction

1.4.2 Robust Co-segmentation of Multiple Images

To co-segment a set of N (> 2) images (see Fig. 1.3 for example), finding MCS of
N graphs involves very high computations. Moreover, the node correspondences
obtained from this MCS must lead to consistent matching of corresponding super-
pixels across the image set and this is very difficult. Moreover, it is quite common
for the image set to contain a few outlier images that do not share the common object
present in majority of the images (see Fig. 1.15). This makes co-segmentation even
more challenging. In Chap. 5, we describe an efficient method that can handle these
challenges.
• First a latent class graph (LCG) is built by combining all the graphs Gi . In par-
ticular, we need to compute pairwise MCS sequentially until all graphs have been
included. This LCG (H L ) contains information of all graphs and its cardinality is
limited by
    
|H L | = | Hi | − MCS(Hi , H j ) + MCS(Hi , H j , Hk ) − · · ·
i i j>i i j>i k> j
(1.1)
where Hi ⊆ Gi is obtained through joint clustering of all image superpixels (see
Chap. 5).
• A maximally occurring common subgraph (MOCS) matching algorithm finds the
common object completely by using the LCG as a reference graph.
• We show in Chap. 5 that MOCS can handle the problem of outliers and reduce
the required number of graph matchings from O (N 2 N −1 ) to O (N ). We also show
that this formulation can perform multiclass co-segmentation (see Fig. 1.4 for
example).

1.4.3 Co-segmentation by Superpixel Classification

Crowd-sourced images are generally captured under different camera illumination


and compositional context. These variations make feature selection difficult which
in turn makes it very difficult to extract the common object(s) accurately. Hence, dis-
criminative features which better distinguish between background and the common
foreground, are required. In Chap. 6, co-segmentation is formulated as a foreground–
background classification problem where superpixels belonging to the common
object across images are labeled as foreground and the remaining superpixels are
labeled as background in an unsupervised manner.
• First a novel statistical mode detection method is used to initially label a set of
superpixels as foreground and background.
• Using the initially labeled superpixels as seeds, labels of the remaining superpixels
are obtained, thus finding the common object completely. This is achieved through
a novel feature iterated label propagation technique.
1.4 Organization of the Monograph 17

Fig. 1.15 Multi-image co-segmentation in the presence of outlier images. Top block shows a set of
six images that includes an outlier image (image-5). Bottom block shows that the common object
(‘kite’) is present only in five of the six images. Image Courtesy: Source images from the iCoseg
dataset [8]

• There may be some feature variation among the superpixels belonging to the
common object and this may lead to incorrect labeling. Hence, a modified linear
discriminant analysis (LDA) is designed to compute discriminative features that
significantly improves the labeling accuracy.
18 1 Introduction

1.4.4 Co-segmentation by Graph Convolutional Neural


Network

Notwithstanding the usefulness of different unsupervised co-segmentation algo-


rithms, the effectiveness of these approaches is reliant on appropriate choice of
hand-crafted features for the task. When sufficient annotated data is available, how-
ever, we can compute learned features using deep learning methods which do away
with manual feature selection. Hence, we next explore co-segmentation methods
based on deep neural networks. In Chap. 7, we discuss an end-to-end foreground–
background classification framework using a graph convolutional neural network.
• In this framework, each image pair is jointly represented as a weighted graph by
exploiting both intra-image and inter-image feature similarities.
• The model then uses graph convolution operation to learn features as well as
classify each superpixel into the common foreground or the background class,
thus achieving co-segmentation.

1.4.5 Conditional Siamese Convolutional Network

In Chap. 8, we shift to a framework based on standard convolutional neural networks


that directly works on image pixels instead of superpixels. It consists of a metric
learning, a decision network and a conditional siamese encoder-decoder network.
• The metric learning network’s job is to identify an optimal latent feature space in
which samples of the same class are closer together and those of different classes
are separated.
• The encoder-decoder network estimates the co-segmentation masks appropriately.
• The decision network determines whether input images include common objects
or not, based on the extracted characteristics. As a result, the model can handle
outliers in the input set.

1.4.6 Co-segmentation in Few-Shot Setting

Fully supervised methods perform well when a large training data is available. How-
ever, collecting sufficient training samples is not always easy, and for some tasks,
it may be almost impossible to achieve. Hence, a framework for multi-image co-
segmentation that uses a meta-learning technique is required in such scenarios, and
is discussed in Chap. 9.
• We discuss a directed variational inference cross encoder, which is an encoder-
decoder network that learns a continuous embedding space to provide superior
similarity learning.
1.4 Organization of the Monograph 19

• It is a class agnostic technique that can generalize to new classes with only a
limited number of training samples.
• To address the limited sample size problem in co-segmentation with small datasets
like iCoseg and MSRC, a few-shot learning strategy is also discussed.
Having introduced the problem of co-segmentation and a brief description of
approaches, in the next chapter we review the related literature on image co-
segmentation and relevant datasets.
Chapter 2
Survey of Image Co-segmentation

In this chapter, we first review the literature related to unsupervised image co-
segmentation. Then we review available supervised co-segmentation methods.

2.1 Unsupervised Co-segmentation

Image co-segmentation methods aim to extract the common object present in more
than one image by simultaneously segmenting all the images. The co-segmentation
problem was first explored by Rother et al. [103] which considered the case of
two images, and it was followed by the methods in [49, 88, 130]. Subsequently,
researchers have actively worked on co-segmentation of more than two images [78,
105] because of its many practical applications. Recently, the focus has shifted to
multiple class co-segmentation. Some of these works include the methods in [18,
56, 57, 59, 60, 78, 136] that jointly segment the images into multiple classes to find
common foreground objects of different classes.

2.1.1 Markov Random Field Model-Based Methods

Early methods in [49, 88, 103, 130] provide a solution for co-segmentation of two-
images by histogram matching in a Markov random field (MRF) model-based energy
minimization framework. These methods extend the MRF model-based single-image
segmentation technique [15] to co-segmentation. The energy function to be mini-
mized can be written as:

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 21
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_2
22 2 Survey of Image Co-segmentation
⎛ ⎞

2  
E t (y) = ⎝ E u (yi ) + E p (yi , y j )⎠ + χh E h (H1 , H2 ) , (2.1)
k=1 i∈Ik (i∈Ik , j∈N (i))

where yi ∈ {0, 1} is the unknown label (0 for common foreground and 1 for back-
ground) of pixel-i, N (·) denotes neighborhood, χh is a weight and H1 , H2 are the
histograms of the common foreground region in the image pair. The first two terms in
Eq. (2.1) together correspond to standard MRF model-based single-image segmen-
tation cost function: E u (yi ) is the unary term that can be computed using histogram
or foreground–background Gaussian mixture model, and

E p (yi , y j ) = S (i, j)|yi − y j | (2.2)

is the pairwise term that ensures smoothness within and distinction between the
segmented foreground and background by feature similarity S (·). The third term
E h (·) in Eq. (2.1) measures the histogram distance of the unknown common object
regions in the image pair and, it is responsible for inter-image region matching.
Rother et al. [103] used L 1 -norm of histogram difference to compute E h (·),
and proposed an approximation method called submodular-supermodular procedure
since optimizing a cost function with L 1 -norm is difficult, whereas Mukherjee et
al. [88] replaced L 1 -norm by L 2 -norm for approximation. However, the optimiza-
tion problem in both methods is computationally intensive. Hochbaum and Singh [49]
rewarded foreground histogram consistency by using inner product to measure E h (·),
instead of minimizing foreground histogram difference to simplify the optimization.
Moreover, prior information about foreground and background colors have been used
in [49, 88] to compute the unary term E u (·), whereas Vicente et al. [130] ignored the
unary term. Instead, they considered separate models for the common foreground,
background-1 and background-2, and added a constraint that all pixels belonging
to a histogram bin must have same the label. The methods in [49, 88, 103, 130]
perform well only for common objects with highly similar appearance on different
background.

2.1.2 Saliency-Based Methods

The methods in [22, 54, 63, 105, 139] first compute image saliency, and use it
as a key component in their co-segmentation methods. Rubinstein et al. [105] use
salient pixels in the images as seeds to find inter-image pixel correspondences using
SIFT-flow. These are used to initialize an iterative algorithm for optimization of
a cost function involving correspondence and histogram similarity and match the
common regions across images. The clustering-based method in [63] can extract
only the salient foregrounds from multiple images. However, it should be noted that
the common object may not always be salient in all constituent images.
2.1 Unsupervised Co-segmentation 23

Recently, co-saliency-based methods [16, 19, 21, 39, 69, 73, 77, 124, 126] have
also been used for co-segmentation. These methods detect common, salient objects
from the image set. Typically, most methods [19, 69] define co-saliency of a region
or superpixel in an image as a combination of its single-image saliency value and the
average of its highest feature similarities with the regions in the remaining images.
The method in [19] extracts the co-salient object through MRF model-based labeling
using salient pixels for label initialization. Cao et al. [16] combine outputs of multiple
saliency detection methods. The weight for a method is computed from the histogram
vectors of salient regions detected by that method in all images. Since dependent
vectors indicate that the corresponding regions are co-salient, weight is inversely
proportional to the low-rank approximation error of the combined histogram matrix.
Tsai et al. [126] jointly compute co-saliency values and co-segmentation labels of
superpixels. They build a combined graph of all superpixels in the image set, and
optimize a cost similar to Eq. (2.1) to obtain the superpixel labels. Co-saliency of each
superpixel is obtained as a weighted sum of an ensemble of single-image saliency
maps, and the weights are learned using the unary term. The cost also includes a cou-
pling term to make the co-saliency and co-segmentation results coherent so that a
common object superpixel has a high co-saliency value and vice-versa. Liu et al. [77]
hierarchically segment all the images. Coarse-level segments are used to compute
object prior, and fine-level segments are used to compute saliency. Co-saliency is
computed using global feature similarity and saliency similarity among fine-level
segments. Tan et al. [124] used a bipartite graph to compute feature similarity. Fu et
al. [39] cluster all the pixels of all images into a certain number of clusters, and find
saliency of every cluster using center bias of clusters, distribution of image pixels in
every cluster and inter-cluster feature distances. Co-saliency methods can detect the
common object from an image set only if they are highly salient in the respective
images. Since most image sets to be co-segmented do not satisfy this criterion (exam-
ples shown in Chap. 1 and more examples to be shown in Chap. 10), co-saliency
can be applied to a limited number of image sets and it cannot be generalized for all
kinds of datasets. Hence, we use co-segmentation for detection of common objects
in this monograph.

2.1.3 Other Co-segmentation Methods

Joulin et al. [56] formulated co-segmentation as a two-class clustering problem using


a discriminative clustering method. They extended this work for multiple classes
in [57] by incorporating spectral clustering. As their kernel matrix is defined for all
possible pixel pairs of all images, the complexity goes up rapidly with the number
of images. Kim et al. [60] used anisotropic diffusion to optimize the number and
location of image segments. As all the images are segmented into an equal num-
ber of clusters, oversegmentation may become an issue in a set of different types
of images. Furthermore, this method cannot co-segment heterogeneous objects. An
improvement to this method has been proposed in [59] using supervision such as
24 2 Survey of Image Co-segmentation

bounding box or pixelwise annotations for different foreground classes in some


selected images of the set. However, the methods in [56, 57, 60] cannot determine
the common object class automatically, and this has to be selected manually. The
scale invariant co-segmentation method in [89] solves co-segmentation for different
sized common foreground objects where instead of minimizing the distance between
histograms of common foregrounds, it constrains them to have low entropy and to
be linearly dependent. Lee et al. [64] and Collins et al. [29] employed random walk
for co-segmentation. The idea is to perform a random walk from each image pixel
to a set of user specified seed points (foreground and background). The walk being
biased by image intensity gradients, each pixel is labeled as foreground if the pixel-
specific walk reaches a foreground seed first, and vice-versa for a background label.
Tao et al. [125] use shape consistency among common foreground regions as a cue
for co-segmentation. But changes in pose and viewpoint, which are quite common
for natural images, inherently results in change of shapes of the common object,
thus making the method invalid in such cases. The method in [136] performs pair-
wise image matching, resulting in high computational complexity. Being a pairwise
method, it does not produce consistent co-segmentation across the entire dataset, and
requires further optimization techniques to ensure consistency.
Meng et al. [84] and Chen et al. [22] split the input image set into simple and
complex subsets. Both the subsets contain a common object of same class. But,
co-segmentation from the simple subset (foreground and background are homoge-
neous and they are well separated) is easier compared to the complex subset (fore-
ground is not homogeneous and background is cluttered, and there is less contrast
between them). Subsequently, they use results obtained from the simple subset for co-
segmentation of the complex one. In their co-segmentation energy function, Meng
et al. [84] split the common foreground feature distance/similarity term E h (·) of
Eq. (2.1) into two components: similarity between region pairs present in the same
image group and across different image groups. This helps to have matching between
few inter-group region pairs, instead of forcing matching between all pairs. Chen et
al. [22] assume that the salient foregrounds in the simple image set are well sep-
arated from the background, and the well-segmented object masks are used as a
segmentation prior in order to segment more difficult images. Li et al. [68] proposed
to improve co-segmentation results of existing methods by repairing bad segments
using information propagation from good segments as a post-processing step.
Sun and Ponce [123] proposed to learn discriminative part detectors for each class
using a support vector machine model. The category label information of images is
used as the supervision, and the learned part detectors of each class discriminate that
class from the background. Then, co-segmentation is performed using the object cue
obtained using the part detectors into the discriminative clustering framework of [56].
Rubio et al. [106] first compute objectness score of pixels and image segments, and
use them to form the unary term of Eq. (2.1), and incorporate an energy term in
E t (·) that considers the similarity values of every two inter-image matched region
pairs. In order to obtain high-level features, semi-supervised methods in [71, 74,
85, 86, 131, 146] compute region proposals from images using pretrained networks,
whereas Quan et al. [99] use CNN features. The graph-based method in [85] includes
2.1 Unsupervised Co-segmentation 25

high-level information like object detection, which is also a complex problem. They
construct a directed graph by representing the computed local image regions (gen-
erated using saliency and object detection) as nodes, and sequentially connecting
edges among nodes of consecutive images. Then the common object is detected as
the set of nodes on the shortest path in the graph. Vicente et al. [131] used proposal
object segmentations to train a random forest regressor for co-segmentation. This
method relies heavily on the accuracy of individual segmentation outputs as it is
assumed that one segment contains the complete object. Li et al. [71] extended the
proposal selection-based co-segmentation methods in [85, 131] by a self-adaptive
feature selection strategy. Quan et al. [99] build a graph from all superpixels of all
images, and use a ranking algorithm and boundary prior to segment the common
objects such that the nodes corresponding to the common foreground are assigned
high rank scores.
In the first part of this monograph, we discuss image co-segmentation methods
based on unsupervised frameworks. Hence, these methods do not involve CNN fea-
tures or region proposals or saliency. Every image is segmented into superpixels, a
region adjacency graph (RAG) is constructed, and the nodes in the graph (superpix-
els) are attributed with only low-level and mid-level features, e.g., color, HOG and
SIFT features. These co-segmentation algorithms are based on graph matching and
superpixel classification. The graph-based framework performs maximum common
subgraph matching of the RAGs obtained from the image set to be co-segmented. In
the classification framework, discriminative features are computed using a modified
linear discriminant analysis (LDA) to classify superpixels as background and the
common foreground.

2.2 Supervised Co-segmentation

2.2.1 Semi-supervised Methods

There has been a small volume of work on semi-supervised co-segmentation that


tries to simplify the co-segmentation problem by including user interaction for seg-
mentation. Batra et al. [8] have proposed interactive co-segmentation for a group
of similar images using scribbles drawn by users. This method guides the user to
draw scribbles in order to refine segmentation boundary in the uncertain regions.
This method is quite similar to the method in [130], but they consider only one back-
ground model for the entire image set. Similar semi-supervised methods have been
proposed in [29, 33, 142].
26 2 Survey of Image Co-segmentation

2.2.2 Deep Learning-Based Methods

Very little work has been done on the application of deep learning to solve the
co-segmentation problem. Recently, the methods in [20, 72] suggested end-to-end
training of deep siamese encoder-decoder networks for co-segmentation. The CNN-
based encoder is responsible for learning object features, and the decoder performs
the segmentation task. The siamese nature of the network allows an image pair to
be input simultaneously, and the segmentation loss is computed for the two images
jointly. To capture the inter-image similarity, Li et al. [72] compute the spatial cor-
relation of the encoder feature (say, shape C × H × W ) pair obtained from the two
images such that high correlation values identify the common object pixel locations
in the respective images. Using the two correlation results (shape H × W × H W
each) as seed, the decoder generates the segmentation masks through deconvolu-
tion operation. Different from this, Chen et al. [20] simply concatenate the encoder
feature pair, which is decoded to obtain the common object masks. The common
class is identified by fusing channel attention measures of the image pair, and the
objects are localized using spatial attention maps. Specifically, the encoder features
are modulated by the attention measures before feeding them to the decoder. More
recently, the methods in [67, 147] consider more than two images simultaneously for
training their co-segmentation networks. Li et al. [67] use a co-attention recurrent
neural network (RNN) that learns a prototype representation for the image set. The
RNN is trained in the traditional manner by taking the images as input in any random
sequence. In addition, the update gate model of the RNN unit involves both channel
and spatial attention so that common object information is infused in the group repre-
sentation. This prototype is then combined with the encoder feature of each image to
obtain the common object masks. Zhang et al. [147] use an extra classifier to learn the
common class. Specifically, the fused channel attention measures obtained from all
images in the set are combined to (i) predict the co-category label, (ii) modulate the
encoder feature maps through a semantic modulation subnetwork. Further, a spatial
modulation subnetwork learns to jointly separate the foreground and the background
in all images to obtain a coarse localization of the common objects.
The methods in [25, 50] perform multiple image co-segmentation using deep
learning, while training the network in an unsupervised manner, that is, without using
any ground-truth mask. Hsu et al. [50] train a generator to estimate co-attention
maps that highlight the objects approximately. Then feature representation of the
approximate foreground and background regions obtained from the co-attention maps
are used to train the generator using a co-attention loss that aims to reduce the inter-
image object distance and increase the intra-image figure-ground discrepancy for all
image pairs. To ensure that the co-attention maps capture complete objects instead
of object parts, a mask loss is used. It is formulated as a binary cross-entropy loss
between the predicted co-attention maps and the best fit object proposals. Thus, the
mask loss and the co-attention loss complement each other. Given that the object
proposals are computed from a pretrained model, and the ones that best fit the co-
attention map are used to refine the co-attention map itself, the method may lead to
2.2 Supervised Co-segmentation 27

a solution of incomplete objects. The method in [25] is built on the model of Li et


al. [72], discussed earlier. Since ground-truth masks are not used, the segmentation
loss, which is typically computed between the segmentation maps predicted by the
decoder pair and the corresponding ground-truth, is replaced by the co-attention
loss of Hsu et al. [50], where the figure-ground is estimated from the segmentation
predictions. In addition, a geometric transformation is learned using consistency
losses that align the feature map pair and the segmentation prediction pair. However,
these methods are not capable of efficiently handling outliers in the image set since
the network is always trained with image sets in which all images contain the common
object.
In the second part of this monograph, we discuss image co-segmentation methods
based on deep learning. The first method uses superpixel and region adjacency graph
representation of images. Then a graph convolutional network performs foreground–
background classification of the nodes. The other two methods do not use any super-
pixel representation. Instead, they directly classify image pixels using convolutional
neural networks. Specifically, object segmentation masks are obtained using encoder-
decoder architectures. Network training is done in fully supervised manner as well
as in few-shot setting where the number of labeled images is small. At the same time,
negative samples (image pairs without any common object and image sets containing
outlier images) are also used so that the model can identify outliers.

2.3 Co-segmentation Datasets

We provide a summary of the datasets used in this monograph for co-segmentation.


The image pair dataset [69] contains 105 pairs with a single common object present
in majority of the pairs. The common object present in both the images of a pair
are visually very similar. However, the objects themselves are not homogeneously
featured in most cases. Faktor et al. [36] created a dataset by collecting images from
the PASCAL-VOC dataset. It consists of 20 classes with average 50 images per
class. The classes are: person, bird, cat, cow, dog, horse, sheep, airplane, bicycle,
boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa and
tv/monitor. The images within a class have significant appearance and pose vari-
ations. It is a challenging dataset to work with due to the variations and presence
of background clutter. In particular, the images capture scenes from both indoor
and outdoor. The Microsoft Research Cambridge (MSRC) [131] subdataset consists
of the following classes: cow, plane, car, sheep, cat, dog and bird. Each class has
10 images. The iCoseg dataset was introduced by Batra et al. [8] for interactive
segmentation. It contains total 643 images of 38 homogeneous classes. This dataset
contains a varying number of images of the same object instance under very different
viewpoints and illumination, articulated or deformable objects like people, complex
backgrounds and occlusions. Even though the image pair dataset, as the name sug-
gests, already contains pairs for co-segmentation, the other three datasets mentioned
28 2 Survey of Image Co-segmentation

do not. Hence, for co-segmentation on these datasets, image sets are constructed by
grouping multiple images from each class.
The multiple image co-segmentation methods discussed in this monograph also
consider outlier contaminated image sets. Given a set of N images, they find the
common object from M(≤ N ) images. Since such datasets are not available in abun-
dance, the iCoseg dataset has been used to create a much larger dataset by embedding
each class with several outlier images randomly chosen from other classes. Each set
may have up to 30% of the data as outlier images. This dataset has an overwhelm-
ingly large 626 sets containing a total of 11,433 non-unique images where each set
contains 5–50 images. Similarly, outlier contaminated image sets have been created
using images from the Weizmann horse dataset [13], the flower dataset [92] and the
MSRC dataset. The MIT object discovery dataset of internet images or the Internet
dataset [105] has three classes: car, horse and airplane with 100 samples per class.
Though the number of classes is small, this dataset has high intra-class variation
and is relatively large. Every class also contains a variable number (few) of outlier
images.
Chapter 3
Mathematical Background

In this chapter, we describe some concepts that will be instrumental in developing


the co-segmentation algorithms of this monograph. We begin with the superpixel
segmentation algorithm in Sect. 3.1 as superpixels are the basic component in major-
ity of the chapters. In Sect. 3.2, we explain binary and multiclass label propagation
algorithms that aid in classification of samples (e.g., superpixels) by assigning differ-
ent labels (e.g., foreground and background) to them, thus facilitating segmentation.
Then in Sect. 3.3, we describe the maximum common subgraph computation algo-
rithm, which is applied in the graph matching-based co-segmentation algorithms.
The deep learning-based co-segmentation methods of this monograph are based on
convolution neural networks (CNN), and we describe them briefly in Sect. 3.4. Next
in Sect. 3.5, we provide a brief introduction to graph convolutional neural network
which is a variant of CNNs, specifically designed to be applied on graphs. We include
a short discussion on variational inference and few-shot learning as these will be used
in a co-segmentation method discussed in the monograph.

3.1 Superpixel Segmentation

Pixel-level processing of images of large spatial size comes with high computational
requirements. Hence, oversegmentation techniques are often used to divide an image
into many non-overlapping segments or atomic regions. Then these segments are used
as image primitives instead of pixels for further processing of the image. In addition
to the reduction in computations, these segments also provide better local features
than pixels alone because a group of pixels provide more context. This makes them
meaningful substitutes for pixels. Currently, the superpixel segmentation algorithm is
the most common and popular choice for oversegmentation, whose resulting atomic
regions are called image superpixels.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 29
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_3
30 3 Mathematical Background

(a) (b)

(c) (d)

Fig. 3.1 Superpixel segmentation. a Input image. b 500 superpixels with compactness Q = 20.
c 500 superpixels with Q = 40. d 1000 superpixels with Q = 40. Image Courtesy: Source image
from the internet

Each superpixel is a group of spatially contiguous pixels. Hence, unlike image


pixels or rectangular patches, superpixels do not form any rigid grid structure.
However, the pixel values inside a superpixel are similar, thus making it homoge-
neous. Figure 3.1 shows an example of superpixel-based oversegmentation. It can be
observed that the pixels constituting each superpixel belong to a single object and the
superpixel contours follow the object boundary. Thus, the irregular shape of superpix-
els represents both foreground object and background region parts more appropriately
than what regular-shaped rectangular patches can do. This allows superpixels to be
3.1 Superpixel Segmentation 31

the building blocks for several computer vision algorithms such as PASCAL-VOC
Challenge for visual recognition [145], depth estimation [152], segmentation [75]
and object localization [40].
A good superpixel segmentation algorithm should be efficient in terms of compu-
tation and memory requirements. Further, the superpixels should follow the object
and region boundaries, thus enabling the computer vision algorithms that use them
to achieve high accuracy. The following oversegmentation algorithms have been
designed to generate superpixels directly or have been adapted for superpixels:
simple linear iterative clustering (SLIC) [1], turbopixel [66], mean shift [30], the
watershed method [132], the normalized cuts algorithm [114], the agglomerative
clustering-based segmentation method of Felzenszwalb and Huttenlocher [38], the
image patch stitching-based superpixel computation method of Veksler et al. [128],
and the superpixel generation method of Moore et al. [87] by imposing a grid con-
formability criterion. Here, we will discuss the SLIC superpixel algorithm in detail
since it is more efficient and often outperforms the rest of the approaches. It pro-
duces superpixels of relatively uniform size, which are regularly shaped to a certain
degree, and they follow the boundary adherence property with high accuracy. Hence,
we have used it in the co-segmentation algorithms for this monograph.
The SLIC algorithm is an adaptation of the well-known k-means clustering algo-
rithm. Given an image, it clusters the pixels, and each cluster represents a super-
pixel. For a color image, the clustering of pixels is performed using their five-
dimensional feature vectors: CIE L, a, b values and X , Y coordinate values, i.e.,
f = [L , a, b, X, Y ]T . For a grayscale image, f = [L , X, Y ]T is used as the feature.
Thus, in addition to the pixel intensities, the location of pixels are also used. This
ensures the pixels belonging to a cluster to be spatially cohesive. Further, to con-
strain all pixels in a cluster to form a single connected component, the pixel-to-cluster
assignment is done within a specified spatial window, as described next.
To oversegment an n 1 × n 2 image into n S superpixels, √ the clusters are initialized
to be P × P non-overlapping windows where P =  n 1 n 2 /n S . In each window,
the smallest gradient pixel in the 3 × 3 neighborhood of its center pixel is set as a
cluster center.
Assignment step: Each pixel-i is assigned to a cluster C if the corresponding cluster
center c has the shortest feature distance among all cluster centers in a 2P × 2P
window Ni2P .
c = arg min d(fi , fc ) (3.1)
c∈Ni2P

where d(·) is the feature distance. Restricting the search space to Ni2P reduces
the number of distance calculations. So the algorithm requires less computation
compared to the standard k-means clustering which compares with all data points
(pixels) for finding the minimum distance. It can be observed that for a pixel, the
assigned cluster center is one among its eight spatially closest cluster centers. So,
computational complexity of SLIC for an image with N pixels is O (N ).
Cluster update step: After each pixel is assigned to a cluster, each cluster center is
updated as the average of all pixels in that cluster with an updated feature fc .
32 3 Mathematical Background

These two steps are repeated iteratively until the shift in the cluster center locations
is below a certain threshold. The final number of clusters after convergence of the
clustering is the number of generated superpixels, which may be slightly different
from n S .
Now, we explain the distance measure used in Eq. (3.1). Each pixel feature vector
f consists of pixel intensity and chromaticity information L, a, b, and the pixel
coordinates X , Y . However, (L , a, b) and (X, Y ) components exhibit different ranges
of values. Hence, the distance measure d(·) must be designed as a combination of
two distance measures: color distance d1 (·) and spatial distance d2 (·), which are
computed as the Euclidean distances between [L i , ai , bi ]T and [L c , ac , bc ]T , and
[X i , Yi ]T and [X c , Yc ]T , respectively. Then, these two measures are combined as:

d(fi , fc ) = (d1 /t1 )2 + (d2 /t2 )2 (3.2)

where t1 and t2 are two normalization constants. Here, the arguments of d1 (·) and
d2 (·) have been dropped for simplicity. Since the search space is restricted to a
2P × 2P window, the maximum possible spatial distance between a pixel and the
cluster center obtained using Eq. (3.1) is P. So, t2 can be set to P. Thus, Eq. (3.2)
can be rewritten as: 
d(fi , fc ) = d12 + Q(d2 /P)2 (3.3)

where Q = t12 is a compactness factor that acts as weight for the two distances. Note
that the denominator t1 has been dropped because it does not impact the arg min
operation in Eq. (3.1). A small value of Q gives more weight to the color distance and
the resulting superpixels better adhere to the object boundaries (Fig. 3.1b), although
the superpixels are less compact and regular in shape and size. The converse is
true when Q is large (Fig. 3.1c). In CIELab color space, the typical values of Q ∈
[1, 40] [1]. It is possible that even after the convergence of the algorithm, there are
some isolated pixels that are not connected to the superpixels they have been assigned
to. As a post-processing, such a pixel is assigned to the spatially nearest superpixel.

3.2 Label Propagation

Often datasets contain many samples, which are unlabeled, and only a small number
of them are labeled. This may occur  due to the large cardinality of datasets or the
lack of annotations. Let X = Xa Xb be a set of n data sample features where
Xa = {x1 , x2 , . . . , xm } and Xb = {xm+1 , xm+2 , . . . , xn } are the labeled and unlabeled
sets, respectively. Each xi ∈ Xa has a label L (xi ) ∈ {1, 2, . . . , K }, which indicates
that every sample belongs to one of the K classes. Label propagation techniques
are designed to predict labels of the unlabeled samples from the available labeled
samples. Hence, it is a type of semi-supervised learning. This has several applications
including classification, image segmentation and website categorization.
3.2 Label Propagation 33

Semi-supervised learning methods [12, 53, 55, 150] learn a classifying function
from the available label information and the combined space of the labeled and
unlabeled samples. This learning step incorporates a local and a global labeling
consistency assumption. (i) Local consistency: neighboring samples should have the
same label, and (ii) global consistency: samples belonging to a cluster should have
the same label. Then this classifying function assigns labels to the unlabeled samples.
We first describe this process for two classes (K = 2) and then for multiple classes
(K > 2).

3.2.1 Two-class Label Propagation

We explain the two-class label propagation process by performing foreground–


background segmentation of an image. Let each image pixel-i is represented by its
feature vector xi . A certain subset of the pixels are labeled as foreground (L (xi ) = 1)
or background (L (xi ) = 2), and they constitute the set Xa . Let p F (x) and p B (x)
be two distributions obtained from the foreground and background labeled features,
respectively. Subsequently, all pixels in the image are classified as either foreground
(yi → 1) or background (yi → 0) by minimizing the following cost function:


n 
n 
n 
n
E (y) = (1 − yi )2 p F (xi ) + yi2 p B (xi ) + Si j (yi − y j )2 , (3.4)
i=1 i=1 i=1 j=1

where yi ∈ [0, 1] is the likelihood of pixel-i being foreground, and S ∈ Rn×n is a


feature similarity matrix. One can use any appropriate measure to compute S for
a particular task. For example, Si j can be calculated as the negative exponential of
the normalized distance between xi and x j . Here, p F (x) and p B (x) act as the prior
probability of a pixel x being foreground or background, respectively. In Eq. (3.4),
minimization of the first term forces yi to take a value close to 1 for foreground
pixels because a pixel-i with a large p F (xi ) is more likely to belong to foreground.
For similar reasons, minimization of the second term forces yi to take a value close to
0 for background pixels since they have a large p B (xi ). Observe that these two terms
together attain the global consistency requirement mentioned earlier. The third term
maintains a smoothness in labeling by forcing neighboring pixels to have the same
label. If two pixels are close in the feature space, Si j will be large, and the resulting yi
and y j will be close in order to minimize the third term in Eq. (3.4), and consequently
the local consistency requirement is satisfied. Here, diagonal elements of S are set to
0 to avoid self-bias. Till now, neighborhood has been considered only in the feature
space. However, one may also consider spatial neighborhood by (i) setting Si j = 0 if
pixel-i is not in the neighborhood of pixel- j, or (ii) scaling Si j by the inverse of the
spatial distance between pixel-i and pixel- j, or even (iii) by certain post-processing
when (i) and (ii) are not applicable. We will describe one such method in Chap. 6.
As another example, consider the example of an undirected graph where each node
34 3 Mathematical Background

represents a sample xi and the neighborhood is defined by the presence of edges. It


may be noted that, irrespective of the initial labels L (xi ), the design of Eq. (3.4)
will assign labels to all pixels in Xa as well as in Xb . However, labels of most pixels
in Xa will not change since features xi ∈ Xa have been used to compute the prior
probabilities and they have a large p F (xi ) (if L (xi ) = 1, i.e., yi → 1) or p B (xi ) (if
L (xi ) = 2, i.e., yi → 0). Thus, the first two terms in Eq. (3.4) jointly act as a data
fitting term.
Label propagation is evident in the third term of Eq. (3.4) as an unlabeled pixel- j
is likely to have the same label as of pixel-i if Si j is large. In the first two terms,
label propagation occurs indirectly through the prior probabilities, which are com-
puted from the labeled pixel features xi ∈ Xa . The final labeling yi is influenced by
p F (x) and p B (x) as explained earlier. Minimization of E can be performed by first
expressing it using vector–matrix notation:
 

n
E (y) = p F (xi ) − 2yT p F + yT P F y + yT P B y + yT (D − S)y
i=1
 

n
= p F (xi ) − 2yT p F + yT (P F + P B + D − S)y , (3.5)
i=1

where P F and P B are diagonal matrices with { p F (xi )}i=1


n
and { p B
(xi )}i=1
n
as diagonal
elements, respectively. D is also a diagonal matrix with Dii = j Si j . The cost E
can be minimized with respect to y to obtain the solution as:

− 2p F + 2(P F + P B + D − S)y = 0
y = (P F + P B + D − S)−1 p F . (3.6)

This solution will yield yi ∈ [0, 1], and it can be thresholded to obtain the label for
each pixel-i as either foreground or background.

3.2.2 Multiclass Label Propagation

In the two-class label propagation process, each sample xi is classified as either of the
two classes based on the obtained value of yi . However, in the case of multiple classes
(K > 2), the initial label information of samples xi ∈ Xa is represented using L i ∈
R1×K , where L ik = 1 if xi has label k and L ik = 0 otherwise. Since the samples xi ∈
Xb are unlabeled, for them L ik = 0, ∀k. Let us denote L = [L 1T , L 1T , . . . , L nT ]T ∈
Rn×K . It is evident that (i) for an unlabeled sample xi , the i-th row L i is all-zero,
and (ii) for a labeled sample xi , the index of value 1 in L i specifies its initial label.
Similar to the two-class process, the goal of the multiclass label propagation process
is to obtain a final label matrix containing labels of all samples in X . Following
the design of L, let us denote it as Y = [Y1T , Y1T , . . . , YnT ]T . The label information
3.2 Label Propagation 35

of each sample xi is obtained from Yi ∈ R1×K , whose elements Yik determine the
likelihood of the sample belonging to class-k. Thus, the final label to be assigned to
xi is obtained as:
L (xi ) = arg max Yik . (3.7)
k

To calculate Y, a label propagation cost function Emulti (Y) can be formulated as:


n 
n 
n
1 1
Emulti (Y) = χ Yi − L i 2 + Si j  √ Yi −  Y j 2 , (3.8)
i=1 i=1 j=1
Dii Djj

where χ is a regularization parameter that weighs the two terms. The first term is
the data fitting term that ensures that the final labeling is not far from the initial
labeling. Unlike in Eq. (3.4), multiple prior probabilities are not computed here for
different classes. Similar to Eq. (3.4), the second term in Eq. (3.8) satisfies the local
consistency requirement and ensures smoothness in labeling. Yi and Y j are further
normalized by Dii and D j j , respectively, to incorporate similarities of xi and x j
with their respective neighboring samples. To obtain the optimal Y, the cost Emulti is
minimized with respect to Y as [150]:

2χ (Y − L) + 2(I − D−1/2 SD−1/2 )Y = 0


Y = μ(I − ωl S̃)−1 L, (3.9)

where μ = χ /(1 + χ ), ωl = 1/(1 + χ ) and S̃ = D−1/2 SD−1/2 . Then labels can be


computed using Eq. (3.7).
The above regularization framework can also be expressed as an iterative algo-
rithm [150] where the initial label matrix L gets iteratively updated to finally obtain
the optimal Y at convergence. Let Y(0) = L and the label update equation for t ≥ 1:

Y(t) = ωl S̃Y(t−1) + (1 − ωl )L, (3.10)

where 0 < ωl < 1 is a regularization parameter. The first term updates Y(t−1) to
Y(t) using the normalized similarity matrix S̃. To understand
 the label propagation,
consider ωl = 1 and Y(t) = S̃Y(t−1) . Thus, Yik(t) = nj=1 S̃i j Y jk (t−1)
. This illustrates
that if sample xi has a large similarity with sample x j , the likelihood of x j belonging
(t−1)
to class-k, i.e., Y jk influences Yik(t) , i.e., the likelihood of xi also belonging to
class-k. Thus, label propagation occurs from x j to its neighbor xi . The second term
in Eq. (3.10) ensures that the final label matrix is not far from the initial label matrix
L. To obtain the optimal label matrix Y∗ = limt→∞ Y(t) at convergence, we rewrite
Eq. (3.10) using recursion as:


t−1
Y(t) = (ωl S̃)t−1 L + (1 − ωl ) (ωl S̃)i L . (3.11)
i=0
36 3 Mathematical Background

Fig. 3.2 Example of maximum common subgraph of two graphs G1 and G2 . The set of nodes
V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the maximum common subgraphs H1
and H2 of G1 and G2 , respectively, are highlighted (in blue)

Since eigenvalues
t−1 of S̃ lie in [0, 1], we have (i) limt→∞ (ωl S̃)t−1 = 0 and (ii)
limt→∞ i=0 (ωl S̃)i = (I − ωl S̃)−1 . Hence,

Y∗ = lim Y(t) = (1 − ωl )(I − ωl S̃)−1 L , (3.12)


t→∞

which is proportional to the closed form solution in Eq. (3.9).

3.3 Subgraph Matching

In this section, we briefly describe the maximum common subgraph (MCS) com-
putation for any two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ). Here, Vi = {vki } and
Ei = {ekli } for i = 1, 2 denote the set of nodes and edges, respectively. Typically,
each node is attributed with a label (e.g., digits or strings) or a representative vector
depending on the task that is being solved using graphs, whereas edges represent a
certain association among the nodes as may be specified in the dataset under consid-
eration. The MCS corresponds to a pair of subgraphs H1 in G1 and H2 in G2 . Thus,
the nodes in the resulting H1 should have a high similarity in their attributes with
the nodes in H2 . Further, the nodes in both H1 and H2 should be cohesive through
edge connectivity. The maximum common subgraphs for an example graph pair G1
and G2 are highlighted in Fig. 3.2.
This being a computationally very demanding task, we use an appropriate solution
to compute the MCS. To find the MCS, we first build a product graph W (ideally
known as vertex product graph) from the graphs G1 and G2 based on their inter-
graph attribute similarities. If labels are used as attributes, one may consider that
a node v 1 ∈ G1 is similar to a node v 2 ∈ G2 when their labels match exactly, i.e.,
L (v 1 ) = L (v 2 ). In case of attribute vectors, one may compare the corresponding
vector distance with a predecided threshold to conclude if a specific node pair matches
or not. A node in a product graph [61] is denoted as a 2-tuple (vk1 , vl2 ) with vk1 ∈ G1
and vl2 ∈ G2 . Let us call it a product node to differentiate it from single graph nodes.
We define the set of product nodes U W of the product graph W as:
3.3 Subgraph Matching 37

UW = vk1 ∈ V1 , vl2 ∈ V2 |L (vk1 ) = L (vl2 ) , considering attribute labels,


(3.13)
or

UW = vk1 ∈ V1 , vl2 ∈ V2 | d(vk1 , vl2 ) < tG , considering attribute vectors,


(3.14)
where tG is a threshold. In W , an edge is added between two product nodes vk11 , vl21
and vk12 , vl22 with k1 = k2 ∧ l1 = l2 if
C1. ek11 k2 exists in G1 and el21 l2 exists in G2 , or
C2. ek11 k2 is not present in G1 and el21 l2 is not present in G2 ,
where ∧ stands for the logical AND operation. In the case of product nodes vk1 , vl21
and vk1 , vl22 (i.e., k1 = k2 ), an edge is added if el21 l2 exists. As edges in the product
graph W represent matching, the edges in its complement graph W C and the product
nodes which they are incident on, represent non-matching, and such product nodes
are essentially the minimum vertex cover (MVC) of W C . The MVC of a graph is
the smallest set of vertices required to cover all the edges in that graph [31]. The set
of product nodes (U M ⊆ U W ) other than this MVC represents the matched product
nodes that form the maximal clique of W in the literature [17, 61]. Let V1H ⊆ V1
and V2H ⊆ V2 be the set of nodes in the corresponding common subgraphs H1 in G1
and H2 in G2 , respectively, with

V1H = {vk1 |(vk1 , vl2 ) ∈ U M } and (3.15)

V2H = {vl2 |(vk1 , vl2 ) ∈ U M }, (3.16)

and they correspond to the matched nodes in G1 and G2 , respectively. Note H1 and
H2 are induced subgraphs.
Step-by-step demonstration of MCS computation is shown in Figs. 3.3, 3.4, 3.5,
3.6, 3.7 and 3.8 using three example graph pairs. The graphs G1 and G2 in Fig. 3.3
have node set

V1 = {v11 , v21 , v31 , v41 , v51 , v61 , v71 , v81 } ,


V2 = {v12 , v22 , v32 , v42 , v52 , v62 , v72 , v82 , v92 } ,

and the edge information is captured by the binary adjacency matrices


⎡ ⎤
⎡ ⎤ 0 1 0 000011
0 1 0 00011 ⎢1
⎢1 0 1 0 0 0 0 0 1⎥
⎢ 0 1 0 0 0 0 1⎥


⎢0

⎢0 ⎥ ⎢ 1 0 1 0 0 0 0 1⎥

⎢ 1 0 1 0 0 0 1⎥ ⎢0
⎢0 ⎥ ⎢ 0 1 0 1 0 0 0 1⎥

0 1 0 1 0 0 1⎥
A1 = ⎢
⎢0 ⎥ , A2 = ⎢0
⎢ 0 0 1 0 1 0 0 0⎥
⎥.
⎢ 0 0 1 0 1 0 1⎥ ⎢0
⎢0 ⎥ ⎢ 0 0 0 1 0 1 0 1⎥

⎢ 0 0 0 1 0 1 1⎥ ⎢0
⎣1 ⎦ ⎢ 0 0 0 0 1 0 1 0⎥

0 0 00101 ⎣1 0 0 0 0 0 1 0 1⎦
1 1 1 11110
1 1 1 101010
38 3 Mathematical Background

Fig. 3.3 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product
nodes U W of the product graph W are added based on inter-graph similarities among nodes in
G1 and G2 . d Edges in W are added based on the conditions. e The complement graph (W C ) of the
product graph W shows that its minimum vertex cover is v11 , v62 . f The nodes in the complement
set (U M ) of the MVC constitute the MCS. g, h The subgraphs H1 and H2 of G1 and G2 , respectively

The product node set obtained using Eq. (3.13) or Eq. (3.14) is

U W = { v11 , v12 , v21 , v22 , v71 , v82 , v81 , v92 , v11 , v62 } .

Here, v11 ∈ G1 matched with both v12 , v62 ∈ G2 . Then edges are added between the
product nodes using the conditions C1 and C2. Specifically, the edge between
v21 , v22 and v71 , v82 exists due to condition C2 and the remaining edges exist due
to condition C1 (Fig. 3.3d). The minimum vertex cover of the complement graph is
3.3 Subgraph Matching 39

Fig. 3.4 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3
considering different inter-graph node similarities. a Product nodes U W of the product graph W .
b Edges in W . c The complement graph (W C ) of W shows that its minimum vertex cover is
v11 , v62 . d The nodes in the complement set (U M ) of the MVC constitute the MCS. e The set of
nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the subgraphs H1 and H2 of G1
and G2 , respectively, are highlighted

v11 , v62 since it covers all the edges in W C (Fig. 3.3e). The complement set of the
MVC is

U M = U W \MVC
= { v11 , v12 , v21 , v22 , v71 , v82 , v81 , v92 },

and it provides the nodes of the resulting subgraphs as V1H = {v11 , v21 , v81 , v71 } and
V2H = {v12 , v22 , v92 , v82 }. Figure 3.4 shows an example of a graph pair with the same
set of nodes and edges as in Fig. 3.3, but having different node attributes, and resulting
in a different product node set, given as:

U W = { v11 , v12 , v21 , v82 , v71 , v22 , v81 , v92 , v11 , v62 } .

However, we observe that the resulting subgraphs are the same. Figure 3.5 shows
another similar example graph pair, but having different node attributes from both
Figs. 3.3 and 3.4. Here, the complement of the product graph contains a sin-
40 3 Mathematical Background

Fig. 3.5 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3
considering different inter-graph node similarities. This example demonstrates the possibility of
non-unique minimum vertex cover (MVC). a Product nodes U W of the product graph W . b Edges
in W . c The complement graph (W C ) of W shows that its MVC is either v61 , v62 or v71 , v82 ,
and d, e corresponding nodes in the complement set (U M ) of the MVC constitute the MCS. f, g
Two possible subgraph pairs with the set of nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 },
or V1H = {v11 , v21 , v81 , v61 }, V2H = {v12 , v22 , v92 , v62 }
3.3 Subgraph Matching 41

Fig. 3.6 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3
considering different inter-graph node similarities. a Product nodes U W of the product graph W . b
Edges in W . c The complement graph (W C ) of W shows that its minimum vertex cover is either
{ v11 , v62 , v31 , v22 } or { v11 , v62 , v11 , v12 }. d The nodes in the complement set (U M ) of the MVC
{ v11 , v62 , v31 , v22 } constitute the MCS. e The set of nodes V1H = {v11 , v81 , v71 }, V2H = {v12 , v92 , v82 }
and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted

gle edge, and this creates an ambiguity in MVC. The MVC can be chosen as
either v61 , v62 or v71 , v82 . These choices result in maximum common subgraphs
with node sets either V1H = {v11 , v21 , v81 , v71 } and V2H = {v12 , v22 , v92 , v82 } (Fig. 3.5f), or
V1H = {v11 , v21 , v81 , v61 } and V2H = {v12 , v22 , v92 , v62 } (Fig. 3.5g). One may choose either
result depending on the task in hand. Figures 3.6, 3.7 and 3.8 show more examples.
It can be observed in Fig. 3.7d that the product graph W is a complete graph, hence,
the MVC is an empty set resulting in U M = U W . Thus, all product nodes constitute
the MCS.
42 3 Mathematical Background

Fig. 3.7 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product
nodes U W of the product graph W are added based on inter-graph similarities among nodes in
G1 and G2 . d Edges in W . e Since W is a complete graph; its complement (W C ) does not have
any edges. Hence, the minimum vertex cover of W C is an empty set. f The nodes in the com-
plement set (U M ) of the MVC constitute the MCS. g The set of nodes V1H = {v11 , v21 , v31 , v81 , v71 },
V2H = {v12 , v22 , v42 , v92 , v82 } and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are
highlighted
3.3 Subgraph Matching 43

Fig. 3.8 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product
nodes U W of the product graph W are added based on inter-graph similarities among nodes in
G1 and G2 . d Edges in W . e The complement graph (W C ) of the product graph W shows that its
minimum vertex cover is either v11 , v12 or v51 , v22 . f The nodes in the complement set (U M ) of the
MVC v11 , v12 constitute the MCS. g The set of nodes V1H = {v81 , v41 , v51 , v61 }, V2H = {v92 , v32 , v22 , v62 }
and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted
44 3 Mathematical Background

3.4 Convolutional Neural Network

A convolutional neural network (CNN) is one of the variants of neural networks.


It typically operates on images and extracts semantic features using convolution
filters. The primary efficacy of CNN is that it automatically learns those filters by
optimizing a task specific loss function computed over a set of training samples.
CNN was primarily developed for classification purpose, and when compared to
other classification methods, the amount of preprocessing required by a CNN is
significantly less. While basic approaches require handcrafting of filters, CNN can
learn these filters and feature extractors with enough training. The architecture of a
CNN is inspired by the organization of the visual cortex and is akin to the connectivity
pattern of neurons in the human brain. Individual neurons can only react to stimuli
in a limited region of the visual field called the receptive field. A number of similar
fields can be stacked on top of each other to span the full visual field.
Benefits of applying CNNs to image data in place of fully connected networks
(FCN) are multi folds. First, a convolutional layer (CL) in CNNs extracts meaningful
features by preserving the spatial relations within the image. On the other hand, a
fully connected (FC) layer in FCNs, despite being a universal function approximator,
remains poor at identifying and generalizing the raw image pixels. Another important
aspect is that a CL summarizes the image and yields more concise feature maps for
the next CL in the CNN pipeline (also known as network architecture). To this end,
CNNs provide dimensionality reduction and reduce the computational complexity.
This is not possible with FC layers. Lastly, FCNs enforce static image sizes due to
their inherent properties, whereas CNNs permit us to work on arbitrary sized images,
especially in the fully convolutional way.
Suppose, image I ∈ RC×M×N is a tensor with C channels and M × N pixels.
Let h ∈ RC×K ×L be the convolution kernel of a filter, where K < M and L < N .
Typically, K and L are chosen to be equal and odd. Now, the convolution operation
at a pixel (i, j) of the image I with respect to the kernel h can be defined as:

 
C K −1 
L−1
F[i, j] = h[c, k, l] I [c, i − k, j − l], (3.17)
c=1 k=0 l=0

where F ∈ R M×N is an output feature map obtained after the first convolutional
layer. The kernel slides over the entire tensor to produce values at all pixel locations.
It may be noted that, the same spatial size of I and F is ensured by padding zeros
to boundaries of I before convolution. Without zero-padding, the output spatial size
will be less than that of the input. Typically, instead of using a single filter, a set of
Co filters {h} are used, and outputs of all the filters are concatenated channelwise to
obtain the consolidated feature map F ∈ RCo ×M×N . Next, this feature map is input
to the second convolutional layer, which uses a different set of filters and produces
an output. This process is then repeated for subsequent convolutional layers in the
CNN. The output of the final CL can be used as the image feature in a range of
3.4 Convolutional Neural Network 45

Fig. 3.9 CNN architecture. The network in this example has seven layers. The first four layers
are the convolutional layers and the last three layers are fully connected layers. Figure courtesy:
Internet

computer vision problems. Typically, any CNN designed for the task of classification
or regression requires a set of fully connected layers to follow the final CL (Fig. 3.9).
In some cases, the convolution operation is not performed at each and every point
of the input tensor. Specifically, this approach is adopted when the input image has
pixelwise redundancy. In this context, the stride of a convolution operation is defined
as the number of pixels the convolution kernel slides to reach the next point of
operation. For example, if δw and δh are the strides across width and height of the
image I , then after the operation at a certain point (i, j) as shown in Eq. (3.17), the
next operations will be performed at points (i + δh , j), (i, j + δw ) and (i + δh , j +
δw ). Thus, the output feature map shape will be Co × ((M + P − K )/δh + 1) ×
((N + P − L)/δw + 1) where P is the number of zero rows and columns padded.
The value of the stride can be determined from the amount of information one may
want to drop. Thus, the stride is certainly a hyper parameter. Further, the kernel size
is also a hyper parameter. If the size is very high, the kernel accumulates a large
amount of contextual information at every point, which might be very useful for
some complex tasks. However, this increases the space and time complexity. On the
other hand, reducing the kernel size essentially simplifies the complexity, but also
reduces the network’s expressive quality.
Similar to FCNs, a sequence of convolutional layers can also be viewed as a graph
with image pixels (in the input layer) and points in feature maps (in the subsequent
layers) as nodes and kernel values as edge weights. However, the edge connectivity
is sparse since the kernel size is much smaller than the image size. An FC layer is
essentially a CL when (i) the input layer vector is reshaped to an 1 × M × N array
and (ii) C number of filters with convolution kernels of size 1 × M × N are applied
at the center point only, without any sliding. The resulting C × 1 output map will be
the same as the output the FC layer.

3.4.1 Nonlinear Activation Functions

In modern neural networks, nonlinear activation functions are used to enable the
network to build complicated mappings between inputs and outputs, which are critical
for learning and modeling complex and higher dimensional data distributions. Being
46 3 Mathematical Background

linear, convolutional layers and FC layers alone cannot achieve this. Hence, they are
typically followed by nonlinear activation layers in the network.
One of the most widely used nonlinear activation functions in the neural network
community is the sigmoid, defined as follows:

1
σ (x) = , (3.18)
1 + exp(−x)

where x denotes the output of a CL or an FC layer. Another classical nonlinear


activation function is the hyperbolic tangent which is used whenever the intermediate
activations are required to be zero-centered. It is defined as:

exp(x) − exp(−x)
tanh(x) = . (3.19)
exp(x) + exp(−x)

However, it can be deduced from both the equations that whenever x becomes
extremely high or low, the functions start saturating. As a result, the gradient at
those points becomes almost flat. This phenomenon is called the vanishing gradi-
ent problem, which can inhibit the learning, and hence the optimization does not
converge at all.
The issue of vanishing gradient becomes more prominent and devastating with
increasing number of layers in neural networks. Therefore, almost all the modern deep
learning models do not use sigmoid or hyperbolic tangent-based activation functions.
Instead the rectified linear unit (ReLU) nonlinearity is used, which always produces
a constant gradient value independent of the scale of the input. As a result, the
network converges faster than the sigmoid and hyperbolic tangent-based networks.
It is defined as: 
0, for x < 0
ReLU(x) = (3.20)
x, for x ≥ 0

Furthermore, since ReLU introduces sparsity in a network, calculation load becomes


significantly less than using the sigmoid or hyperbolic tangent functions. This leads
to a higher preference for deeper layered networks. However, it should be noted that
when inputs approach zero or negative, the function’s gradient becomes zero, and
the network is unable to learn through backpropagation. Therefore, sometimes the
blessing with this nonlinearity can become a curse depending upon the task in hand.
To avoid this problem, a small positive slope is added in the negative area. Thus,
backpropagation is possible even for negative input values. This variant of ReLU is
called the leaky ReLU. It is defined as:

0.01x, for x < 0
Leaky ReLU(x) = (3.21)
x, for x ≥ 0
3.4 Convolutional Neural Network 47

With this concept, more flexibility can be achieved by introducing a scale factor α
to the negative component, and this is called the parametric ReLU, defined as:

αx, for x < 0
Parametric ReLU(x) = (3.22)
x, for x ≥ 0

As an argument, this function returns the slope of the negative component of the
function. As a result, backpropagation can be used to determine the most appropriate
value of α.

3.4.2 Pooling in CNN

The output feature map of convolutional layers has the drawback of recording the
exact position of features in the input. This means that even little changes in the
feature’s position in the input image will result in a different feature map. Recrop-
ping, rotation, shifting and other minor changes to the input image can cause this.
Downsampling is a typical signal processing technique for solving this issue. This
is done by reducing the spatial resolution of an input signal, keeping the main or
key structural features but removing the fine detail that may not be as valuable to the
task.
In CNNs, the commonly used downsampling mechanism is called pooling. It is
essentially a filter, with kernel size 2 × 2 and stride 2 in most networks. Different
from convolution filter, this kernel chooses either (i) the maximum value, or (ii)
the average value from every patch of the feature map it overlaps with, and these
values constitute the output. These two methods are known as max-pooling and
average pooling, respectively. The standard practice is to have the pooling layer
after the convolutional and nonlinearity layers. If a stride value 2 is considered, the
pooling layer halves the spatial dimensions of a feature map. Thus, if a feature map
F ∈ RCo ×M×N obtained after a convolutional and, say, ReLU layer is passed through
a pooling layer, the resulting output F̃ ∈ RCo ×M/2×N /2 (assuming even M and N ),
given as:

F̃[c, i, j] = max{F[c, 2i − 1, 2 j − 1], F[c, 2i − 1, 2 j], F[c, 2i, 2 j − 1], F[c, 2i, 2 j]} (3.23)

or
1
F̃[c, i, j] = (F[c, 2i − 1, 2 j − 1] + F[c, 2i − 1, 2 j] + F[c, 2i, 2 j − 1] + F[c, 2i, 2 j])
4
(3.24)
Figure 3.10 shows an example where a 4 × 4 patch of a feature map is pooled, both
max and average, to obtain a 2 × 2 output. This change in shape is also depicted in
Fig. 3.9 (after first, third and fourth layers). In both strategies, there is no external
parameter involved. Thus, the pooling operation is specified, rather than learned. In
48 3 Mathematical Background

Fig. 3.10 Max and average pooling operations over a 4 × 4 feature map with a kernel of size 2 × 2
and stride 2

addition to making a model invariant to small translation, pooling also makes the
training computation and memory efficient due to the reduction in feature map size.

3.4.3 Regularization Methods

In order to effectively limit the number of free parameters in a CNN so that overfitting
can be avoided, it is necessary to enforce regularization over the parameters. A
traditional method is to first formulate the problem in a Bayesian setting and then
introduce zero mean Gaussian or Laplacian prior over the parameters in the network
while calculating its posterior. That is called L 2 or L 1 regularization depending upon
the nature of the prior.
In larger networks, while learning the weights, i.e., the filter kernel values, it is
possible that some connections will be more predictive than others. As the network
is trained iteratively through backpropagation over multiple epochs, in such scenar-
ios, the stronger connections are learned more, while the weaker ones are ignored.
Only a certain percentage of the connections gets trained, and thus only the corre-
sponding weights are learned properly and the rest cease taking part in learning. This
phenomenon is called co-adaptation [119], and it cannot be prevented with the tradi-
tional L 1 or L 2 regularization. The reason for this is that they also regularize based
on the connections’ prediction abilities. As a result, they approach determinism in
selecting and rejecting weights. Hence, the strong becomes stronger and the weak
becomes weaker. To avoid such situations, dropout has been proposed [119].
Dropout: To understand the efficacy of the dropout regularization, let us consider
the simple case of an FCN with a single layer. It takes an input x ∈ Rd and has the
weight vector w ∈ Rd . If t is the target, considering linear activation at the output
neuron, the loss can be written as:
3.4 Convolutional Neural Network 49

 2

d
L = 0.5 t − wi xi (3.25)
i=1

Now, let us introduce dropout to the above neuron, where the dropout rate is denoted
as δ ∼ Bernouli( p). In the context of that neuron, it signifies that the probability of
randomly dropping any parameter wi out of training is p. The loss function at the
same neuron with dropout can be written as:
 2

d
Lr = 0.5 t − δi wi xi (3.26)
i=1

The expectation of ‘gradient of the loss with dropout’ [4] can be written in terms of
the ‘gradient of the loss without dropout’ as follows:
 
∂ Lr ∂L
E = + wi p(1 − p)xi2 (3.27)
∂wi ∂wi

Thus, minimizing the dropout-based loss in Eq. (3.26) is effectively the same as
optimizing a regularized network whose loss can be written as:
 2

d 
d
L̃r = 0.5 t − pwi xi + p(1 − p) wi xi2 (3.28)
i=1 i=1

While training a network using this regularization, in every iteration, a random p


fraction of the weight parameters are not considered for updation. However, the
randomness associated with the selection of the weights considered for updation
ensures that after sufficient number of iterations when the network converges, all the
weight parameters are learned properly.
Batch normalization: In deep networks, the distribution of features varies across
different layers at different point of time during the training. As a result, the inde-
pendent and identical distribution (i.i.d.) assumption over the input data does not
hold [51]. This phenomenon in deep neural networks is called covariate shift [51],
and it significantly slows down the convergence since sufficient time is required for
a network to get adapted to the continuous shift of the data distribution. In order
to reduce the effect of this shift, the features can be standardized batchwise at each
layer by using the empirical mean and variance of those features computed over
the concerned batch [51]. Since the normalization is performed in each batch, this
method is called batch normalization.
50 3 Mathematical Background

3.4.4 Loss Functions

A neural network is a parametric function where weights and biases are the learnable
parameters. In order to learn those parameters, the network relies on a set of training
samples with ground-truth annotations. The network predicts outputs for those sam-
ples and the predicted outputs are compared with the corresponding ground-truth to
compute the prediction error. The function to measure this prediction error is known,
in deep neural network community, as the loss function. It takes two input arguments,
the predicted output and the corresponding ground-truth, and provides the deviation
between them, which is called the loss. The network then tries to minimize the loss
or the prediction error by updating its parameters. Next, we discuss loss functions
commonly used for training a CNN and the optimization process using them.
The cross-entropy loss, or log loss, is used to measure the prediction error of a
classification model which predicts the probabilities of each sample belonging to
different classes. The loss (LC E ) for a sample is defined as:


K
LC E = − y j (k) log( ŷ j (k)), (3.29)
k=1

where K is the number of classes, y j (k) ∈ {0, 1} and ŷ j (k) ∈ [0, 1] are the ground-
truth and the predicted probability of sample- j belonging to class-k, respectively. It
should be noted that softmax activation is applied to the logits (i.e., the final FC layer
outputs) to transform the individual class score into a class probability before the
LC E computation. It can be seen that minimizing LC E is equivalent to maximizing
the log likelihood of correct predictions.
The binary cross-entropy loss (L BC E ) is a special type of cross-entropy loss where
the loss is computed only over two classes, positive and negative classes, defined as:

L BC E = −y j log( ŷ j ) + (1 − y j ) log(1 − ŷ j ), (3.30)

where the ground-truth label y j = {1, 0} for positive and negative class, respectively,
and ŷ j ∈ [0, 1] is the model’s estimated probability that sample- j belongs to the
positive class, which is obtained by applying the sigmoid nonlinearity to the logits.
The focal loss is another type of cross-entropy loss that weighs the contribution
of each sample to the loss based on the predictions. The rationale is that, if a sample
can be easily classified correctly by the CNN, its estimated correct class probability
will be high. Thus, the contribution of the easy samples to the loss overwhelms that
of the hard samples whose estimated correct class probabilities are low. Hence, the
loss function should be formulated to reduce the contribution of easy samples. With
this strategy, the loss focuses more on the hard samples during training. The focal
loss (L F L ) for binary classification is defined as:

L F L = −(1 − ŷ j )γ y j log( ŷ j ) + ( ŷ j )γ (1 − y j ) log(1 − ŷ j ), (3.31)


3.4 Convolutional Neural Network 51

where (1 − ŷ j )γ , with the focusing parameter γ > 0, is a modulating factor to reduce


the influence of easy samples in the loss.

3.4.5 Optimization Methods

With a loss function in hand, the immediate purpose of a deep neural network is to
minimize the loss by updating its parameters. Now that a convolutional neural net-
work includes a large number of layers of convolution and fully connected operations
and thus has a large number of learnable parameters, minimizing the loss function
with respect to that vast set of parameters is not straightforward. Employing an inef-
fective optimization approach can have a substantial impact on the training process,
jeopardizing the network’s performance and training duration. In this section, we
will discuss some common optimization methods and their pros and cons.
Consider f cnn (·; wcnn ) is a CNN parametrized by a set of weights wcnn , and it is
composed of a sequence of functions, each of which represents a specific layer with
its own set of parameters as:

f cnn (·; wcnn ) = f L . . . ( f 1 (·; w1 ); . . .); w L . (3.32)

The network consists of L layers where each layer-i has its own set of parameters
wi . The parameters wi represent the weights of convolution filters and biases in a
convolutional layer, and the weights and biases of fully connected operation in a
fully connected layer. During training, if the network’s prediction is ŷ for a sample x
with ground-truth y, the loss L is computed as some distance measure (Sect. 3.4.4)
between y and ŷ = f cnn (x; wcnn ). Thus, the loss function can be parameterized by
wcnn as L(·; wcnn ). Further, it is designed to be continuous and differentiable at each
point, allowing it to be minimized directly using the gradient descent optimization,
which minimizes the loss by updating wcnn in the opposite direction of the gradient
of the loss function ∇wcnn L(·; wcnn ) as:
(t+1) (t) (t)
wcnn = wcnn − η∇wcnn L(·; wcnn ), (3.33)

where the learning rate η determines the size of the steps we take to reach a (local)
minimum. In other words, we follow the direction of the slope of the surface created
by the objective function L(·; wcnn ) downhill until we reach a valley.
There exists three types of gradient descent approaches: batch gradient descent,
stochastic gradient descent, mini-batch gradient descent where each variant is classi-
fied based upon the amount of data utilized to compute the gradient of the objective
function. However, it should be noted that there is a fine trade-off between the accu-
racy of the parameter updation and the total training time of the model, which is
discussed in detail next.
Batch gradient descent: Let {(x1 , y1 ), . . . , (xn , yn )} be the training dataset. Batch
gradient descent estimates the average of ∇wcnn L(·; wcnn ) over the entire training set
52 3 Mathematical Background

and uses that mean gradient to update the parameters in each iteration or epoch as:

1
n
(t+1) (t) (t)
wcnn = wcnn −η ∇w L(y j , ŷ j ; wcnn ) (3.34)
n j=1 cnn

For convex loss functions, batch gradient descent is guaranteed to converge to the
global minimum. However for non-convex loss functions, there is no such guarantee.
Furthermore, since the average gradient is computed over the entire training set, it
results in a stable gradient and a stable convergence to an optimum. However in
practice, the entire training set may be too large to fit in a single memory, necessitating
the use of additional memory.
Stochastic gradient descent: In contrast to batch gradient descent, stochastic gra-
dient descent (SGD) performs parameter updation for each training sample (x j , y j )
as:
(t+1) (t) (t)
wcnn = wcnn − η∇wcnn L(y j , ŷ j ; wcnn ) (3.35)

This type of parameter updation enables SGD to escape local minima if the opti-
mizer gets trapped in one. Thus, it is able to arrive at a better one over time. Due
to the intrinsic high variance of the gradient computed for each sample, the noisier
gradient computation is better suited to a loss surface with a large number of local
minima. However, excessive frequent traversals of the loss surface may impair the
optimizer’s ability to maintain a decent minimum once it is discovered. In such situ-
ations, selecting an appropriate learning rate becomes critical so that the movement
can be stabilized as necessary. Compared to batch gradient descent, larger datasets
can be processed through SGD since it stores a single sample at a time in memory for
optimization. It is also computationally faster because it processes only one sample
at a time, making it suitable to perform optimization in an online fashion.
Mini-batch gradient descent: This algorithm is a hybrid of stochastic and batch
gradient descent. In each epoch, the training set is randomly partitioned into multiple
mini-batches, each of which contains a specified number of training samples (say,
m). A single mini-batch is passed through the network at a time, and the average of
∇wcnn L(·; wcnn ) over it is calculated to update the weights as:

1 
m
(t+1) (t) (t)
wcnn = wcnn −η ∇w L(y j , ŷ j ; wcnn ) (3.36)
m j=1 cnn

This method establishes a trade-off between batch gradient descent and SGD and
hence has the advantages of both. As compared to SGD, it reduces the variance of
parameter updates, potentially resulting in steadier convergence. Furthermore, by
choosing an appropriate mini-batch size, training data can easily be fit into memory.
Considering various parameter optimization methods as stated in Eqs. (3.34–
3.36), the immediate question is how the gradient of the loss computed at the output
of a network can influence the intermediate layer parameters in order to update them.
3.4 Convolutional Neural Network 53

The method of backpropagation executes this task as follows. Considering only the
weight parameters wl in an intermediate layer-l, from Eq. (3.35), we can write

wl(t+1) = wl(t) − η∇wl L(y j , ŷ j ; wl(t) ) . (3.37)

From Eq. (3.32), the gradient of the loss with respect to the parameters of this layer
can be computed using chain rule as:

∇wl L(·; wl(t) ) = ∇wL L(·; w (t)


L )∇w L−1 f L−1 (·; w L−1 ) . . . ∇wl f l (·; wl ), (3.38)

that is, the loss gradient can be propagated till the desired layer, and hence this method
is called backpropagation.

3.5 Graph Convolutional Neural Network

Many important real-world datasets such as social networks, knowledge graphs,


protein-interaction networks, the World Wide Web, to name a few, come in the form
of graphs or networks. Yet, until recently, very little attention has been devoted to the
generalization of neural network models to such structured datasets. Given a graph
with n nodes (Fig. 3.11a), a graph signal f in ∈ Rn is constructed by concatenating the
node attributes. The graph convolution operation on this graph signal can be defined
as:
H̄ = h̄ 0 I + h̄ 1 A + h̄ 2 A2 + · · · + h̄ L A L , (3.39)

f out = H̄ f in , (3.40)

where H̄ is the convolution filter, {h̄ l } is the set of filter taps to be learned, I is identity
matrix, A ∈ Rn×n is the binary or weighted adjacency matrix, and f out ∈ Rn is the
output graph signal.
As mentioned in [121], representing a graph convolution filter as a polynomial
of the adjacency matrix serves two purposes. (1) Since the filter parameters (i.e.,
the scalar multipliers h̄ l ) are shared at different nodes, the filter becomes linear and
shift-invariant. (2) Different degrees of the adjacency matrix ensure involvement
of higher-order neighbors for feature computation, which apparently increases the
filter’s receptive field and induces more contextual information into the computed
feature. To reduce the number of parameters, VGG network-like architectures can be
used, where instead of using a large sized filter, a series of small convolution filters
along different layers have been used. One such filter is given as:

H = h 0 I + h 1 A. (3.41)
54 3 Mathematical Background

Fig. 3.11 Graph CNN


outcome. a An input graph
with 14 nodes and scalar
node attributes resulting in
the input graph signal
f in ∈ R14 . b Convolution
operation resulting in
updated node attributes
(three-dimensional) with the
output graph signal
f out ∈ R14×3 . The adjacency
relationship among the nodes
remains unchanged

(a)

(b)

As one moves toward higher layers, this implicitly increases the receptive field of
filters by involving L-hop neighbors. Therefore, a cascade of L layers of filter banks
eventually transforms Eq. (3.41) to works as Eq. (3.39).
To further improve the convolution operation, one may split the adjacency matrix
A into multiple slices (say, T numbers), where each slice At carries the adjacency
information of a certain set of nodes and corresponding features. For example, they
can be designed to encode the information of relative orientations of different neigh-
3.5 Graph Convolutional Neural Network 55

bors with respect to the node where the convolution is centered at. Considering this
formulation, Eq. (3.41) can be rewritten as:

H = h 0 I + h 1,1 A1 + h 1,2 A2 + . . . + h 1,T AT , (3.42)


T
where A = t=1 At .
It may be noted that the network can take a graph of any size as the input. For
the case of D-dimensional node attributes, the input graph signal f in ∈ Rn×D . This
is exactly the same as in traditional CNN, where the convolution operation does
not impose any constraint on the image size and pixel connectivity. Like traditional
CNN, at any layer, graph convolution is performed on each channel dimension of
( j)
the graph signal ( f in ∈ Rn , for j = 1, 2, . . . , u) coming from the preceding layer
separately, and then the convolution results are added up to obtain a single output
(i)
signal ( f out ∈ Rn ) as:

(i)

u
( j)
f out = H(i, j) f in , for i = 1, 2, . . . , p, (3.43)
j=1

where p is the number of filters at that layer, {H(i, j) }uj=1 constitute filter-i, and the
output signal has p channels, i.e., f out ∈ Rn× p . The output graph signal (Fig. 3.11b)
obtained after multiple convolution layers can be passed through fully connected
layers for classification or regression tasks.

3.6 Variational Inference

Inference in probabilistic models is frequently intractable. To solve this issue, existing


algorithms use subroutines to sample random variables to offer approximate solutions
to the inference. Most sample-based inference algorithms are basically different
instances of Markov chain Monte Carlo (MCMC) methods where Gibbs sampling and
Metropolis–Hastings are the two most widely used MCMC approaches. Although
they are guaranteed to discover a globally optimum solution given enough time, it
is impossible to discern how near they are to a good solution given they have a
finite amount of time in practice. Furthermore, choosing a right sampling approach
subjected to a particular problem is an art rather than science.
Variational family of algorithms address these challenges by casting inference as
an optimization problem. Let’s say we have an intractable probability distribution p.
In order to discover a q ∈ Q that is the most similar to p, variational approaches will
attempt to solve an optimization problem over a class of tractable distributions Q to
obtain an optimum q which can be used as a representative of p. The following are
the key distinctions between sampling and variational techniques:
• Variational techniques, unlike sampling-based methods, nearly never identify the
globally optimal solution.
56 3 Mathematical Background

• We will, however, always be able to tell if they have converged, and it is possible
to put bounds on their accuracy in some circumstances.
• In reality, variational inference approaches scale better and are better suited to
approaches such as stochastic gradient optimization, parallelization over several
processors and GPU acceleration.
Suppose x is a data point which has a latent feature z. For most of the statistical
inference task, one might be interested to find out the distribution of the latent feature
given the observation. This can be obtained using Bayes’ rule as:

p(x | z; θ )
p(z | x; θ ) = p(z; θ )  , (3.44)
p(x | z; θ ) p(z; θ ) dz

where p(x | z; θ ), p(z; θ ) and p(z | x; θ ) are the likelihood, prior and the posterior
distributions with parameter θ , respectively.
Now the above integral is intractable asit deals with a possibly high-dimensional
integral; the marginal likelihood, p(x) = p(x | z) p(z) , as a result, Bayes’ rule is
difficult to apply in general. As mentioned before, the goal of variational inference
is to recast this integration problem as one of optimization, which includes taking
derivatives rather than integrating because the former is more easier and generally
faster.
The primary objective of variational inference is to obtain an approximate distri-
bution of the posterior p(z | x). But instead of sampling, it tries to find a distribution
q̂(z | x; φ) from a parametric family of distributions Q such that it can best approx-
imate the posterior as:

q̂(z | x; φ) = arg minq(z|x;φ)∈Q KL [q(z | x; φ) || p(z | x; θ )] (3.45)

where KL[·||·] stands for Kullback–Leibler divergence and is defined as:



q(z | x; φ)
KL [q(z | x; φ) || p(z | x; θ )] = q(z | x; φ) log dz. (3.46)
p(z | x; θ )

Now the objective function in Eq. (3.45) cannot be solved directly since the compu-
tation still requires the marginal likelihood. It can be observed that:
 
KL [q(z | x) || p(z | x)] = Eq(z|x) log q(z|x)
 p(z|x)   
= Eq(z|x) log q(z | x) − Eq(z|x) log p(z | x) 
 
= Eq(z|x) log q(z | x) − Eq(z|x) log p(x|z) p(z)
   
p(x)
= Eq(z|x) log q(z | x) − Eq(z|x)  log p(z)  
+Eq(z|x) log p(x) − Eq(z|x) log  p(x | z)
= KL [q(z | x) || p(z)] − Eq(z|x) log p(x | z) + log p(x)
(3.47)
3.6 Variational Inference 57

In the above equation, the term log p(x) can be ignored as it does not depend upon the
optimizer q(z | x). As
 a result, minimizing
 KL [q(z | x) || p(z | x)] is equivalent to
maximizing Eq(z|x) log p(x | z) − KL [q(z | x) || p(z)], which is called evidence
lower bound (ELBO) since it acts as the lower bound of the evidence p(x), as shown in
Eq. (3.48). This is possible because KL divergence is always a non-negative quantity.
 
log p(x) ≥ Eq(z|x) log p(x | z) − KL [q(z | x) || p(z)] (3.48)

3.7 Few-shot Learning

Humans can distinguish new object classes based on a small number of examples.
The majority of machine learning algorithms, on the other hand, require thousands
of samples to reach comparable performance. In this context, few-shot learning tech-
niques have been designed that aim to classify a novel class using a small number
of training samples from that class. One-shot learning is the extreme case where
each class has only one training example. In several applications, few-shot learning
is extremely useful where training examples are scarce (e.g., rare disease cases) or
when the expense of labeling data is too high.
Let us consider a classification task of N categories where the training dataset has
a small number (say, K ) of samples per class. Any classifier trained using standard
approaches will be highly parameterized. Hence, it will generalize poorly and will
not be able to distinguish test samples from the N classes. As the training data is
insufficient to constrain the problem, one possible solution is to gain experience from
other similar problems having large training datasets. To this end, few-shot learning
is characterized as a meta-learning problem in an N -way-K -shot framework. In the
classical learning framework, we learn how to classify from large training data, and
we evaluate the results using test data. In the meta-learning framework, we learn
how to learn to classify using a very small set of training samples. Here several tasks
are constructed to mimic the few-shot scenario. So for N -way-K -shot classification,
each task includes batches of N classes with K training examples of each, randomly
chosen from a larger dataset different from the one that we finally want to perform
classification on. These batches are known as the support set for the task and are
used for learning how to solve that task. In addition, there are further examples of
the same N classes that constitute a query set, and they are used to evaluate the
performance on this task. Each task can be completely non-overlapping in terms of
classes involved; we may never see the classes from one task in any of the other
tasks. The idea is that the network repeatedly sees instances (tasks) during training
in a manner that matches the structure of the final few-shot task, i.e., few training
samples per class, but involves different classes.
At each step of meta-learning, the model parameters are updated based on a
randomly selected training task. The loss function is determined by the classification
performance on the query set of this training task, based on the knowledge gained
q q q
from its support set [117]. Let {D1s , D2s , . . . , D sN } and {D1 , D2 , . . . , D N } denote
58 3 Mathematical Background

the support set and the query set, respectively, for a task. Here, Dis = {(x sj , y sj )} Kj=1
q q q 
and Di = {(x j , y j )} Kj=1 contain examples from class-i, where x s (or x q ) and y s ∈
{1, 2, . . . , N } (or y q ) denote a sample and the corresponding ground-truth label.
Let f θ (·) be the embedding function to be learned, parameterized by θ , that obtains
feature representation of samples. The mean representation of class-i is obtained as:

1 
ci = f θ (x sj ). (3.49)
K
(x sj ,y sj )∈Dis

Given a query sample x q , a probability distribution over the classes is computed as:

exp (−d( f θ (x q ), ci ))
pθ (y = i|x q ) =  N , (3.50)
i  =1 exp (−d( f θ (x ), ci  ))
q

where d(·) is an appropriate distance function, e.g., cosine distance or Euclidean dis-
tance. The model is learned by minimizing the negative log-probability considering
each query sample’s true class label, and the corresponding loss is given as:

1  
N
q q
L=− log pθ (y = y j |x j ) (3.51)
N K  i=1 q q q
(x j ,y j )∈Di

Since the network is presented with a different task at each time step, it must learn
how to discriminate data classes in general, rather than a particular subset of classes.
To evaluate the few-shot learning performance, a set of test tasks with both support
and query sets are constructed, which contain only unseen classes that were not in
any of the training tasks. For each test task, we can measure its performance on the
q q q
query {D̃1 , D̃2 , . . . , D̃ N } based on the knowledge provided by the corresponding
support set set {D̃1s , D̃2s , . . . , D̃ sN }. Similar to Eq. (3.49), the mean representation for
a test support class-i is computed as:

1 
c̃i = f θ (x̃ sj ). (3.52)
K
(x̃ sj , ỹ sj )∈D̃is

Given a query sample x̃ q , its class label is predicted as:

ỹ = arg min d( f θ (x̃ q ), c̃i ). (3.53)


i∈{1,2,...,N }

Till now, we have discussed class-based meta-learning. However, we need the learn-
ing to be class agnostic for tasks such as co-segmentation that do not involve any
semantic information. In Chap. 9, we discuss a class agnostic meta-learning scheme
for solving the co-segmentation problem.
Chapter 4
Maximum Common Subgraph Matching

4.1 Introduction

Foreground segmentation from a single image without supervision is a difficult prob-


lem. One would not know what constitutes the object of interest. If an additional
image containing a similar foreground is provided, both images can possibly be seg-
mented simultaneously with a higher accuracy using co-segmentation. We study this
aspect; i.e., if one is given multiple images without any other additional information,
is it possible to fuse useful information for segmentation purpose. Thus given a set
of images (e.g., say, crowd-sourced images), the objects of common interest in those
images are to be jointly segmented as co-segmented objects [103, 105, 106] (see
Fig. 1.3).

4.1.1 Problem Formulation

In this chapter, we demonstrate the commonly understood graph matching algorithm


for foreground co-segmentation. We set up the problem as a maximum common
subgraph (MCS) computation problem. We find a solution to MCS of two region
adjacency graphs (RAG) obtained from an image pair and then perform region co-
growing to obtain the complete co-segmented objects.
In a standard MCS problem, typically, certain labels are assigned as the node
attributes. Thus, given a pair of graphs, the inter-graph nodes can be matched exactly
as discussed in Sect. 3.3. But in natural images, we expect some variations in
attributes, i.e., features of similar objects or regions (e.g., color, texture, size). So
in the discussed approach, for an inter-graph node pair to match, the attributes need
not be exactly equal. They are considered to match if the attribute difference is within
a certain threshold (Eq. 3.14). The key aspects of this chapter are as follows.
• The MCS-based matching algorithm allows co-segmentation of multiple common
objects.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 59
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_4
60 4 Maximum Common Subgraph Matching

• Region co-growing helps to detect common objects of different sizes.


• An efficient use of the MCS algorithm followed by region co-growing can co-
segment high-resolution images without increasing computations.
We describe the co-segmentation algorithm initially for two images in Sects. 4.2
and 4.3. Then we show its extension to multiple images in Sect. 4.5. Comparative
results are provided in Sect. 4.4.

4.2 Co-segmentation for Two Images

In the co-segmentation task for two images, we are interested in finding the objects of
interest that are present in both the images and have similar features. The flow of the
co-segmentation algorithm is shown in Fig. 4.1, which is detailed in the subsequent
sections. First each image (Fig. 4.2a, b) is segmented into superpixels using SLIC
method [1]. Then a graph is obtained by representing the superpixels of an image as
nodes, and every node pair corresponding to a spatially adjacent superpixel pair is
connected by an edge. Superpixel segmentation describes the image at a coarse-level
through a limited number (n S ) of nodes of the graph. An increase in n S increases the
computation during graph matching drastically. So, it is efficient to use superpixels
as nodes instead of pixels as this cuts down the number of nodes significantly. As an
image is a group of connected components (i.e., objects, background), and each such
component is constituted by a set of contiguous superpixels, this region adjacency
graph (RAG) representation of images is favorable in the co-segmentation method
for obtaining the common objects.

4.2.1 Image as Attributed Region Adjacency Graph

Let an image pair I1 , I2 is represented using two RAGs G1 = (V1 , E1 ) and G2 = (V2 ,
E2 ), respectively. Here, Vi = {vki } and Ei = {ekli } for i = 1, 2 denote the set of nodes
and edges, respectively. Any appropriate feature can be used as the node attribute. The
experiments performed in this chapter consider two features for each node: (i) CIE
Lab mean color and (ii) rotation invariant histogram of oriented gradient (HoG) of
the pixels within the corresponding superpixel. HoG features are useful to capture the
image texture, and they help to distinguish superpixels that may have similar mean
color in spite of being completely different in color. Further, rotation invariant HoG
features can match similar objects with different orientation. If an image is rotated,
the gradient direction at every pixel is also changed by the same angle. If h denotes
the histogram of directions of gradients computed at all pixels within a superpixel,
the values in the vector h will be shifted as a function of the rotation angle. In order
to achieve rotation invariance, the values in the computed HoG (h) are circularly
shifted with respect to the index of the maximum value in it. To incorporate both
4.2 Co-segmentation for Two Images

Fig. 4.1 Co-segmentation algorithm using a block diagram. Input image pair (I1 , I2 ) is represented as region adjacency graphs (RAGs) G1 and G2 . The maximum
common subgraph (MCS) of the RAGs yields the node sets V1H and V2H , which form the initial matched regions in I1 and I2 , respectively. These are iteratively
(index-(t)) co-grown using inter-image feature similarity between the nodes in them to obtain the final matched regions V1H ∗ and V2H ∗ . Figure courtesy: [48]
61
62 4 Maximum Common Subgraph Matching

features in the algorithm, the feature similarity S f (·) between nodes vk1 in G1 and vl2
in G2 is computed as a weighted sum of the corresponding color and HoG feature
similarities, denoted as Sc (·) and Sh (·), respectively.
     
S f vk1 , vl2 = 0.5 Sc vk1 , vl2 + 0.5 Sh vk1 , vl2 . (4.1)
 
Here, the similarity Sh vk1 , vl2 is computed as the additive inverse of the distance dkl
(e.g., Euclidean distance measure) between the corresponding HoG features h(vk1 )
and h(vl2 ). Prior to computing Sh (·), each dkl is normalized with respect to the
maximum pairwise distance of all node pairs as:

dkl
dkl = . (4.2)
max dk  l 
k  ,l  : vk1 ∈G1 ,vl2 ∈G2

The same strategy is adopted for computing the color similarity measure Sc (·). Next,
we proceed to obtain the MCS between the two RAGs to obtain the common objects
as explained next.

4.2.2 Maximum Common Subgraph Computation

To solve the co-segmentation problem of an image pair, we first need to find super-
pixel correspondences from one image to the other. Then the superpixels within
the objects of similar features across images can be matched to obtain the com-
mon objects. However, since any prior information about the objects is not used in
this unsupervised method, the matching becomes exhaustive. Colannino et al. [28]
showed that the computational complexity of such matching is O ((|G1 | + |G2 |)3 )
assuming a minimum cost many-to-many matching algorithm. Further, the resulting
matched regions may contain many disconnected segments. Each of these segments
may be a group of superpixels or even a single superpixel, and such matching may not
be meaningful. To obtain a meaningful match, wherein the connectivity among the
superpixels in the matched regions is maintained, we describe a graph-based approach
to jointly segment the complete objects from an image pair. In this framework, the
objective is to obtain the maximum common subgraph (MCS) that represents the
co-segmented objects. The MCS corresponds to the common subgraphs H1 in G1
and H2 in G2 (see Fig. 4.3 for illustration). However, H1 and H2 may not be identical
as (i) G1 and G2 are region adjacency graphs with feature vectors as node attributes,
and (ii) the common object regions in both the images need not undergo identical
superpixel segmentation. Hence, unlike in a standard MCS finding algorithm, many-
to-one matching must be permitted here. Complications arising from many-to-one
node matching can be reduced by restricting the number of nodes in one image that
can match to a node in the other image, and that number (say, τ ) can be chosen based
on the inter-image (superpixel) feature similarities in Eq. (4.1). Following the work
4.2 Co-segmentation for Two Images 63

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Fig. 4.2 Co-segmentation stages using an image pair. a, b Input images and their SLIC segmenta-
tion. c, d The matched nodes i.e., superpixels across images (shown in same color) obtained through
MCS computation and the corresponding e, f object regions in the images. g, h Co-segmented
objects obtained after performing region co-growing on the initially matched regions in (e, f).
Figure courtesy: [48]
64 4 Maximum Common Subgraph Matching

of Madry [80], it is possible to show that the computation complexity reduces to


O ((τ (|G1 | + |G2 |))10/7 ) when the matching is restricted to a maximum of τ nodes
only.
We begin the MCS computation by building two product graphs W12 and W21
from the RAGs G1 and G2 based on the similarity values S f (·) following a similar
strategy described in Sect. 3.3. Here, we only describe the steps to compute W12 and
the subsequent steps in MCS computation, considering G1 as the reference graph
which is being matched to G2 . The steps involving W21 (i.e., the reference graph G2
being matched to G1 ) are identical. To find the product nodes of W12 , a threshold tG
(0 ≤ tG ≤ 1) is selected for node matching, as node features do not need to match
exactly for natural images. Further, to enforce the constraint τ , we need to find the
(k)
τ largest similar nodes in G2 for  1every node vk ∈ V1 . Let V2 be the ordered list of
1

nodes {vl } in V2 such that {S f vk , vl }∀l are in descending order of magnitude. The
2 2

product node set U12 W


of the product graph W12 is obtained as:

    

U12
W
= vk1 , u l ∈ V2(k) |S f vk1 , u l > tG (4.3)
l=1,2,...τ
∀k

Similarly, we can compute U21 W


by keeping V2 as reference. It is evident from Eq. (4.3)
that restricting one node in one graph to match to at most τ nodes in the other graph
leads to U12W
= U21
W
, resulting in H1 = H2 (i.e. not commutative) as noted earlier.
Conversely, the two product graphs W12 and W21 would be identical in the absence
of τ . In Sect. 3.3, the parameter τ was not considered, and hence, a single product
graph W was considered. Let us analyze the effect of the two parameters tG and τ .
A large value of tG and a small value of τ .
• restrict the matching to only a few candidate superpixels, and yet allowing certain
amount of inter-image variations in the common objects,
• ensure a fast computation during subgraph matching and
• reduce the product graph size as well as the possibility of spurious matching.
For example, the size of the product graph for many-to-many matching is O (|G1 ||G2 |).
The choice of τ in the matching process reduces the size to O (τ (|G1 | + |G2 |)), while
the additional use of the threshold tG makes it O (ζ τ (|G1 | + |G2 |)) with 0 < ζ  1.
This reduces the computation drastically. We will show in Sect. 4.2.3 that the soft
matches can be recovered during the region co-growing phase.
Then using the method described in Sect. 3.3, the MVC of the complement
graph W12 C
is computed. The set of product nodes (U12 M
⊆ U12W
) other than this MVC
represent the left matched product nodes that form the maximal clique of W12 .
Similarly, we obtain the right matched product nodes U21 M
⊆ U21W
from W21 . Let
U  U12 ∪ U21 . The set of nodes V1 ⊆ V1 and V2 ⊆ V2 (see Fig. 4.2c, d) in the
M M M H H

corresponding common subgraphs H1 and H2 are obtained from U M using Eq. (3.15)
and Eq. (3.16), respectively, and they correspond to the matched regions in I1 and
I2 , respectively (see Fig. 4.2e, f).
4.2 Co-segmentation for Two Images 65

Fig. 4.3 Example of maximum common subgraph of two graphs G1 and G2 . The set of nodes
V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the maximum common subgraphs H1
and H2 of G1 and G2 , respectively, are highlighted (in blue). Figure courtesy: [48]

 
Fig. 4.4 Requirement of condition C2 of edge assignment between product graph nodes v11 , v12
 1 2
and v3 , v3 obtained using Eq. (4.3). Here, condition C1 is not satisfied, however, condition C2
 
is satisfied, and an edge is added. It is easy to derive that the nodes in the MCS are v11 , v12 and
 1 2
v3 , v3 . This shows that multiple disconnected but similar objects can be co-segmented. Figure
courtesy: [48]

In Sect. 3.3, we discussed two conditions for connecting product nodes with edges,
and we analyze them here. As seen in Figs. 3.3–3.8, both conditions are necessary
for computing the maximal clique correctly. However, if multiple common objects
are present in the image pair, and they are not connected to each other, condition C1
alone cannot co-segment both. We illustrate using an example that condition C2 helps
to achieve this. In Fig. 4.4, let the disconnected nodes v11 and v31 in G1 be similar to
the disconnected nodes v12 and v32 in G2 , respectively. Here, the use of condition C1
alone will co-segment either (i) v11 and v12 , or (ii) v31 and v32 , but not both. But using
both conditions, we will be able to co-segment both (i) v11 and v12 , and (ii) v31 and v32 ,
which is the correct result.

4.2.3 Region Co-growing

In the MCS algorithm of Sect. 4.2.2, we used certain constraints on the choice
of similarity threshold tG and the maximal many-to-one matching parameter τ so
that the product graph size remains small and the subsequent computations reduce.
However, the subgraphs H1 and H2 obtained at the MCS output do not cover the
complete objects; i.e., the resulting V1H and V2H may not contain all the superpixels
constituting the common objects. So, we need to grow these matched regions to obtain
66 4 Maximum Common Subgraph Matching

the complete co-segmented objects. In this section, we discuss an iterative method


that performs region growing in both images simultaneously based on neighborhood
feature similarities across the image pair.
Given V1H and V2H which constitute the common objects partially in the respective
images, for a superpixel pair in them that has matched, it is expected to find matching
of superpixels in their neighborhoods. Thus, we can perform region co-growing on
V1H and V2H using them as seeds to obtain the complete objects as:

(H1 , H2 ) = F RC G (H1 , H2 , I1 , I2 ) , (4.4)

where F RC G denotes the region co-growing function, and H1 , H2 denote the sub-
graphs representing the complete objects. Further, F RC G has the following benefits.
• Even if an image pair contains common objects of different size (and number of
superpixels), they are completely detected after region co-growing.
• Obtaining an MCS with a small product graph followed by region co-growing is
computationally less intensive than solving for MCS with a large product graph.
Any region growing method typically considers the neighborhood of the seed
region, and appends the superpixels (or pixels) from that neighborhood which is
similar (in some metric) to the seed. However, here instead of growing V1H and V2H
independently, it is more appropriate to grow them jointly because the co-segmented
objects must have commonality. Let NViH denotes the set of neighbors of ViH , with

NViH = {u ∈ N(v)} for i = 1, 2 , (4.5)
v∈ViH

where N(·) denotes the first-order neighborhood. To co-grow V1H and V2H , the set of
nodes Nsi ⊆ NViH having high inter-image feature similarity to the nodes in V jH is
obtained, and ViH grows as:

ViH,(t+1) ← ViH,(t) ∪ Ns(t)


i
for i = 1, 2 . (4.6)

To completely grow H1 , H2 into H1 , H2 , this process is iterated until convergence.
These iterations implicitly consider higher-order neighborhoods of V1H and V2H ,
which is necessary for their growth. In every iteration-t, ViH,(t) denotes the already
matched regions (nodes), and NV(t)H denotes the nodes in their first-order neighbor-
i
hood. The region co-growing algorithm converges when

V1H,(t) = V1H,(t−1) and


V2H,(t) = V2H,(t−1) . (4.7)

After convergence, we denote V1H ∗ and V2H ∗ as the node sets (see Fig. 4.1) which
constitute H1 , H2 representing the common objects completely in I1 and I2 , respec-
4.2 Co-segmentation for Two Images 67

Algorithm 1 Pairwise image co-segmentation algorithm


Input: An image pair I1 , I2
Output: Common objects F1 , F2 present in the image pair
1: Build RAGs G1 = (V1 , E1 ), G2 = (V2 , E2 ) from I1 , I2
2: // MCS computation
3: Compute product graphs W12 , W21 using Eqn (4.3)
4: Compute minimum vertex covers of W12 C , W C and their complements U M , U M
21 12 21
5: U M := U12M ∪ UM
21
6: Find maximum common subgraphs H1 ⊆ G1 , H2 ⊆ G2 and corresponding node sets V1H , V2H
from U M .
7: // Region co-growing
8: t ← 1, V1H,(t) ← V1H , V2H,(t) ← V2H
9: while no convergence do
(t) (t)
10: Ns1 := v 2 ∈V H,(t) {vk1 ∈ NV H |S f (vk1 , vl2 ) > tG }
l 2 1
11: Ns(t)
2 := H,(t)
vk1 ∈V1
{vl2 ∈ NV(t)H |S f (vk1 , vl2 ) > tG }
2
12: Region growing in G1 : V1H,(t+1) ← V1H,(t) ∪ Ns(t)
1
H,(t+1) H,(t) (t)
13: Region growing in G2 : V2 ← V2 ∪ Ns2
14: t ← t + 1
15: end while
16: Obtain F1 , F2 from V1H ∗ , V2H ∗

tively (also see Fig. 4.2g, h). The example in Fig. 4.5a–f shows that region growing
helps to completely detect common objects of different size. The larger object has
been partially detected from MCS (Fig. 4.5b), and it is fully recovered after region
co-growing (Fig. 4.5c). The co-segmentation algorithm is given as a pseudocode in
Algorithm 1.
Similarity metric. As discussed earlier, the neighborhood subset Ns2 is obtained
by analyzing the feature similarity between the node sets {vk1 ∈ V1H } and {vl2 ∈ NV2H }.
However, the node level similarity S f (vk1 , vl2 ) of Eq. (4.1) alone may not be enough.
Hence to further enhance co-growing, an additional neighborhood level similarity
is required which can be defined as the average feature similarity between their
neighbors (Nvk1 and Nvl2 ) that are already in the set of matched regions, i.e., in V1H
and V2H , respectively. A weighted feature similarity is computed as:
   1 2
S f (vk1 , vl2 ) = ωN S f vk1 , vl2 + (1 − ωN )S N
f vk , vl , (4.8)

where ωN is an appropriately chosen weight, and S N f (·) is the neighborhood simi-


larity. Thus, the similarity measure for region co-growing has an additional measure
of neighborhood similarity compared to the measure used for graph matching in
Sect. 4.2.2. We illustrate this using example graphs next. Figure 4.6a shows two
graphs and their MCS output

V1H = {v11 , v21 , v31 , v51 },


V2H = {v12 , v22 , v32 , v52 },
68 4 Maximum Common Subgraph Matching

(a) (b) (c)

(d) (e) (f)

Fig. 4.5 Effectiveness of region co-growing in co-segmenting objects of different size. a, d Input
images. b, e Similar object regions obtained using the MCS algorithm, where the larger object
(of image d) is not completely detected. c, f Complete co-segmented objects are obtained after
co-growing. Figure courtesy: [48]

with the correspondences v11 ↔ v12 , v21 ↔ v22 , v31 ↔ v32 , v51 ↔ v52 . While growing
V2H , we need to analyze the similarity between nodes in V1H and NV2H . For the
pair of a matched node v11 ∈ V1H and an unmatched neighboring node v42 ∈ NV2H ,
the weighted measure S f (v11 , v42 ) is computed considering their feature similarity
S f (v11 , v42 ) and the feature similarity between the respective (matched) neighboring
node pairs (v31 ∈ V1H ∩ Nv11 , v32 ∈ V2H ∩ Nv42 ) and (v51 ∈ V1H ∩ Nv11 , v52 ∈ V2H ∩ Nv42 ).
The neighboring nodes v21 ∈ V1H and v12 ∈ V2H are ignored since they have not been
matched to each other. If S f (v11 , v42 ) computed using Eq. (4.8) exceeds a certain
threshold tG , v42 is assigned to the set Ns2 , which is used in Eq. (4.6). Similarly while
growing V1H , the weighted feature similarity between v41 ∈ NV1H and the nodes in
V2H is computed (see Fig. 4.6c).
Now we discuss the formulation of S N f (·). If a node in Gi has few already matched
neighbors (i.e., neighbors from ViH ), it is less likely to be part of the foreground in Ii .
So, less importance should be given to it even if it has relatively high inter-image
feature similarities with the nodes within the object in I j . In Fig. 4.6a, the unmatched
node v42 ∈ NV2H has three matched neighboring nodes v12 , v32 and v52 , whereas in
Fig. 4.6c, the unmatched node v41 ∈ NV1H has one matched neighboring node v11 . The
neighborhood similarity measure S N f (vk , vl ) is computed as:
1 2

SN
f (vk , vl ) = 1 − (Z )
1 2 nM
, where (4.9)
4.2 Co-segmentation for Two Images

Fig. 4.6 Region co-growing. a The set of nodes V1H and V2H at the MCS outputs of the graphs G1 and G2 , with v11 , v21 , v31 , v51 match v12 , v22 , v32 , v52 , respectively.
The nodes in MCSs are V1H,(t) and V2H,(t) (blue) at t = 1. To grow V2H,(t) , we compare feature similarities of each node, e.g., v42 (red), in the neighborhood of
V2H,(t) to all the nodes in V1H,(t) . b V2H,(t+1) (green) has been obtained by growing V2H,(t) where v42 has been included in the set due to high feature similarity
H,(t) H,(t)
with v11 and their neighbors. c To grow V1 , we compare feature similarities of each node, e.g., v41 (red), in the neighborhood of V1 to all the nodes in
H,(t) 1
V2 . d The set of matched nodes (purple) after iteration-1 of region growing, assuming no match has been found for v4 . Figure courtesy: [48]
69
70 4 Maximum Common Subgraph Matching

nM = 1 (u 1 , u 2 ) with V1 = Nvk1 ∩ V1H , V2 = Nvl2 ∩ V2H , and


u 1 ∈V1 u 2 ∈V2
1  
Z = 1 − S f (u 1 , u 2 ) 1 (u 1 , u 2 ). (4.10)
nM
u 1 ∈V1 u 2 ∈V2

Here, Z denotes the average distance between the already matched pairs belonging
to the neighborhood of vk1 and vl2 . The indicator function 1 (u 1 , u 2 ) = 1 if the MCS
matching algorithm yields a match between nodes u 1 and u 2 , and 1 (u 1 , u 2 ) = 0
otherwise. It can be observed from Eq. (4.9) that S N f (·) increases as the number of
neighbors that have already been matched increases, as desired.
Relevance Feedback. The weight ωN in Eq. (4.8) is used to provide relevance
to the two constituent similarity measures. Instead of using heuristics, relevance
feedback [81] can be used to quantify the importance of the neighborhood information
and to find ωN . This is an iterative method that uses a set of training image pairs.
In each iteration, users assign a score to the co-segmentation output (denoted as
Fω(t)N ) based on its quality. It is then compared with the scores of the co-segmentation
outputs (denoted as F1 and F0 , respectively) obtained using each of the constituent
similarity measures individually. The score difference is used to obtain the weights.
As one of the notable applications of relevance feedback, Rocchio [81] has used it to
modify the query terms in document retrieval application. Rui et al. [107] have used
relevance feedback to find appropriate weights for combining image features for
content-based image retrieval. It may be mentioned here that in case one does have
access to the ground-truth, one can use this ground-truth while using the relevance
feedback instead of manual scoring.
User feedback is used to rate FωN of every training image pair as well as F1
and F0 obtained separately using ωN = 1 and ωN = 0, respectively (Algorithm 2).
Initially equal weight is assigned, i.e., ωN = 0.5 and Fω(1) N
is computed. Then weights
are iteratively updated by comparing user ratings for F1 and F0 with Fω(t)N until
convergence, as explained next. User feedback is used to assign scores θ1,k and θ0,k
to F1 and F0 , respectively, for every image pair k in a set of Nt training image pairs.
Then in each iteration t, users assign a score θk(t) to Fω(t)N for every image pair k. The
improvements π1(t) , π0(t) in Fω(t)N over F1 and F0 are computed based on the score
difference as:
Nt  
πi(t) = θi,k − θk(t) , i = 0, 1. (4.11)
k=1

Although the use of more levels of score improves the accuracy, it becomes inconve-
nient for the users to assign scores. As a trade-off, we use seven levels of scores: −3,
−2, −1, 0, 1, 2, 3, where −3 and 3 indicate the worst and the best co-segmentation
outputs, respectively. To find the ratio of π1(t) , π0(t) , they must be positive. As they
are computed as difference between scores, they can be positive or negative. To
4.2 Co-segmentation for Two Images 71

Algorithm 2 Estimation of the weight ωN used in the weighted feature similarity


computation using relevance feedback
1: Input: Co-segmentation outputs F1 and F0 obtained using ωN = 1 and ωN = 0 separately for
every image pair-k in training dataset of size Nt ; score ∈ {−3, −2, −1, 0, 1, 2, 3}
2: Output: Updated weight ωN
(t)
3: Initialization: t ← 1, ωN ← 0.5
4: for k = 1 to Nt do
5: θ1,k ← score assigned to F1 obtained from image pair-k
6: θ0,k ← score assigned to F0 obtained from image pair-k
7: end for
8: loop
9: for k = 1 to Nt do
(t)
10: Run co-segmentation algorithm on image pair-k with ωN as weight
(t)
11: FωN ← co-segmentation output obtained
12: θk(t) ← score assigned to Fω(t)N
13: end for
Nt  
(t) (t)
14: Cumulative improvement πi = θi,k − θk , i = 0, 1

k=1   
15: Normalization πi(t) ← πi(t) − min π1(t) , π0(t) + min π1(t) , π0(t) , i = 0, 1
 
(t+1) (t) (t) (t)
16: ωN = π1 π1 + π0
(t)
17: if converged in ωN then
(t+1)
18: ωN ← ωN
19: break;
20: end if
21: t ← t + 1
22: end loop

have positive values, we scale π1(t) , π0(t) (Algorithm 2). Then these improvements are
normalized to obtain weights ωi(t+1) for the next iteration as:
 
ωi(t+1) = πi(t) π1(t) + π0(t) , i = 0, 1. (4.12)

After convergence, we obtain ωN = ω1 .

4.2.4 Common Background Elimination

In the co-segmentation problem, we are interested in common foreground segmenta-


tion and not in common background segmentation. However, an image pair may con-
tain similar background regions such as the sky, field or water body. In this scenario,
the co-segmentation algorithm, as described so far, will also co-segment the back-
ground regions since it is designed to capture the commonality across images. Thus,
such background superpixels should be ignored while building the product graphs and
72 4 Maximum Common Subgraph Matching

during region co-growing. Further, discarding the background nodes will reduce the
product graph size and subsequent computations. In the absence of any prior informa-
tion, Zhu et al. [151] proposed a method to estimate the backgroundness probability of
superpixels using an unsupervised framework. This method is briefly described next.
Typically, we capture images keeping the objects of interest at the center of the
image. This is called the center bias. Hence, most superpixels at the image bound-
ary (B ) are more likely to be part of the background. Additionally, several non-
boundary superpixels also belong to the background, and we expect them to be
highly similar to B . Thus, a boundary connectivity measure of each superpixel v is
defined as:
S f (v, v  )
v ∈B

C B (v) = , (4.13)
S f (v, v  )
v  ∈I

where S f (·) can be the feature similarity measure of Eq. (4.1) or any other appropriate
measure that may also incorporate spatial coordinates of the superpixels. Then, the
probability that a superpixel v belongs to the background is computed as:
 
1
PB (v) = 1 − exp − (C B (v)) .
2
(4.14)
2

To identify the possible background superpixels, we can compute this probabil-


ity for all superpixels in the respective images I1 and I2 , and the superpixels with
PB (vi ) < tB can be marked as background. Here, tB is a threshold.

4.3 Multiscale Image Co-segmentation

The number of superpixels in an image containing a well-textured scene increases


with the image size, which in turn makes the region adjacency graph larger. To
maintain the computational efficiency of the co-segmentation algorithm for high-
resolution images, we discuss a method using a pyramidal representation of images,
where every image is downsampled into multiple scales. First, images at every scale
are oversegmented into superpixels keeping the average superpixel size fixed, and
RAGs are computed. Naturally, images at the coarsest level (i.e., the smallest resolu-
tion) contain the least number of superpixels (nodes). Hence, the maximum common
subgraph is computed at the coarsest level where the computation of the MCS match-
ing algorithm is the least. One could next perform region co-growing at this level,
and resize that output to the input image size. However, this would introduce object
localization error. To avoid this, the matched superpixels obtained from the MCS
at the coarsest level are mapped to the upper-level superpixels through pixel coor-
dinates, and region co-growing is performed there. This process of mapping and
co-growing is successively repeated at the remaining finer levels of the pyramid to
obtain the final result.
4.3 Multiscale Image Co-segmentation 73

Let us explain this process using an example. The input images I1 and I2 are
successively downsampled (by 2) R times with I1,R and I2,R being the coarsest level
image pair, and let us denote I1,1 = I1 and I2,1 = I2 . Let V1,R H
and V2,R
H
be the set of
matched superpixels in I1,R and I2,R obtained using the MCS matching algorithm. To
find the matched superpixels in Ii,R−1 , every superpixel in Vi,R H
is mapped to certain
superpixels in Ii,R−1 based on the coordinates of the pixels inside the superpixels.
Since Ii,R−1 is larger than Ii,R , this mapping is one-to-many. To obtain the mapping
of a superpixel v ∈ Vi,R H
in Ii,R−1 , let {(xv , yv )} denotes the coordinate set of the
pixels constituting v, and {(x̃v , ỹv )} denotes the twice-scaled coordinates. Now a
superpixel u ∈ Ii,R−1 is marked as a mapping of v if {(xu , yu )} has the highest overlap
with {(x̃v , ỹv )} among all superpixels in Vi,RH
, i.e., v → u if

v = arg max

|{(x̃v , ỹv )} ∩ {(xu , yu )}| . (4.15)
v

Then region co-growing can be performed on the mapped superpixels in I1,R−1 and
I2,R−1 , as discussed in Sect. 4.2.3, to obtain the matched superpixel sets V1,R−1 H

in I1,R−1 and V2,R−1 in I2,R−1 . This process is to be repeated for subsequent levels to
H

obtain the final matched superpixel sets V1,1


H
and V2,1
H
that constitute the co-segmented
objects in I1,1 and I2,1 , respectively.

4.4 Experimental Results

In this section, we analyze the performance of several image pair co-segmentation


algorithms including the method described in this chapter (denoted as PR) by
performing experiments on images selected from five datasets: the image pair
dataset [69], the MSRC dataset [105], the iCoseg dataset [8], the flower dataset [92]
and the Weizmann horse dataset [13].
Let us begin by discussing the choice of different parameters in the PR method.
For an n 1 × n 2 image (at the coarsest level), the number of superpixels is chosen to be
N = min(100, n 1 n 2 /250). This limits the size of the graph to at most 100 nodes. The
maximal many-to-one matching parameter τ is limited to 2 as a trade-off between
the product graph size and possible reduction in the number of seed superpixels
for region co-growing. The inter-image feature similarity threshold tG in Eq. (4.3)
has been adaptively chosen to ensure that the size of the product graphs, W12 and
W21 , is at most 40–50 due to computational restrictions. In Sect. 4.2.4, the threshold
for background probability is set as tB = 0.75 max({PB (vi ), ∀vi ∈ I }) to ignore the
possible background superpixels in the co-segmentation algorithm.
We first visually analyze the results and then discuss quantitative evaluation. Row 1
in Figs. 4.7 and 4.8 show multiple image pairs containing a single common object and
multiple common objects, respectively. Co-segmentation results on these image pairs
using the methods PC [21], CIP [69], CBC [39], DSAD [60], UJD [105], SAW [16],
MRW [64] and PR are provided in Rows B-I, respectively. These results demonstrate
74 4 Maximum Common Subgraph Matching

Fig. 4.7 Co-segmentation of image pairs containing single common object. Results obtained from
the image pairs in a–h of Row A using methods PC, CIP, CBC, DSAD, UJD, SAW, MRW and the
method described in this chapter (PR) are shown in Rows B-I, respectively. Ground-truth data is
shown in Row J. Figure courtesy: [48]
4.4 Experimental Results 75

Fig. 4.7 (Continued): Co-segmentation of image pairs containing single common object
76 4 Maximum Common Subgraph Matching

Fig. 4.8 Co-segmentation of image pairs containing multiple common objects. Results obtained
from the image pairs in a–d of Row A using methods PC, CIP, CBC, DSAD, UJD, SAW, MRW
are shown in Rows B-H, respectively. See next page for continuation. Figure courtesy: [48]
4.4 Experimental Results 77

Fig. 4.8 (Continued): Co-segmentation of image pairs containing multiple common objects.
Results obtained from the image pairs in a–d of Row A using the method PR are shown in Row I.
Ground-truth data is shown in Row J. Figure courtesy: [48]

the superior performance of PR (Row I) while comparing with the ground-truth


(Row J). Among these methods, PC, CIP, CBC and SAW are co-saliency detection
methods. For the input image pair in Fig. 4.7a, b the methods PC, DSAD, UJD detect
only one of the two common objects (shown in Rows B, E, F). Most of the outputs
of PC, CIP, CBC, DSAD (shown in Rows B-E) contain discontiguous and spurious
objects. Further, in most cases the common objects are either under-segmented or
oversegmented. Although the method UJD yields contiguous objects, they very often
fail to detect any object from both images (Row F in Fig. 4.7a, c, e, h. However, PR
yields the entire object as a single entity with very little over or under-segmentation.
More experimental results are shown in Figs. 4.9, 4.10 and 4.11.
The quality of the co-segmentation output is quantitatively measured using pre-
cision, recall and F-measure, as used in earlier works, e.g., [69]. These metrics are
computed by comparing the segmentation output mask with the ground-truth pro-
vided in the database as defined next. Precision (P) is the ratio of the number of
correctly detected co-segmented object pixels to the number of detected object pix-
els. It penalizes for classifying background pixels as object. Recall (R) is the ratio of
the number of correctly detected co-segmented object pixels to the number of object
pixels in the ground-truth image (G). It penalizes for not detecting all pixels of the
object. F-measure (FP R ) is the weighted harmonic mean of precision and recall,
computed as:
n1 n2
F (i)G(i)
P = i=1n 1 n 2 , (4.16)
i=1 F (i)

i=1 F (i)G(i)
n1 n2
R= n1 n2 , (4.17)
i=1 G(i)

(1 + ω F ) × P × R
FP R = , (4.18)
ωF × P + R
78 4 Maximum Common Subgraph Matching

IN
(a) (b)

UJD
(c) (d)

MRW
(e) (f)

PR
(g) (h)

Fig. 4.9 Co-segmentation results of the methods PR, UJD and MRW on an image pair selected
from the MSRC dataset [105]. a, b Input image pairs (IN). c, d Co-segmentation outputs of UJD
and e, f that of MRW. g, h Co-segmentation outputs of PR. The extracted objects are shown on gray
background

where ω F = 0.3 (as in other works) to place more emphasis on precision. It is worth
mentioning here that Jaccard similarity is another commonly used metric for evalu-
ating segmentation results. However, we do not use it in this chapter because most
of the methods we considered here are co-saliency methods and F-measure is com-
monly used to evaluate saliency detection algorithms. In subsequent chapters, we
will use Jaccard similarity as the metric.
Quantitative comparison of the methods MRW, UJD, SAW, CBC, CIP, DCC,
DSAD, PC and PR on the image pair dataset [69] and the MSRC dataset [105]
4.4 Experimental Results 79

Fig. 4.10 Co-segmentation results of the methods PR, UJD and MRW on an image pair selected
from the Weizmann horse dataset [13]. a, b Input image pairs (IN). c, d Co-segmentation outputs
of UJD and e, f that of MRW. g, h Co-segmentation outputs of PR. The extracted objects are shown
on gray background
80 4 Maximum Common Subgraph Matching

Fig. 4.11 Co-segmentation results of the methods PR, UJD and MRW on three image pairs selected
from the flower dataset [92]. a–f Input image pairs (IN). g–l Co-segmentation outputs of UJD and
m–r that of MRW. s–x Co-segmentation outputs of PR. The extracted objects are shown on black
background

Table 4.1 Precision (P), recall (R) and F-measure (FP R ) values of the method PR with the methods
MRW, UJD, SAW, CBC, CIP, DCC, DSAD, PC on the image pair dataset [69]
Metrics Methods
PR MRW UJD SAW CBC CIP DCC DSAD PC
P 0.841 0.701 0.573 0.913 0.897 0.836 0.515 0.549 0.519
R 0.811 0.907 0.701 0.674 0.614 0.620 0.823 0.371 0.217
FP R 0.817 0.719 0.575 0.798 0.788 0.752 0.542 0.428 0.358

are shown in Tables 4.1 and 4.2, respectively. Results show that precision and recall
values of PR are very close, as it should be, and yet being very high. This indicates that
PR reduces both false positives and false negatives. While the methods CBC, SAW
4.4 Experimental Results 81

Table 4.2 Mean precision (P), recall (R) and F-measure (FP R ) values of the method PR with the
methods MRW, UJD, SAW, CBC, CIP, DCC, DSAD, PC on images selected from ‘cow’, ‘duck’,
‘dog’, ‘flower’ and ‘sheep’ classes in the MSRC dataset [105]
Metrics Methods
PR MRW UJD SAW CBC CIP DCC DSAD PC
P 0.981 0.837 0.812 0.859 0.970 0.790 0.803 0.566 0.564
R 0.791 0.836 0.791 0.655 0.680 0.373 0.377 0.310 0.230
FP R 0.928 0.818 0.789 0.787 0.872 0.510 0.593 0.432 0.394

Table 4.3 Computation time (in seconds) required by the methods PR, MRW, UJD, as the image
pair size (86 × 128 and 98 × 128) increases by shown factors
Method Increase in image size
1×1 2×2 22 × 22 23 × 23 24 × 24
MRW 32.65 51.63 78.83 163.61 820.00
UJD 1.80 6.00 25.20 107.40 475.80
PR 1.54 2.08 2.94 5.69 13.90

(Table 4.1) also have high precision values, the recall rate is significantly inferior.
Method MRW has a good recall measure, but the precision is quite low. In order to
compare the speed, we execute all the methods on the same system and report the
computation time required to execute the algorithms in Table 4.3. It shows that the
method PR is significantly faster than methods MRW, UJD. The advantage in PR is
more noticeable when the image size increases.

4.5 Extension to Co-segmentation of Multiple Images

In this section, we describe an extension of the pairwise co-segmentation method to


an image set. Finding matches over multiple images instead of just an image pair
is more relevant in analyzing crowd-sourced images from an event or at a touristic
location. Simultaneous computation of the MCS of N graphs drastically grows the
product graph size to the order of O ζ τ N −1 |G1 | N −1 , assuming same cardinality of
every graph for simplicity, making the algorithm incomputable. Hence, we describe
a different scheme to solve the multiple image co-segmentation problem using a hier-
archical setup where N − 1 pairwise co-segmentation tasks are solved over a binary
tree structured organization of the constituent images and results (see Figs. 4.12 and
4.13). For each task, there is a separate product graph of size O (ζ τ (|G1 | + |G2 |))
only, which demonstrates the computational efficiency of this scheme, to be elabo-
rated next.
82 4 Maximum Common Subgraph Matching

Fig. 4.12 Hierarchical image co-segmentation scheme for N = 4 images. Input images I1 -I4 are
represented as graphs G1 -G4 . Co-segmentation of I1 and I2 yields MCS H1,1 . Co-segmentation of
I3 and I4 yields MCS H2,1 . Co-segmentation of H1,1 and H2,1 yields MCS H1,2 that represents the
co-segmented objects in images I1 -I4 . Thus, total N − 1 = 3 co-segmentation tasks are required.
Figure courtesy: [48]

Fig. 4.13 Hierarchical image co-segmentation scheme for N = 6 images. Here, total N − 1 = 5
co-segmentation tasks are required

Let I1 , I2 , …, I N be a set of N images, and G1 , G2 , …, G N denote the respective


RAGs. To co-segment them using this hierarchical approach, T = log2 N  lev-
els of co-segmentation are required. Let H j,l denotes the j-th subgraph at level l.
First, the image pairs (I1 ,I2 ), (I3 ,I4 ), …, (I N −1 ,I N ) are co-segmented indepen-
dently. Let H1,1 , H2,1 , …, H N /2,1 be the resulting subgraphs at level l = 1 (see
Fig. 4.12). Then MCS of the pairs (H1,1 , H2,1 ), (H3,1 , H4,1 ), …, (H N /2−1,1 , H N /2,1 )
are computed to obtain the corresponding co-segmentation maps H1,2 , H2,2 , …at
level l = 2. This process is repeated until the final co-segmentation map H1,T at
level l = T is obtained. Figures 4.12, 4.13 show the block diagrams when consider-
ing co-segmentation for four images (T = 2) and six images (T = 3), respectively.
The advantages with this approach are as follows.
• The computational complexity greatly reduces after the first level of operation as
|H j,l |  |Gi | at any level l, and the graph size reduces at every subsequent level.
• We need to perform co-segmentation at most N − 1 times for N input images; i.e.,
the complexity increases linearly with the number of images to be co-segmented.
• If at any level any MCS is null, we can stop the algorithm and conclude that all
the images in the set do not share a common object.
4.5 Extension to Co-segmentation of Multiple Images 83

Fig. 4.14 Image co-segmentation from four images. Co-segmentation of the image pair in (a, b)
yields outputs (e, f), and the pair in (c, d) yields outputs (g, h). These outputs are co-segmented to
obtain the final outputs (i, j, k, l). Notice how small background regions present in (e, f) have been
removed in (i, j) after the second round of co-segmentation. Figure courtesy: [48]

It may be noted that during the first level of co-segmentation, the images I1 , I2 ,
…, I N can be paired in any order. Further, due to the non-commutativity discussed
in Sect. 4.2.2, the MCS output at any level actually corresponds to two matched
subgraphs from the respective input graphs, and we may choose either of them as
H j,l for MCS computation at the next level. Figure 4.14 shows an example of co-
segmentation for four images. For the input image pairs I1 , I2 in (a, b) and I3 , I4
in (c, d), the co-segmentation outputs at level l = 1 are shown in (e, f) and in (g, h),
respectively. Final co-segmented objects (at level l = 2) are shown in (i, j) and (k, l).
More experimental results for multi-image co-segmentation are shown in Figs. 4.15,
4.16, 4.17, and 4.18.
To summarize this chapter, we have described a computationally efficient image
co-segmentation algorithm based on the concept of maximum common subgraph
matching. Computing MCS from a relatively small product graph, and performing
84 4 Maximum Common Subgraph Matching

Fig. 4.15 Co-segmentation results obtained using the methods PR, UJD and MRW on four images
selected from the MSRC dataset [105]. a–d Input images (IN). e–h Co-segmentation outputs of
UJD and i–l that of MRW. m–p Co-segmentation outputs of PR. The extracted objects are shown
on gray background. The method UJD could not detect the common object (cow) from images (a),
(d). The method MRW includes a large part of the background in the output

region co-growing on the nodes (seeds) obtained at the MCS output is efficient.
Further, incorporating them in a pyramidal co-segmentation makes the method com-
putationally very efficient for high-resolution images. This method can handle varia-
tions in shape, size, orientation and texture in the common object among constituent
images. It can also deal with the presence of multiple common objects, unlike some
of the methods analyzed in the results section.
4.5 Extension to Co-segmentation of Multiple Images 85

Fig. 4.16 Co-segmentation results obtained using the methods PR, UJD and MRW on four images
selected from the iCoseg dataset [8]. a–d Input images (IN). e–h Co-segmentation outputs of UJD
and i–l that of MRW. m–p Co-segmentation outputs of PR. The extracted objects are shown on
black background
86 4 Maximum Common Subgraph Matching

Fig. 4.17 Co-segmentation results obtained using the methods PR, UJD and MRW on six images
selected from the iCoseg dataset [8]. a–f Input images (IN). g–l Co-segmentation outputs of UJD
and m–r that of MRW. s–x Co-segmentation outputs of PR. The extracted objects are shown on
gray background. The method UJD could not find any common objects and yields significantly
different outputs with the change in the number of input images to be co-segmented

The extension of the pairwise co-segmentation method to multiple images may


not always yield the desired result. It requires the common object to be present in all
the images in the image set. It is evident from Fig. 4.13 that the resulting subgraphs in
every level are a reducing subset of nodes. If there is at least one image that does not
contain the common object, then H1,T = {φ} (T = 3 in Fig. 4.13). Thus, this method
fails to detect the common object from all images. Hence, we explore a solution to
this problem in the next chapter through a multi-image co-segmentation algorithm
that can handle the presence of images without the common object.
4.5 Extension to Co-segmentation of Multiple Images 87

Fig. 4.18 Co-segmentation results obtained using the methods PR, UJD and MRW on three chal-
lenging images selected from the MSRC dataset [105]. a–c Input images (IN). d–f Co-segmentation
outputs of UJD and g–i that of MRW. j–l Co-segmentation outputs of PR. m–o Ground-truth. The
extracted objects are shown on gray background
Chapter 5
Maximally Occurring Common
Subgraph Matching

5.1 Introduction

In image co-segmentation, we simultaneously segment the common objects present


in multiple images. In the previous chapter, we have learned an algorithm for co-
segmentation of an image pair, its possible extension to multiple images and the
challenges involved. These images are generally retrieved from the internet. Hence,
not all images in the set may contain the common object as some of these images may
be totally irrelevant in the collected data set (see Fig. 5.1). Presence of such outlier
images in the database makes the co-segmentation problem even more difficult.
As mentioned earlier, it is possible that the set of crowd-sourced images to be
co-segmented contains some outlier images that do not at all share the common
object present in majority of the images in the set. Several methods including those
in [18, 56, 57, 60, 64, 78] do not consider this scenario. Rubinstein et al. [105] and
Wang et al. [139] do consider this. But the method of Rubinstein et al. [105] being a
saliency based approach misses out on all non-salient co-segments. Wang et al. [139]
proposed a supervised method for co-segmentation of image sets containing outlier
images. First, they learn an object segmentation model from a training image set of
same category. The outlier, if any, is rejected if it does not conform to the trained
model during co-segmentation. For an unsupervised scheme, to co-segment a set
of N images with the common object being present in an unknown M (M ≤ N )
number of images, the order of image matching operations is O (N 2 N −1 ). In this
chapter, we show that this problem can be addressed by processing all the images
together, and solving it in linear time using a greedy algorithm. The discussed method
co-segments a large number (N ) of images using only O (N ) matching operations.
It can also detect multiple common objects present in the image set.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 89
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_5
90 5 Maximally Occurring Common Subgraph Matching

Fig. 5.1 Image co-segmentation. The input image set that includes an outlier image (second image
from right) is shown in the top row. The extracted common objects are shown in the bottom row.
Image courtesy: Source images from the iCoseg dataset [8]

5.2 Problem Formulation

In this section, we introduce relevant terminology and notations, and formulate the
problem of multiple image co-segmentation using a graph-based approach. Let I =
{I1 , I2 , . . . , I N } be the set of N images to be co-segmented. As in Chap. 4, every
image Ii is represented as an undirected graph Gi = (Vi , Ei ) where Vi is the set of
nodes (or vertices) and Ei is the set of edges.

5.2.1 Mathematical Definition

Given two graphs G1 and G2 , the maximum common subgraph (MCS) G ∗ = (V ∗ , E ∗ )


(see Fig. 5.2) is defined as:

G ∗ = MCS(G1 , G2 ) , and
V ∗ = arg max {|V | : V ∈ G1 , V ∈ G2 } , (5.1)
V

where | · | denotes cardinality, G ∗ ⊆ G1 and G ∗ ⊆ G2 . Obtaining MCS is known to


be an NP-complete problem [61]. This definition of MCS can be extended to a set
of N graphs Ḡ = {G1 , G2 , . . . , G N } as G ∗ = MCS(G1 , G2 , . . . , G N ) and

V ∗ = arg max {|V | : V ∈ G1 , V ∈ G2 , . . . , V ∈ G N }. (5.2)


V

We have seen in Chap. 4 that co-segmentation of N images can be performed by


solving O (N ) NP-complete problems through hierarchical subgraph matching using
pairwise comparison if there is a non-empty set of nodes in every graph (image)
sharing the same node labels and edge connections across all the graphs, i.e.,

MCS(G1 , G2 , . . . , G N ) = {φ} . (5.3)


5.2 Problem Formulation 91

Fig. 5.2 Maximum common


subgraph (MCS). G1 , G2 are
the input graphs.
{u 1 , u 2 , u 4 , u 3 } and
{v1 , v2 , v4 , v3 } are the set of
nodes in the subgraphs that
match and hence constitute
the MCS(G1 , G2 ) shown in
the bottom row

5.2.2 Multi-image Co-segmentation Problem

Let F1 , F2 , . . . , F N be the common object(s) present in I , and our objective is to


find them. It is possible that every image in the set may not contain the common
object i.e., it is permissible to have for some of the images F j = {φ}. We refer to
such images as outlier images as they do not share the same object with majority of
the images, leading to MCS(G1 , G2 , . . . , G N ) = {φ}. Hence, we would like to find the
MCS from a subset of graphs (images) and maximize the cardinality of that subset.
At the same time, we need to set a minimum size (α) of the MCS as it represents
the unknown common object. So, we introduce the concept of maximally occurring
common subgraph (MOCS) G α = (V α , E α ), with |V α | ≥ α, computed from a subset
of M (≤ N ) graphs Ḡ M = {Gi1 , Gi2 , . . . , Gi M } as:

G α = MOCS(G1 , G2 , . . . , G N ) = MCS(Gi1 , Gi2 , . . . , Gi M ) , (5.4)

where i k ∈ {1, 2, . . . , N }. M and Ḡ M are computed as:

M = arg max {n : MCS(Subn (G1 , G2 , . . . , G N )) = {φ} and |V α | ≥ α} (5.5)


n

Ḡ M = SubαM (G1 , G2 , . . . , G N ). (5.6)

Here Subn (·) means any combination of n such elements from the set of graphs
G1 , G2 , . . . , G N , whereas SubαM (·) means the particular combination of M graphs
that maximizes Eq. (5.5) and α is the minimum number of nodes in G α . Hence, G α is
the solution to the general N -image co-segmentation problem in presence of outlying
observations.
As discussed in Sect. 5.2.1, if the set of graphs (images) Ḡ M containing the com-
mon object is known, then we need to find MCS of M graphs and it requires solving
O (M) NP-complete problems. However, in the co-segmentation problem setting
where the input image set contains outlier images, both Ḡ M and M are unknown.
Hence, solution to Eq. (5.4) requires solving O (M  ) NP-complete problems where
N N 
M = i=2 i i. This can be approximated as O (N 2 N −1 ). Hence, it is a compu-

92 5 Maximally Occurring Common Subgraph Matching

tationally nightmarish problem. Thus, an appropriate greedy algorithm is required.


The key aspects of this chapter are as follows.
• We describe the concept of maximally occurring common subgraph (MOCS)
matching across N number of graphs and demonstrate how it can be approxi-
mately solved in O (N ) matching steps, thus achieving a significant reduction in
computation time compared to the existing methods.
• We discuss a concept called latent class graph (LCG) to solve the above MOCS
problem.
• We demonstrate that the multi-image co-segmentation problem can be solved as an
MOCS matching problem, yielding an accurate solution that is robust in presence
of outlier images.
We describe MOCS and LCG in subsequent sections.

5.2.3 Overview of the Method

In this section, we provide an overview of the co-segmentation method discussed in


this chapter. It is described in Fig. 5.3 using a block diagram and illustrated using an
image set in Fig. 5.5.
Coarse-Level co-segmentation: First, RAGs are obtained by superpixel segmenta-
tion of all the images in I . Let Gi = (Vi , Ei ) be the graph representation of image Ii
with Vi = {v ij }∀ j where v ij is the jth node (superpixel) in Gi and Ei = {eijl }∀( j,l)∈N
where ( j, l) ∈ N denotes v ij ∈ N (vli ). Then image superpixels are grouped into
 N
clusters C1 , C2 , . . . , C K (see Blocks 1, 2 in Fig. 5.5) such that Kj=1 C j = i=1 Vi .
The cluster C having the most number of images with more than a certain number
(α) of superpixels (with a spatial constraint to be defined in Sect. 5.3.2) is selected
to serve as seeds to grow back the common object at a later stage. This yields a very
coarse co-segmentation of the images (details in Sect. 5.3).
Fine-level co-segmentation: The non-empty superpixel set belonging to cluster C
in every image Ii is represented as an RAG Hi ⊆ Gi such that Hi = ({v ij ∈ C ∩
Vi }∀ j , {eijl }∀( j,l)∈N ). Then a latent class graph H L is constructed by combining Hi ’s
based on node correspondences (see Block 4 in Fig. 5.5). This graph embeds the
feature (node attribute) similarity and spatial relationship (edges in graphs) among
superpixels, from all the images, belonging to that cluster. We obtain H L = H(N L
)

as:
H(1) (i)
L = H1 ; H L = F (H L
(i−1)
, Hi ), ∀i = 2, 3, . . . , N , (5.7)

where F (·) is the graph merging function defined in Sect. 5.4.1. This latent class
graph is used for region growing on every Hi to obtain a finer level of co-segmentation
and to obtain the complete common object Hi (see Block 5 in Fig. 5.5).

Hi = F R (H L , Hi , Ii ), ∀i = 1, 2, 3, . . . , N , (5.8)

where F R (·) is the region growing function described in Sect. 5.4.2.


5.3 Superpixel Clustering 93

Fig. 5.3 Block diagram of the multiple image co-segmentation method

It may be noted that in graph theory, typically graph matching is done by matching
nodes that have same labels. But here image superpixels are represented as nodes
in the graph and nodes within the common object in multiple images may not have
exactly the same features. Moreover, the common object may be of different size
(number of superpixels constituting the object) in different images. Hence in the
solution to the MOCS problem in Eq. (5.4), the resulting common subgraph G α ⊆ Gi
also has different cardinality (≥ α) in different image Ii . This makes the problem
computationally challenging.
In Sect. 5.3, we describe the superpixel features and superpixel clustering, and
select clusters containing the common object. In Sect. 5.4, we describe the process
of obtaining the common object from the selected clusters. We conclude with exper-
imental results and discussions in Sect. 5.5.

5.3 Superpixel Clustering

First the input images are segmented into superpixels using the simple linear iterative
clustering algorithm [1] so that common objects across images can be identified by
matching inter-image superpixels. Assuming each object to be a group of superpixels,
we expect high feature similarity among all such superpixel groups corresponding
to common objects. For every superpixel s ∈ F j , there must be superpixels of sim-
ilar features present in other images Sub M−1 (F1 , F2 , . . . , F j−1 , F j+1 , . . . , F N ) to
maintain globally consistent matching among the superpixels. To solve this problem
of matching superpixel groups along with neighborhood constraints efficiently, every
image I can be represented as a region adjacency graph (RAG) G . Each node in G
represents a superpixel and it is attributed with the corresponding superpixel features
and spatial coordinates of the centroid. A pair of nodes is connected by an edge if
the corresponding superpixel pair is spatially adjacent.
As the image set I may contain a large number of images, the total number of
superpixels becomes very large and superpixel matching across images becomes
computationally prohibitive. Hence, the superpixels are grouped into clusters for
further processing as described next.
94 5 Maximally Occurring Common Subgraph Matching

5.3.1 Feature Computation

To obtain homogeneous regions through clustering of superpixels, we must include


spatial neighborhood constraints in addition to superpixel features. Since all super-
pixels from all images are considered simultaneously for clustering, it is difficult to
design such spatial constraints without location prior. Hence, we want superpixel
features to be embedded with this spatial neighborhood information. So for each
superpixel s, first a feature vector f(s) is computed as a combination of low-level
features: (i) color histogram and mean color in the RGB color space and (ii) rota-
tion invariant histogram of oriented gradients (HOG). Then, features from first-order
neighborhood N1 (s) and second-order neighborhood N2 (s) of every superpixel s
are combined to design a new feature h(s). For example, u 4 ∈ N1 (u 1 ), u 5 ∈ N2 (u 1 ),
v2 ∈ N1 (v1 ) and v5 ∈ N2 (v1 ) in Fig. 5.2. The most similar superpixels, s1 in N1 (s)
and s2 in N2 (s), to s are chosen as:

si = arg min {d f (f(s), f(r)), ∀r ∈ Ni (s)} , i = 1, 2 , (5.9)


r

where d f (·) denotes the feature distance. One can use any appropriate distance mea-
sure such as the Euclidean distance as d f (·). Using the neighborhood superpixels,
h(s) is computed as:
h(s) = [f(s) f(s1 ) f(s2 )] , (5.10)

where [ ] denotes concatenation of vectors. This feature h(s) compactly contains


the information about the neighborhood of s. Hence, it serves as a better feature than
f(s) alone while grouping superpixels inside common objects in multiple images as
described next.

5.3.2 Coarse-level Co-segmentation

To find the maximally occurring common object(s) in multiple images, we need to


find superpixels with similar features across images. As there is a large number of
superpixels in a large image set, a large number of matchings are required to be
performed. This increases the computational complexity. This problem is alleviated
by clustering the superpixels using their features h(s) defined in Sect. 5.3.1. This
results in a coarse-level matching of superpixels where one cluster contains the
common object partially. Figure 5.4 shows an illustration. The number of clusters
(K ) defines the coarseness of matching. A small value of K may not help much in
saving computations during graph matching, while the choice of a large value of K
may not help in picking up the common object in several of the images unless the
common objects in the images are very close in the feature space, which is rarely
the case. Any clustering technique such as the k-means can be used to cluster the
superpixels.
5.3 Superpixel Clustering 95

Fig. 5.4 How common objects in constituent images form a single common cluster in the entire
pooled data. Each data point denotes a superpixel feature from any of the four images. Arrows
indicate the common object. Note that image-4 does not contribute to the designated cluster C and
hence is an outlier image, signifying absence of a common object

Each cluster contains superpixels from multiple images with similar features and
these superpixels constitute parts of the common object or background regions in each
image. One of these clusters should contain the superpixels of the common object
and our goal is to find this cluster leading to a coarse co-segmentation. Among the
images, there could be two different types of commonalities: the common foreground
object and (very often) similar background. Hence, these background superpixels are
first discarded from every cluster using the background probability measure P (s) of
Zhu et al. [151] before finding the cluster of interest. Background removal also helps
in discarding clusters containing only the common background. A superpixel s is
marked as background if P (s) > tB , where tB is an appropriately chosen threshold.
Further, all sub-images (for a given cluster), having less than α (see Eq. (5.5)) super-
pixels are discarded. This means that the minimum size of the seeds for coarsely
co-segmented images to grow into an arbitrarily sized common object is set to α
superpixels. Thus, the number of non-empty sub-images in cluster C is the estimate
of M in Eq. (5.5).
Subsequently, to determine the cluster of interest C , we need to consider all (say
n i j number) superpixels {si j (k), k = 1, 2, . . . , n i j } in image Ii belonging to the jth
cluster. For an image with a segmentable object which is typically compact, the
superpixels constituting it should be spatially close. Here, we introduce the notion of
compactness of a cluster. Let σi2j denote the spatial variance of centroids of {si j (k)}
and the initial co-segmented area is obtained as:


ni j
Ai j = area(si j (k)). (5.11)
k=1
96

Fig. 5.5 The co-segmentation method. Block 1 shows the input image set I1 , I2 , . . . , I10 that contains one majority class (bear) in I1 , I2 , . . . , I8 and two outlier
images I9 , I10 . Block 2 shows sub-images in clusters C1 , C2 , C3 , C4 , respectively, after removing background superpixels from them using the measure given
in Sect. 5.3.2. Cluster 3 is the computed cluster of interest C (using the compactness measure given in Eq. (5.12)) and it shows the partial objects obtained from
the coarse-level co-segmentation. This cluster has a single superpixel (encircled) from image I9 and it is discarded in subsequent stages as we have set α = 3 in
Eq. (5.5). Arrows in Block 2 show that holes in sub-images 2, 5, 6 and 7 of C3 will be filled by transferring superpixels from C2 and C4 using the method given
5 Maximally Occurring Common Subgraph Matching

in Sect. 5.3.3. Block 3 shows the effect of hole filling (H.F.)


5.3 Superpixel Clustering

Fig. 5.5 (Continued): The co-segmentation method. Block 4 shows the latent class graph generation. Block 5 shows the complete common object(s) obtained
after fine-level co-segmentation (F.S.) i.e., latent class graph generation and region growing. Image courtesy: Source images from the iCoseg dataset [8]
97
98 5 Maximally Occurring Common Subgraph Matching

We can define the compactness measure of a cluster using an inverse relationship


of spatial variances of superpixels belonging to it. The compactness Q j of the jth
cluster is computed as:
 −1
 σi2j
Qj = , ∀i ∈ [1, N ] such that Ai j = 0. (5.12)
i
Ai j

The cluster j for which Q j is the maximum is chosen as the appropriate cluster C
for further processing to obtain the fine-level co-segmentation. For an image with
a segmentable object, the superpixels should be spatially close, when the measure
Q will be high. Figure 5.5 shows the clustering result of a set of 10 images into 10
clusters where background superpixels have been removed. Here, we show only the
top four densest clusters for simplicity. The coarse-level co-segmentation output is
presented in Block 2 which shows the partially recovered common objects constituted
by the superpixels belonging to the cluster of interest (C = C3 ).

5.3.3 Hole Filling

Cluster C may not contain the complete common object (as shown in Block 2,
Fig. 5.5) because we are working with natural images containing objects of varying
size and pose, and variations in the feature space. Moreover, spatial coordinates of
the superpixels have not been considered during clustering. Hence, superpixels in
an image belonging to the cluster C need not be spatially contiguous and the partial
object may have an intervening region (hole) within this segmented cluster, with
the missing superpixels being part of another cluster. Since an image has no holes
spatially, this hole can be safely assigned to the same cluster. These segmentation
holes in cluster C are filled as explained next.
Let ViC ⊆ Vi be the set of superpixels of image Ii , belonging to cluster C . For
every image, we represent ViC as an RAG Hi with |Hi | = |ViC |. Then all cycles
present in every graph Hi are identified. A cycle in an undirected graph is a closed
path of edges and nodes, such that any node in the path can be reached from itself
by traversing along this path. As Hi is a subgraph of Gi (i.e., |Hi | ≤ |Gi |), every
cycle in Hi is also present in Gi . For every cycle, the nodes (of Gi ) interior to it are
identified using the superpixel coordinates. Then every graph Hi (and corresponding
ViC ) is updated by adding these interior nodes to the cycle, thus filling the holes. This
is explained in Fig. 5.6. The superpixels corresponding to the interior nodes (v5 ) that
fill the holes belong to clusters other than cluster C . So, superpixel transfer across
clusters occurs in this stage. Cluster C is updated accordingly by including those
missing superpixels. This is illustrated in Block 3 of Fig. 5.5.
5.4 Common Object Detection 99

Fig. 5.6 Hole filling. a Graph G of image I . b Subgraph H of G constituted by the superpixels
in cluster C . c Cycle v1 − v2 − v4 − v3 − v1 of length 4. d Find node v5 ∈ G interior to the cycle,
and add it and its edges to H

Fig. 5.7 Hole filling in a cycle of length 3. a Graph G of image I . b Subgraph H of G constituted
by the superpixels in cluster C and the cycle v1 − v2 − v3 − v1 of length 3. c Find node v4 ∈ G
interior to the cycle, and add its edges to H

It is interesting to note that even for a cycle of three nodes, though the corre-
sponding three superpixels are neighbors to each other in image space, it may yet
contain a hole because superpixels can have any shape. In Fig. 5.7, nodes v1 , v2 and
v3 (belonging to the cluster of interest) are neighbors to each other (in image space)
and form a cycle. Node v4 is also another neighbor of each v1 , v2 and v3 , but it
belongs to some other cluster. Hence, it creates a hole in the cycle formed by v1 , v2
and v3 . This illustrates that even a cycle of length three can contain a hole. So, we
need to consider all cycles of length three or more for hole filling.

5.4 Common Object Detection

Even after hole filling, the superpixels belonging to the updated cluster C may not
constitute the complete object (due to coarse-level co-segmentation and possible
inter-image variations in the feature) in the respective images they belong to. So,
we need to perform region growing on them. In every image Ii , we can append
certain neighboring superpixels of updated ViC (after hole filling) to it based on
inter-image feature similarity. This poses two challenges. First, we need to match
each image with all possible combinations of the rest of the images, thus requiring
O (N 2 N −1 ) matching operations, which is large if a large number of images are being
co-segmented. Secondly, if each image is matched independently, global consistency
in superpixel matching across all the images is not ensured, as the transitivity property
100 5 Maximally Occurring Common Subgraph Matching

is violated while matching noisy attributes. In this section, we describe a technique


to tackle both these challenges by representing ViC in every image as RAGs (updated
Hi ’s) and combining them to build a much larger graph that contains information
from all the constituent images. As the processing is restricted to superpixels in one
specific cluster (cluster C ), we call this combined graph as latent class graph (H L ) for
that cluster. We will show that this requires only O (N ) graph matching steps. Then
pairwise matching and region growing between H L and every constituent graph Hi
are performed independently. So, the computational complexity gets reduced from
O (N 2 N −1 ) to O (N ) computations (more specifically O (M) if M = N , where the
common object is present in M out of N number of input images). However, it
may be noted (and to be seen later in Table 5.4) that during region growing, since
|H L | ≥ |Hi |, ∀i, the matching of Hi with H L will involve a certain more number
of computations compared to matching Hi and H j . The details are explained next.

5.4.1 Latent Class Graph

Building of the latent class graph is a commutative process and the order in which the
constituent graphs (Hi ’s) are merged does not matter. To build H L for cluster C , we
can start with the two largest graphs in that cluster and find correspondences among
their nodes. It is possible that every node in one graph may not have some match
in the other graph due to feature dissimilarity. The unmatched nodes in the second
largest graph are appended to the largest graph based on attribute similarity and graph
structure, resulting in a larger intermediate graph. Then this process is repeated using
the updated intermediate graph and the third largest graph in cluster C and so on.
After processing all the input graphs of cluster C , H L is obtained as the final updated
intermediate graph. This is explained in Fig. 5.8. H L describes the feature similarity
and spatial relationship among the superpixels in cluster C . Being a combined graph,
H L need no longer be a planar graph and it may not be physically realizable in
the image space. It is just a notional representation for computational benefit. This
algorithm is explained next.
Let the images I1 , I2 , . . . , I M be ordered using the size of their respective RAGs
in cluster C (Hi ’s as defined in Sect. 5.3.3) such that |H1 | ≥ |H2 | ≥ · · · ≥ |H M |, and
let H(1)
L = H1 . First, we need to find node correspondences between the graphs H L
(1)

and H2 . As object sizes and shapes of co-occurring objects differ across images, we
must allow many-to-many matches among nodes. One can use any many-to-many
graph matching technique for this including the maximum common subgraph (MCS)
matching algorithm of Chap. 4. This method finds the MCS between two graphs by
building a vertex product graph from the input graph pair and finding the maximal
clique in its complement. The resulting subgraphs provide the required inter-graph
node correspondences. Depending on node attributes and number of nodes in the
input graph pair, there may be some nodes in H2 not having any match in H1 and
vice-versa. We describe the latent class graph generation steps using the graphs
shown in Fig. 5.8. Let v1 ∈ H2 has a match with u 1 ∈ H1 . However, v4 ∈ H2 does
5.4 Common Object Detection 101

Fig. 5.8 Latent class graph generation. Five nodes in graphs H1 and H2 matched with each other
(circumscribed by dashed contour). The unmatched nodes v3 , v4 and v5 in H2 are duplicated in
H1 as u 3 , u 4 and u 5 to obtain the intermediate latent class graph H(2)
L . Applying this method A to
H3 , H4 , . . . , H M , we obtain the latent class graph H L = H(M)
L . See color image for node attributes

not have any match in H1 and there is an edge between v1 and v4 . In such a scenario,
a node u 4 is added in H1 by assigning the same attributes of v4 and connecting it
to u 1 using an edge. If v1 matches with more than one node in H1 , the node u m in
H1 having the highest attribute similarity with v1 is chosen and u 4 is connected to
that node u m . Adding a new node in H1 for every unmatched node in H2 results in
an updated intermediate graph, denoted as H(2) L . Then the above mentioned process
is repeated between H L and H3 to obtain H(3)
(2)
L , and so on. Finally the latent class
graph H L = H(M) L is obtained. Thus the latent class graph generation requires O (M)
matching steps. Here, H L ’s are non-decreasing with |H(t)
(t) (t−1)
L | ≥ |H L |.
To prevent possible exponential blow up of the latent class graph in every iteration-
t, exactly one node (u 4 ) is added to the intermediate graph H(t)L for every unmatched
node (v4 ) in the other graph Hi . Hence, cardinality of H L is equal to the sum of |H1 |
and the total number of non-matching nodes (unique superpixels in cluster C ) in Hi
in each iteration. A sketch of the proof is as follows.
The latent class graph generation function F (·) in (5.7) can be defined as:

F (H(i−1)
L , Hi ) = F A ({Hi \MCS(H(i−1)
L , Hi )}, H(i−1)
L ), (5.13)
102 5 Maximally Occurring Common Subgraph Matching

where F A (·) is a function that appends the unmatched nodes in Hi to H(i−1)


L as
explained above. For i = 2,

F (H(1) (1) (1)


L , H2 ) = F A ({H2 \MCS(H L , H2 )}, H L ) . (5.14)

Hence, the cardinality of H(2) (1)


L is equal to the sum of cardinality of H L = H1 and
the number of non-matching nodes (unique superpixels in cluster C ) in H2 , given by

H(2)
L = |H1 | + (|H2 | − |MCS(H1 , H2 )|) . (5.15)

Similarly, cardinality of H(3)


L is given by

H(3)
L = H(2)
L + |H3 | − MCS(H(2)
L , H3 ) . (5.16)

Now MCS(H(2) L , H3 ) includes nodes of both MCS(H1 , H3 ) and MCS(H2 , H3 ). As


these two set of nodes also overlap and contain nodes of MCS(H1 , H2 , H3 ),

MCS(H(2)
L , H3 ) = |MCS(H1 , H3 )| + |MCS(H2 , H3 )|

− |MCS(H1 , H2 , H3 )| . (5.17)

Combining (5.15), (5.16) and (5.17), we obtain

H(3)
L = |H1 | + |H2 | + |H3 | − |MCS(H1 , H2 )|
− |MCS(H1 , H3 )| − |MCS(H2 , H3 )|
+ |MCS(H1 , H2 , H3 )| . (5.18)

Similarly, cardinality of H L = H(M)


L is given by
 
H(M)
L = | Hi | − MCS(Hi , H j )
i i j>i

+ MCS(Hi , H j , Hk ) − · · · (5.19)
i j>i k> j

One can obtain a numerical estimate of the size of the latent class graph as follows.
Consider a simple case where every sub-image in the cluster of interest (C ) has n H
superpixels and a fraction of these (say, β) constitutes the common object partially in
that sub-image obtained at the output of the coarse-level co-segmentation. Thus the
number of unmatched nodes is (1 − β)n H which gets appended to the intermediate
latent class graph at every iteration. Hence, the cardinality of the final latent class
graph is
5.4 Common Object Detection 103


M−1
|H(M)
L | = nH + (1 − β)n H ≈ n H M(1 − β) . (5.20)
i=1

A step-wise demonstration of latent class graph generation is shown in Fig. 5.9.


Here, we have used four images (from Block 3 of Fig. 5.5) for simplicity. Region
adjacency graphs (RAG) H1 , H2 , H3 , H4 corresponding to the sub-images (I1(s) ,
I2(s) , I3(s) , I4(s) ) present in the computed cluster of interest are used to generate the
latent class graph.

5.4.2 Region Growing

As every graph Hi in cluster C represents partial objects in the respective image Ii ,


region growing (RG) needs to be performed on it for object completion. But we should
not grow them independently. As we are doing co-segmentation, we should grow the
graphs with respect to a reference graph that contains information of all graphs. We
can follow the method described in Chap. 4 that performs region growing on a pair
of graphs jointly. Here, every graph Hi is grown with respect to H L using the region
growing function F R (·) in (5.8) to obtain Hi . This method uses the node-to-node
correspondence information obtained during latent class graph generation stage to
find the matched subgraph of H L and jointly grows this subgraph of H L and Hi by
appending similar (in attribute) neighboring nodes to them until convergence. These
neighboring nodes (superpixels) belong to Gi \Hi . Upon convergence, Hi grows to
Hi that represents the complete object in image Ii . The set {Hi , ∀i = 1, 2, . . . , M}
represents co-segmented objects in the image set I and the solution to the MOCS
problem in Eq. (5.4) as explained in Sect. 5.2.3. This is explained in Fig. 5.10. As the
same H L , that contains information of all the constituent graphs, is used for growing
every graph Hi , consistent matching in the detected common objects is ensured. The
results of region growing is shown in Block 5 of Fig. 5.5.
It may be noted that in Chap. 4, we compute MCS of graph representations (Gi ’s)
of the input image pair in their entirety. Unlike in Chap. 4, here MCS computation
during latent class graph generation stage in Sect. 5.4.1 involves graphs Hi ’s that
are much smaller than the corresponding Gi ’s (note Hi ⊆ Gi ). Hence, it is not at
all computationally expensive. Moreover, in Chap. 4, the common subgraph pair
obtained using MCS co-grows to the common objects. Unlike in Chap. 4, here we
consider only the growth of every graph Hi with respect to the fixed H L . In Fig. 5.11,
a step-wise demonstration of region growing on the sub-image I4(s) and its RAG H4
is shown. We obtain the complete object (shown in Fig. 5.11k) in image I4 using the
resulting latent class graph of Fig. 5.9n as H4 (shown in Fig. 5.11l)

H4 = F R (H L , H4 , I4 ) . (5.21)
104 5 Maximally Occurring Common Subgraph Matching

Fig. 5.9 Steps for latent class graph generation described in Fig. 5.8. a–d Sub-images present (from
Block 3, Fig. 5.5) in the computed cluster of interest. e–h The corresponding RAGs H1 , H2 , H3
and H4 . Values on axes indicate image dimension. See next page for continuation
5.4 Common Object Detection 105

Fig. 5.9 (Continued): Steps for latent class graph generation described in Fig. 5.8. i New nodes
(1)
(shown in red) have been added to H L = H1 after matching it with H2 . j Intermediate latent
(2) (1)
class graph H L = F (H L , H2 ) (note the new edges). k New nodes (shown in red) have been
added to H(2) (3)
L after matching it with H3 . l Intermediate latent class graph H L . (m) New nodes
(3)
(shown in red) have been added to H L after matching it with H4 . (n) Final latent class graph
H L = H(4) (3)
L = F (H L , H4 )
106 5 Maximally Occurring Common Subgraph Matching

Fig. 5.10 Region growing


(RG). Here H1 , H2 , …, HM
are the graphical
representations of the
corresponding co-segmented
objects

Algorithm 1 Co-segmentation algorithm


Input: Set of images I1 , I2 , . . . , I N
Output: Common objects Fi1 , Fi2 , . . . , Fi M present in M images (M ≤ N )
1: for i = 1 to N do
2: Superpixel segmentation of every image Ii
3: Obtain region adjacency graph (RAG) representation Gi of every image Ii
4: Compute background probability P (s) of every superpixel s ∈ Ii
5: Compute feature h(s) of every superpixel s ∈ Ii
6: end for
7: Cluster all superpixels (from all images) together into K clusters
8: Remove every superpixel s from sub-images if P (s) > threshold tB
9: for j = 1 to K do
10: Compute compactness Q j of every cluster j
11: end for
12: Select the cluster of interest as C = arg max Q j .
j
13: Find RAG Hi (⊆ Gi ) in every non-empty sub-image Ii belonging to cluster C , where i =
1, 2, . . . , M and M ≤ N
14: Order Hi ’s such that |H1 | ≥ |H2 | ≥ . . . ≥ |H M |
15: // Latent class graph generation
(1)
16: H L ← H1
17: for i = 2 to M do
(i−1)
18: Find matches between H L and Hi
19: Append non-matched nodes in Hi to H(i−1) L and obtain H(i)L
20: end for
21: Latent class graph H L = H(M) L
22: // Region growing
23: for i = 1 to M do
24: Perform region growing on Hi with respect to H L using Ii and obtain Hi
25: end for
26: Hi1 , Hi2 , . . . , Hi M are the graphical representations of the common objects Fi1 , Fi2 , . . . , Fi M

Newly added nodes in every iteration of region growing are highlighted in different
colors. Similarly, region growing is also performed on H1 , H2 and H3 to obtain the
complete co-segmented object in the corresponding images. The overall algorithm
for the method is presented in a complete block diagram in Fig. 5.12 and the complete
algorithmic description as a pseudo-code is given in Algorithm 1.
5.4 Common Object Detection 107

(s)
Fig. 5.11 Steps of region growing on sub-image I4 and its RAG H4 to obtain the complete object.
a Sub-image I4(s) and b its corresponding RAG H4 . Note that this RAG has two components. c, e, g,
i Partially grown object after iterations 1, 2, 3, 4, respectively and d, f, h, j the corresponding inter-
mediate graphs, respectively. k Completely grown object after iteration 5 and l the corresponding
graph
108

Fig. 5.12 The co-segmentation method using a block diagram


5 Maximally Occurring Common Subgraph Matching
5.4 Common Object Detection

Fig. 5.12 (Continued): The co-segmentation method using a block diagram


109
110 5 Maximally Occurring Common Subgraph Matching

5.5 Experimental Results

In this section, we analyze the performance of the co-segmentation methods in


DCC [56], DSAD [60], MC [57], MFC [18], JLH [78], MRW [64], UJD [105],
RSP [68], GMR [99], OC [131], CMP [36], EVK [24] and the method described in
this chapter (denoted as PM). Experiments are performed on images selected from
the following datasets: the MSRC dataset [105], the flower dataset [92], the Weiz-
mann horse dataset [13], the Internet dataset [105], the 38-class iCoseg dataset [8]
without any outliers and the 603-set iCoseg dataset containing outliers.
We begin by discussing the choice of different parameters in the PM method.
The number of superpixels in every image is set to 200. The RGB color histogram,
mean color and HOG feature vectors are of lengths 36, 3 and 9, respectively. These
three features are concatenated to generate f(s) of length 48. Hence, the length of the
combined feature h(s) is 3 × 48 = 144. In Sect. 5.3.2, the background probability
threshold tB is chosen as 0.75. Further, α = 3 is chosen to prevent spurious points to
grow during region growing stage, and it helps to discard outlier images. Experiments
are also performed by varying the number of clusters K ∈ [7, 10], and they yield quite
comparable results.

5.5.1 Quantitative and Qualitative Analysis

We first discuss quantitative evaluation, and then visually analyze the results. We
have used Jaccard similarity (J ) and accuracy (A) as the metrics to quantitatively
evaluate [105] the performance of the methods. Jaccard similarity is defined as the
intersection over union between the ground-truth and the binary mask of the co-
segmentation output. Accuracy is defined as the percentage of the number of correctly
labeled pixels (in both common object and background) with respect to the total
number of pixels in the image. For small sized objects, the measure of accuracy is
heavily biased towards a higher value and hence Jaccard similarity is commonly the
preferred measure.
Quantitative result: The values of A and J obtained using different methods on
the images from the iCoseg dataset are provided in Table 5.1. The methods DCC,
DSAD, MC require the output segmentation class to be manually chosen for each
dataset before computing metrics. The poor performance of the method UJD is due
to the use of saliency as a cue as discussed in Chap. 4. It is often argued that one
achieves robustness by significantly sacrificing the accuracy [104]. Hence, the per-
formance of the methods on datasets having no outliers at all is also provided in
Table 5.1. We observe that the method JLH, being a GrabCut based method, per-
forms marginally better when there is no outlier image in the dataset, whereas the
method PM is more accurate than most of the other methods in both presence and
absence of outliers. Table 5.3 provides results of the methods DCC, DSAD, MC,
MFC, MRW, UJD, GMR, EVK, PM on the Internet dataset. Since objects in this
5.5 Experimental Results 111

Table 5.1 Accuracy (A) and Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC,
JLH, MRW, UJD, RSP, GMR, OC, CMP on the dataset created using the iCoseg dataset and the
38-class iCoseg dataset without outliers
Metrics Methods
PM MRW CMP MFC MC UJD
DCC DSAD JLH GMR RSP OC
A(%) 90.46 88.67 89.95 88.72 82.42 73.69
(iCoseg
603) with
outliers
75.38 83.94 x x x x
A(%) 93.40 91.14 92.80 90.00 70.50 89.60
(iCoseg 38)
without
outliers
80.00 76.00 x 93.30 x 85.34
J (iCoseg 0.71 0.65 0.63 0.62 0.41 0.38
603) with
outliers
0.36 0.26 x x x x
J (iCoseg 0.76 0.70 0.73 0.64 0.59 0.68
38) without
outliers
0.42 0.42 0.79 0.76 0.66 0.62
‘x’ : code not available to run on outlier data

dataset have more variations, the h(s) feature in Eq. (5.10) is not sufficient. In par-
ticular, the airplane class contains images of airplanes with highly varying pose,
making it a very difficult dataset for applying unsupervised methods. Hence, some
semi-supervised methods use saliency or region proposal (measure of objectness)
for initialization, whereas some unsupervised methods perform post-processing. For
example, the methods JLH, CMP use GrabCut [102], and EVK uses graph-cut. The
method PM uses additional hand-crafted features such as bag-of-words of SIFT [138]
and histogram features [18]. Similarly, the method GMR uses learnt CNN features to
tackle variations in images. It performs marginally better for the horse class, whereas
the method PM achieves significantly higher Jaccard similarities on car and airplane
classes, achieving the highest average Jaccard similarity. Table 5.2 shows quantita-
tive comparison on the images from the Weizmann horse dataset [13] and the flower
dataset [92], respectively. The quantitative analysis shows the method PM performs
well in all datasets, whereas performance of other methods vary across datasets.
Qualitative result: We show visual comparison of the co-segmentation outputs
obtained using the methods PM, DCC, DSAD, MC, MFC, MRW, UJD on images
from the iCoseg dataset [8] in Fig. 5.13. The method PM correctly co-segments the
soccer players in red, whereas other methods wrongly detect additional objects (other
players, referees and signboards). Moreover, they cannot handle the presence of out-
112 5 Maximally Occurring Common Subgraph Matching

Table 5.2 Accuracy (A) and Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC,
MRW, UJD, CMP on images selected from the Weizmann horse dataset [13] and the flower
dataset [92]
Horse Methods
data
Metrics
PM MFC CMP MRW DSAD MC DCC UJD
A(%) 95.61 91.18 91.37 93.45 89.76 83.30 84.82 63.74
J 0.85 0.76 0.76 0.75 0.69 0.61 0.58 0.39
Flower Methods
data
Metrics
PM MFC CMP MRW DSAD MC DCC UJD
A(%) 94.50 94.78 92.82 89.61 80.24 79.36 78.70 53.84
J 0.85 0.82 0.73 0.70 0.71 0.56 0.52 0.45

Table 5.3 Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC, MRW, UJD, GMR,
EVK on the Internet dataset [105]
Metrics Methods
PM GMR UJD MFC CMP EVK MRW DCC MC DSAD
J (Car 0.703 0.668 0.644 0.523 0.495 0.648 0.525 0.371 0.352 0.040
class)
J 0.556 0.581 0.516 0.423 0.477 0.333 0.402 0.301 0.295 0.064
(Horse
class)
J (Air- 0.625 0.563 0.558 0.491 0.423 0.403 0.367 0.153 0.117 0.079
plane
class)
J 0.628 0.604 0.573 0.479 0.470 0.461 0.431 0.275 0.255 0.061
(Aver-
age)

lier images (containing baseball players in white) unlike PM. Figure 5.14 shows the
co-segmentation results of the methods PM, MFC, MRW on images from the flower
dataset [92]. The method MRW incorrectly co-segments the horse (in the outlier
image) with the flowers. Figure 5.15 shows the co-segmentation outputs obtained
using the methods PM, MFC, MRW on images from the MSRC dataset [105]. Both
MFC and MRW fail to discard the outlier image containing sheeps. These results
show the robustness of PM in presence of outlier images in the image sets to be
co-segmented.
5.5 Experimental Results

Fig. 5.13 Co-segmentation results on an image set from the iCoseg dataset. For the input image set (includes two outlier images) shown in Row A, the
co-segmented objects obtained using MRW, MFC, UJD, MC are shown in Rows B-E, respectively
113
114

Fig. 5.13 (Continued): Co-segmentation results on an image set from the iCoseg dataset. The co-segmented objects obtained using DCC, DSAD and PM are
shown in Rows F-H, respectively. Row I shows the ground-truth
5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results 115

Fig. 5.14 Co-segmentation results on an image set from the flower dataset. For the input image
set (that includes one outlier image of horse) shown in Row A, the co-segmented objects obtained
using MRW, MFC and PM are shown in Rows B–D, respectively

5.5.2 Multiple Class Co-segmentation

Images analyzed in Sect. 5.5 consisted of objects primarily belonging to a single


class only. In real life, there could be multiple classes of common objects (e.g., a
helicopter in M1 images, a cheetah in M2 images with M1 + M2 ≤ N ). We now
demonstrate how the method described in this chapter can be adapted to handle
co-segmentation of multiple class common objects. In Fig. 5.17, we show the co-
segmentation results of a set of 22 images containing two different common objects.
The intermediate clustering result for this set is given in Fig. 5.16. First, the cluster
(C1 = 7) having the largest compactness (Q j in Eq. (5.12)) is selected. Then latent
class graph generation and region growing are performed on that cluster to extract
the first common object (cheetah). Then the cluster (C2 = 10) having the second
largest compactness is selected and the same procedure is repeated on the left over
data to extract the second common object (helicopter). It is quite clear from Fig. 5.17
that this method is able to co-segment objects belonging to both the classes quite
accurately. It may be noted that this method being unsupervised, does not use any
class information while co-segmenting an image set. Here, we can only identify two
subsets of images with different classes without specifying the class and provide
segmented objects in them.
116

Fig. 5.15 Co-segmentation results on cow image set from the MSRC dataset. For the input image set (that includes one outlier image of sheeps) shown in
Row A, the co-segmented objects obtained using PM, MRW, MFC are shown in Rows B-D, respectively. Unlike the methods MFC and MRW, the method PM
(Row B) is able to reject the sheeps as co-segmentable objects (shown in final column)
5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results 117

Fig. 5.16 Clustering a set of 22 images (from the iCoseg dataset) containing two different classes
(helicopter and cheetah) into ten clusters. The input images are shown in Row 1. Rows 2 to 11 show
sub-images in clusters 1 to 10, respectively. Out of 22 images, 8 are shown here, and remaining are
shown in next two pages. Reader may view all images simultaneously for better understanding
118 5 Maximally Occurring Common Subgraph Matching

Fig. 5.16 (Continued): Clustering a set of 22 images (from the iCoseg dataset) containing two
different classes (helicopter and cheetah) into ten clusters. The input images are shown in Row 1.
Rows 2 to 11 show sub-images in clusters 1 to 10, respectively. Out of 22 images, 7 are shown here,
and remaining are shown in previous page and next page
5.5 Experimental Results 119

Fig. 5.16 (Continued): Clustering a set of 22 images (from the iCoseg dataset) containing two
different classes (helicopter and cheetah) into ten clusters. The input images are shown in Row 1.
Rows 2 to 11 show sub-images in clusters 1 to 10, respectively. Out of 22 images, 8 are shown here,
and remaining are shown in previous two pages
120

Fig. 5.17 Co-segmentation when the image set contains two different classes of common objects. Rows A, B together show the input set of 22 images (from the
iCoseg dataset) containing two objects helicopter and cheetah. Rows C,D show the extracted common objects. The numbers (1) and (2) indicate that cheetahs
and helicopters have been obtained by processing clusters C1 and C2 , respectively
5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results 121

Table 5.4 Computation time (in seconds) required by the methods PM, DCC, DSAD, MC, MFC,
MRW, UJD for image sets of different cardinalities on i7 3.5 GHz PC with 16GB RAM
No. of Methods
images
PM DSAD MFC DCC MRW UJD MC
8 12 41 69 91 256 266 213
16 33 81 141 180 502 672 554
24 64 121 205 366 818 1013 1202
32 111 162 273 452 1168 1342 1556
40 215 203 367 1112 2130 1681 3052
60 350 313 936 1264 5666 2534 4064
80 532 411 1106 2911 X 3353 5534
PE M M+C M+C M+C M+C M+C M+C
Here, PE stands for programming environment, M for Matlab and C for C language. Here, X stands
for the case when Matlab ran out of memory

5.5.3 Computation Time

In Table 5.4, we show the computation time (from reading images to that of obtaining
the co-segmented objects) taken by the methods PM, DCC, DSAD, MC, MFC, MRW,
UJD for sets of 8, 16, 24, 32, 40, 60 and 80 images. In the case of the method PM,
the supra-linear growth in computation with respect to N is due to the increased
cardinality of the latent class graph as explained in Sect. 5.4. It is evident from the
results that the method PM, despite being run in Matlab, is computationally very fast
compared to all other methods. Method DSAD is actually faster but the performance,
as given in Table 5.1, is very poor in comparison.
The method CMP uses region proposal which is computationally very expensive
taking on average 32 min to co-segment a set of 10 images. It is worth mentioning
that the method PM processes all the images in the set simultaneously unlike some
methods. For example, the method MRW cannot handle large image sets. They
generate multiple random subsets of the image set, perform co-segmentation on them
and compute average accuracy over all the subsets. The method JLH requires O (n R )
operations in every round (10 rounds are needed) of foreground model updation alone
where n R is the number of generated mid-level regions. The method UJD performs
SIFT matching with only 16 most similar images in the set, thus reducing the number
of matching operations to O (16N ) in each round with the number of rounds ranging
from 5 to 10. In addition to this, it requires optimizing a cost function in every round
to find pixel labels. The method MFC also performs multiple rounds of foreground
label refinement by considering all the images in every round. In contrast, the method
PM processes all the images in one round to obtain the latent class graph using only
O (N ) matching operations. The methods DCC, MC subsample the images to reduce
computation with the order of computation being O (n P K ) and O (n 2S ), respectively,
where n P is the number of pixels in every image, n S is the total number of superpixels
122 5 Maximally Occurring Common Subgraph Matching

in the image set and K is the number of classes being considered. Since the method
PM uses only the superpixels from the most compact cluster to build the latent class
graph, the cardinality of every graph (Hi ⊆ Gi ) is very small compared to the total
number of superpixels in that image Ii . Hence, it is computationally very efficient.
In Sect. 5.3.3, cycle bases in graph G = (V , E ) can be computed using depth first
search with complexity O (|V | + |E |). Hole filling for a planar graph in Sect. 5.3.3
can be alternatively implemented using morphological operations. More specifically,
the interior nodes of a cycle can be found using morphological reconstruction by
erosion [118] on the binary image formed by the superpixels that constitute the
cycle.
In this chapter, we have described a fast method for image co-segmentation in
an unsupervised framework applicable to a set of large number of images with
an unknown number of them (majority class) containing the common object. The
method is shown to be robust against presence of outlier images in the dataset. We
have discussed a concept of a latent class graph to define a combined representation
of all unique superpixels within a class and this graph is used to detect the com-
mon object in the images under O (N ) operations of subgraph matching. The use of
the latent class graph also helps to maintain global consistency in matches. It may
be noted that the accuracy of the boundary detection in common objects depends
on the accuracy of the superpixel generation. This can be further improved using
post-processing by GrabCut [102] or image matting [65] and is not pursued in this
chapter.
In this MOCS based co-segmentation algorithm, latent class graph generation and
region growing are performed on the sub-images in the cluster of interest. But choice
of this cluster based on the compactness measure may not always be accurate. Some
superpixels from the outlier images may also belong to this cluster due to feature
similarity with some regions in the images from the major class. This may lead to
poor result and the outlier images may not get excluded. In the next chapter, we
describe another co-segmentation method to address this problem.
Chapter 6
Co-segmentation Using a Classification
Framework

6.1 Introduction

Image co-segmentation, as explained in earlier chapter, is useful for finding objects of


common interest from a set of large number of crowd-sourced images. These images
are generally captured by different people using different cameras under different
illuminations and compositional context. These variations make it very difficult to
extract the common object(s). Further as discussed in Chap. 5, the presence of outlier
data, totally irrelevant images (see Fig. 6.1 for illustration), in the database makes
the co-segmentation problem even more difficult.
Since we are required to co-segment natural images, the common objects may
not be homogeneous. Hence, feature selection is difficult and use of low-level and
mid-level features, whose computation does not require any supervision, may not
yield good results. In order to obtain high-level features, semi-supervised methods
in [36, 71, 74, 85, 86, 131] compute region proposals from images using pretrained
networks, whereas Quan et al. [99] use CNN features. However in this chapter,
we do not involve any supervised learning [50, 72, 106, 123, 146] and describe a
method for robust co-segmentation of an image set containing outlier images in a
fully automated unsupervised framework, yet yielding excellent accuracy.

6.1.1 Problem Definition

In a set of images to be co-segmented, typically (i) the common object regions in


different images are concentrated in the feature space since they are similar in fea-
ture, (ii) the background varies across images and (iii) the presence of unrelated
images (outliers) would produce features of low concentration away from the com-
mon object feature points. Hence in the space containing features of all regions from
the entire image set, the statistical mode and a certain neighborhood around it (in
the feature space) corresponds to the features of the common object regions. So,

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 123
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_6
124 6 Co-segmentation Using a Classification Framework

Fig. 6.1 Image co-segmentation. Top row: an input image set with four of them having a common
object and the fifth one without any commonality is the outlier image. Bottom row: the common
object (cheetah). Image Courtesy: Source images from the iCoseg dataset [8]

finding the mode in the multidimensional feature space can be a good starting point
for performing co-segmentation. This problem, however, is difficult for two reasons.
Firstly, computation of mode in higher-dimensional feature space is inherently very
difficult. Secondly, variations in ambient imaging conditions across different images
make the population density less concentrated around the mode.

Method overview. First, the images to be co-segmented (Fig. 6.2a) are tesselated
into superpixels using the simple linear iterative clustering (SLIC) method [1]. Then
low-level and mid-level features for every superpixel (Fig. 6.2b) are computed. Next,
mode detection is performed in these features in order to classify a subset of all
superpixels into background (Fig. 6.2e) and common foreground classes (Fig. 6.2c),
in an unsupervised manner and mark them as seeds. Although these features are
used to compute the mode, they are not sufficient for finding the complete objects
since objects may not be homogeneously textured within an image and across the
set of images. So, an appropriate distance measure between the common foreground
class and the background class(es) is defined and a discriminative space is obtained
(using the seed samples) that maximizes the distance measure. In this discriminative
space, samples from the same class come closer and samples from different classes
go far apart (Fig. 6.2h). This discriminative space appropriately performs the task of
maximally separating the common foreground from the background regions. Thus,
it better caters to robust co-segmentation than the initial low-level and mid-level
feature space. Next, to get the complete labeled objects, the seed region is grown
in this discriminative space using a label propagation algorithm that assigns labels
to the unlabeled regions as well as updates the labels of the already labeled regions
(Fig. 6.2i). The discriminative space computation stage and the label propagation
6.1 Introduction

Fig. 6.2 Co-segmentation algorithm: a Input images for co-segmentation. b Features extracted for each superpixel. c Mode detection performed in the feature
space to obtain initial foreground seed samples. e Background seed samples computed using a background probability measure. d The remaining samples that
do not belong to the foreground/background seeds. f Clustering performed on background seed samples to obtain K background clusters. g The labeled seed
samples from 1 + K classes, and unlabeled samples shown together (compare g with b). Image Courtesy: Source images from the iCoseg dataset [8]
125
126

Fig. 6.2 (Continued): Co-segmentation algorithm: g The labeled seed samples from 1 + K classes, and unlabeled samples shown together. All samples are
fed as input for cyclic discriminative subspace projection and label propagation. h A discriminative subspace is learned using the labeled samples, such that
same class samples come closer and dissimilar class samples get well-separated. i Label propagation assigns new labels to unlabeled samples as well as updates
previously assigned labels (Both h and i are repeated alternatively until convergence). j The final labeled and few unlabeled samples. Foreground labeled samples
(in green) are used to obtain k the co-segmentation mask and l the common object
6 Co-segmentation Using a Classification Framework
6.1 Introduction 127

stage are iterated till convergence to obtain the common objects (Fig. 6.2j–l). The
salient aspects of this chapter are
• We describe a multi-image co-segmentation method that can handle outliers.
• We discuss a method for statistical mode detection in a high-dimensional feature
space. This is a difficult problem and has never been attempted in computer vision
applications.
• We explain a foreground–background distance measure designed for the problem
of co-segmentation and compute a discriminative space based on this distance
measure, in an unsupervised manner, to maximally separate the background and
the common foreground regions.
• We show that discriminative feature computation alleviates the limitations of low-
level and mid-level features in co-segmentation.
• We describe a region growing technique, using label propagation with spatial
constraints, that achieves robust co-segmentation.
In Sect. 6.2, we describe the co-segmentation algorithm through mode detection,
discriminative feature computation and label propagation. We report experimental
results in Sect. 6.3.

6.2 Co-segmentation Algorithm

For every superpixel extracted from a given set of input images, first a feature
representation based on low-level (Lab color, SIFT [127]) and mid-level (locality-
constrained linear coding-based bag-of-words [138]) features is obtained. Then in
that feature space, the mode of all the superpixel samples is detected. This mode is
used to label two subsets of samples that partially constitute the common foreground
and the background, respectively, with high confidence. Next using them as seeds,
the remaining superpixels are iteratively labeled as foreground or background using
label propagation. Simultaneously, the labels of some of the incorrectly labeled seed
superpixels get updated, where appropriate. In order to increase the accuracy of label
propagation, instead of using the input feature space directly, a discriminative space
is obtained where the foreground and background class samples are well separated,
aiding more robust co-segmentation. All these stages are explained in the following
sections.

6.2.1 Mode Estimation in a Multidimensional Distribution

We expect that the superpixels constituting the common object in multiple images
have similar features and they are closer to each other in the feature space compared
to the background superpixels. Furthermore, the superpixels from the outlier images
have features quite distinct from the common object. Under this assumption, the
128 6 Co-segmentation Using a Classification Framework

seeds for the common foreground are obtained as the set of superpixels belonging to
the densest region in the feature space. To find the seeds, we first introduce the notion
of dominant mode region in the multidimensional feature space by representing every
superpixel as a data point in the feature space.
Definition of mode: Let p(x) denote the probability density function (pdf) of the
samples x ∈ R D , which is the D-dimensional feature of a superpixel. Then x0 is the
mode [110] if for each ε > 0 there exists δ > 0 such that the distance d(x, x0 ) > ε
implies
p(x0 ) > p(x) + δ. (6.1)

Here, δ = 0 is chosen to eliminate the case where a sequence {xi } is bounded away
from x0 , but p(xi ) → p(x0 ). To prove the existence of the mode in a distribution,
Sager [110] showed that given a sequence of integers {( j)} such that

( j)/j = o(1), (6.2)


 
j/(( j) log j) = o(1), and (6.3)

( j)
S j being the smallest volume set containing at least ( j) samples, if x0 ∈ S j for
( j)
each j, then x0 → x0 almost surely.
Since the superpixel feature space is not densely populated by sample points, the
previous relationship may be modified as:
 
p(x0 + g)dg > p(x + g)dg + δ (6.4)
Cir(ν) Cir(ν)


with g ∈ R D and Cir(ν) denoting the integral over a ball of radius ν. One can find
the dominant mode by fitting a Gaussian kernel with an appropriate bandwidth on
the data points when the feature space is densely populated. But in this setup, the
data points are sparse; ı.e, the total number of samples (n A ) is small with respect
to the feature dimension, and hence Gaussian kernel fitting is not useful. So, the
mode for multidimensional data points is very difficult to compute, unlike in the
unidimensional case [9, 27, 129]. Though multidimensional mode for very low
dimensions can be computed using Hough transform or EM algorithm, it is not
applicable at high dimension.
Let the entire collection of superpixels in the feature space be modeled as a mixture
of samples from the common foreground (F) and the background (B):

p(x) = ξ F p F (x) + (1 − ξ F ) p B (x), (6.5)

where ξ F is the mixing proportion, p F (x) and p B (x) are the pdfs of foreground
and background samples, respectively. The superpixels belonging to common fore-
ground have similar visual characteristics. Hence, their corresponding data points
are expected to be in close proximity in the feature space. Therefore, without loss of
6.2 Co-segmentation Algorithm 129

generality, we can assume that the data samples of F are more concentrated around
a mode in the feature space. On the other hand, the superpixels belonging to the
background come from different images. Hence, they will have a much diverse set of
features that are widely spread in the feature space. Thus p F (x) should have a much
lower variance than p B (x). Let us define

E(x, ν)  p(x + g)dg. (6.6)
Cir(ν)

For the case of ξ F = 1 (i.e., not a mixture distribution) and assuming p F (·) to be
spherically symmetric, Sager [110] computed the mode (x0 ) by finding the radius
(ν0 ) of the smallest volume containing a certain number of points (n 0 ), such that
E(x0 , ν0 ) = nnA0 . Mathematically,

n0
x0 = Solution {inf p(x + g)dg = }. (6.7)
x ν Cir(ν) nA

Although the estimator has been shown to be consistent, it is not known how one
can select n 0 in Eq. (6.7). Thus, we need to handle two specific difficulties: (i)
how to extend the method to deal with mixture densities and (ii) how to choose n 0
appropriately. From Eq. (6.5), integrating both sides over a ball of radius ν, we obtain

E(x, ν) = ξ F E F (x, ν) + (1 − ξ F ) E B (x, ν). (6.8)

For the data points belonging to the background class, p B (x) can be safely assumed
to be uniformly distributed. We also observe that
ν D
E B (x, ν) ∝ ( ) , for all x, (6.9)
dmax

where dmax is the largest pairwise distance among all feature points, and hence
E B (x, ν) is very small. However, due to centrality in the concentration of p F (x)
(e.g., say Gaussian distributed), E F (x, ν) is very much location dependent and is
high when x = x0 (i.e., mode). Hence, Eq. (6.8) may be written as:

E(x, ν) = ξ F E F (x, ν) + (1 − ξ F ) E B (ν)  κ F + κ B . (6.10)

Thus, a proper mode estimation requires that we select κ F > κ B , and although
E F (x, ν) and E B (ν) are both monotonic increasing functions of ν, ddν
EF
becomes
d EB
smaller than dν due to centrality in concentration of the foreground class beyond
some value ν = νm . Ideally, one should select 0 < ν0 ≤ νm while extending Sager’s
method [110] to mixture distribution. Hence, we need to (i) ensure an upper bound
νm on ν and (ii) constrain κ F + κ B to a small value (say κm ) while searching for
the mode x0 . For example, we may set κm = nnA0 = 0.2 and νm = 0.6 dmax to decide
on the value of ν0 . Further assumptions can be made on the maximum background
130 6 Co-segmentation Using a Classification Framework

(superpixels from p B (x)) contamination (say α%) within the neighborhood of the
mode x0 . Thus at the end of co-segmentation if one finds more than α% foreground
labeled data points being changed to background labels, it can be concluded that
mode estimation was unsuccessful. In order to speed up the computation, the mode
can be approximated as one of the given data points only, as is commonly done for
computing the median.
Seed labeling. In the feature space, the mode and only a certain number of data
points in close proximity to the mode are chosen as seeds for the common fore-
ground C F (Fig. 6.2c). A restrictive choice of seeds is ensured by the bounds on ν0
and κm . As this is an approximate solution, the mode region may yet contain a few
background samples also. In Sect. 6.2.3, we will show how this can be corrected.
In a set of natural images, it is quite usual to have common background, e.g., sky,
field, etc. In such cases, superpixels belonging to these common background seg-
ments may also get detected as the mode region in the feature space. To avoid such
situations, the background probability measure of Zhu et al. [151] can be used to
compute the probability P (s) that a superpixel s belongs to the background. Only
the superpixels having background probability P (s) < t1 can be considered during
mode detection, where t1 is an appropriately chosen threshold, thus eliminating some
of the background superpixels. A small value of t1 helps reduce false positives. A
superpixel is marked as a background seed if P (s) > t2 . A high value of threshold
t2 ensures a high confidence during initial labeling. Figure 6.3a shows seed regions
in the common foreground (C F ) and background (C B ) for an image set.

6.2.2 Discriminative Space for Co-segmentation

To obtain the common foreground object, we need to assign labels to all superpixels.
Hence, the set of common foreground and background superpixels obtained as seeds
are used to label the remaining unlabeled superpixels as well as to update the already
labeled superpixels using label propagation. To achieve a more accurate labeling and
better co-segmentation, it is beneficial to have the following conditions:
R1: Maximally separate the means of foreground and background classes.
R2: Minimize the within-class variance of the foreground F to form a cluster.
R3: Minimize the within-class variance of the background B to form another separate
cluster.
With this motivation, a discriminative space is learned using the labeled samples. In
this space, dissimilar class samples get well separated and same class samples come
closer, thus satisfying the above conditions. Since labeled and unlabeled samples
come from the same distribution, the unlabeled samples come closer to the correctly
labeled samples of the corresponding ground-truth class. This better facilitates the
subsequent label propagation stage, yielding more accurate co-segmentation.
As the background superpixels belong to different images and there is usually
large diversity in these backgrounds, this superpixel set is heterogeneous, having
6.2 Co-segmentation Algorithm 131

Fig. 6.3 a Seed labeling. Row 1 shows a set of input images for co-segmentation. Row 2 shows
the regions corresponding to the common foreground (C F ) seed superpixels. Row 3 shows regions
corresponding to the background seed superpixels (C B ). b The heterogeneous background C B seed
is grouped into K = 3 clusters. The three rows show the regions corresponding to the background
clusters C B1 , C B2 and C B3 , respectively
132 6 Co-segmentation Using a Classification Framework

Fig. 6.4 R3 (Sect. 6.2.2):


Satisfying R3 is equivalent to
meeting requirements R3a
and R3b sequentially

largely varying features. This makes the background distribution multimodal. The
heterogeneous nature of the background can be observed even in the background
seeds as shown in Fig. 6.3a(Row 3). Hence, R3, ı.e., enforcing the multimodal back-
ground class to form a single cluster, becomes an extremely difficult requirement
for the space to satisfy. However for co-segmentation, we are interested in accurate
labeling of the foreground (rather than the background), for which R1 and R2 are
more important than R3. Hence, R3 can be relaxed to a simpler condition R3a by
allowing background samples to group into some K clusters (with some minimum
within-cluster variance).
Justification: As illustrated in Fig. 6.4, meeting R3 is equivalent to meeting two
requirements R3a and R3b sequentially.
• R3a: transforming the multimodal background class to form multiple (some K )
clusters, with some minimum within-cluster variance (without any constraint on
the separation of cluster means).
• R3b: transforming all the above clusters to have a minimum between-cluster vari-
ance and finally form a single cluster.
The difficulty in meeting R3 is mostly due to R3b which is necessarily seeking a
space where the multiple clusters have to come closer and form a single cluster.
Clearly, achieving R3a alone is a much simpler task than achieving both R3a and
R3b. So, the final requirements for the discriminative space to satisfy are R1, R2 and
R3a.
6.2 Co-segmentation Algorithm 133

In order to facilitate R3a, the given background seeds are grouped into clusters.
Figure 6.3b illustrates a case where the clustering algorithm forms K = 3 background
clusters (C B1 , C B2 , C B3 ). Next, using the foreground (C F ) and K background clusters,
a discriminative space is learned that satisfies R1, R2 and R3a. As mentioned ear-
lier, in R3a, we do not need to enforce any separation of the K background clusters
within themselves. This is intuitive, as for co-segmentation we are only interested in
separating the foreground class from the background, and not necessarily in the sep-
aration among all 1 + K classes. The significance of this multiple cluster modeling
of the background will be illustrated in Sect. 6.3.1, in terms of improved accuracy.
Learning the discriminative space. There exists various works on finding dis-
criminative space for person re-identification [143], action classification [98], face
recognition [95, 141], human gait recognition [62], object and image classifica-
tion [34, 144], character recognition [148]. However, these methods give equal pri-
ority for discriminating all classes. Hence, they are not appropriate for meeting the
above-mentioned specific requirements for co-segmentation. In order to find the
appropriate discriminative space, first a measure of separation between the fore-
ground class (C F ) and the background classes (C B1 , C B2 , . . ., C BK ) is defined based
on the requirements mentioned previously. Then the optimal discriminative space
that maximizes this separation is computed.
Let X ∈ R D×n T contains feature vectors of all (say n T ) labeled superpixels with
nT = n F + n B , where n F is the number of labeled superpixels in class C F and n B =
K
k=1 n Bk be the total number of samples from all the background classes. Here, the
goal is to compute discriminant vectors that maximize the foreground–background
separation E , defined to be of the form

d (C F , ∪k=1 C Bk )
K
E = , (6.11)
v (C F , ∪k=1 C Bk )
K

where d (·) is a measure of foreground–background feature distance (achieves R1


and R3a) and v (·) is the measure of variance in all classes (achieves R2 and R3a)
as defined next.
K
n Bk
d (C F , ∪k=1 C Bk ) = d(C F , C Bk ),
K
(6.12)
k=1
nB

where d(C F , C Bk ) is the inter-class distance between the common foreground class
C F and any background class C Bk , defined as

d(C F , C Bk ) = (m Bk − m F )T (m Bk − m F )
 
= tr (m Bk − m F )(m Bk − m F )T , (6.13)

where m F is the mean of feature vectors in the foreground class C F and m Bk is the
mean of background class C Bk . Here, d (·) is formulated using only the distances
d(C F , C Bk ) to achieve large discrimination between C F and C Bk , ∀k.
134 6 Co-segmentation Using a Classification Framework

It is quite possible that two classes Ci , C j , with large intra-class variances overlap,
and they still have large inter-class distance d(Ci , C j ). Hence in Eq. (6.11), d (·) is
normalized using v (·), which is given by

nF  K
nB
v (C F , ∪k=1 C Bk ) = ω F S (C F ) + ω Bk k S (C Bk ),
K
(6.14)
nT k=1
nT

where S (C ) denotes a measure of scatter of class C and ω is the corresponding weight


of the class. Typically, the characterization S (C )  tr(V) is used to represent the
scatter where V is the covariance matrix of the class C . As we are interested in
separating the foreground, we require the reduction in the scatter of foreground C F
to be more than that of the background classes. Hence, to give more weight to V F ,
we should select ω F > ω Bk . A good choice can be ω F = 1, ω Bk = 1/K .
Using definitions of d (·) and v (·) in Eqs. (6.12)–(6.14), we rewrite the
foreground–background separation in Eq. (6.11) as:
K n Bk    
k=1 n B tr (m Bk − m F )(m Bk − m F )
T
tr Q f b
E =  K n Bk = , (6.15)
nF
tr(V F ) + K1 k=1 tr(V Bk ) tr (Qw )
nT nT

where the foreground–background inter-class scatter matrix Q f b ∈ R D×D and the


intra-class (or within-class) scatter matrix Qw ∈ R D×D are given as

1 
K
Qfb = n B (m Bk − m F )(m Bk − m F )T (6.16)
n B k=1 k

1 
K
1
Qw = {n F V F + n B V B }, (6.17)
nT K k=1 k k

and they represent variances among the superpixel features in X. A high value of E
implies that the foreground is well-separated from the background classes and the
above formulations of Q f b and Qw ensure this.
Next, we seek a discriminative space that maximizes the above-defined foreground–
background separation E . Let W = [w1 w2 . . . w Dr ] ∈ R D×Dr be the projection
matrix for mapping each data sample x ∈ R D to the discriminative space to obtain
z ∈ R Dr , where Dr < D.
z = WT x (6.18)

Similar to Eqs. (6.16) and (6.17), the foreground–background inter-class and intra-
f b and Qw are derived using all the projected data {z} in the
class scatter matrices QW W

discriminative space as follows:

w = W Qw W
QW T
(6.19)
6.2 Co-segmentation Algorithm 135

f b = W Q f b W.
QW T
(6.20)

f b } ensures good separation between the common foreground


A large value of tr{QW
class and the background classes. A small value of tr{QW w } ensures less overlap
among the classes, in the projected domain. Hence, the optimal W is obtained by
maximizing the foreground–background separation in the feature space.

tr WT Q f b W)
f b )/ tr(Qw )
tr(QW =
W
max (6.21)
W tr(WT Qw W)

This is a trace ratio problem, for which an approximate closed-form solution can be
obtained by transforming it to a ratio trace problem [32]:

max tr{(WT Qw W)−1 (WT Q f b W }. (6.22)
W

This can be efficiently solved as a generalized eigenvalue problem:

Q f b wi = λQw wi , i.e., Q−1


w Q f b wi = λwi . (6.23)

Thus, the solution W, that contains the discriminants wi ∈ R D , is determined by the


eigenvectors corresponding to the Dr largest eigenvalues of Q−1 w Q f b . After solv-
ing for W, all the superpixel samples in the low-level and mid-level feature space
are projected onto the discriminative space using Eq. (6.18). These samples in the
discriminative space are used for label propagation as described in Sect. 6.2.3.
Discussion. (1) The learned discriminative space has two advantages. Firstly,
there occurs a high separation between the foreground and the background class
means. Secondly, the same class samples come closer in the discriminative space.
This phenomenon occurs not only for the labeled samples but also for the unlabeled
samples. This facilitates more accurate assignment of labels in the subsequent label
propagation stage of the co-segmentation algorithm.
(2) Although the solution of W has a form similar to that of linear discriminant
analysis (LDA), the formulations of scatter matrices Q f b and Qw differ significantly.
In LDA [10], the scatter matrices are designed to ensure good separation among all
classes after projection, and hence the inter-class and intra-class scatter matrices are
accordingly defined as:

K
nk
Qb = (mk − m̄)(mk − m̄)T and (6.24)
k=1
n T

K
nk
Qw = Vk , (6.25)
n
k=1 T
136 6 Co-segmentation Using a Classification Framework

respectively, where m̄ is the mean of all feature vectors in all classes. Since for co-
segmentation, we require only the common foreground class to be well-separated
from the background classes, here the condition of discrimination among the back-
ground classes is relaxed and the discrimination of the foreground class from the rest
is prioritized. The scatter matrices Q f b and Qw are consistent with these require-
ments. As seen in Eq. (6.16), the foreground–background inter-class scatter matrix
Q f b measures how well the foreground mean is separated from each of the back-
ground class means, without accounting how well the background class means are
separated from each other. Similarly, the foreground class scatter V F is given more
weight in the within-class scatter matrix Qw , thus prioritizing more the foreground
samples to populate together, as compared to each of the background classes.

6.2.3 Spatially Constrained Label Propagation

In Sect. 6.2.1, we have seen how a set of seed superpixels with different class labels
can be initialized. Now to find the common object, the regions constituted by seed
need to be grown by assigning labels to the remaining superpixels. This region
growing is done in two stages. First, label propagation is performed considering all
superpixels (n A ) from all images simultaneously using the discriminative features
(z) described in Sect. 6.2.2. Then the updated superpixel labels in every image are
pruned independently using spatial constraints as described next.
First, every seed superpixel si is assigned a label L ∈ {1, 2, . . . , K + 1}. Specif-
ically, superpixels belonging to C F have label L = 1 and superpixels belong-
ing to C Bk have label L = k + 1 for k = 1, 2, . . . , K . Then a binary seed label
matrix Yb ∈ {0, 1}n A ×K +1 is defined as

Yb (i, l) = 1, if L (si ) = l (6.26a)


Yb (i, l) = 0, if (L (si ) = l) ∨ (si is unlabeled) (6.26b)

which carries class information of only the seed superpixels. Here, the unlabeled
superpixels are the remaining superpixels other than the seed superpixels. The aim
is to obtain an optimal label matrix Y ∈ Rn A ×K +1 with class information of all the
superpixels from all images.
Let S0 ∈ Rn A ×n A be the feature similarity matrix where S0 (i, j) is a similarity
measure between si and s j in the discriminative space. One can use any appropri-
ate measure to compute S0 (i, j) such as the additive inverse of Euclidean distance
between zi and z j that represent si and s j , respectively. This similarity matrix can
be normalized as:
S = D−1/2 S0 D−1/2 , (6.27)

where D is a diagonal matrix with D(i, i) = j S0 (i, j). To obtain the optimal Y,
the following equation is iterated.
6.2 Co-segmentation Algorithm 137

Y(t+1) = ωl S Y(t) + (1 − ωl )Yb , (6.28)

where ωl is an appropriately chosen weight in (0, 1). The first term updates Y using
the similarity matrix S. Thus, labels are assigned to unlabeled superpixels through
label propagation from the labeled superpixels. The second term minimizes the dif-
ference between Y and Yb . It has been shown in [150], that Y(t) converges to

Y∗ = lim Y(t) = (I − ωl S)−1 Yb . (6.29)


t→∞

The label of superpixel si is obtained as:

L = arg maxY∗ (i, j), under constraints C1, C2. (6.30)


j

Every row and column of Yb correspond to a superpixel and a class, respectively.


If the number of superpixels in one class C j is significantly large compared to the
remaining classes (Ck , k = j), the columns of Yb corresponding to Ck ’s will be sparse.
In such scenario, the solution to Eq. (6.30) will be biased toward C j . Hence, every
column of Yb is normalized by its L 0 -norm. Next, we need to add two constraints to
this solution and update it.
C1: Y∗ (i, j) is a measure of similarity of superpixel si to the set of superpix-
els with label L = j. If Y∗ (i, j) is small for all j = 1, 2, . . . , K + 1 (i.e.,
max Y∗ (i, j) < tl ), these similarity values may lead to wrong label assignment;
j
so it is discarded and the corresponding superpixel si remains unlabeled. A good
choice of the threshold tl can be median(Y∗ ).
C2: The label update formulation in Eq. (6.28) does not use any spatial information
of superpixels. Thus any unlabeled superpixel in an image can get assigned to
one of the classes based only on feature similarity. Hence, every newly labeled
superpixel may not be a neighbor of the seed regions in that subimage belonging
to a certain class and that subimage may contain many discontiguous regions.
But typically, objects (in C F ) and background regions (in C Bk ), e.g., sky, field
and water body, are contiguous regions. Hence, a spatial constraint is added
to Eq. (6.28) that an unlabeled superpixel si will be considered for assignment
of label L = j using Eq. (6.30) only if it belongs to the first-order spatial
neighborhood of an already labeled region (with label L = j) in that subimage.
Result of label propagation with the seed regions of Fig. 6.3a at convergence of
Eq. (6.28) is shown in Fig. 6.5a. Due to the above two constraints, not all unlabeled
superpixels are assigned labels. Only a limited number of superpixels in the spatial
neighborhood of already labeled superpixels are assigned labels.
After this label updating, all labeled superpixels are used to again compute a
discriminative space using original input feature vectors (low-level and mid-level
features) following the method in Sect. 6.2.2 and label propagation is performed
again in that newly computed discriminative space. These two stages are iterated
138 6 Co-segmentation Using a Classification Framework

Fig. 6.5 Label propagation. a Label propagation assigns new labels to a subset of previously
unlabeled samples, as well as updates previously labeled samples. Superpixel labels of the fore-
ground and the three background classes in Fig. 6.3 have been updated and some of the unlabeled
superpixels have been assigned to one of the four classes after discriminative space projection and
label propagation. Label propagation. b Final co-segmentation result (Row 2) after multiple itera-
tions of successive discriminative subspace projection and label propagation. Rows 3–5 show the
background labeled superpixels
6.2 Co-segmentation Algorithm 139

alternately until convergence as shown in the block diagram in Fig. 6.2. The iteration
converges if
• either there is no more unlabeled superpixel left or
• labels no longer get updated.
It may be noted that some superpixels may yet remain unlabeled after convergence
due to the spatial neighborhood constraint. However, it does not pose any problem
as we are interested in labels of co-segments only. Figure 6.5b shows the final co-
segmentation result after convergence. As ωl in Eq. (6.28) is nonzero, initial labels of
the labeled superpixels also get updated. This is evident from the fact that the green
regions in subimages 5,6 in C F of Fig. 6.3a are not present after label propagation
and are assigned to background classes as shown in Fig. 6.5b. The strength of this
method is further proved by the result that the missing balloon in the subimage 4 in
C F of Fig. 6.3a gets recovered.
Label refinement as preprocessing. Every iteration of discriminative space com-
putation (Sect. 6.2.2) and label propagation begins with a set of labeled superpixels
and ends with updated labels where some unlabeled superpixels are assigned labels.
To achieve better results, the input labels can be refined before every iteration as
a preprocessing step. The motivation behind this label refinement and procedure is
described next. As an illustration, Fig. 6.7 shows the updated common foreground
class C F after performing label refinement on the seed labels shown in Fig. 6.3a
(Row 2).
• In Fig. 6.3a, we observe that some connected regions (group of superpixels) in the
common foreground class (C F ) spread from image left boundary to right boundary.
These regions are most likely to be part of background. Hence, they are removed
from C F , thus pruning the set. This is illustrated in subimages 3, 4 of C F in Figs. 6.3a
and 6.7.
• In Fig. 6.3, we also observe that there are ‘holes’ inside the connected regions in
some subimages. These missing superpixels either belong to some other class or are
unlabeled (not part of the set of already labeled superpixels). Such holes in C F are
filled by assigning the missing superpixels to it, thus enlarging the set. In the case
of any background class C Bk , such holes are filled only if the missing superpixels
are unlabeled. Hole filling can be performed using the cycle detection method
described in Chap. 5. Alternatively, morphological operations can also be applied.
The missing superpixels in every image, if any, are found using morphological
reconstruction by erosion [118] on the binary image formed by the already labeled
superpixels belonging to that image. This is illustrated in Fig. 6.6. Result of hole
filling in C F (of Fig. 6.3a) is illustrated in subimage 1 in Fig. 6.7.
Further, the spatial constraint of the first-order neighborhood can be relaxed for fresh
label assignment of superpixels in case of subimages that have no labeled segment
yet. This allowed label assignment to superpixels of subimage 4 in C F of Fig. 6.7,
thus providing much-needed seeds for subsequent label propagation. The entire co-
segmentation method is given as a pseudocode in Algorithm 1.
140 6 Co-segmentation Using a Classification Framework

Fig. 6.6 Hole filling. Subimage 1 of C F in Row 2 of Fig. 6.3a shows that the foreground seed region
contains five holes (indicated by arrows) which can be filled by morphological operations

Fig. 6.7 Removal of connected regions in C F (in Row 2 of Fig. 6.3a) that spread from left to right
image boundary. Note that subimage 4 does not have a foreground seed to begin with

It may be noted that the use of mode detection ensures that superpixels from the
outlier images are not part of the computed common foreground seeds due to their
distinct features. This in turn helps during the label propagation stage so that outlier
superpixels are not added to C F . Thus, the method is able to discard outliers in the final
co-segmentation results, and hence the resulting C F , at convergence, only constitutes
the common object. This method’s robustness to outlier images is demonstrated in
Fig. 6.8, which is discussed in detail in Sect. 6.3.1.
6.2 Co-segmentation Algorithm 141

Algorithm 1 Co-segmentation algorithm


Input: Set of images I1 , I2 , . . . , I N
Output: Set of superpixels in all images belonging to the common objects
1: for i = 1 to N do
2: {s} ← Superpixel (SP) segmentation of every image Ii
3: Compute background probability P (s) of every s ∈ Ii
4: end for
5: Compute feature x for every SP s using LLC from SIFT, CSIFT and L∗ a∗ b∗ mean color
6: // Initial labeling
7: x̄ (and corresponding SP s̄) ← mode({x ∈ i=1 N I })
i
8: N f (x̄) ← {s : d(x̄, x) < ν0 }
9: Foreground cluster C F ← s̄ ∪ N f (x̄)
10: Divide the set {s : P (s) > 0.99} into K clusters and find background clusters C B1 , C B2 , . . . , C B K
11: // Initial cluster update
12: Fill holes and update C F , C B1 , C B2 , . . . , C B K
13: For each Ii , find Ri ⊆ C F ∩ Ii that constitutes a contiguous region spreading from image left to
right boundary
14: Update C F ← C F \{ i=1 N R }
i
15: Assign label Ls(0) ← 1, ∀s ∈ C F
16: For each k = 1 to K , assign label Ls(0) ← k + 1, ∀s ∈ C Bk
17: Initialize t ← 0, C (0) (0) (0)
F ← C F , C B1 ← C B1 , C B2 ← C B2 , . . .
18: // Iterative discriminative subspace projection and label propagation
19: while no convergence do
20: Find discriminant vectors w using the input features x of C (t) (t) (t) (t)
F , C B , C B , …, C B using (6.16),
1 2 K
(6.17) and (6.23)
21: W ← [w1 w2 . . . w Dr ]
22: Project every feature vector x as z ← WT x
23: Compute similarity matrix S0 where S0 (i, j)  1 − d(zi , z j )

24: Compute diagonal matrix D where D(i, i)  j S0 (i, j)
25: Compute normalized similarity matrix S ← D−1/2 S0 D−1/2
(t)
26: for si ∈ C F do
27: Yb (i, 1) ← 1(t) // Initialize foreground
|C F |
28: end for
29: for k = 1 to K do
(t)
30: for si ∈ C B do
k
31: Yb (i, k + 1) ← 1
(t) // Initialize background
|C B |
k
32: end for
33: end for
34: Y∗ ← (I − ωl S)−1 Yb // regularizer ωl
35: tl ← median(Y∗ )
36: // Label update
(t) K C (t) )) do
37: for all si ∈ Ns (C F ∪ ( k=1 Bk
38: ∗
if max Y (i, k) > tl then
k
(t+1)
39: Lsi ← arg max Y∗ (i, k)
k
40: end if
41: end for
42: C (t+1)
F ← {s : Ls
(t+1)
= 1}
43: For each k = 1 to K , C (t+1)
Bk ← {s : Ls
(t+1)
= k + 1}
44: t ← t + 1
45: end while
46: C F at convergence is the set of superpixels in all images constituting the common objects
142 6 Co-segmentation Using a Classification Framework

Table 6.1 Comparison of Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC,
JLH, MRW, UJD, RSP, CMP, GMR, OC on the dataset created using the iCoseg dataset and the
38-class iCoseg dataset without outliers
Methods PM MRW CMP MFC MC UJD
J (iCoseg 626) with outliers 0.73 0.65 0.62 0.61 0.41 0.38
J (iCoseg 38) no outlier 0.76 0.70 0.73 0.64 0.59 0.68
DCC DSAD JLH GMR RSP OC
J (iCoseg 626) with outliers 0.36 0.26 x x x x
J (iCoseg 38) no outlier 0.42 0.42 0.79 0.76 0.66 0.62
‘x’: code not available to run on outlier data

6.3 Experimental Results

In this section, we analyze the results obtained by the co-segmentation method


described in this chapter, denoted as PM, on the same datasets considered in Chap. 5:
the MSRC dataset [105], the flower dataset [92], the Weizmann horse dataset [13],
the Internet dataset [105], the 38-class iCoseg dataset [8] without any outliers and
the 603-set iCoseg dataset containing outliers.
We begin by discussing the choice of features in the PM method. Dense SIFT
and CSIFT features [127] are computed from all images, and they are encoded
using locality-constrained linear coding [138], with the codebook size being 100,
to obtain mid-level features. The L∗ a∗ b∗ mean color feature (length 3) and color
histogram [18] (length 81) have been used as low-level features. Hence, the feature
dimension D = 100 + 100 + 3 + 81 = 284. Unlike semi-supervised methods, we
do not use saliency or CNN features or region proposal (measure of objectness) for
initialization here.

6.3.1 Quantitative and Qualitative Analyses

For each image, a binary mask is obtained by the assigning value 1 to all the pixels
within every superpixel belonging to C F and value 0 to the remaining pixels. This
mask is used to extract the common object from that image. If an image does not
contain any superpixel labeled C F , it is classified as an outlier image. Jaccard sim-
ilarity (J ) [105] is used as the metric to quantitatively evaluate the performance of
the methods PM, DCC [56], DSAD [60], MC [57], MFC [18], JLH [78], MRW [64],
UJD [105], RSP [68], CMP [36], GMR [99], OC [131], EVK [24]. Table 6.1 and
Table 6.2 provide results on the iCoseg dataset [8] and the Internet dataset [105],
respectively. We also show results on images from the Weizmann horse dataset [13],
the flower dataset [92] and the MSRC dataset [105] in Table 6.3, considering outliers.
Figure 6.8 shows the co-segmentation outputs obtained using the methods PM,
DCC, DSAD, MC, MFC, MRW, UJD on the challenging ‘panda’ images from the
iCoseg dataset [8] that also includes two outlier images (Image 5 and Image 9) from
‘stonehenge’ subset. The method PM correctly co-segments the pandas, whereas
6.3 Experimental Results 143

Table 6.2 Comparison of Jaccard similarity (J ) on the Internet dataset


Methods PM GMR UJD MFC EVK MRW DCC DSAD
J (Car class) 0.653 0.668 0.644 0.523 0.648 0.525 0.371 0.040
J (Horse class) 0.519 0.581 0.516 0.423 0.333 0.402 0.301 0.064
J (Airplane class) 0.583 0.563 0.558 0.491 0.403 0.367 0.153 0.079
J (Average) 0.585 0.604 0.573 0.479 0.461 0.431 0.275 0.061

Table 6.3 Comparison of Jaccard similarity (J ) on images selected from the Weizmann horse
dataset [13], the flower dataset [92] and MSRC dataset [105] considering outliers
Methods PM MFC CMP MRW DSAD MC DCC UJD
J (horse dataset) 0.81 0.76 0.76 0.75 0.69 0.61 0.58 0.39
J (flower dataset) 0.85 0.82 0.73 0.70 0.71 0.56 0.52 0.45
J (MSRC dataset) 0.73 0.60 0.46 0.62 0.46 0.60 0.61 0.66

other methods partially detect the common objects and also detect some background
regions. The methods MC, MRW (in Rows 3,6) and DSAD (in Row 5) detect either
white or black segments of panda. The method UJD (in Row 2) fails to co-segment
all images with the large sized pandas because it is a saliency-based method and a
large panda is less salient than a small one. Moreover, methods other than PM cannot
handle the presence of outlier images and wrongly co-segment regions of significant
size from them.

6.3.2 Ablation Study

Number of background clusters: In Sect. 6.2.1, we motivated that grouping the


background seed superpixels into multiple clusters improves superpixel labeling
accuracy because the variation of superpixel features in each background cluster
is less than the variation of features in the entire background labeled superpixel
set. In Fig. 6.9 (blue curve), this is validated through results of the method ‘with’
(ı.e., K > 1) and ‘without’ (i.e., K = 1) background superpixel clustering. Further,
Jaccard similarity values are also provided by setting the number of background
clusters K ∈ {2, 3, 4, 5, 6, 7, 8} on the 626 image sets as mentioned above. It is
evident that higher J is achieved when K > 1.
Comparison with baselines: In this co-segmentation method, low-level and mid-
level features (LMF) are used for obtaining the seeds for background regions and
the common foreground region using mode detection as described in Sect. 6.2.1.
Whereas in every iteration of label propagation, first discriminative features (DF)
are computed using LMF of the labeled superpixels (Sect. 6.2.2) and then the com-
puted DF are used to perform label assignment (Sect. 6.2.3). In Sect. 6.2.2, it is
explained that DF helps to discriminate different classes better, thus achieving better
co-segmentation results. Figure 6.9 shows the values of J obtained by the method
on 626 sets of images using LMF (black curve) and DF (blue curve) for performing
144 6 Co-segmentation Using a Classification Framework

label assignment. This validates the choice of DF over LMF in the label propagation
stage. Further, this figure shows the robustness of discriminative space projection
based on the measure of foreground–background separation in Eq. (6.15), which
performs much better compared to LDA. It provides the values of J obtained using
(i) Q f b and Qw as the scatter matrices (blue curve) and (ii) Qb and Qw instead of Q f b
and Qw (red curve) for discriminative space projection. It is evident that Q f b and
Qw (in Eqs. 6.16 and 6.17) outperforms LDA, thus validating the efficiency of the
foreground–background separation measure in Eq. (6.15), which has been designed
specifically for solving the co-segmentation problem. Hence, Q f b and Qw have been
used for all quantitative analyses. Here, we comment on two aspects of this study. For

Fig. 6.8 Co-segmentation results on images from the iCoseg dataset. For the input image set shown
in Row A (out of 12 images, 6 images are shown here, and 6 more images are shown in next page),
which includes two outlier images, Image 5 and Image 9 (shown in next page), the co-segmented
objects obtained using methods UJD, MC, DCC, DSAD, MRW, MFC, PM are shown in Rows B-H,
respectively. Row I shows the ground-truth (GT)
6.3 Experimental Results 145

Fig. 6.8 (Continued): Co-segmentation results on images from the iCoseg dataset. For the input
image set shown in Row A (out of 12 images, 6 images are shown here, and 6 more images were
shown in previous page), which includes two outlier images, Image 5 (shown in previous page) and
Image 9, the co-segmented objects obtained using methods UJD, MC, DCC, DSAD, MRW, MFC,
PM are shown in Rows B-H, respectively. Row I shows the ground-truth (GT)
146 6 Co-segmentation Using a Classification Framework

Fig. 6.9 Ablation study on the outlier dataset created using the iCoseg dataset by varying the
number of background clusters: K = 1 (i.e., no clustering) and K = 2, 3, 4, 5, 6, 7, 8. Jaccard
similarity (J ) values are provided while performing label propagation (i) with discriminative features
(DF) obtained using the formulations of scatter matrices Q f b and Qw , (ii) with DF obtained using
formulations in LDA and (iii) with low-level and mid-level features (LMF) alone, ı.e., without using
DF

K = 1, the co-segmentation-oriented DF formulation reduces to LDA due to their


design in Eqs. (6.16) and (6.17). Hence, they have the same J in the plot. Further,
we observe that for any choice of K , the PM curve is always above LDA and LMF
curves.

6.3.3 Analysis of Discriminative Space

In the co-segmentation algorithm, we begin with a set of seed superpixels for the com-
mon foreground class and multiple background classes (Fig. 6.3). This seed selection
is done in the space of low-level and mid-level features. In Sect. 6.1, it is motivated
that these features are not sufficient for co-segmentation because they do not dis-
criminate the classes well. Figure 6.10a demonstrates this by showing all image
superpixels (spatial regions) at their respective locations in the feature space. It is
evident that the superpixels belonging to different classes are not well-separated.
To attain a better separation among the classes, in Sect. 6.2.2, a discriminative
space is obtained where the common foreground superpixels are well-separated from
the background superpixels. Figure 6.10c shows all image superpixels (same set of
superpixels as in (a)) in this discriminative space. Superpixels of different classes
form clusters, and there exists better discrimination among classes compared to the
input feature space shown in (a). The cluster constituted by the common foreground
class superpixels (balloon in red and blue) is well-separated from the remaining clus-
6.3 Experimental Results 147

Fig. 6.10 Effectiveness of discriminative space projection. a All superpixels from the input image
set of Fig. 6.3 in their low-level and mid-level feature space. b Superpixels in the discriminative
space after label propagation using LDA, ı.e., using scatter matrices Qb and Qw and c using scatter
matrices Q f b and Qw . It can be seen that the foreground and background superpixels are better
clustered, and the foreground cluster (indicated using bounding box(es)) is well-separated from
the background clusters in (c) as compared to (a) and (b). Here, tSNE plots are used to visualize
multidimensional feature vectors in two dimensions
148 6 Co-segmentation Using a Classification Framework

Table 6.4 Computation time (in seconds) required by the methods PM, DCC, DSAD, MC, MFC,
MRW, UJD for different cardinalities of the image sets on an i7, 3.5 GHz PC with 16GB RAM
No. of Methods
images
PM DSAD MFC DCC MRW UJD MC
8 9 41 69 91 256 266 213
16 30 81 141 180 502 672 554
24 79 121 205 366 818 1013 1202
32 150 162 273 452 1168 1342 1556
40 285 203 367 1112 2130 1681 3052
PE M M+C M+C M+C M+C M+C M+C
Here, PE stands for programming environment, M for MATLAB and C for C language

ters constituted by the background class superpixels. This validates the formulations
of scatter matrices in Eq. (6.16), Eq. (6.17). Figure 6.10b shows the superpixels in the
discriminative space obtained using LDA, where the foreground class superpixels
got clustered into multiple groups (3 in this case). These visualizations concur with
the ablation study of Fig. 6.9.

6.3.4 Computation Time

In Table 6.4, we provide the computation time taken by the methods PM, DCC,
DSAD, MC, MFC, MRW, UJD for processing sets of 8, 16, 24, 32 and 40 images.
For the method PM, on average, 15 iterations are required for convergence of label
propagation. Despite being run in MATLAB, it is computationally faster than all
other methods. The method DSAD is quite competitive but the performance is com-
paratively poor as given in Table 6.1. The method PM processes all the images in the
set simultaneously unlike some methods. For example, UJD performs SIFT match-
ing with only 16 most similar images in the set, whereas MRW generates multiple
random subsets of the image set and computes average accuracy.
In this chapter, we have explained a co-segmentation algorithm that considers the
challenging scenario where the image set contains outlier images. First, a discrimina-
tive space is obtained and then label assignment (background or common foreground)
is performed for image superpixels in that space. Thus, we obtain the common object
that is constituted by the set of superpixels having been assigned the common fore-
ground label. Label propagation starts with a set of seed superpixels for different
classes. It has been shown that statistical mode detection helps in automatically find-
ing the seed from the images without any supervision. The choice of using mode
was driven by the objective of generating seed superpixels for foreground robustly,
efficiently and in an unsupervised manner. Further, it has been shown that multiple
class modeling of background is more effective to capture its large heterogeneity.
6.3 Experimental Results 149

The measure of foreground–background separation with multiple background classes


helps to find a more discriminative space that efficiently separates foreground from
the rest, thus yielding robust co-segmentation. Further, multiple iterations of the dis-
criminative space projection in conjunction with label propagation result in a more
accurate labeling. Spatial cohesiveness of the detected superpixels constituting the
co-segmented objects is achieved using a spatial constraint at the label propagation
stage.

Acknowledgements Contributions of Dr. Feroz Ali is gratefully acknowledged.


Chapter 7
Co-segmentation Using Graph
Convolutional Network

7.1 Introduction

Extracting common object from a collection of images where images are captured
in a completely uncontrolled environment resulting in significant variations of the
object of interest, in terms of its appearance, pose and illumination, is certainly a
challenging task. Unsupervised approaches discussed in the previous chapters have
limitations in handling such scenarios due to their requirements of finding out the
appropriate features of the common object. In this chapter, we discuss an end-to-end
learning method for co-segmentation based on graph convolutional neural networks
(graph CNN) that formulates the problem as a classification task of superpixels into
the common foreground or background class (similar to the method discussed in
Chap. 6). Instead of predefining the choice of features, here superpixel features are
learned with the help of a dedicated loss function. Since the overall network is trained
in an end-to-end manner, the learned model is able to perform both feature extraction
and superpixel classification simultaneously, and hence, these two components assist
each other to achieve the individual objectives.
We begin with an overview of the method described in this chapter. Similar to the
previous chapters, each individual image is oversegmented into superpixels, and a
graph is computed by exploiting the spatial adjacency relationship of the extracted
superpixels. Then for each image pair, using their individual spatial adjacency graphs,
a global graph is obtained by connecting each node in one graph to a group of very
similar nodes in the other graph, based on a node feature similarity measure. Thus,
the resulting global graph is a joint representation of the image pair. The graph
CNN model then classifies each node (superpixel) in this global graph using the
learnt features into the common foreground or the background class. The rationale
behind choosing graph CNN is that it explicitly uses neighborhood relationships
through the graph structure to compute superpixel features, which is not achievable
with a regular CNN without considerable preprocessing and approximations. As a
result, the learnt superpixel features become more robust to appearance and pose
variations of the object of interest and carry a greater amount of context. In order

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 151
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_7
152 7 Co-segmentation Using Graph Convolutional Network

to reduce the time required for the learning to converge, the model uses additional
supervision as semantic labels of the object of interest for some of the data points.
The overall network, therefore, comprises of two shared subnetworks, where one of
them monitors shared foreground and background labels, while the other extracts
semantic characteristics from associated class labels.

7.2 Co-segmentation Framework

Given a collection of images I , each image Ii ∈ I is oversegmented using the simple


linear iterative clustering (SLIC) algorithm [1] into a non-overlapping superpixel set
Si . The goal is to obtain the label of each superpixel as either common foreground
or background.

7.2.1 Global Graph Computation

In the graph convolutional neural network (graph CNN)-based co-segmentation of


an image pair, the network requires a joint representation of the corresponding graph
pair as a single entity. So, the first step in this algorithm is to combine the graph
pair, and obtain a global graph containing feature representation of superpixels from
both images. Then it can be processed through the graph CNN for co-segmentation.
However, the number of superpixels (|Si | = n i ) computed by the SLIC algorithm
typically varies for each image. Since image superpixels represent nodes in the cor-
responding region adjacency graphs, the resulting graphs for the image set will be
of different sizes. To avoid this non-uniformity, it is ideal to determine the minimum
number n = mini n i , and perform superpixel merging in each image such that the
cardinality of the superpixel set Si corresponding to every image Ii becomes n. For
this purpose, given an image’s superpixel set, the superpixel pair with the highest
feature similarity can be merged, and this process can be repeated until there are n
superpixels left. To avoid the possibility of some superpixels blowing up in size, a
merged superpixel should not be allowed to be merged again in subsequent iterations.
A pseudocode for this procedure is provided in Algorithm 1.
Let I1 , I2 be the image pair to be co-segmented, and S1 , S2 be the associated
superpixel sets. In order to combine the corresponding graph pair to obtain a global
graph, an initial feature fs ∈ R D for each superpixel s ∈ Si , ∀i is required. One may
consider any appropriate feature such as RGB color histogram, dense SIFT features,
etc. Each image Ii is represented as an undirected, sparse graph Gi = (Si , Ai ), in
which each superpixel represents a node, and superpixels that are spatially adjacent
are connected by an edge. Different from previous chapters, this graph is required to
be weighted. Thus, the adjacency matrix Ai ∈ Rn×n is defined as:
7.2 Co-segmentation Framework 153

Algorithm 1 Superpixel merging algorithm


Input: Set of images I = {I1 , I2 , . . . , Im }
Output: Set of superpixels S1 , S2 , . . . , Sm such that |Si | = n, ∀i
1: for i = 1 to m do
2: Superpixel segmentation of Ii
3: Si  Set of superpixels {s} in Ii
4: n i = |Si |
5: end for
6: n = min n i
i
7: // Superpixel merging
8: for i = 1 to m do
9: Sφ ← φ
10: while n i > n do
11: for s ∈ Si , r ∈ Si do
 D (fs (l)−fr (l))2 
12: W (s, r ) = exp −λ l=1 (fs (l)+fr (l))
13: end for
14: (sm , rm ) = arg max{W (s, r ) : ∀ (s, r ) ∈ Si \Sφ }
(s,r )
15: // Obtain a larger superpixel by grouping pixels of two superpixels
16: s̄ = Merge(sm , rm )
17: Si ← {Si ∪ s̄}\{sm , rm }
18: n i = | Si |
19: Sφ ← Sφ ∪ s̄
20: end while
21: end for
22: S1 , S2 , . . . , Sm are the superpixel sets used for graph construction and remaining stages


W (s, r) , ∀r ∈ N 1 (s), ∀s ∈ Ii
Ai (s, r) = (7.1)
0, otherwise

where W (s, r) signifies the feature similarity between superpixels s and r, and N 1
denotes a first-order neighborhood defined by spatial adjacency. Thus, edge weights
measure the feature similarity of linked nodes, i.e., spatially adjacent superpixel
pairs. The Chi-square kernel to determine W (s, r) for a superpixel pair (s, r) with
feature vectors fs and fr is defined as:
 

D
(fs (l) − fr (l))2
W (s, r) = exp −λ , (7.2)
l=1
(fs (l) + fr (l))

where λ is a parameter. To construct a global graph from the individual graphs


G1 , G2 , we need to consider the affinity among nodes across the graph pair. Thus,
the nodes in an inter-graph node pair possessing high feature similarity value are
considered to be neighbors, and they can be connected by an edge. This leads to a
global graph Gglobal = (Sglobal , Aglobal ) with node set Sglobal = {S1 ∪ S2 }, |Sglobal | =
2n and adjacency matrix Aglobal ∈ R2n×2n , defined as:
154 7 Co-segmentation Using Graph Convolutional Network

Ai (s, r), if (s, r) ∈ Gi , i = 1, 2
Aglobal (s, r) = (7.3)
W (s, r) 1(W (s, r) > t), if s ∈ Gi , r ∈ G j , i = j,

where 1(·) is an indicator function, and Aglobal (s, r) = 0 indicates s ∈ Gi , r ∈ G j


are not connected. The threshold t controls the number of inter-image connections.
Thus, a high threshold reduces the number of inter-graph edges, and vice-versa. It is
evident that Aglobal retains information from A1 and A2 , and attaches the inter-image
superpixel affinity information to them as:

A1 O
Aglobal = + Ainter , (7.4)
O A2

where Ainter is the inter-image adjacency matrix representing node (superpixel) con-
nections across the image pair while forming Gglobal , Co-segmentation utilizing this
global graph by learning a graph convolution model is described next.

7.3 Graph Convolution-Based Feature Computation

Given the global graph Gglobal containing 2n nodes, let

Fin  [f1 f2 · · · f2n ]T ∈ R2n×D , (7.5)

where each row represents the D-dimensional feature fiT of a node. This feature
matrix is input to a graph CNN, and let Fout be the output feature matrix after graph
convolution. In graph signal processing, these feature matrices are frequently referred
to as graph signals. Rewriting the input feature matrix Fin = [F1 F2 · · · FD ], the
input signal is 2n in length and has D channels ({Fi }i=1D
). In traditional CNNs, each
convolution kernel is typically an array that is suitable for convolving with an image
since pixels are on a rectangular grid. However, graphs do not have such a regular
structure. Hence, a graph convolution filter is built from the global graph’s adjacency
matrix Aglobal so that the spatial connectivity and intra-image as well as inter-image
superpixel feature similarities can be exploited in order to obtain the output graph
signal.
Specifically, a convolution filter is designed as an affine function of the graph
adjacency matrix where the coefficients represent the filter tap. Due to the fact that
the filter is coupled with a single filter tap, it functions efficiently as a low pass
filter. This in turn facilitates the decomposition of the adjacency matrix into several
components as:

T
Aglobal = I + Ainter + Adi (7.6)
i=1
7.3 Graph Convolution-Based Feature Computation 155

Fig. 7.1 At each node, eight angular bins are considered. Subsequently, depending upon the rela-
tive orientation with respect to the center node (i.e., node at which convolution is getting computed,
shown in blue circle), each neighboring node falls into one of the bins. Consequently, the corre-
sponding directional adjacency matrix (Adi ) dedicated to the bin will be activated. For example,
node 3 falls in bin 2, and Ad2 gets activated. Thus, each neighboring node is associated with either
of eight adjacency matrices (for eight different directions). Figure courtesy: [6]

where I is the 2n × 2n identity matrix, Ainter is the inter-image adjacency matrix


introduced earlier, and Adi ’s are the directional adjacency matrices representing
superpixel adjacency in both images. We now explain their role.
Since Gglobal is composed of individual spatial adjacency graphs (G1 , G2 ), the
adjacency matrix Ainter is introduced in Eq. (7.6) to represent inter-image superpixel
feature similarities. It is crucial for co-segmentation since it contains information
about a common item in the form of feature similarities between various superpixel
pairs across images. We will see in the next section that it receives distinct attention
during convolution.
The set of all potential directions around every superpixel (node) in the respective
image is quantified here into T bins or directions, and each matrix Adi contains
information about the feature similarity of only the adjacent nodes along direction
di . The direction of a node r in relation to a reference node s is calculated by the angle
θ between the line segment connecting the centroids of two nodes (superpixels) and
the horizontal axis, as can be seen in Fig. 7.1. Thus, each row of Adi corresponds to
a specific node of the global graph, and it contains feature similarities between that
node and its first-order neighbors in the respective graph along the direction di .

Aglobal (s, r), if (s, r) ∈ G1 or G2 , and θ (s, r) ∈ bini
Adi (s, r) = (7.7)
0, otherwise

Multiple adjacency matrices have been proven to have a significant effect on


graph convolution in [121]. Splitting an image’s pixel-based adjacency matrix into
directional matrices is trivial because each pixel’s local neighborhood structure is
identical, that is, each pixel has eight neighboring pixels on a regular grid. However,
156 7 Co-segmentation Using Graph Convolutional Network

performing the same operation on any random graph, as considered here, is not
straightforward because region adjacency graphs {Gi } constructed from different
images exhibit an inconsistent pattern of connectivity among nodes. As a result,
each node has a unique local neighborhood structure, including the number and
orientation of its neighbors. To address this scenario, uniformly partitioning the
surrounding 360◦ of each node of individual graphs into a set of angular bins is
beneficial. This helps to encode the orientation information of its various first-order
neighbors. Using a set of eight angular bins implicitly maintains the filter size 3 × 3,
motivated by the benefits and success of VGG nets. Thus, the number of distinct
directions T is chosen as 8, where each Adi corresponds to an angular bin.

7.3.1 Graph Convolution Filters

Given an input graph signal Fin consisting of D input features {Fi }i=1
D
, and adjacency
matrices {Adk }k=1 and Ainter representing the global graph, the graph convolution
T

filters are designed as:


T 
Hi = h i,0 I + h i,k Adk + h i,T +1 Ainter , ∀i = 1, 2, . . . , D, (7.8)
k=1

where each graph convolution filter is the set {Hi }i=1


D
, with {h} being the filter taps.
It produces one new feature F j from D input features {Fi }i=1 D
through the graph
convolution operation as:


D
Fj = Hi Fi . (7.9)
i=1

Therefore in a particular graph convolutional layer, depending upon the number of


p
graph convolution filters p, a series of new features { F j } j=1 are produced, result-
ing in a new graph signal F(1) = [ F1 , F2 , . . . , F p ] ∈ R2n× p . Then using features in
F(1) as the input signal, a second graph convolutional layer outputs F(2) . Similarly,
F(3) , F(4) , . . . , F(L) are obtained after L layers of graph convolutions with Fout = F(L)
being the final output graph signal. Next, we show that each F(l) can be represented
in a recursive framework, and this helps to analyze the node information propagation
beyond the first-order neighborhood.
Similar to CNNs, here a convolution kernel Hi is used for every input channel i,
where D is the number of input channels, and this set of D filters together is consid-
ered as one convolution filter. For every output channel j, let us denote Hi as H j,i
and define: 
H̃ j  H j,1 H j,2 . . . H j,D ∈ R2n×2n D . (7.10)
7.3 Graph Convolution-Based Feature Computation 157

A block diagonal matrix R(1) is defined using all p convolution filters {H̃ j } j=1 as:
p

⎡ ⎤
H̃1 O ... O
⎢O H̃2 ... O⎥
⎢ ⎥
R(1) =⎢ . .. .. .. ⎥ (7.11)
⎣ .. . . . ⎦
O . . . O H̃ p

Let F(0) = Fin be the input graph signal for layer 1, and define v(0) ∈ R2n D and
another block diagonal matrix B(0) as:
⎡ ⎤ ⎡ (0) ⎤
F1(0) v o ... o
⎢F ⎥
(0) ⎢ o v(0) ... o ⎥
⎢ 2 ⎥ ⎢ ⎥
v(0) = ⎢ . ⎥ and B(0) = ⎢ . .. .. .. ⎥ (7.12)
⎣ .. ⎦ ⎣ .. . . . ⎦
FD(0) o . . . o v(0)

where the vector v(l) at any particular layer l contains output features of all nodes.
Now for layer 1, the output features can be calculated as:
⎡ ⎤
F1(1)
⎢ F (1) ⎥
⎢ 2 ⎥
R(1) B(0) 1(1) = ⎢ . ⎥ = v(1) (7.13)
⎣ .. ⎦
F p(1)

It may be noted that F j(1) was denoted as F j in Eq. (7.9) for simplicity of notation.
The output features at any layer l can be computed using a recursive relationship as:

v(l) = R(l) B(l−1) 1(l) for l ≥ 1 (7.14)

1(l) is a vector containing p 1’s where p is the number of channels at the output of l th
convolutional layer. Assuming L layers and p convolution filters in the L-th layer,
F1(L) , F2(L) , . . ., F p(L) are obtained from v(L) , and they constitute Fout .

7.3.2 Analysis of Filter Outputs

Each graph convolution layer has its own collection of filters, each with a unique set
of learnable filter taps {h}, and as a result, creates a new set of node features, i.e.,
graph signal. Further, Eq. (7.14) shows that the graph signal at various layers can be
recursively computed from v(l) . Now v(1) is a function of the adjacency matrix (A) of
the graph (see Eqs. 7.8–7.14). Therefore, v(2) becomes a function of A2 , and similarly,
v(l) becomes a function of Al . Thus, convolution of graphs using a series of filters over
158 7 Co-segmentation Using Graph Convolutional Network

multiple layers gradually enhances the receptive field of the filter, by involving higher
order neighbors for computation of individual node features. For instance, at layer l,
l-th order neighbors are considered when computing the characteristics of individual
nodes, which increases the model’s nonlinearity. This substantially enhances the
amount of contextual information used in feature computation, and hence the model’s
performance. Representation of the graph convolution filter as a polynomial of the
adjacency matrix serves two purposes: (1) it makes the filter linear and shift invariant
by sharing filter parameters across nodes, and (2) varied degrees of adjacency matrix
ensure that higher order neighbors are involved in feature computation, which infuses
the derived features with additional contextual information.
It can be observed from the convolution Eqs. (7.8–7.14) that the model is capable
of handling any heterogeneous graph. Convolution does not place any constraint
on the image size or pixel connectivity in CNN since the operation of convolution
is not dependent on the input size. However, such a constraint is imposed by the
fully connected layers. Such et al. [121] addressed this issue by utilizing graph
embed pooling, which converts heterogeneous graphs to a fixed-size regular graph.
To accomplish node classification with the suggested method, we can connect a set
of fully connected layers and a softmax classifier to each node. However, it is not
possible to employ any type of node pooling approach in this formulation, as this
would result in the loss of genuine node information. As a result, it is required to
maintain a consistent number of superpixels for each image, ensuring an identical
number of nodes for both training and test images.

7.4 Network Architecture

The flow of the graph CNN-based co-segmentation method is shown in Fig. 7.2. The
network takes the global graph generated from an image pair as input, and performs
co-segmentation by classifying the nodes with appropriate labels. Specifically, the co-
segmentation branch assigns a binary label to each node: foreground or background,
and the classification branch predicts the class label of the entire global graph.
Co-segmentation branch: Given a global graph, this branch (highlighted in red
dotted line) first extracts superpixel features through a series of graph convolutional
layers, which make up the graph CNN. In this architecture, L = 8 convolutional
layers are used which contain 32, 64, 64, 256, 256, 512, 512, 512 filters, respectively.
Thus, the output feature matrix Fout ∈ R2n×512 obtained at the final convolutional
layer contains the 512-dimensional feature vector of every superpixel (node). Next
to classify each node as either the common foreground or the background, the node
features are passed through four 1 × 1 convolution layers with 256, 128, 64, 2 filters,
respectively. All layers except the final 1 × 1 convolutional layer are associated with
a ReLU nonlinearity. The final 1 × 1 convolutional layer uses a softmax layer for
classifying each superpixel. Thus, this classification is based on the learned superpixel
features and does not use any semantic label information.
7.4 Network Architecture

Fig. 7.2 Complete architecture of the co-segmentation model described in this chapter, where GCN stands for graph convolutional network and FC stands
for fully connected network. GCN:32F + 64F + 64F + 256F + 256F + 512F + 512F + 512F, 1 × 1 convolution:256F + 128F + 64F (for co-segmentation block),
FC:128F + 64F (for semantic segmentation block). The GCN is shared between two subnetworks. Each of the convolution and fully connected layers are
associated with ReLU nonlinearity and batch normalization. During training, both the co-segmentation and semantic classification blocks are present. During
testing, only the co-segmentation block (as highlighted in red dotted line) is present. Figure courtesy: [6]
159
160 7 Co-segmentation Using Graph Convolutional Network

Classification branch: This branch is responsible for learning the commonality


information in the image pair through classification of the global graph into one of
the K semantic classes assuming the images contain objects from a common class.
It shares the eight convolutional layers of the graph CNN in the co-segmentation
branch to extract the output node features Fout ∈ R2n×512 . Then this feature is passed
through a fully connected network of three layers with 128, 64 and K filters, respec-
tively, to classify the entire global graph into the common object class that the image
pair belongs to. The first two layers are associated with ReLU nonlinearity, whereas
the final layer is associated with a softmax layer to predict the class label. Unlike in
the co-segmentation branch where the 512-dimensional feature vector of every node
is classified into two classes, here the entire feature map Fout is flattened and input
to the fully connected network. Training this classification branch infuses semantic
information into the entire network while obtaining the graph convolution features.
It may be noted that since the task in hand here is co-segmentation, the classifica-
tion branch is considered only during the training phase, which affects the learned
weights of the graph convolutional layers. However, during test time, only the co-
segmentation branch is required for obtaining the label (foreground or background)
of each node.

7.4.1 Network Training and Testing Strategy

For the task of segmentation, the ground-truth data of an image is typically provided
as a mask where each pixel is labeled (foreground and background here). However,
for superpixel-based segmentation, as considered in this chapter, we first need to
compute ground-truth labels for each superpixel. To obtain the superpixel labels, the
oversegmentation map obtained using the SLIC algorithm is overlayed on the binary
ground-truth mask, and then each superpixel’s area of overlap with the foreground
region is noted. If a superpixel s has a overlap of more than fifty percent of its total
area with the foreground region, it is assigned a foreground ( j = 1) label as:

x1,s = 1 and x2,s = 0.

Otherwise, it is assigned a background ( j = 2) label as:

x1,s = 0 and x2,s = 1.

With this ground-truth information, the co-segmentation branch of the network is


trained using a binary cross-entropy loss Lbin that classifies each superpixel into
foreground–background. It is defined as:

1   2 
n samples n  2
1 i,k
Lbin = − x j,s log(x̂ i,k
j,s ) (7.15)
n samples k=1 i=1 s=1 j=1
μ j
7.4 Network Architecture 161

where n samples denotes the number of image pairs (samples) in a mini-batch dur-
ing training, and n is the number of superpixels in every image. For a superpixel s
in image Ii of image pair-k, the variables x i,k i,k
j,s ∈ {0, 1} and x̂ j,s ∈ [0, 1] denote the
ground-truth label and the predicted probability, respectively, of superpixel s belong-
ing to foreground or background. Further, it is quite common to have the number of
foreground labeled superpixels to be less compared to background labeled superpix-
els since background typically covers more space in natural images. This creates data
imbalance during training. Hence, the foreground and background class superpixel
frequencies over the training dataset, μ1 and μ2 , have been incorporated in the loss
function of Eq. (7.15) to balance out the contribution of both classes.
The classification branch of the network is trained using a categorical cross-
entropy loss Lmulti that classifies the global graph into one of the K semantic classes.
It is defined as:

1   2 
n samples n  K
1 i,k
Lmulti = − i,k
yc,s log( ŷc,s ) (7.16)
n samples k=1 i=1 s=1 c=1
φ c

where yc,s ∈ {0, 1} and ŷc,s ∈ [0, 1] denote the true semantic class label and the pre-
dicted probability of superpixel s belonging to class c. Since the number of superpix-
els belonging to different semantic classes varies over the training dataset, similar to
the formulation of Lbin , here the semantic class frequencies over the training dataset,
φc , is utilized to eliminate the problem of semantic class imbalance.
The entire network is trained using a weighted combination of the losses Lbin and
Lmulti as:
L = ω1 Lbin + (1 − ω1 )Lmulti , (7.17)

where ω1 is a weight. Thus, the parameters in the graph CNN layers are influenced by
both the losses, whereas the 1 × 1 convolutional layers of the co-segmentation branch
and the fully connected layers of the classification branch are influenced by Lbin and
Lmulti , respectively. This loss L can be minimized using a mini-batch stochastic
gradient descent optimizer. Backpropagating Lmulti ensures that the learned model
can differentiate different classes, and the computed features become more discrim-
inative and exclusive for distinct classes. Minimization of the loss Lbin computed
using the ground-truth binary masks of image pairs ensures that the learned model
can distinguish the foreground and background superpixels well resulting in a good
segmentation performance.
Once the model is learned, to co-segment a test image pair, the corresponding
global graph is passed through the co-segmentation branch, and the final softmax
layer classifies each superpixel as either the common foreground or background
class. However, the classification branch is not required during testing.
162 7 Co-segmentation Using Graph Convolutional Network

7.5 Experimental Results

In this section, we analyze the results obtained by the end-to-end graph CNN-based
co-segmentation method described in this chapter, denoted as PMG on images from
the Internet dataset [105] and the PASCAL-VOC dataset [35]. We begin by listing
the choice of parameters in the PMG method for both datasets. The feature similarity
threshold t in Eq. (7.7) used for inter-graph node connections is chosen as t = 0.65.
The weight ω1 used to combine the losses Lbin and Lmulti in Eq. (7.17) is set to 0.5.
The network is initialized with Xavier’s initialization [41]. For stochastic gradient
descent, the learning rate and momentum are empirically fixed at 0.00001 and 0.9,
respectively, and the mini-batch size n samples is kept at 8 samples. Weight decay is
set to 0.0004 for the Internet dataset, and 0.0005 for the PASCAL-VOC dataset.
Jaccard similarity index (J ) is used as the metric to quantitatively evaluate
the performance the methods. To evaluate a method for a set of m input images
I1 , I2 , . . . , Im , first all image pairs are co-segmented, and then the average Jaccard
index over all such pairs is calculated as the final co-segmentation accuracy for the
given set. Next, we describe the results on the two datasets.

7.5.1 Internet Dataset

The Internet dataset [105] has three classes: airplane, car and horse, with a subset
of 100 images per class used for experiments as considered in relevant works [99].
However, there is no standard training-test split available for this dataset. Hence,
the dataset is randomly split in 3:1:1 ratio for training, validation and testing sets,
and image pairs from them are input to the network to compute the co-segmentation
performance. This process is repeated 100 times, and the average accuracy computed
over the 100 different test sets is reported.
The comparative results of the methods PMG, DCC [56], UJD [105], EVK [24],
GMR [99], DD [146] are shown in Table 7.1. The method PMG performs better than
other methods except the method DD. However, DD is constrained in two aspects.
First, it requires precomputed object proposals from the images. Then the propos-
als containing common objects are identified, and those proposals are segmented
independently. Thus, co-segmentation is not performed by directly leveraging the
inter-image feature similarities. Second, to compute correct object proposals, sepa-
rate fine tuning is required to obtain an optimum model. Therefore, this method may
fail to co-segment images containing complex object structures with occlusion. On
the contrary, PMG uses a global graph which is constructed using superpixels from
an image pair. Superpixel computation is easier than object proposal computation,
and it does not require to learn any separate model. Further, PMG has the flexi-
bility of using any oversegmentation algorithm for computing superpixels. Unlike
DD, usage of the global graph and graph convolution helps PMG model perform
co-segmentation even if the common object in one image is occluded by some other
7.5 Experimental Results 163

Table 7.1 Comparison of Jaccard index (J) of the methods PMG, DCC, UJD, EVK, GMR, DD on
the Internet dataset
Method Car (J) Horse (J) Airplane (J)
DCC 37.1 30.1 15.3
UJD 64.4 31.6 55.8
EVK 64.8 33.3 40.3
GMR 66.8 58.1 56.3
DD 72.0 65.0 66.0
PMG 70.8 63.3 62.2

Table 7.2 Comparison of Jaccard index (J) of the methods on CMP, GMR, PMG evaluated on the
PASCAL-VOC dataset
Method Jaccard index
CMP 0.46
GMR 0.52
PMG 0.56

object since the model can use information from the common object in the other
image of the input pair. Figure 7.3 visually demonstrates co-segmentation results
on six image pairs. The first and third columns show three easy and three difficult
image pairs, respectively, and the second and fourth columns show the corresponding
co-segmentation results.

7.5.2 PASCAL-VOC Dataset

The PASCAL-VOC dataset [35] has 20 classes, and is more challenging due to
significant intra-class variations and presence of background clutter. Here, the same
experimental protocol as in the experiments with the Internet dataset is used. The
accuracy over the test set of all 20 classes are calculated and the average is reported
here. Table 7.2 shows comparative results of different methods. The method PMG
performs well because it involves semantic information, which infuses high degree
of context in the computed features, thus making it robust to pose and appearance
changes. Common objects obtained using the method PMG on four image pairs is
shown in Fig. 7.4.
Visual comparison of co-segmentation results obtained using the methods PMG
and CMP [36] on four image pairs are shown in Fig. 7.5. In the TV image pair, the
appearance of some part of the background is similar to the common foreground
(TV). Hence, the method CMP incorrectly segments some part of the background
as foreground. In case of the other three image pairs, the similarity in shape of the
164 7 Co-segmentation Using Graph Convolutional Network

Fig. 7.3 Co-segmentation results of the method PMG described in this chapter on images from
airplane, horse and car classes of the Internet dataset. Columns 1, 3 show input image pairs and
Columns 2, 4 show the corresponding co-segmented objects. Figure courtesy: [6]

common foreground with other objects and background results in incorrect segmen-
tation. However, the method PMG learns features with strong semantic information
and context, hence performs well. This is demonstrated using the dog image pair,
where the three dogs in image-2 are not homogeneous in color. Specifically, their
body color is a mixture of both white and its complement. Yet, PMG co-segments
them correctly.
7.5 Experimental Results 165

Fig. 7.4 Co-segmentation results of the method PMG on images from the PASCAL-VOC dataset.
Columns 1, 3, 5, 7 show input image pairs and Columns 2, 4, 6, 8 show the corresponding co-
segmented objects. Figure courtesy: [6]

Fig. 7.5 Visual comparison of co-segmentation results on images from the PASCAL-VOC dataset.
Columns 1, 4: input image pairs (TV, boat, dog, sofa classes). Columns 2, 5: results of CMP [36]
and Columns 3, 6: results of PMG. The detected common objects are shown using red contours.
Figure courtesy: [6]
Chapter 8
Conditional Siamese Convolutional
Network

8.1 Introduction

In the previous chapter, we have discussed and shown through different experiments
the advantage of employing a graph neural network for performing co-segmentation
across input images. The experimental results demonstrate the utility of a deep learn-
ing model to find out critical features for the co-segmentation task, and then use the
features to extract common objects efficiently in an end-to-end manner. However, per-
forming co-segmentation across a set of superpixels predetermined from the input
image set instead of the images directly, may be bottlenecked by the accuracy of
the superpixel extraction method. Furthermore, inconsistent performance of a GCN
across graphs with variable number of nodes constrains the method to maintain same
number of superpixels across the input images that may hinder the network’s per-
formance in some cases. In addition to that, the model is not able to handle outliers
in the input set. In this chapter, we discuss a CNN-based end-to-end architecture for
co-segmentation that operates on input images directly and can handle outliers as
well.
Specifically, we consider co-segmentation of image pairs in the challenging setting
that the images do not always contain common objects, similar to the set up in
Chaps. 5 and 6. Further, shape, pose and appearance of the common objects in the
images may vary significantly. Moreover, the objects should be distinguishable from
the background. Humans can easily identify such variations. To capture these aspects
in the co-segmentation algorithm, a metric learning approach is used to learn a latent
feature space, which ensures that objects from the same class get projected very close
to each other and objects from different classes get projected apart, preferably at least
by a certain margin. The objective of the co-segmentation problem is: (i) Given an
input image pair I1 , I2 , and the corresponding ground-truth binary masks (M1 , M2 )
indicating common object pixels, learn a model through end-to-end training of the co-
segmentation network, and (ii) given a test image pair, detect if they share a common
object or not, and, if present, estimate the masks M̂1 , M̂2 using the learned model. The
co-segmentation network architecture is shown in Fig. 8.1, and we describe it next.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 167
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_8
168

Fig. 8.1 The deep convolution neural network architecture for co-segmentation. An input image pair (I1 , I2 ) is input to a pair of encoder-decoder networks,
with shared weights (indicated by vertical dotted lines). Output feature maps of the ninth decoder layer pair (cT9 ) are vectorized ( f u1 , f u2 ) by global average
pooling (GAP) and input to a siamese metric learning network. Its output vectors f s1 and f s2 are then concatenated and input to the decision network that
predicts ( ŷr ) the presence or absence of any common object in the input image pair. Red dotted line arrows show backpropagation for L1 , and green dotted line
8 Conditional Siamese Convolutional Network

arrows show backpropagation for L2 and L3 . The complete network is trained for positive samples
8.1 Introduction

Encoder Decoder
PredicƟon ignored,
and Loss 1 not
backpropagated

GAP

NegaƟve ConcatenaƟon Decision


Siamese metric
sample Shared weights
learning net net

GAP

Encoder Decoder

BackpropagaƟon
Fig. 8.1 (Continued): The deep convolution neural network architecture for co-segmentation of an image pair without any common object. Green dotted line
arrows show backpropagation for L2 and L3 . For negative samples, the decoder network after cT9 is not trained
169
170

Fig. 8.1 (Continued): Details of the encoder-decoder network. 64-conv indicates convolution using 64 filters followed by ReLU. MP stands for max-pooling
with a kernel of size 2 × 2. NNI stands for nearest neighbor interpolation. Deconvolution is performed using convolution-transpose operation (convT). Figure
courtesy: [7]
8 Conditional Siamese Convolutional Network
8.2 Co-segmentation Framework 171

8.2 Co-segmentation Framework

The co-segmentation model consists of a siamese convolution–deconvolution net-


work, or encoder-decoder synonymously (Sect. 8.2.1), a siamese metric learning
network (Sect. 8.2.2) and a decision network (Sect. 8.2.3). We first highlight the key
aspects of these modules, and then explain them in detail in subsequent sections.
• The siamese encoder network takes an image pair as input, and produces interme-
diate convolutional features.
• These features are passed to the metric learning network that learns an optimal
latent feature space where objects belonging to the same class are closer and objects
from different classes are well separated. This enables the convolutional features
learned by the encoder to segment the common objects accurately.
• To make the model class-agnostic, no semantic class label is used for metric
learning.
• The decision network uses the features learned by the metric learning network,
and predicts the presence or absence of a common object in the image pair.
• The siamese decoder network conditionally maps the encoder features into cor-
responding co-segmentation masks. These masks are used to extract the common
objects.
• The metric learning network and the decision network together condition the
encoder-decoder network to perform an accurate co-segmentation depending on
the presence of a common object in the input image pair.

8.2.1 Conditional Siamese Encoder-Decoder Network

The encoder-decoder network has two parts: a feature encoder and a co-segmentation
mask decoder. The siamese encoder consists of two identical feature extraction CNNs
with shared parameters and is built upon the VGG-16 architecture. The feature extrac-
tor network is composed of 5 encoder blocks containing 2, 2, 3, 3, 3 convolutional
layers (conv), respectively. Each block also has one max-pooling (MP) layer, which
makes the extracted features spatially invariant and contextual. Provided with an
image pair {Ii }i=1
2
∈ R N ×N (with N = 224 as required by VGG-16) as input, this
network outputs high level semantic feature maps f 1 , f 2 ∈ R512×7×7 .
The siamese decoder block, that follows the encoder, takes the semantic feature
map pair f 1 , f 2 produced by the encoder as input, and performs the task of produc-
ing foreground masks of the common objects through two identical deconvolution
networks. The deconvolution network is formed by five spatial interpolation layers
with 13 transposed convolutional layers (convT). The max-pooling operation in the
encoder reduces the spatial resolution of the convolutional feature maps. Hence, the
use of 5 MP layers makes the encoder features very coarse. The decoder network
transforms these low resolution feature maps to the co-segmentation masks. The fea-
ture maps are upsampled using nearest neighbor interpolation (NNI). Although NNI
172 8 Conditional Siamese Convolutional Network

Fig. 8.2 Testing the co-segmentation model


8.2 Co-segmentation Framework 173

is fast compared to bilinear or bicubic interpolation, it introduces blurring and spa-


tial artifacts. Therefore, each NNI is followed by a transposed convolution operation
in order to reduce these artifacts. Every deconvolution or transposed convolutional
layer except the final layer is followed by a ReLU layer. The final deconvolution
layer produces two single channel maps with size 224 × 224, which are converted
to co-segmentation masks M̂1 , M̂2 by sigmoid function. During test time, the output
layer of this network is gated by the binary output of the decision network to make
the conditional siamese convolutional network as shown in Fig. 8.2.

8.2.2 Siamese Metric Learning Network

This network helps to learn features that better discriminate objects from different
classes. It may be noted that in Chap. 6, we discussed an LDA based approach for
the same goal in an unsupervised setting. Here in the fully supervised framework,
this is achieved through metric learning where the images are projected to a feature
space such that the distance between two images containing the same class objects
gets reduced and that for an image pair without any commonality increases. This
projection is performed using a series of two fully connected layers with dimensions
128 and 64, respectively. The first layer has ReLU as nonlinearity, and the second
layer does not have any nonlinearity. These layers together constitute the metric
learning network. It takes input f u1 , f u2 ∈ R256 from the siamese encoder-decoder
network, and outputs a pair of feature vectors f s1 , f s2 ∈ R64 that represents the
objects in the learned latent space. One can use the final convolutional layer or
subsequent deconvolution layers as the source. We will analyze the choice of the
source in Sect. 8.3.4, and we will show through experiments that the deconvolution
layers at the middle of the decoder network are ideal. The reason is that they capture
sufficiently enough object information. Hence, the output of the ninth deconvolution
layer (256 × 56 × 56) is used to find f u1 and f u2 . Specifically, global average pooling
(GAP) is performed over each channel of the deconvolution layer output to get the
vectors f u1 , f u2 . The network is trained using the standard triplet loss (Sect. 8.2.4).
Thus during backpropagation, the first nine deconvolution layers of the decoder
and all thirteen convolutional layers of the encoder also get updated. This infuses
the commonality of the image pair in the encoder features, which leads to better
masks. It is worth mentioning that f s1 , f s2 are not used for mask generation because
the GAP operation destroys the spatial information completely while computing
f u1 , f u2 . Hence, encoder features, i.e., the output of the final convolutional layer is
passed onto the decoder for obtaining the common object masks. However, f s1 , f s2
are utilized in the decision network as described next.
174 8 Conditional Siamese Convolutional Network

8.2.3 Decision Network

In this chapter, the co-segmentation task is not limited to extracting common objects,
if present. We are also required to detect cases when there is no common object present
in the image pair, and this is achieved using a decision network. This network uses the
output of the metric learning network f s1 , f s2 to predict the common object occur-
rence. It takes the 128-dimensional vector obtained by concatenating f s1 and f s2 as
the input, and passes the same to a series of two fully connected layers with dimen-
sions 32 and 1, respectively. The first layer is associated with a ReLU nonlinearity.
The second layer is associated with a sigmoid function that gives a probability of the
presence of common objects in the image pair. During the test stage, this probability
is thresholded to obtain to a binary label. If the decision network predicts a binary
label 1, the output of the siamese decoder network gives us the co-segmentation
masks.

8.2.4 Loss Function

To train the co-segmentation network, we need to minimize the losses incurred in


the siamese encoder-decoder network, the siamese metric learning network and the
decision network. We define these losses next.
p
Let (Ira , Ir ) and (Ira , Irn ) denote a positive pair and a negative pair of images,
p
respectively, where Ira is an anchor, Ir belongs to the same class of the anchor, and
p p
Irn belongs to a different class. For the positive sample, let (Mra , Mr ) and ( M̂ra , M̂r )
be the corresponding pair of ground-truth masks and the predicted masks obtained
from the decoder, respectively. We do not require the same for negative samples
because of the absence of any common object in them. The pixelwise binary cross-
entropy loss for training the encoder-decoder network is given as:
n samples
  
N 
N
L1 = − {Mri ( j, k) log( M̂ri ( j, k)) + (1 − Mri ( j, k)) log(1 − M̂ri ( j, k))},
r =1 i∈[a, p] j=1 k=1
(8.1)

where n samples is the total number of samples (positive and negative image pair) in
a mini-batch, M̂( j, k) and M( j, k) denote the values of the ( j, k)-th pixel of the
predicted and true mask, respectively. It may be noted that this loss does not involve
the negative samples since the final part of the decoder is not trained for them, and this
will be elaborated in Sect. 8.2.5. The triplet loss is used to train the metric learning
network, and it is given as:


n samples
     
L2 = max 0,  f s (Ira ) − f s (Irp )2 −  f s (Ira ) − f s (Irn )2 + α , (8.2)
r =1
8.2 Co-segmentation Framework 175

where α is a scalar valued margin. Therefore by minimizing this loss, the network
p
implicitly learns a feature space where the distance between Ira and Ir reduces, and
the same between Ir and Ir increases at least by the margin α. The decision network
a n

is trained using binary cross-entropy loss as:


n samples
L3 = − {yr log ŷr + (1 − yr ) log(1 − ŷr )} , (8.3)
r =1

p
where yr = 1 and 0 for a positive pair (Ira , Ir ) and negative pair (Ira , Irn ), respectively,
and ŷr ∈ [0, 1] is the predicted label obtained from its sigmoid layer. The overall loss
is computed as:
Lfinal = ω1 L1 + ω2 L2 + ω3 L3 (8.4)

where ω1 , ω2 , ω3 are the corresponding weights with ω1 + ω2 + ω3 = 1.

8.2.5 Training Strategy

As mentioned earlier, at the time of training, the metric learning part guides the
encoder-decoder network to reduce the intra-class object distance and increase the
inter-class object distance. The binary ground-truth masks guide the encoder-decoder
network to differentiate common objects from the background based on the learned
features. Thus, a positive sample (an image pair with a common object) will predict
the common foreground at the output, whereas a negative sample (an image pair with
no common object) should predict the absence of any common object producing null
masks. However, forcing the same network to produce an object mask and a null
mask from a particular image depending on that image being part of a positive and
negative sample, respectively, hinders learning. Hence, the learning strategy should
be different in the two cases. In the case of positive samples, the whole network is
trained by backpropagating all three losses. However for negative samples, only the
complete metric learning, decision, encoder and certain part of decoder network are
trained, i.e., a part of the decoder responsible for mask generation is not trained.
This is achieved by backpropagating only the losses L2 and L3 . Thus, the predicted
masks are not utilized in training, and they are ignored. This is motivated by the fact
that for negative examples, the decoder network is not required to produce any mask
at all since the decision network notifies the absence of any common object. Hence,
the role of the decision network is to reduce the overall difficulty level of training
the deconvolution layers by making them produce object masks only for positive
samples. It helps to train the network properly, and also improves the performance as
shown in Fig. 8.6. To summarize, the entire network is trained for positive samples
(yr = 1), and a part of the network is trained for negative samples (yr = 0), thus
making the mask estimation of the siamese network a conditioned one.
176 8 Conditional Siamese Convolutional Network

During testing, the decision network predicts the presence or absence of a common
object in them. If the prediction is ŷr = 1, the output of the siamese decoder network
provides the corresponding co-segmentation masks. If the prediction is ŷr = 0, the
decoder output is not considered, and it is concluded that the image pair does not have
any common object. This is implemented in the architecture by gating the decoder
outputs through ŷr as shown in Fig. 8.2. Thus, prediction ŷr = 0 will yield null masks.
We show the experimental results for negative samples in Fig. 8.6.

8.3 Experimental Results

In this section, we analyze the results obtained by the end-to-end siamese encoder-
decoder-based co-segmentation method described in this chapter, denoted as PMS.
The co-segmentation performance is quantitatively evaluated using precision and
Jaccard index. Precision is the percentage of correctly segmented pixels of both the
foreground and the background. It should be noted that this precision measure is
different from the one discussed in Chap. 4, which does not consider background.
Jaccard index is the intersection over union of the resulting co-segmentation masks
and the ground-truth common foreground masks. As in Chap. 7, to evaluate the
performance on a set of m input images I1 , I2 , . . . , Im , co-segmentation is performed
on all such pairs, and the average precision and Jaccard index is computed for the
given set. Next, we describe the results on the PASCAL-VOC dataset, the Internet
dataset and the MSRC dataset.
First, we specify various parameters used in obtaining the co-segmentation results
reported in this chapter using the PMS network. It is initialized with the weights of
the VGG-16 network trained on the Imagenet dataset for the image classification
task. Stochastic gradient descent is used as the optimizer. The learning rate and
momentum are fixed at 0.00001 and 0.9, respectively, for all three datasets. For
the PASCAL-VOC and the MSRC datasets, the weight decay is set to 0.0004, and
for the Internet dataset, it is set to 0.0005. At the time of training, the strategy of
Schroff et al. [111] has been followed for generating samples. For the case of positive
samples, the weights in Eq. (8.4) are set as ω1 = ω2 = ω3 = 1/3 to give them equal
importance. As explained in Sect. 8.2.5, the mask loss L1 is not backpropagated
for negative samples. Hence, the weights are set as ω1 = 0, ω2 = ω3 = 0.5. Due to
memory constraints, batch size is limited to 3 in the experiments, where each input
sample in a batch is a pair of input images, either positive or negative. All the input
images are resized to 224 × 224, and the margin α in the triplet loss L2 is set to 1.

8.3.1 PASCAL-VOC Dataset

This dataset has 20 classes. To prepare the dataset for training, it is randomly split
in the ratio of 3:1:1 for training, validation and testing sets. Since there is no stan-
8.3 Experimental Results 177

Table 8.1 Comparison of precision (P) and Jaccard index (J) of the methods CMP [36], GMR [99],
ANP [134], CAT [50], PMS on the PASCAL-VOC dataset
Method Precision (P) Jaccard index (J)
CMP 84.0 0.46
ANP 84.3 0.52
a GMR 89.0 0.52
a CAT 91.0 0.60
a PMS 95.4 0.68
a Denotes deep learning-based methods

dard split available, this splitting process is repeated 100 times, and the average
performance is reported here. Table 8.1 shows comparative results of the methods
PMS, CMP [36], GMR [99], ANP [134], CAT [50]. The PMS method performs very
well because it involves convolution–deconvolution with pooling operation, which
involves a high degree of context for feature computation. Furthermore, the metric
learning network learns a latent feature space where common objects come closer
irrespective of their pose and appearance variations, making the method robust to
such changes. This can also be observed in Fig. 8.3 where PMS segments common
objects even when they have significant pose and appearance variations (Rows 3, 4).

Fig. 8.3 Co-segmentation results on the PASCAL-VOC dataset. In each row, Columns 1, 3 show
an input image pair, and Columns 2, 4 show the corresponding co-segmented objects obtained using
PMS. Figure courtesy: [7]
178 8 Conditional Siamese Convolutional Network

Table 8.2 Comparison of precision (P) and Jaccard index (J) of the methods DCC [56], UJD [105],
EVK [24], GMR [99], CAT [50], DD [146], DOC [72], CSA [20], PMS on the Internet dataset
Method C (P) C (J) H (P) H (J) A (P) A (J) M (P) M (J)
DCC 59.2 0.37 64.2 0.30 47.5 0.15 57.0 0.28
UJD 85.4 0.64 82.8 0.32 88.0 0.56 82.7 0.51
EVK 87.6 0.65 89.3 0.33 90.0 0.40 89.0 0.46
*GMR 88.5 0.67 85.3 0.58 91.0 0.56 89.6 0.60
*CAT 93.0 0.82 89.7 0.67 94.2 0.61 92.3 0.70
*DD 90.4 0.72 90.2 0.65 92.6 0.66 91.0 0.68
*DOC 94.0 0.83 91.4 0.65 94.6 0.64 93.3 0.70
*CSA – 0.80 – 0.71 – 0.71 – 0.73
*PMS 95.2 0.87 96.2 0.72 96.7 0.71 96.1 0.77
C, H and A stands for car, horse and airplane classes. M denotes mean value, *Denotes deep
learning-based methods

Table 8.3 Precision (P) and Jaccard index (J) of PMS trained with the PASCAL-VOC dataset and
evaluated on the Internet dataset
C (P) C (J) H (P) H (J) A (P) A (J) M (P) M (J)
94.6 0.85 93.0 0.67 94.3 0.65 94.0 0.72

8.3.2 Internet Dataset

This dataset has three classes: airplane, horse and car. Comparative results of different
methods are shown in Table 8.2. The PMS method described in this chapter has
some similarities with the method DOC [72], but the use of metric learning with a
decision network in PMS as opposed to the use of a mutual correlator in DOC makes
PMS faster by 6 times per epoch. This along with the conditional siamese encoder-
decoder significantly improves co-segmentation performance over DOC. We show
visual results of PMS in Fig. 8.4. Further, it is also evaluated by training the network
using the PASCAL-VOC dataset and testing on the Internet dataset. The results are
shown in Table 8.3.

8.3.3 MSRC Dataset

To evaluate the PMS method on the MSRC dataset, a subset containing the classes
cow, plane, car, sheep, bird, cat and dog is chosen as has been widely used in relevant
works to evaluate co-segmentation performance [105, 130, 135]. Each class has 10
images, and there is a single object in each image. The objects from each class have
color, pose and scale variations in different images. The experimental protocol is the
same as that of the PASCAL-VOC dataset. Comparative results of different methods
8.3 Experimental Results

Fig. 8.4 Co-segmentation results on Internet dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8 show the
corresponding co-segmented objects obtained using PMS. Figure courtesy: [7]
179
180

Fig. 8.4 (Continued): Co-segmentation results on Internet dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8
show the corresponding co-segmented objects obtained using PMS
8 Conditional Siamese Convolutional Network
8.3 Experimental Results 181

Table 8.4 Comparison of precision and Jaccard index of the methods in [130], UJD [105], CMP [36,
135], DOC [72], CSA [20], PMS on the MSRC dataset
Method Precision Jaccard index
[130] 90.0 0.71
UJD 92.2 0.75
CMP 92.0 0.77
[135] 92.2 –
*DOC 94.4 0.80
*CSA 95.3 0.77
*PMS 96.3 0.85
*Denotes deep learning-based methods

are shown in Table 8.4. The PMS model was trained on the PASCAL-VOC dataset
since the MSRC dataset does not have sufficient number of samples for training. Yet
it outperforms other methods. Figure 8.5 shows visual results obtained using PMS.

8.3.4 Ablation Study

In this section, we analyze (i) the role of the siamese metric learning network and
the decision network in PMS, and (ii) the choice of the layer in the encoder-decoder
network that acts as the input to the metric learning network. To analyze the first,
a baseline model PMS-base is created that only has the siamese encoder-decoder
network in the architecture. Further to fuse information from both the images, the
encoder features f 1 , f 2 ∈ R512×7×7 are concatenated along the channel dimension to
obtain feature maps [ f 1 ; f 2 ], [ f 2 ; f 1 ] ∈ R1024×7×7 for I1 and I2 , respectively, and they
are passed to the corresponding decoder network for mask generation. In the absence
of the decision network in PMS-base, the entire siamese encoder-decoder network is
trained for negative samples as well using null masks. The performance of PMS-base
is compared with the complete model PMS on different datasets in Table 8.5. The
improved performance of PMS over PMS-base is also visually illustrated in Fig. 8.6.
Different class objects in the image pairs (Rows 1–3) are incorrectly detected as
common objects by PMS-base, whereas PMS correctly infers that there is no common
object in the image pairs by predicting null masks. The image pair in Row 4 contains
objects from the same class, however, the objects have different pose and size. In the
absence of the metric learning network, PMS-base performs poorly. This is due to the
absence of any explicit object similarity learning module, which is essential for co-
segmentation. Hence, the inclusion of both the subnetworks in the PMS architecture
along with the novel training strategy of partially training the decoder for negative
samples is justified.
182

Fig. 8.5 Co-segmentation results on MSRC dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8 show the
corresponding co-segmented objects obtained using PMS. Figure courtesy: [7]
8 Conditional Siamese Convolutional Network
8.3 Experimental Results

Fig. 8.6 Ablation study using images from the Internet dataset. Columns 1, 2 show input image pairs, Columns 3, 4 show the objects obtained (incorrectly)
using PMS-base. Columns 5, 6 show that PMS correctly identifies the absence of common objects (Rows 1–3), indicated by empty boxes. Figure courtesy: [7]
183
184 8 Conditional Siamese Convolutional Network

Table 8.5 Comparison of Jaccard index (J) of the PMS model with the baseline model on different
datasets
Dataset Architecture
PMS-base PMS
PASCAL-VOC 0.47 0.68
Internet 0.61 0.77
MSRC 0.63 0.85

Table 8.6 Comparison of Jaccard index (J) of the PMS model while connecting the input of the
metric learning network to different layers of the decoder network
Dataset\layer c13 cT3 cT6 cT9 cT11
PASCAL-VOC 0.63 0.65 0.66 0.68 0.64
Internet 0.67 0.68 0.68 0.72 0.65

Next we analyze the input to the metric learning network. Table 8.6 shows the
performance of the PMS model by choosing the output of different layers of the
siamese encoder-decoder network:
• c13 : the final layer of the siamese encoder network,
• cT3 , cT6 , cT9 and cT11 : the third, sixth, ninth and eleventh deconvolution layers,
respectively.
The model performs the best for cT9 . The reasons, we believe, are that (a) sufficient
object information is available at cT9 , which may not be available in lower layers c13 ,
cT3 , cT6 , and (b) the higher deconvolutional layers (cT11 and cT13 ) are dedicated for
producing co-segmentation masks. Hence, cT9 is the optimal layer. To summarize,
the co-segmentation method described in this chapter can extract common objects
from an image pair, if present, even in the presence of variations in datasets. In the next
chapter, we will discuss a method that co-segments multiple images simultaneously.
Chapter 9
Few-shot Learning for Co-segmentation

9.1 Introduction

Convolutional neural network (CNN) based models automatically compute suitable


features for co-segmentation with varying levels of supervision [20, 67, 72, 146].
To extract the common foreground, all images in the group are used to leverage
shared features to recognize commonality across the group as shown in Fig. 9.1a,
b. However, all these methods require a large number of training samples for better
feature computation and mask generation of the common foreground. In many real
scenarios, we are presented with datasets containing a few labeled samples (Fig. 9.2a).
Annotating input images in the form of a mask for the common foreground is also
a very tedious task. Hence in this chapter, image co-segmentation is investigated in
a few-shot setting. This implies performing the co-segmentation task over a set of
a variable number of input images (called co-seg set) by relying on the guidance
provided by a set of images (called guide set) to learn the commonality of features as
shown in Fig. 9.2c. We describe a method that learns commonality corresponding to
the foreground of interest without any semantic information. For example as shown
in Fig. 9.3, commonality corresponding to the foreground horse is learned from the
guide set, which is exploited to segment the common foreground of interest from the
co-seg set images.
To solve co-segmentation in a few-shot setting, a meta learning-based train-
ing method can be adopted where an end-to-end model learns the concept of co-
segmentation using a set of episodes sampled from a larger dataset, and subsequently
adapts its knowledge to a smaller target dataset of new classes. Each episode consists
of a guide set and a co-seg set that together mimic the few-shot scenario encountered
in the smaller dataset. The guide set learns commonality using a feature integration
technique and associates it with the co-seg set individuals with the help of a vari-
ational encoder and an attention mechanism to segment the common foreground.
Thus, this encoder along with the attention mechanism helps to model the common
foreground, where the intelligent feature integration method boosts the quality of its

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 185
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_9
186 9 Few-shot Learning for Co-segmentation

Fig. 9.1 a, b Traditional supervised co-segmentation methods use a large training set to learn to
extract common objects. c The few-shot setting requires only a smaller training set (guide set) to
perform co-segmentation of the test image set (co-seg set)

feature. To improve the generalization capacity of the model, it is trained only using
the co-segmentation loss computed over the co-seg set.

9.2 Co-segmentation Framework

Given a dataset Dtarget = {(


xit , 
yit )}i=1
n
∪ {
x uj }rj=1 containing a small set {( xit , 
yit )}i=1
n

of annotated training images and corresponding ground-truth masks, the objective


of co-segmentation is to estimate common object masks { y1u , 
y2u , . . . , 
yru } for the
unlabeled target samples or the test set { x1 , 
u
x2 , . . . , 
u
xr }. A meta-learning approach
u

for solving this problem is explained next.


9.2 Co-segmentation Framework 187

Input Set of Images


Training Images Test Images

(a) Traditional Supervised Co-Segmentation OUTPUT


INPUT (inferior)
Shared Encoder Decoder

Shared weights
large training data Latent space
is required capturing
commonality

(b) Few Shot Multi-image Co-segmentation


co-seg set
OUTPUT
INPUT DVICE (better)

A
Shared weights

Latent Space

guide set

It can work accurately


with less
training data
guide set

Fig. 9.2 The figure illustrates traditional multi-image co-segmentation and the few-shot multi-
image co-segmentation described in this chapter. a Due to fewer training samples, the traditional
method fails to obtain accurate co-segmentation results. b Few-shot training using a cross-encoder
(DVICE) performs co-segmentation more accurately. Figure courtesy: [5]

9.2.1 Class Agnostic Meta-Learning

Inspired from [117], few-shot learning for co-segmentation is defined as follows.


Let us consider two datasets: a base set Dbase with a large number of annotated
samples and a target set Dtarget with a small number of annotated samples for co-
segmentation. The model is iteratively trained over Dbase using a series of episodes
consisting of a guide set and a co-seg set. Each guide set and co-seg set are designed
such that they mimic the characteristics of the small training and test set of Dtarget
as shown in Fig. 9.4. The role of the guide set and the co-seg set is similar to the
188 9 Few-shot Learning for Co-segmentation

support set and query set, respectively, typically encountered in the contemporary
few-shot learning [91, 115, 117]. However unlike the support set, the guide set here
does not rely on semantic class labels, and it is even tolerant to the presence of
outliers while guiding the network to learn and perform foreground extraction over
the co-seg set as shown in Fig. 9.3. The guide set discussed in this chapter includes
samples (images) that contain a dominant class and samples (outlier images) that
contain other non-dominant classes, which we call positive and negative samples,
respectively. The positive samples share a common foreground which is same as the
foreground of interest that is to be extracted from the co-seg set. Due to the lack of
sufficient training samples in the target dataset Dtarget , meta-learning is used to learn
and extract transferable embedding, thus facilitating better foreground extraction on
the target dataset. An episodic training scheme for this purpose is described next.
A few-shot learning strategy is adopted to improve co-segmentation performance
on the smaller target dataset Dtarget , on which standard training leads to over-fitting.
Hence, a larger dataset Dbase is developed for the co-segmentation task such that
Dtarget ∩ Dbase = φ. In order to simulate the training scenario of Dtarget , multiple
episodes over Dbase are created, and an episodic training scheme is developed so
that the model learns to handle the co-segmentation task with few training samples
without overfitting.
Each episode consists of a guide set (G ) and a co-seg set (C ) such that any
operation over set C is directed by set G , which provides the information of the
common object to C over which co-segmentation is performed. To accomodate a
practical scenario, G is allowed to contain negative samples (outliers). Thus in each
g g g g
episode, the guide set is designed as G = {P g ∪ N g } = {(x1 , y1 ), . . . , (xk , yk )},
consisting of n randomly selected positive samples {P } and k − n randomly selected
g

negative samples {N g }. The co-seg set is designed as C = {(x1c , y1c ), . . . , (xmc , ymc )}.
Here, the cardinality of P g is chosen as n because the number of annotated samples
available in Dtarget is also n. Next, a prototype Og for the common object present in
the guide set is obtained from the encoder features of images in it as:

1 
k
g
Og = ChAM(E(xi )) , (9.1)
|G | i=1

where the encoder E is a part of the directed variational inference cross-encoder, to be


explained in detail in Sect. 9.2.2, and ChAM is a channel attention mechanism, to be
described in Sects. 9.3.1 and 9.3.2. The averaging operation removes the influence of
outliers and makes Og robust. The ChAM module is used to focus on the semantically
meaningful part of the image by exploiting the inter-channel relationship of features.
An embedding of each co-seg set sample x cj is also computed as z cj = E(x cj ). Then
its attention ChAM(z cj ) is concatenated channel-wise with Og and passed to the
decoder for obtaining the co-segmentation mask ŷ cj . Thus, the decoder implicitly
checks the similarity between the common object prototype and objects present in
that image (x cj ), and estimates ŷ cj accordingly. The spatial importance of each pixel
at different layers of the encoder is captured by the spatial attention module (SpAM),
Spatial Attention Module (SpAM)
Encoder

Spatial
{zg} Channel ChAM({zg}) Attention
Guide Set Latent Space Attention SpAM(F)
{xg} Module

Feature Convolution Sigmoid


Feature F [Max-pool ; Avg-pool] layer layer
Shared Averaging
Weights
9.2 Co-segmentation Framework

Og

zc Channel ChAM(zc) Channel


Co-seg Set Latent Space Attention Concat- Co-segmentation Mask
Module enation ycj
xcj

Qc F SpAM({F})
Spatial
Attention
Module

Guide Set Channel Attention Module (ChAM)

Max-pool

Sigmoid Channel
layer Attention
Feature Avg-pool Perceptron Perceptron ChAM(z)
z
Output

Fig. 9.3 Illustration of the co-segmentation architecture described in this chapter. [Top-right] The spatial attention module (SpAM). [Bottom-right] The
channel attention module (ChAM). [Bottom-left] A guide set containing horse as the majority class. [Top-left] The architecture of the cross-encoder consisting
of the SpAM and ChAM modules, which learns the commonality from the guide set and learns to segment the common objects. During test time, the method
189

extracts the common objects from an outlier contaminated co-seg set. Here for simplicity, only two images from the co-seg set are shown. Figure courtesy: [5]
190 9 Few-shot Learning for Co-segmentation

(a)

Unlabelled Set Labelled Set

(b)
Fig. 9.4 Sample image sets used for training and testing. a The base dataset Dbase used during
training is a large set of labeled images. b The target dataset Dtarget used during testing has fewer
labeled images. The method is evaluated on the unlabeled sets in Dtarget . Figure courtesy: [5]
9.2 Co-segmentation Framework 191

Training DVICE over the

co-seg set Loss over co-seg

DVICE
guide set

(c)

Evaluating DVICE over the

co-seg set

Co-segmented images

DVICE

guide set

(d)
Fig. 9.4 (Continued): c Guide sets and co-seg sets are sampled from the base set Dbase for training
the cross-encoder (DVICE). d During the evaluation phase, guide sets and co-seg sets are sampled
from the labeled and unlabeled sets of the target set Dtarget , respectively, and the common objects
in the co-seg sets are extracted. Figure courtesy: [5]

to be described in Sect. 9.3.3, and it is used for aiding the decoder to focus on the
localization of the common foreground.
While training for common foreground extraction, this framework relies only on
the assumption that there exists some degree of similarity between the guide set and
co-seg set. Thus any semantic class information is not used during training as can
be seen in Fig. 9.3, and hence, this few-shot co-segmentation strategy is completely
class agnostic.
192 9 Few-shot Learning for Co-segmentation

9.2.2 Directed Variational Inference Cross-Encoder

In this section, we describe the encoder-decoder model used for co-segmentation. It


is built on the theory of variational inference to learn a continuous feature space over
input images for better generalization. However, unlike the traditional variational
auto-encoder setup, a cross-encoder is employed here for mapping an input image x
to corresponding mask y based on a directive Og obtained from the guide set G . Thus
for any image (x c , y c ) ∈ C , randomly sampled from an underlying unknown joint
distribution p(y c , x c ; θ ), the purpose of the encoder-decoder model is to estimate the
parameters θ of the distribution from its likelihood, given G . The joint probability
needs to be maximized as:
 
max p(y , x ; θ ) = max
c c
p(x c , y c , Og , z c )dOg dz c . (9.2)
θ θ
z c Og

For simplicity, we drop θ from p(y c , x c ) in subsequent analysis. The process of


finding the distribution p(y c , x c ) implicitly depends upon latent embedding of the
sample x c , which is z c , and the common class prototype Og computed over G . The
crux of the variational approach here is to learn the conditional distribution p(z c |x c ),
that can produce the output mask y c , and thus maximize p(y c , x c ). Here, Og and x c
are independent of each other as the sets G and C are generated randomly. Thus,
Eq. (9.2) can be written as:
 
p(y , x ) =
c c
p(y c |Og , z c ) p(Og |z c , x c ) p(z c |x c ) p(x c )dOg dz c
g
zc O
(9.3)
= p(y c |Og , z c ) p(Og ) p(z c |x c ) p(x c )dOg dz c
z c Og

Since z c is the latent embedding corresponding to x c , they provide redundant infor-


mation, and hence we refrain from using them together inside joint probability. The
main idea behind the variational method used here is to learn the distributions q(Og )
and q(z c |x c ) that can approximate the distributions p(Og ) and p(z c |x c ) over the
latent variables, respectively. Therefore, Eq. (9.3) can be written as:
 
p(y c , Og , z c )
p(y c , x c ) = q(Og )q(z c |x c ) p(x c )dOg dz c
q(Og , z c )
z c Og   (9.4)
c
,Og ,z c )
= p(x c ) E(Og ,z c )∼q(Og ,z c ) p(y
q(Og ,z c )
9.2 Co-segmentation Framework 193

Taking the logarithm on Eq. (9.4) followed by Jensen’s inequality, we have


c g
,O ,z )
c
log p(y c , x c ) ≥ E(Og ,z c )∼q(Og ,z c ) log p(y
 q(Og ,z c ) 
≥ E(Og ,z c )∼q(Og ,z c ) log p(y c |Og , z c ) (9.5)
−K L [q(Og |G )|| p(Og |G )] − K L [q(z c |x c )|| p(z c |x c )]

From the evidence lower bound [11] obtained in Eq. (9.5), we observe that maxi-
mizing it will in turn maximize the target log-likelihood of generating a mask y c for
a given input image x c . Thus, unlike the traditional variational auto-encoders, here
a continuous embedding Q is learned, which guides mask generation. The terms
q(z c |x c ) and q(Og ) denote the mapping operation of encoders with shared weights,
and p(y c |Og , z c ) denotes the decoder part of the network that is responsible for gen-
erating the co-segmentation mask given the common object prototype Og and the
latent embedding z c .
From Eq. (9.5), the loss (L) to train the network is formulated over the co-seg set
as:
m
L=− log p(y cj (a, b)|Og , z cj )
j=1 (a,b) (9.6)
+K L [q(Og |G )|| p(Og |G )] + K L [q(z c |x c )|| p(z c |x c )] ,

where y cj (a, b) is the prediction at pixel location (a, b). The network is trained over
the larger dataset Dbase using multiple episodes until convergence. During test time,
the labeled set {( xit , 
yit )}i=1
n
of Dtarget is used as the guide set, and the unlabeled set
x1 , 
{ u
x2 , . . . , 
u
xr } is used as the co-seg set. Hence, the final co-segmentation accuracy
u

is examined over the corresponding co-seg set of Dtarget .

9.3 Network Architecture

The overall network architecture is shown in Fig. 9.3. ResNet-50 forms the backbone
of the encoder-decoder framework, which in combination with the channel and spatial
attention modules form the complete pipeline. The individual modules, as shown in
Fig. 9.3, are explained briefly next.

9.3.1 Encoder-Decoder

The variational encoder-decoder is implemented using the ResNet-50 architecture at


its backbone. The encoder (E) is just the ResNet-50 network with a final additional
1 × 1 convolutional layer. The decoder has five stages of upsampling and convolu-
tional layers with skip connections through a spatial attention module as shown in
Fig. 9.3. The encoder and decoder are connected through a channel attention module.
194 9 Few-shot Learning for Co-segmentation

9.3.2 Channel Attention Module (ChAM)

Channel attention of an image x is computed from its embedding z = E(x) obtained


from the encoder. First, z is compressed through pooling. To boost the representa-
tional power of the embedding, both global average-pooling and max-pooling are
performed simultaneously. The output vectors from these operations z avg and z max ,
respectively, are then fed to a multi-layer perceptron  to produce the channel atten-
tion as:
ChAM(z) = σ (z avg ) + (z max ) , (9.7)

where σ is the sigmoid function.

9.3.3 Spatial Attention Module (SpAM)

The inter-spatial relationship among features is utilized to generate the spatial


attention map. To generate the attention map for a given feature F ∈ RC×H ×W ,
both average-pooling and max-pooling are applied across the channels, resulting
in Favg ∈ R H ×W and Fmax ∈ R H ×W , respectively. Here, C, H and W denote the
channels, height and width of the feature map. Then these are concatenated chan-
nel wise to form [Favg ; Fmax ]. A convolution operation  followed by a sigmoid
function is performed over the concatenated features to get the spatial attention map
SpAM(F) ∈ R H ×W as:

SpAM(F) = σ ([Favg ; Fmax ]) . (9.8)

9.4 Experimental Results

In this section, we analyze the results obtained by the co-segmentation method


described in this chapter, denoted as PMF. The PASCAL-VOC dataset is used as
the base set Dbase over which the class-agnostic episodic training is performed as
discussed in Sect. 9.2.1. It consists of 20 different classes with 50 samples per class
[36] where samples within a class have significant appearance and pose variations.
Following this, three different datasets have been considered as the target set Dtarget :
the iCoseg dataset, the MSRC dataset, and the Internet dataset, over which the model
is fine-tuned. The iCoseg and the MSRC datasets are challenging due to the limited
number of samples per class, hence they are not ideal for supervised learning. The
PMF approach overcomes this small sample problem by using a few-shot learning
method for training.
9.4 Experimental Results 195

As in previous chapters, we use precision (P) and Jaccard index (J ) as metrics


for evaluating different methods. Further, we experiment over the Internet dataset
with a variable number of co-segmentable images along with outliers.

9.4.1 PMF Implementation Details

To build the encoder part of the PMF architecture, the pre-trained ResNet-50 is
used. For the rest of the network, the strategy of Glorot et al. [41] has been adopted
for initializing the weights. For the optimization, stochastic gradient descent is used
with the learning rate and momentum 10−5 and 0.9, respectively for all datasets. Each
input image and the corresponding mask are resized to 224 × 224 pixels. Further,
data augmentation is performed by random rotation and horizontal flipping of the
images to increase the number of training samples. For all datasets, the guide set G
and co-seg set C are randomly created such that there are no common images, and
the episodic training scheme described in Sect. 9.2.1 is used.

9.4.2 Performance Analysis

As mentioned earlier, with the PASCAL-VOC dataset as Dbase , we consider the


iCoseg [8] dataset as a Dtarget set because the number of labeled samples present in
it is small. It has 38 classes with 643 images, with some classes having less than 5
samples. Since this dataset is very small furthermore to examine the few-shot learning
based PMF method, it is split into training and testing set in the ratio of 1:1, and as
a result the guide set to the co-seg set ratio is also 1:1. We compare performance of
different methods in Table 9.1. The methods DOC [46, 72, 100], CSA [20], CAT [50,
126] are, by design, not equipped to handle the scenario of the small number of
labeled samples in the iCoseg dataset, whereas the PMF method’s few-shot learning
scheme can fine-tune the model over the small set of available samples without any
overfitting, which inherently boosts our performance. The method in [67] created
additional annotated data to tackle the small sample size problem, which essentially
requires extra human supervision. Visual results in Fig. 9.5 show that PMF performs
well even for the most difficult class (panda).
Next, we analyze the use of the MSRC [131] dataset as Dtarget . It consists of the
following classes: cow, plane, car, sheep, cat, dog and bird, and each class has 10
images. Hence, the aforementioned 7 classes are removed from Dbase (PASCAL-
VOC) to preserve the few-shot setting in the experiments of PMF. The training and
testing split is set to 2:3. The quantitative and visual results are shown in Table 9.2
and Fig. 9.6. It may be noted that the methods DOC, CSA and PMS (the method
described in Chap. 8) perform co-segmentation over only image pairs, and use a
train to test split ratio of 3:2.
196

Fig. 9.5 Co-segmentation results obtained using the PMF method on the iCoseg dataset. (Top row) Images in a co-seg set and corresponding common objects
(statue) obtained, shown pairwise. (Middle row) Images in a co-seg set and corresponding common objects (panda) obtained, shown pairwise. (Bottom row,
left) Guide set for the statue class. (Bottom row, right) Guide set for the panda class that includes a negative sample. Figure courtesy: [5]
9 Few-shot Learning for Co-segmentation
9.4 Experimental Results

Fig. 9.6 Co-segmentation results obtained using the PMF method on the MSRC dataset. (Top row) Images in a co-seg set and corresponding common objects
(dog) obtained, shown pairwise. (Middle row) Images in a co-seg set and corresponding common objects (cat) obtained, shown pairwise. (Bottom row, left)
Guide set for the dog class that includes a negative sample. (Bottom row, right) Guide set for the cat class. Figure courtesy: [5]
197
198

Fig. 9.7 Co-segmentation results obtained using the PMF method on the Internet dataset. (Top row) Images in a co-seg set and corresponding common objects
(horse) obtained, shown pairwise. The co-seg set also contains one outlier image. (Middle row) Images in a co-seg set and corresponding common object
(airplane) obtained, shown pairwise. The co-seg set also contains two outlier images. (Bottom row, left) Guide set for the horse class that includes a negative
sample. (Bottom row, right) Guide set for the airplane class that includes a negative sample. The co-segmentation results show that the model is robust to outliers
in the co-seg set. Figure courtesy: [5]
9 Few-shot Learning for Co-segmentation
9.4 Experimental Results 199

Table 9.1 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using
the iCoseg dataset as the target set
Method Precision (P) Jaccard index (J )
DOC [72] – 0.84
[100] – 0.73
[46] 94.4 0.78
CSA [20] – 0.87
CAT [50] 96.5 0.77
[126] 90.8 0.72
[67] 97.9 0.89
PMF 99.1 0.94

Table 9.2 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using
the MSRC dataset as the target set
Method Precision P Jaccard index J
CMP [36] 92.0 0.77
DOC [72] 94.4 0.80
CSA [20] 95.3 0.77
PMS 96.3 0.85
PMF 98.7 0.88

Finally, we consider the Internet [105] dataset as Dtarget , which has three classes:
car, aeroplane and horse with 100 samples per class. Though the number of classes is
small, this dataset has high intra-class variation and is relatively large. But to examine
the performance in the few-shot setting, it is split in 1:9 ratio into training and testing
sets. As the PASCAL-VOC dataset is considered as Dbase , the above three classes
are removed from it. In the experiments, the cardinality of the co-seg set is varied
(randomly selected 40, 60, or 80 images from the Internet dataset). Similarly, the
number of outliers are also varied from 10 to 50% of the total samples of the set in
steps of 10%. We report the average accuracy computed over all of these sets for the
PMF method as well as accuracies of other methods in Table 9.3. The results indicate
that the PMF method can handle large number of input images and also large number
of outliers. The visual results are shown in Fig. 9.7.

9.4.3 Ablation Study

Encoder: The task of image co-segmentation can be divided into two sub-tasks in
cascade. The first task is to identify similar objects without exploiting any semantic
information or more formally clustering similar objects together. The second task
200 9 Few-shot Learning for Co-segmentation

Fig. 9.8 tSNE plots of the embeddings obtained, for five classes, using a ResNet-50 encoder (left)
and the PMF encoder (right) described in this chapter. It can be observed that the PMF encoder,
when compared to ResNet-50, has smaller intra-class distance and larger inter-class distance. Figure
courtesy: [5]

Fig. 9.9 Illustration of the need for the channel attention and spatial attention modules (ChAM
and SpAM). (Top row) A sample guide set and co-seg set used for co-segmentation. a Output
co-segmented images from the model without using ChAM and SpAM. b Output co-segmented
images from the model when only ChAM is used. c Output co-segmented images from the model
when both ChAM and SpAM are used. When both attention modules are used, the model is able
to correctly segment the common foreground (pyramid) from the co-seg set based on the dominant
class present in the guide set. Figure courtesy: [5]
9.4 Experimental Results 201

Table 9.3 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using
the Internet dataset as the target set
Method Precision (P) Jaccard index (J )
DOC [72] 93.3 0.70
[100] 85.0 0.53
CSA [20] – 0.74
CAT [50] 92.2 0.69
PMS 96.1 0.77
[67] 97.1 0.84
PMF 99.0 0.87

Fig. 9.10 Performance comparison using Jaccard index (J ) for varying number of positive samples
in the guide set. It can be seen that the performance when the variational inference and attention are
used is better than the case when they are not used. The performance is good even when the guide
set has only a small number of positive samples. Figure courtesy: [5]

is to jointly segment similar objects or performing foreground segmentation over


each cluster. In this context, to show the role of the directed variational inference
cross-encoder described in this chapter for clustering, the encoder is replaced with
the ResNet50 (sans the final two layers). The resulting embedding spaces obtained
with both encoders are shown using tSNE plots in Fig. 9.8. This experiment is con-
ducted on the MSRC dataset where five classes are randomly chosen to examine
the corresponding class embedding. The PMF encoder, with the help of variational
inference, reduces intra-class distances and increases inter-class distances implicitly,
which in turn boosts the co-segmentation performance significantly.
Attention mechanism: The channel attention module (ChAM) and spatial attention
module (SpAM) also play a significant role to obtain the common object in the input
image set correctly. It is evident from Fig. 9.9a, c that ChAM and SpAM help to
identify common objects in a very cluttered background and objects with different
scales. However, the role of the ChAM is more crucial to identify common objects
202 9 Few-shot Learning for Co-segmentation

Fig. 9.11 Illustration shows the role of the guide set in obtaining the co-segmented objects for a
given co-seg set (Rows 2 and 5). The guide set 1 (with pyramid as the majority class) leads to the
output shown in Row 3 with pyramid as the co-segmented object. The guide set 2 (with horse as the
majority class) leads to the output shown in Row 6 with horse as the co-segmented object. Figure
courtesy: [5]
9.4 Experimental Results 203

whereas the SpAM is responsible for better mask production. This can be observed
in Fig. 9.9b that without the SpAM, the generated output masks are incomplete.
Guide set: The common object prototype Og is calculated from the set G by feature
averaging. The method of determining Og is similar to noise cancellation where the
motivation is to reduce the impact of outliers and to increase the influence of the
positive samples (samples containing the common object). We experiment on the
iCoseg dataset by varying the number of positive samples in the guide set to be 2, 4,
6, 8. The size of the guide set is fixed at 8. The performance of the PMF method with
and without variational inference and the attention modules is shown in Fig. 9.10. It
can also be observed that the method is robust against outliers and can work with a
small number of positive guide samples.
Multiple common objects: The fine control of the PMF approach over the fore-
ground extraction process is demonstrated in Fig. 9.11. For a given co-seg set with
multiple, potential common foregrounds i.e., pyramid and horse, the network can be
guided to perform common foreground extraction on the co-seg set for each of these
foregrounds just by varying the composition of the guide set. We can also relate this
result to the multiple common object segmentation discussed in Sect. 5.5.2. There,
the seeds for each common object class are obtained from the clusters with large
compactness factor, and the number of different classes present in the image needs
to be provided by the user. However, here guidance comes from the guide set, that
is, the majority class in the guide set is responsible for detecting common objects of
that class from the co-seg set.
In this chapter, we have described a framework to perform multiple image co-
segmentation, which is capable of overcoming the small-sample problem by integrat-
ing few-shot learning and variational inference. The approach can learn a continuous
embedding to extract consistent foreground from multiple images of a given set. Fur-
ther, it is capable of performing consistently, even in the presence of a large number
of outlier samples in the co-seg set.

Acknowledgements Contributions of S Divakar Bhatt is gratefully acknowledged.


Chapter 10
Conclusions

The objective of this monograph has been to detect common objects, i.e., objects with
similar feature present in a set of images. To achieve this goal, we have discussed
three unsupervised and three supervised methods. In the graph-based approaches,
each image is first segmented into superpixels, and a region adjacency graph is con-
structed whose nodes represent the image superpixels. This graph representation of
images allows us to exploit the property that an object is typically a set of contiguous
superpixels, and the neighborhood relationship among superpixels is embedded in
the corresponding region adjacency graph. Since the common objects across differ-
ent images contain superpixels with similar feature, graph matching techniques have
been used for finding inter-graph correspondences among nodes in the graphs.
In the introduction of this monograph, we had briefly introduced the concept of
image co-saliency. It utilizes the fact that the salient objects in an image capture
human attention. But, we found that the common object in an image set may not be
salient in all the images. This is due to the fact that interesting patches occur rarely
in an image set. Hence, it is difficult to find many image sets with salient common
objects in practice. So, we have restricted this monograph to image co-segmentation.
However for completeness, here we compare image co-segmentation and co-saliency
in detail using several example images.
In the following example, we try to find the common object with similar features
in an image set and saliency can be used as a feature. Figure 10.1 shows a set of four
images with the ‘cow’ being the common object. Since ‘cow’ is the salient object in
every image, we can successfully co-segment the image set using saliency. But most
image sets do not show these characteristics. Next, we show two examples to explain
this. In Fig. 10.2, the common object ‘dog’ is salient in Image 1 and Image 2. But
in Image 3, people are salient. Next, the image set in Fig. 10.3 includes an outlier
image that contains an object which is different from the common object. Here, all
the objects (the common object ‘kite’ and the ‘panda’ object in the outlier image) are
salient. Thus, the image set to be co-segmented may contain (i) common objects that
are not salient in all the images (see Fig. 10.2) and (ii) salient objects in the outlier
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 205
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6_10
206 10 Conclusions

image (see Fig. 10.3). Hence, co-segmentation of these image sets using saliency
results in false negatives (the algorithm may miss out on the common object in some
of the images) and false positives (objects in the outlier images may get incorrectly
detected as the common objects).
In literature, co-saliency methods [16, 19, 21, 39, 69, 73, 77, 124] have been
shown to detect common, salient objects by combining (i) individual image saliency
outputs and (ii) pixel or superpixel feature distances among the images. Objects with
high saliency value may not necessarily have common features while considering a
set of images. Hence, these saliency guided methods do not always detect similar
objects across images correctly. In Fig. 10.4, we show co-segmentation of two images
without any common object present in them. Co-segmentation without using saliency
yields correct result (Fig. 10.4c) as it does not detect any meaningful common object
(although small dark segments have been detected due to their feature similarity).
But if saliency (Fig. 10.4b) is used, ‘red flower’ and ‘black bird’ are wrongly co-
segmented (Fig. 10.4d) since they are highly salient in their respective images.
It has been observed that the salient patches of an image have the least probability
of being sampled from a corpus of similar images [116]. Hence, we are unlikely
to obtain a large volume of data with image sets containing many salient common
objects. Image segmentation using co-segmentation is, in principle, different from
object segmentation using co-saliency as the segmented common object need not be
the salient object in both images. In this monograph, we described co-segmentation
methods that are independent of saliency or any prior knowledge or preprocess-
ing. Notwithstanding the above-mentioned disadvantages of using saliency in co-
segmentation, it may be noted that if the common object present in the image set is
indeed salient in the respective images, saliency will certainly aid in co-segmentation.
Then the problem becomes co-saliency. Next, we explain this using four examples
of synthetic image pairs.
The image pair in Fig. 10.5a contains two common objects (‘maroon’ and ‘green’)
and both objects have high saliency (Fig. 10.5b) in respective images. Hence, both
the common objects can be co-segmented (i) using similarity between saliency val-
ues alone (Fig. 10.5c), (ii) using only feature similarities without using saliency at
all (Fig. 10.5d) and (iii) using both feature similarity and saliency similarity together
(Fig. 10.5e). In Fig. 10.6, the image pair contains only one salient common object
(‘maroon’) and two dissimilar salient objects (‘green’ and ‘blue’). (i) The dissimilar
objects are wrongly detected as common objects if only saliency similarity is con-
sidered (Fig. 10.6c), (ii) but they are correctly discarded if only feature similarity is
considered (Fig. 10.6d). (iii) If both saliency and feature similarities are used, there
is a possibility of false positives (Fig. 10.6e). Here, the dissimilar objects may be
wrongly co-segmented if the saliency similarity far outweighs the feature similarity.
For the image pair in Fig. 10.7, we obtain same result although the common object
(‘dark yellow’) is less salient because only the similarity between saliency values
has been used for the results in Fig. 10.7c,e instead of directly using saliency values
(saliency similarity is also considered in Fig. 10.5c,e, Fig. 10.6c,e, Fig. 10.8c,e). In
Fig. 10.8, we consider the dissimilar objects to be highly salient, but the common
object in one image to be less salient than the other (Fig. 10.8b). This reduces saliency
10 Conclusions 207

Fig. 10.1 Illustration of saliency detection on an image set (Column 1) with the common foreground
(‘cow’, shown in Column 2) being salient in all the images. Image courtesy: Source images from
the MSRC dataset [105]
208 10 Conclusions

Fig. 10.2 Illustration of saliency detection on an image set (shown in top row) where the common
foreground (‘dog’, shown in bottom row) is not salient in all the images. Image courtesy: Source
images from the MSRC dataset [105]

similarity between the common object present in the two images resulting in false
negative when only saliency similarity is used (Fig. 10.8c). But, the common object
is correctly co-segmented by using feature similarity alone (Fig. 10.8d). With these
observations, we solved the co-segmentation problem without using saliency.
In Chap. 4, we have described a method for co-segmenting two images. Images
are successively resized and oversegmented into multiple levels, and graphs are
obtained. Given graph representations of the image pair in the coarsest level, we find
the maximum common subgraph (MCS). As MCS computation is an NP-complete
problem, an approximate method using minimum vertex cover algorithm has been
used. Since the common object in both the images may have different sizes, the MCS
represents it partially. Then the nodes in the resulting subgraphs are mapped to graphs
in the finer segmentation levels. Next using them as seeds, these subgraphs are grown
in order to obtain the complete objects. Instead of individually growing them, region
co-growing (RCG) is performed where feature similarity among nodes (superpixels)
is computed across images as well as within images. Since the algorithm has two
components (MCS and RCG), progressive co-segmentation is possible which in turn
results in fast computation. We have shown that this method can be extended for
co-segmenting more than two images. We can co-segment every (non-overlapping)
pair of images and use the outputs for a second round of co-segmentation and so
on. For a set of N images, O (N ) MCS matching steps are required to obtain the
10 Conclusions 209

Fig. 10.3 Illustration of saliency detection on an image set (shown in Column 1) that includes an
outlier image (last image) where both the common foreground (‘kite’, shown in Column 2) and the
object (‘panda’) in the outlier image are salient. Image courtesy: Source images from the iCoseg
dataset [8]
210 10 Conclusions

(a) (b) (c) (d)

Fig. 10.4 Illustration of co-segmentation using saliency. a Input images and b corresponding
saliency outputs. Co-segmentation result c without and d with saliency. Image courtesy: Source
images from the MSRA dataset [26]

(a) (b) (c) (d) (e)

Fig. 10.5 Illustration of co-segmentation when saliency is redundant. a Input image pair and b
saliency output. Co-segmentation c using saliency alone, d using feature similarity and e using both
saliency and feature similarity correctly co-segments both common objects

final output. If at least one outlier image (that does not contain the common object)
is present in the image set, the corresponding MCS involving that image will result
in an empty set, and this result propagates to the final level co-segmentation which
will also yield an empty set. Hence, this extension of two-image co-segmentation to
multi-image co-segmentation will fail unless the common object is present in all the
images.
Next in Chap. 5, a multi-image co-segmentation algorithm has been demonstrated
to solve the problem associated with the presence of outlier images. First, the image
supepixels are clustered based on features, and seed superpixels are identified from
the spatially most compact cluster. In Chap. 4, we could co-grow seed superpixels
since we had two images. In the case of more than two images (say N ), the number
of possible matches is very high (O (N 2 N −1 )). Hence, we need to combine all the
10 Conclusions 211

(a) (b) (c) (d) (e)

Fig. 10.6 Illustration of co-segmentation when high saliency of dissimilar objects introduces false
positives. a Input image pair and b saliency output. Co-segmentation using c saliency alone and e
both saliency and feature similarity introduces false positives by wrongly detecting dissimilar objects
as common (‘?’ indicates the object may or may not be detected), whereas d co-segmentation using
feature similarity alone correctly detects only the common object

(a) (b) (c) (d) (e)

Fig. 10.7 Illustration of co-segmentation when saliency introduces false positives. a Input image
pair and b saliency output. Co-segmentation using c saliency alone and e both saliency and feature
similarity introduce false positives, whereas d co-segmentation using feature similarity alone cor-
rectly detects only the common object even though the common object is less salient in both the
images

seed graphs based on feature similarity and neighborhood relationships of nodes and
build a combined graph which is called latent class graph (LCG). Region growing
has been performed on each of the seed graphs independently by using this LCG as
a reference graph. Thus, we have achieved consistent matching among superpixels
within the common object across images and reduced the number of required graph
matching to O (N ).
In Chap. 6, we have discussed the formulation of co-segmentation as a classifica-
tion problem where image superpixels are labeled as either the common foreground
or the background. Since we usually get a variety of different background regions in
212 10 Conclusions

(a) (b) (c) (d) (e)

Fig. 10.8 Illustration of co-segmentation when saliency difference of the common object introduces
false negatives. a Input image pair and b saliency output. Co-segmentation using c saliency alone
and e both saliency and feature similarity introduce false positives as well as false negatives since
the common object in one image is less salient than in the other image, whereas d co-segmentation
using feature similarity alone correctly detects the common object

a set of images, more than one background class have been used. The training super-
pixels (seeds) have been obtained in a completely unsupervised manner, and a mode
detection method in multidimensional feature space has been used to find the seeds.
Optimal discriminants have been learned using a modified LDA in order to compute
discriminative features by projecting the input features to a space that increases sep-
aration between the common foreground class and each of the background classes.
Then using the projected features, a spatially constrained label propagation algorithm
assigns labels to the unlabeled superpixels in an iterative manner in order to obtain
the complete objects while ensuring their cohesiveness.
Images acquired in an uncontrolled environment present a number of challenges
for segmenting the common foreground from them, including differences in the
appearance and pose of common foregrounds, as well as foregrounds that are strik-
ingly similar to the background. The approach described in Chap. 6 addresses these
issues as an unsupervised foreground–background classification problem, in which
superpixels belonging to the same foreground are detected using their corresponding
handcrafted features. The method’s effectiveness is heavily reliant on the superpixels’
computed features, and manually obtaining the appropriate one may be extremely
difficult depending on the degree of variation of the common foreground across the
input images. In Chap. 7, we discussed an end-to-end foreground–background clas-
sification framework where features of each superpixel is computed automatically
using a graph convolution neural network. In Chap. 8, we discussed a CNN-based
architecture for solving image co-segmentation. Based on a conditional siamese
encoder–decoder architecture, combined with a siamese metric learning and a deci-
sion network, the model demonstrates good generalization performance on seg-
menting objects of the same classes across different datasets, and robustness to
outlier images. In Chap. 9, we described a framework to perform multiple image
10.1 Future Work 213

co-segmentation, which is capable of overcoming the small-sample problem by inte-


grating few-shot learning and variational inference. We have shown that this frame-
work is capable of learning a continuous embedding to extract consistent foreground
from multiple images of a given set. The approach is capable of performing con-
sistently: (i) over small datasets, and (ii) even in the presence of a large number of
outlier samples in the co-seg set.
To demonstrate the robustness and superior performance of the discussed co-
segmentation methods, we have experimented on standard datasets: the image pair
dataset, the MSRC dataset, the iCoseg dataset, the Weizmann horse dataset, the
flower dataset and our own outlier contaminated dataset by mixing images of different
classes. They show better or comparable results to other unsupervised methods in
literature in terms of both accuracy and computation time.

10.1 Future Work

This monograph focused on foreground co-segmentation. It is to be noted that back-


ground co-segmentation is also worth studying as it has direct application in annota-
tion of semantic segments, which include both foreground and background in images.
Assuming we have training data in the form of labeled superpixels of different back-
ground classes, the work in Chap. 6 can be extended to background co-segmentation
with slight modifications. In the label propagation stage of Chap. 6, we have used
spatial constraints in individual images. But, we do not have constraints on spatial
properties of the common object across images. Hence, in this regard, this can be
extended as a future work by incorporating a shape similarity measure among the
partial objects obtained after every iteration of label propagation. The multidimen-
sional mode detection method of Chap. 6 aids in discriminative feature computation
if there is only one type of common object present in the image set. It should be noted
that mode detection in a high-dimensional setting is a challenging research problem
to be considered, even in a generic setting. If there are more than one common object
classes, as in the case of multiple co-segmentation discussed in Sect. 5.5.2, we are
required to compute multiple modes and consider multiple foreground classes instead
of one. Hence, study of multiple mode detection-based co-segmentation can also be
considered as a challenging future work.
Corresponding to the machine learning-based approaches, having sufficient labeled
training data for co-segmentation can be challenging. Hence, approaches that use
less labeled data as in Chap. 9, such as few-shot learning can be further explored.
Specifically the approach considered in Chap. 9 is class agnostic. But considering
class-aware methods for fine-grained co-segmentation is an area for future research.
Along similar direction, incremental learning for co-segmentation will also be useful
in certain settings and can be considered.
The co-segmentation problem can be extended to perform self co-segmentation
of segmenting similar objects or classes within an image. One approach to this
is by considering this problem as self-similarity of subgraphs in an image. Doing
214 10 Conclusions

this in a learning-based setup may need a different approach. Another extension


of the image co-segmentation could be to do co-segmentation across videos. Here,
similar segments of videos across different videos can be extracted. This will have
the challenge of spatial and temporal similarity that will need a different approach.
References

1. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels com-
pared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11),
2274–2282 (2012)
2. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph.
26(3) (2007)
3. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder
architecture for image segmentation. IEEE Trans. Pattern Anal. Machine Intell. 39(12), 2481–
2495 (2017)
4. Baldi, P., Sadowski, P.J.: Understanding dropout. In: Advances in Neural Information Pro-
cessing Systems, vol. 26, pp. 2814–2822 (2013)
5. Banerjee, S., Bhat, S.D., Chaudhuri, S., Velmurugan, R.: Directed variational cross-encoder
network for few-shot multi-image co-segmentation. In: Proceedings of ICPR, pp. 8431–8438
(2021)
6. Banerjee, S., Hati, A., Chaudhuri, S., Velmurugan, R.: Image co-segmentation using graph
convolution neural network. In: Proceedings of Indian Conference on Computer Vision,
Graphics and Image Processing (ICVGIP), pp. 57:1–57:9 (2018)
7. Banerjee, S., Hati, A., Chaudhuri, S., Velmurugan, R.: Cosegnet: image co-segmentation using
a conditional siamese convolutional network. In: Proceedings of IJCAI, pp. 673–679 (2019)
8. Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: interactive co-segmentation with
intelligent scribble guidance. In: Proceedings of CVPR, pp. 3169–3176 (2010)
9. Bickel, D.R., Frühwirth, R.: On a fast, robust estimator of the mode: comparisons to other
robust estimators with applications. Comput. Stat. Data Anal. 50(12), 3500–3530 (2006)
10. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
11. Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians.
J. Am. Stat. Assoc. 112(518), 859–877 (2017)
12. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts, pp.
19–26 (2001)
13. Borenstein, E., Ullman, S.: Combined top-down/bottom-up segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 30(12), 2109–2125 (2008)
14. Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: a benchmark. IEEE Trans.
Image Process. 24(12), 5706–5722 (2015)
15. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation
of objects in ND images. In: Proceedings of ICCV, vol. 1, pp. 105–112 (2001)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 215
Nature Singapore Pte Ltd. 2023
A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082,
https://doi.org/10.1007/978-981-19-8570-6
216 References

16. Cao, X., Tao, Z., Zhang, B., Fu, H., Feng, W.: Self-adaptively weighted co-saliency detection
via rank constraint. IEEE Trans. Image Process. 23(9), 4175–4186 (2014)
17. Chandran, S., Kiran, N.: Image retrieval with embedded region relationships. In: Proceedings
of ACM Symposium on Applied Computing, pp. 760–764 (2003)
18. Chang, H.S., Wang, Y.C.F.: Optimizing the decomposition for multiple foreground coseg-
mentation. Elsevier Comput. Vis. Image Understand. 141, 18–27 (2015)
19. Chang, K.Y., Liu, T.L., Lai, S.H.: From co-saliency to co-segmentation: an efficient and fully
unsupervised energy minimization model. In: Proceedings of CVPR, pp. 2129–2136 (2011)
20. Chen, H., Huang, Y., Nakayama, H.: Semantic aware attention based deep object co-
segmentation. In: Proceedings of ACCV, pp. 435–450 (2018)
21. Chen, H.T.: Preattentive co-saliency detection. In: Proceedings of ICIP, pp. 1117–1120 (2010)
22. Chen, M., Velasco-Forero, S., Tsang, I., Cham, T.J.: Objects co-segmentation: propagated
from simpler images. In: Proceedings of ICASSP, pp. 1682–1686 (2015)
23. Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.: Sketch2photo: internet image montage.
ACM Trans. Graph. 28(5), 124 (2009)
24. Chen, X., Shrivastava, A., Gupta, A.: Enriching visual knowledge bases via object discovery
and segmentation. In: Proceedings of CVPR, pp. 2035–2042 (2014)
25. Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: Show, match and segment: joint weakly
supervised learning of semantic matching and object co-segmentation. IEEE Trans. PAMI
43(10), 3632–3647 (2021)
26. Cheng, M.M., Zhang, G.X., Mitra, N., Huang, X., Hu, S.M.: Global contrast based salient
region detection. In: Proceedings of CVPR, pp. 409–416 (2011)
27. Chernoff, H.: Estimation of the mode. Ann. Inst. Stat. Math. 16(1), 31–41 (1964)
28. Colannino, J., Damian, M., Hurtado, F., Langerman, S., Meijer, H., Ramaswami, S., Souvaine,
D., Toussaint, G.: Efficient many-to-many point matching in one dimension. Graph. Comb.
23(1), 169–178 (2007)
29. Collins, M.D., Xu, J., Grady, L., Singh, V.: Random walks based multi-image segmentation:
quasiconvexity results and GPU-based solutions. In: Proceedings of CVPR, pp. 1656–1663
(2012)
30. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
31. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn.
McGraw-Hill Higher Education (2001)
32. Ding, Z., Shao, M., Hwang, W., Suh, S., Han, J.J., Choi, C., Fu, Y.: Robust discriminative
metric learning for image representation. IEEE Trans. Circuits Syst. Video Technol. (2019)
33. Dong, X., Shen, J., Shao, L., Yang, M.H.: Interactive cosegmentation using global and local
energy optimization. IEEE Trans. Image Process. 24(11), 3966–3977 (2015)
34. Dornaika, F., El Traboulsi, Y.: Matrix exponential based semi-supervised discriminant embed-
ding for image classification. Pattern Recogn. 61, 92–103 (2017)
35. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual
object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
36. Faktor, A., Irani, M.: Co-segmentation by composition. In: Proceedings of ICCV, pp. 1297–
1304 (2013)
37. Fang, Y., Chen, Z., Lin, W., Lin, C.W.: Saliency detection in the compressed domain for
adaptive image retargeting. IEEE Trans. Image Process. 21(9), 3888–3901 (2012)
38. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Com-
put. Vis. 59(2), 167–181 (2004)
39. Fu, H., Cao, X., Tu, Z.: Cluster-based co-saliency detection. IEEE Trans. Image Process.
22(10), 3766–3778 (2013)
40. Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with super-
pixel neighborhoods. In: Proceedings of CVPR, pp. 670–677 (2009)
41. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural net-
works. In: Proceedings of International Conference on Artificial Intelligence and Statistics,
pp. 249–256 (2010)
References 217

42. Goferman, S., Tal, A., Zelnik-Manor, L.: Puzzle-like collage. In: Computer Graphics Forum,
vol. 29, pp. 459–468. Wiley Online Library (2010)
43. Goferman, S., Tal, A., Zelnik-Manor, L.: Puzzle-like collage. In: Computer Graphics Forum,
vol. 29, pp. 459–468. Wiley Online Library (2010)
44. Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically
consistent regions. In: Proceedings of ICCV, pp. 1–8 (2009)
45. Han, J., Ngan, K.N., Li, M., Zhang, H.J.: Unsupervised extraction of visual attention objects
in color images. IEEE Trans. Circuits Syst. Video Technol. 16(1), 141–145 (2006)
46. Han, J., Quan, R., Zhang, D., Nie, F.: Robust object co-segmentation using background prior.
IEEE Trans. Image Process. 27(4), 1639–1651 (2018)
47. Hati, A., Chaudhuri, S., Velmurugan, R.: Salient object carving. In: Proceedings of ICIP, pp.
1767–1771 (2015)
48. Hati, A., Chaudhuri, S., Velmurugan, R.: Image co-segmentation using maximum common
subgraph matching and region co-growing. In: Proceedings of ECCV, pp. 736–752 (2016)
49. Hochbaum, D.S., Singh, V.: An efficient algorithm for co-segmentation. In: Proceedings of
ICCV, pp. 269–276 (2009)
50. Hsu, K.J., Lin, Y.Y., Chuang, Y.Y.: Co-attention CNNs for unsupervised object co-
segmentation. In: Proceedings of IJCAI, pp. 748–756 (2018)
51. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing
internal covariate shift. In: Proceedings of ICML, pp. 448–456 (2015)
52. Itti, L.: Automatic foveation for video compression using a neurobiological model of visual
attention. IEEE Trans. Image Process. 13(10), 1304–1318 (2004)
53. Jaakkola, M.S.T., Szummer, M.: Partially labeled classification with markov random walks.
In: Advances in Neural Information Processing Systems, vol. 14, pp. 945–952 (2002)
54. Jerripothula, K.R., Cai, J., Yuan, J.: Image co-segmentation via saliency co-fusion. IEEE
Trans. Multimedia 18(9), 1896–1909 (2016)
55. Joachims, T.: Transductive learning via spectral graph partitioning. In: Proceedings of ICML,
pp. 290–297 (2003)
56. Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: Pro-
ceedings of CVPR, pp. 1943–1950 (2010)
57. Joulin, A., Bach, F., Ponce, J.: Multi-class cosegmentation. In: Proceedings of CVPR, pp.
542–549 (2012)
58. Kanan, C., Cottrell, G.: Robust classification of objects, faces, and flowers using natural image
statistics. In: Proceedings of CVPR, pp. 2472–2479 (2010)
59. Kim, G., Xing, E.P.: On multiple foreground cosegmentation. In: Proceedings of CVPR, pp.
837–844 (2012)
60. Kim, G., Xing, E.P., Fei-Fei, L., Kanade, T.: Distributed cosegmentation via submodular
optimization on anisotropic diffusion. In: Proceedings of ICCV, pp. 169–176 (2011)
61. Koch, I.: Enumerating all connected maximal common subgraphs in two graphs. Theor. Com-
put. Sci. 250(1), 1–30 (2001)
62. Lai, Z., Xu, Y., Jin, Z., Zhang, D.: Human gait recognition via sparse discriminant projection
learning. IEEE Trans. Circuits Syst. Video Techn. 24(10), 1651–1662 (2014)
63. Lattari, L., Montenegro, A., Vasconcelos, C.: Unsupervised cosegmentation based on global
clustering and saliency. In: Proceedings of ICIP, pp. 2890–2894 (2015)
64. Lee, C., Jang, W.D., Sim, J.Y., Kim, C.S.: Multiple random walkers and their application to
image cosegmentation. In: Proceedings of CVPR, pp. 3837–3845 (2015)
65. Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image matting. IEEE
Trans. Pattern Anal. Mach. Intell. 30(2), 228–242 (2008)
66. Levinshtein, A., Stere, A., Kutulakos, K.N., Fleet, D.J., Dickinson, S.J., Siddiqi, K.: Turbopix-
els: fast superpixels using geometric flows. IEEE Trans. Pattern Anal. Mach. Intell. 31(12),
2290–2297 (2009)
67. Li, B., Sun, Z., Li, Q., Wu, Y., Hu, A.: Group-wise deep object co-segmentation with co-
attention recurrent neural network. In: Proceedings of ICCV, pp. 8519–8528 (2019)
218 References

68. Li, H., Meng, F., Luo, B., Zhu, S.: Repairing bad co-segmentation using its quality evaluation
and segment propagation. IEEE Trans. Image Process. 23(8), 3545–3559 (2014)
69. Li, H., Ngan, K.N.: A co-saliency model of image pairs. IEEE Trans. Image Process. 20(12),
3365–3375 (2011)
70. Li, J., Levine, M.D., An, X., Xu, X., He, H.: Visual saliency based on scale-space analysis in
the frequency domain. IEEE Trans. Pattern Anal. Mach. Intell. 35(4), 996–1010 (2013)
71. Li, K., Zhang, J., Tao, W.: Unsupervised co-segmentation for indefinite number of common
foreground objects. IEEE Trans. Image Process. 25(4), 1898–1909 (2016)
72. Li, W., Jafari, O.H., Rother, C.: Deep object co-segmentation. In: Proceedings of ACCV, pp.
638–653 (2018)
73. Li, Y., Fu, K., Liu, Z., Yang, J.: Efficient saliency-model-guided visual co-saliency detection.
IEEE Sig. Process. Lett. 22(5), 588–592 (2015)
74. Li, Y., Liu, L., Shen, C., van den Hengel, A.: Image co-localization by mimicking a good
detector’s confidence score distribution. In: Proceedings of ECCV, pp. 19–34 (2016)
75. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. 23(3), 303–308
(2004)
76. Liu, H., Xie, X., Tang, X., Li, Z.W., Ma, W.Y.: Effective browsing of web image search
results. In: Proceedings of ACM SIGMM International Workshop on Multimedia Information
Retrieval, pp. 84–90 (2004)
77. Liu, Z., Zou, W., Li, L., Shen, L., Le Meur, O.: Co-saliency detection based on hierarchical
segmentation. IEEE Sig. Process. Lett. 21(1), 88–92 (2014)
78. Ma, J., Li, S., Qin, H., Hao, A.: Unsupervised multi-class co-segmentation via joint-cut over
l1 -manifold hyper-graph of discriminative image regions. IEEE Trans. Image Process. 26(3),
1216–1230 (2017)
79. Ma, Y.F., Hua, X.S., Lu, L., Zhang, H.J.: A generic framework of user attention model and
its application in video summarization. IEEE Trans. Multimedia 7(5), 907–919 (2005)
80. Madry, A.: Navigating central path with electrical flows: from flows to matchings, and back.
In: IEEE Annual Symposium on FOCS, pp. 253–262 (2013)
81. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
University Press, New York, NY, USA (2008)
82. Marchesotti, L., Cifarelli, C., Csurka, G.: A framework for visual saliency detection with
applications to image thumbnailing. In: Proceedings of ICCV, pp. 2232–2239 (2009)
83. Mei, T., Hua, X.S., Li, S.: Contextual in-image advertising. In: Proceedings of ACM Multi-
media, pp. 439–448 (2008)
84. Meng, F., Cai, J., Li, H.: Cosegmentation of multiple image groups. Elsevier Comput. Vis.
Image Understand. 146, 67–76 (2016)
85. Meng, F., Li, H., Liu, G., Ngan, K.N.: Object co-segmentation based on shortest path algorithm
and saliency model. IEEE Trans. Multimedia 14(5), 1429–1441 (2012)
86. Meng, F., Li, H., Ngan, K.N., Zeng, L., Wu, Q.: Feature adaptive co-segmentation by com-
plexity awareness. IEEE Trans. Image Process. 22(12), 4809–4824 (2013)
87. Moore, A.P., Prince, S.J., Warrell, J., Mohammed, U., Jones, G.: Superpixel lattices. In:
Proceedings of CVPR, pp. 1–8. IEEE (2008)
88. Mukherjee, L., Singh, V., Dyer, C.R.: Half-integrality based algorithms for cosegmentation
of images. In: Proceedings of CVPR, pp. 2028–2035 (2009)
89. Mukherjee, L., Singh, V., Peng, J.: Scale invariant cosegmentation for image groups. In:
Proceedings of CVPR, pp. 1881–1888 (2011)
90. Mukherjee, P., Lall, B., Shah, A.: Saliency map based improved segmentation. In: Proceedings
of ICIP, pp. 1290–1294 (2015)
91. Nguyen, K., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In:
Proceedings of ICCV, pp. 622–631 (2019)
92. Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: Proceedings
of CVPR, vol. 2, pp. 1447–1454 (2006)
93. Oliva, A., Torralba, A., Castelhano, M.S., Henderson, J.M.: Top-down control of visual atten-
tion in object detection. In: Proceedings of ICIP, vol. 1, pp. I–253 (2003)
References 219

94. Pal, R., Mitra, P., Mukherjee, J.: Visual saliency-based theme and aspect ratio preserving
image cropping for small displays. In: Proceedings of National Conference on Computer
Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 89–92 (2008)
95. Pang, Y., Yuan, Y., Li, X.: Gabor-based region covariance matrices for face recognition. IEEE
Trans. Circuits Syst. Video Technol. 18(7), 989–993 (2008)
96. Patel, D., Raman, S.: Saliency and memorability driven retargeting. In: Proceedings of IEEE
International Conference SPCOM, pp. 1–5 (2016)
97. Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering
for salient region detection. In: Proceedings of CVPR, pp. 733–740 (2012)
98. Presti, L.L., La Cascia, M.: 3d skeleton-based human action classification: a survey. Pattern
Recogn. 53, 130–147 (2016)
99. Quan, R., Han, J., Zhang, D., Nie, F.: Object co-segmentation via graph optimized-flexible
manifold ranking. In: Proceedings of CVPR, pp. 687–695 (2016)
100. Ren, Y., Jiao, L., Yang, S., Wang, S.: Mutual learning between saliency and similarity: image
cosegmentation via tree structured sparsity and tree graph matching. IEEE Trans. Image
Process. 27(9), 4690–4704 (2018)
101. Rosenholtz, R., Dorai, A., Freeman, R.: Do predictions of visual perception aid design? ACM
Trans. Appl. Percept. 8(2), 12 (2011)
102. Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: interactive foreground extraction using
iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004)
103. Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram
matching—incorporating a global constraint into MRFs. In: Proceedings of CVPR, vol. 1,
pp. 993–1000 (2006)
104. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection, vol. 589. Wiley
(2005)
105. Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and seg-
mentation in internet images. In: Proceedings of CVPR, pp. 1939–1946 (2013)
106. Rubio, J.C., Serrat, J., López, A., Paragios, N.: Unsupervised co-segmentation through region
matching. In: Proceedings of CVPR, pp. 749–756 (2012)
107. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive
content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5), 644–655 (1998)
108. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015)
109. Rutishauser, U., Walther, D., Koch, C., Perona, P.: Is bottom-up attention useful for object
recognition? In: Proceedings of CVPR, vol. 2, pp. II–II (2004)
110. Sager, T.W.: Estimation of a multivariate mode. Ann. Stat. 802–812 (1978)
111. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition
and clustering. In: Proceedings of CVPR, pp. 815–823 (2015)
112. Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. In:
Proceedings of CVPR, pp. 3506–3513 (2012)
113. Shen, X., Wu, Y.: A unified approach to salient object detection via low rank matrix recovery.
In: Proceedings of CVPR, pp. 853–860 (2012)
114. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888–905 (2000)
115. Siam, M., Oreshkin, B.N., Jagersand, M.: AMP: adaptive masked proxies for few-shot seg-
mentation. In: Proceedings of ICCV, pp. 5249–5258 (2019)
116. Siva, P., Russell, C., Xiang, T., Agapito, L.: Looking beyond the image: unsupervised learning
for object saliency and detection. In: Proceedings of CVPR, pp. 3238–3245 (2013)
117. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances
in Neural Information Processing Systems, pp. 4077–4087 (2017)
118. Soille, P.: Morphological Image Analysis: Principles and Applications, 2nd edn. Springer,
New York Inc, Secaucus, NJ, USA (2003)
220 References

119. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
120. Srivatsa, R.S., Babu, R.V.: Salient object detection via objectness measure. In: Proceedings
of ICIP, pp. 4481–4485 (2015)
121. Such, F.P., Sah, S., Dominguez, M.A., Pillai, S., Zhang, C., Michael, A., Cahill, N.D., Ptucha,
R.: Robust spatial filtering with graph convolutional neural networks. IEEE J. Sel. Topics Sig.
Process. 11(6), 884–896 (2017)
122. Sun, J., Ling, H.: Scale and object aware image retargeting for thumbnail browsing. In:
Proceedings of ICCV, pp. 1511–1518 (2011)
123. Sun, J., Ponce, J.: Learning dictionary of discriminative part detectors for image categorization
and cosegmentation. Int. J. Comput. Vis. 120(2), 111–133 (2016)
124. Tan, Z., Wan, L., Feng, W., Pun, C.M.: Image co-saliency detection by propagating superpixel
affinities. In: Proceedings of ICASSP, pp. 2114–2118 (2013)
125. Tao, W., Li, K., Sun, K.: Sacoseg: object cosegmentation by shape conformability. IEEE
Trans. Image Process. 24(3), 943–955 (2015)
126. Tsai, C.C., Li, W., Hsu, K.J., Qian, X., Lin, Y.Y.: Image co-saliency detection and co-
segmentation via progressive joint optimization. IEEE Trans. Image Process. 28(1), 56–71
(2019)
127. Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)
128. Veksler, O., Boykov, Y., Mehrani, P.: Superpixels and supervoxels in an energy optimization
framework. In: Proceedings of ECCV, pp. 211–224. Springer (2010)
129. Venter, J.: On estimation of the mode. Ann. Math. Stat. 1446–1455 (1967)
130. Vicente, S., Kolmogorov, V., Rother, C.: Cosegmentation revisited: models and optimization.
In: Proceedings of ECCV, pp. 465–479 (2010)
131. Vicente, S., Rother, C., Kolmogorov, V.: Object cosegmentation. In: Proceedings of CVPR,
pp. 2217–2224 (2011)
132. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion
simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 583–598 (1991)
133. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Netw. 19(9), 1395–
1407 (2006)
134. Wang, C., Zhang, H., Yang, L., Cao, X., Xiong, H.: Multiple semantic matching on augmented
n-partite graph for object co-segmentation. IEEE Trans. Image Process. 26(12), 5825–5839
(2017)
135. Wang, F., Huang, Q., Guibas, L.J.: Image co-segmentation via consistent functional maps. In:
Proceedings of ICCV, pp. 849–856 (2013)
136. Wang, F., Huang, Q., Ovsjanikov, M., Guibas, L.J.: Unsupervised multi-class joint image
segmentation. In: Proceedings of CVPR, pp. 3142–3149 (2014)
137. Wang, J., Quan, L., Sun, J., Tang, X., Shum, H.Y.: Picture collage. In: Proceedings of CVPR
1, 347–354 (2006)
138. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for
image classification. In: Proceedings of CVPR, pp. 3360–3367 (2010)
139. Wang, L., Hua, G., Xue, J., Gao, Z., Zheng, N.: Joint segmentation and recognition of catego-
rized objects from noisy web image collection. IEEE Trans. Image Process. 23(9), 4070–4086
(2014)
140. Wang, P., Zhang, D., Wang, J., Wu, Z., Hua, X.S., Li, S.: Color filter for image search. In:
Proceedings of ACM Multimedia, pp. 1327–1328 (2012)
141. Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant analysis for
dimension reduction and classification. Pattern Recogn. 57, 179–189 (2016)
142. Wang, W., Shen, J.: Higher-order image co-segmentation. IEEE Trans. Multimedia 18(6),
1011–1021 (2016)
143. Wang, X., Zheng, W.S., Li, X., Zhang, J.: Cross-scenario transfer person reidentification.
IEEE Trans. Circuits Syst. Video Technol. 26(8), 1447–1460 (2016)
References 221

144. Xiao, C., Chaovalitwongse, W.A.: Optimization models for feature selection of decomposed
nearest neighbor. IEEE Trans. Systems Man Cybern. Syst. 46(2), 177–184 (2016)
145. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object detection for multi-class
segmentation. In: Proceedings of CVPR, pp. 3113–3120 (2010)
146. Yuan, Z., Lu, T., Wu, Y.: Deep-dense conditional random fields for object co-segmentation.
In: Proceedings of IJCAI, pp. 3371–3377 (2017)
147. Zhang, K., Chen, J., Liu, B., Liu, Q.: Deep object co-segmentation via spatial-semantic net-
work modulation. In: Proceedings of AAAI Conference on Artificial Intelligence, vol. 34, pp.
12813–12820 (2020)
148. Zhang, X.Y., Bengio, Y., Liu, C.L.: Online and offline handwritten Chinese character recog-
nition: a comprehensive study and new benchmark. Pattern Recogn. 61, 348–360 (2017)
149. Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification.
In: Proceedings of CVPR, pp. 3586–3593 (2013)
150. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global
consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)
151. Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection.
In: Proceedings of CVPR, pp. 2814–2821 (2014)
152. Zitnick, C.L., Kang, S.B.: Stereo for image-based rendering using image over-segmentation.
Int. J. Comput. Vis. 75(1), 49–65 (2007)

You might also like