Professional Documents
Culture Documents
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2474 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018
In recent years, many metric learning algorithms have been bottom-up methods [1], [5], [43], fusion-based meth-
developed to learn a desired distance metric from the given ods [9], [10], and learning-based methods [2], [4], [6], [8].
training samples [28]–[42], measured by which the samples The bottom-up methods [1], [5], [43] are almost the ear-
from the same class are as close as possible, while the samples liest and the simplest methods for co-saliency detection by
from different classes are as far as possible. Inspired by the scoring each pixel/region in the image group through manu-
core idea of metric learning technique, we propose a unified ally designed co-saliency cues. Generally, bottom-up methods
metric learning based framework to jointly learn discriminative consist of four main components including pre-processing,
feature representation and co-salient object detector, which feature extraction, exploring bottom-up cues, and weighted
can address the above-mentioned challenges and hence further combination. In brief, in the pre-processing step the input
improve the state-of-the-art performance. Specifically, this is images are first divided into a number of computational
achieved via optimizing a new objective function that explic- units. Afterwards, feature extraction is performed to represent
itly embeds a metric learning regularization term into support the property of each computational unit based on bottom-
vector machine (SVM) training. Here, the metric learning up cues. Finally, results obtained from each bottom-up cue
regularization term is used to learn a powerful feature rep- are integrated together to generate the co-saliency maps for
resentation that has small intra-COs scatter but big separation the input images. Bottom-up methods have achieved big
between background and COs and the learnt SVM classifier development during the past few years. However, since this
is used for subsequent co-saliency detection. kind of approaches heavily relies on handcrafted cues, they
To summarize up, our main contributions are as follows. are often too subjective and hence cannot generalize well to
First, to our best knowledge, we make the earliest effort various scenarios encountered in practice.
to introduce metric learning to co-saliency detection. To be More recently, the researchers have developed several
specific, we propose a unified framework for jointly learning fusion-based methods for co-saliency detection [9], [10].
discriminative features and training co-salient object detector Rather than devoting to discover informative cues from the
simultaneously. By using the metric learning regularization, we image group to represent co-salient objects, the fusion-based
transform the input data into a new feature space. Thus, we can methods mainly focus on mining useful knowledge from the
pull the pixels from co-salient objects closer while pushing the predicted results obtained with several off-the-shelf saliency or
different-class pixels (from foreground class and background co-saliency algorithms and then fusing them to obtain the final
class) farther away in the transformed new feature space, co-saliency maps. The fusion-based methods can often achieve
as well as make the input images to be easily classified into better results as they can make further improvement based
co-salient objects and background by using the learnt detector. on the existing (co-)saliency detection approaches. However,
Second, different from most existing co-saliency approaches since they severely rely on the existing (co-)saliency detection
that are based on weakly supervised learning strategies in methods, when most of the adopted (co-)saliency techniques
which there are only image-level labels indicating whether an lose their power, the final results of the fusion-based methods
image contains the to-be-detected co-salient objects or not, may be hurt significantly.
in this work we attempt to perform co-saliency detection With the development of machine learning and data min-
through supervised learning scheme in which the accurate ing, the learning-based methods [2], [4], [6], [8] have been
locations of those co-salient objects are supposed to be known attracting more and more research attention. This kind of
in advance. By using supervised learning, we can adequately methods usually casts co-saliency detection as a classification
exploit discriminative information from the available training problem for each image pixel/region. In such learning-based
images to learn feature representation that can generalize well framework, most of the knowledge about the co-salient object
to various scenarios encountered in practice. Third, by using regions is inferred by the designed learner automatically rather
the proposed method, we obtain state-of-the-art results on two than heavily relying on handcrafted cues as in other categories
commonly used benchmark datasets, compared with super- of co-saliency detection methods. By taking advantage of
vised learning based baseline and all existing co-saliency machine learning and data mining techniques, the learning-
detection methods. based methods usually achieve promising results. Besides,
This paper is organized as follows. Section II gives a brief co-saliency detection is also related to multi-task learning
review of related work. Section III describes the proposed problem [44] by taking each task as an input image. Our work
metric learning based co-saliency detection method in detail. also belongs to learning-based method due to its superiority
Section IV presents comprehensive experimental results on in comparison with other two kinds of co-saliency detection
two widely used benchmark datasets. Section V concludes the methods.
paper and discusses future work.
B. Metric Learning
II. R ELATED W ORK
In recent years, the researchers have proposed many
A. Co-Saliency Detection approaches to learn appropriate metrics from the given training
As a newly emerging research topic, co-saliency detection data in many visual classification tasks [28]–[41], [45], [46].
has attracted significant research efforts [1]–[10]. Based on In brief, metric learning aims to learn a desired distance metric,
the used strategy for co-saliency detection, the existing meth- measured by which the same-class samples are closer, while
ods can be roughly categorized into three main categories: the different-class samples are as far as possible. According to
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2475
Fig. 2. Illustration of the basic framework of our proposed metric learning-based co-saliency detection method. It consists of three main steps: super-pixel
segmentation and feature extraction, joint feature learning and co-saliency detector training, and co-saliency detection. Given an image group, the first step
partitions the input images into a set of super-pixels and extracts their features. The second step jointly learns discriminative feature representations and
trains co-salient object detector simultaneously. Finally, the third step infers the co-saliency of each super-pixel region with the learnt co-saliency detector and
generates co-saliency maps by using a spatial map recovery technique.
the availability of the class labels of training data, metric The SVM classifier is used for subsequent co-saliency detec-
learning methods can be generally categorized into unsuper- tion. By transforming the input features into a new feature
vised ones [33], semi-supervised ones [34] and supervised space through metric learning, the super-pixels from co-salient
ones [28], [30]–[32]. Some of them are based on pair-wise objects are pulled closer while the different-class super-pixels
constraints [28], [30], in which the distance metric is supposed from foreground class and background class are pushed farther
to keep the instances in similar constraints close, and at the away in the transformed new feature space, and hence making
same time the instances in dissimilar constraints separated. the input images be easily classified into co-salient objects
Besides, there are also methods for learning distance metrics and background by using the learnt SVM classifier. Finally,
with triplet constraints [32], [35]. In this paper, the pair- the third step infers the co-saliency of each super-pixel region
wise constraint is also adopted in our co-saliency detection with the learnt co-saliency detector and generates co-saliency
framework. However, different from most of the existing maps by using a simple spatial map recovery technique [21].
metric learning approaches, the metric learning regularization This first step is actually a pre-processing step and its
is not only used to learn discriminative feature representations detailed implementation will be described in Section 4. Next
but also well explored to train an effective co-saliency detector. we will describe the second and the third steps in detail.
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2476 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018
samples and the super-pixels with ≤0.2 overlap with the Algorithm 1 Joint Feature Learning and Co-Saliency
ground-truth masks as negatives. Detector Training via Metric Learning
For a training sample xi ∈ Rd from the training set X, let Input: the training set X = [x1 , x2 , · · · , x N ] ∈ Rd×N
A ∈ Rk×d (k < d means dimensionality reduction and vise that contains all super-pixels of the training images from
versa, we set k = d in our implementation) be the to-be- a specific image group, the label set
learned transformation matrix used for linearly projecting the Y = [y1, y2 , · · · , y N ] ∈ R N of X, where xi ∈ Rd is the
data from the original feature space to a latent feature space. i-th training sample (super-pixel) represented by a
Thus, the new feature representation of xi can be computed d-dimensional feature vector, and yi ∈ {+1, −1} indicates
by x i = Ax i . Apart from requiring that the objective function the class label of sample xi , the learning rate μ
should minimize the classification error on the training set, Output: the parameters of metric learning transformation
we also require that the feature representation obtained from matrix and co-salient object detector denoted by A and
the transformation matrix should have powerful discrimina- (w, b)
tion capability. Thus, in the new feature space, the super-
1: begin
pixels of co-salient objects should be mapped closely to each
2: Obtain the paired data {(xi , x j , i j )}
other while the distances of positive-negative super-pixel pairs
3: Initialize A and (w, b)
should be mapped farther apart. To this end, we propose
4: while stopping criterion has not been met do
a new objective function to jointly learn the transformation
matrix A and the co-salient object detector (i.e., a linear SVM 5: update (w, b) using Eq. (5) with A being fixed
parameterized with w and b) by the following formula 6: update A using Eq. (13) with (w, b) being fixed
7: end while
min J = J1 (X, Y ) + λJ2 (X, L) (1)
w,b,A 8: return A and (w, b)
where λ is a trade-off parameter that controls the relative 9: end begin
importance of these two terms. We set λ = 1 in our work.
The first term J1 (X, Y ) in Eq. (1) is the loss function of the
traditional soft margin SVM classifier. It seeks to minimize the with the same number of positive data pairs. τ is an adaptive
classification error for the given input-target pairs (X, Y ) by threshold to connect the margin between similar pairs and
N dissimilar pairs. It is selected as follows: by assuming that
1
J1 (X, Y ) = w2 + C ξi the distances of positive data pairs and negative data pairs
2 obey normal distribution, we first draw the probability density
i=1
distribution curves of positive data pairs and negative data
s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0,
pairs, respectively, and then obtain the junction (x 0 , y0 ) of the
∀i = 1, · · · , N (2) two curves. Here, x 0 is the to-be-selected adaptive threshold.
where ξi are slack variables, C is the only free parameter in As can be seen, Eq. (3) enforces the similar pairs from
linear SVM to control the trade-off between slack variable the co-salient objects to be mapped closely to each other
penalty and the maximization of the margin. By using a while the dissimilar pairs to be mapped apart. If this term
grid search scheme from the set of {0.001, 0.01, 0.1, 0.5, 1, outputs a small value, the feature representation is sought to be
10, 100}, we empirically set C = 0.5 in our implementation. discriminative.
The second term J2 (X, L) in Eq. (1) is a metric learning By incorporating Eqs. (2) and (3) into Eq. (1), we obtain
regularization term, which is imposed on the transformation the following discriminative objective function
matrix A to enforce the learnt feature representation to have 1 N
small intra-COs scatter but big separation between COs and min J = w2 + C ξi
w,b,A 2
background. Referring to the work of [28], we define the i=1
λ 2
regularization term as + g 1 − i j τ − Axi − Ax j 2
1 2 2
J2 (X, L) = g 1 − i j τ − Axi − Ax j 2 (3)
i, j
2
i, j s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
where L = {i j } with i and j denoting the selected train- (4)
ing samples, g (z) = γ1 log (1 + exp (γ z)) is the generalized
As can be seen from Eq. (4), the proposed new dis-
logistic loss function, which is a smoothed approximation of
criminative objective function not only minimizes the clas-
the hinge loss function h (z) = max (0, z), γ is a sharpness
sification loss, but also enforces the learnt features to be
parameter and is set to be 1 in our implementation.
i j is a
more discriminative for distinguishing co-salient objects and
label indicator for the paired data xi , x j . If xi and x j are
background.
both from the co-salient objects i j = +1 and otherwise if
xi or x j is from the co-salient objects i j = −1. In practice,
the number of negative data pairs is much bigger than that C. Optimization Strategy
of positive data pairs. To prevent data imbalance, in our To solve this optimization problem of Eq. (4), we present an
implementation the negative data pairs were randomly selected effective EM-like iterative minimization algorithm by updating
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2477
(w, b) and A alternatively. Algorithm 1 gives the pseudo-code each super-pixel. Next, we transform the initial features to a
of the optimization of Eq. (4). new feature space by using the transformation matrix A(learnt
1) Updating (w, b): With A being fixed, Axi is explicit, and from the training image group with the corresponding object
thus, Eq. (4) can be reformulated as category) and predict the co-saliency score of each super-pixel
using the following formula
1 N
min J = w2 + C ξi cosal (xi ) = wT Axi + b (14)
w,b 2
i=1
where xi ∈ Rdis the initial feature of the i-th super-pixel
s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
represented by a d-dimensional feature vector. (w, b) are the
(5) learnt parameters of the co-salient object detector.
In order to obtain co-saliency maps with satisfactory spatial
This becomes exactly the primal form of soft margin SVM,
which can be solved by using off-the-shelf SVM solvers. recovery, as same as the work of [4], we adopt a graph based
2) Updating A: With (w, b) being fixed, Eq. (4) can be manifold ranking model [21] to smooth the co-saliency values
reformulated as of each super-pixel by exploring the spatial relationship of the
adjacent super-pixels in each image. To be specific, the graph
λ 2
N
is established by connecting the super-pixels adjacent with
min J = C ξi + g 1−i j τ − Axi − Ax j 2
A 2 each other as well as the super-pixels at the four image
i=1 i, j
boundaries. Then, we obtain the foreground (salient) super-
s.t. yi w Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
T
pixels with their co-saliency values being bigger than an
(6) adaptive threshold. In our work, the adaptive threshold is set to
be the average value of the co-saliency values over all super-
By introducing the hinge loss function h (z) = max (0, z) to pixels in one image by following the work of [8]. Finally,
eliminate the slack variables ξi , we can reformulate Eq. (6) as the smoothed co-saliency values of each super-pixel in each
N image are calculated via a ranking function [21] as follows
min J = C h 1 − yi wT Axi + b cosalsmoot hed (xi )ni=1 = (D − αW)−1 q (15)
A
i=1
λ 2 where W = [wi j ]n×n and D = di ag {d11 , · · · , dnn } are
+ g 1 − i j τ − Axi − Ax j 2 (7)
2 defined as
i, j
cosal (xi ) − cosal x j 2
The gradient of the objective function J with respect to the wi j = exp − (16)
transformation matrix A can be computed as follows σ2
∂J N dii = wi j (17)
h (z 1 )yi wxiT + λ g (z 2 )i j
j
= −C
∂A Here, q is a binary vector indicating which super-pixels are
i=1 i, j
foreground in an image. α and σ are two free parameters
× A(xi − x j )(xi − x j ) T
(8)
which are set to be 0.99 and 10, respectively, according to [21].
where h (z), g (z), z 1 , and z 2 are defined as follows cosalsmoot hed (xi )ni=1 indicates the smoothed co-saliency val-
ues of the super-pixels in the image, n is the total number of
h (z) = max (0, z) (9) super-pixels in the image. W and D are the affinity matrix and
1
g (z) = log (1 + exp (γ z)) (10) the degree matrix, respectively.
γ
z 1 1 − yi wT Ax i + b (11) IV. E XPERIMENTS
2 A. Datasets
z 2 1 − i j τ − Axi − Ax j 2 (12)
In the experiments we evaluate the proposed metric learning
Then, the transformation matrix A can be updated by using based co-saliency detection method on two publicly avail-
the following gradient descent algorithm until convergence able benchmark datasets: the MSRC dataset [48] and the
∂J Cosal2015 dataset [8]. These two datasets have been widely
A=A−μ (13) used by the works of [2], [4], [5], and [7]–[10] for the task of
∂A
co-saliency detection.
where μ is the learning rate that controls the updating speed.
The MSRC dataset [48] consists of seven image groups
We set the learning rate to be 0.05 in this paper.
with a total of 240 images that are pixel-wisely labeled
for ground truth masks. Each image group contains about
D. Co-Saliency Detection 30 images. The complex background of the images makes
Given a test image group containing multiple related images MSRC dataset more challenging for co-saliency detection. The
and a common object category, we first partition each input Cosal2015 dataset is a new benchmark dataset established
image into a number of super-pixels by the method of [47] by Zhang et al. [8]. In this dataset, 50 image groups con-
and extract the initial features to represent the property of taining a total of 2015 images were collected. The image
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2478 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018
TABLE I
C OMPARISON OF AP S CORES B ETWEEN THE BASELINE M ETHOD AND
O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH
I MAGE G ROUP IN THE MSRC D ATASET. T HE E NTRIES
W ITH THE B EST APs FOR E ACH I MAGE
G ROUP A RE B OLD -FACED
TABLE II
C OMPARISON OF F-M EASURE B ETWEEN THE BASELINE M ETHOD AND
O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH
I MAGE G ROUP IN THE MSRC D ATASET. T HE E NTRIES W ITH
THE B EST F-M EASURE S CORES FOR E ACH
I MAGE G ROUP A RE B OLD -FACED
C. Implementation Details
Fig. 4. Some example images from the Cosal2015 dataset.
We used the off-the-shelf convolutional neural net-
number in each group changes from 26 to 52. To construct work (CNN) for initial feature extraction owing to its sig-
ground truth masks, 20 subjects were asked to view these nificant success in the computer vision community [50]–[53].
image groups and provide the pixel-level annotation manually. Specifically, we used all the convolutional layers in the CNN
To our best knowledge, the Cosal2015 dataset is the largest to establish the hyper-column feature representation for each
and the most challenging dataset so far used for evaluating super-pixel. To this end, we first resized each image to a fixed
different co-saliency detection algorithms. Figure 3 and Figure 224×224 pixel size and then extracted the convolutional fea-
4 show some example images from the MSRC dataset and the ture maps with a CNN model pre-trained on ImageNet dataset.
Cosal2015 dataset, respectively. Having the same architecture as the “CNN-S” model proposed
in [54], the CNN used in this work contains 13 convolutional
layers, five pooling layers, and one fully connected layer,
B. Evaluation Metrics where the five convolutional layers before each pooling layer
To evaluate the performance of the proposed method, were regarded as the feature maps to represent the image
we adopt three standard and commonly used measures, namely contents. As the convolution and pooling operations in CNN
precision recall (PR) curve, average precision (AP) score, and resulted in feature maps with different scales, we up-sampled
F-measure. The PR curve is based on the overlapping area each feature map to the scale of the original input image.
between detections and ground truth masks. It is obtained by Thus, the obtained feature maps can represent each pixel of the
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2479
TABLE III
C OMPARISON OF AP S CORES B ETWEEN THE BASELINE M ETHOD AND O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH I MAGE
G ROUP IN THE C OSAL 2015 D ATASET. T HE E NTRIES W ITH THE B EST APs FOR E ACH I MAGE G ROUP A RE B OLD -FACED
TABLE IV
C OMPARISON OF F-M EASURE S CORES B ETWEEN THE BASELINE M ETHOD AND O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH I MAGE
G ROUP IN THE C OSAL 2015 D ATASET. T HE E NTRIES W ITH THE B EST F-M EASURE S CORES FOR E ACH I MAGE G ROUP A RE B OLD -FACED
input image. Then, we max-pooled the feature vectors located regularization term (the second term in Eq. (1)), i.e., without
within each super-pixel region to obtain a set of 1888 dimen- considering feature learning.
sional feature vectors to represent each super-pixel. Finally, Tables I-II and Tables III-IV report the comparison results
these 1888 dimensional features were normalized to learn of the baseline method and our proposed method for each
more powerful and discriminative feature representations by image group in the MSRC dataset and the Cosal2015 dataset,
using Eq. (4). measured in terms of AP score and F-measure score, respec-
In our implementation each image group was randomly split tively. The results show that: (i) The proposed metric learning-
into 50% for feature learning and the co-salient object detector based method outperforms the baseline method for all seven
training and the remaining 50% for co-saliency detection. image groups from the MSRC dataset and for 48 out of
In addition, to obtain reliable results, we repeated the experi- 50 image groups from the Cosal2015 dataset; (ii) On the
ment five times by randomly selecting 50%-50% training-test challenging Cosal2015 dataset, our method improves upon
images and reported the mean result of the five runs. the baseline method significantly (at least two percent points)
for 39 out of 50, and 40 out of 50 image groups measured
in terms of AP score and F-measure, respectively. Especially
D. Comparison With Baseline Method for the challenging image groups such as “boat” and “snail”,
In our work, the linear SVM classifier trained with the we obtained 17.7% and 10.6% AP improvements, and 10.3%
original 1888 dimensional hyper-column feature representation and 12.3% F-measure improvements; (iii) From the perspec-
serves as a baseline method, which is called “baseline” in tive of overall performance measured in terms of AP score and
this paper for short. For fair comparison, we (i) used the F-measure score, our proposed method improves the baseline
same training images and test images for both baseline and method by 1.4% and 2.4% for the relatively simple MSRC
our proposed method; and (ii) adopted the same scheme dataset, and boosts the baseline method by 4% and 4.6%
for both baseline and our method, as described in subsec- for the challenging Cosal2015 dataset, which demonstrates
tion 3.4, to perform co-saliency detection. The only difference that our method is effective for learning discriminative feature
between baseline and our proposed method is that the baseline representations. It is worth noting that, for some image groups
SVM classifier was trained without using the metric learning such as “apple” and “motorbike”, the baseline method without
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018
Fig. 5. Overall performance comparison between our proposed method and eight state-of-the-art comparison methods, including LDW [8], CBCS [5],
CSHS [7], SACS [9], ESMG [10], CBCS-S [5], BLSM [26], LR [27], on the MSRC dataset, measured in term of (a) PR curves, (b) AP score,
and (c) F-measure score, respectively.
Fig. 6. Overall performance comparison between our proposed method and eight state-of-the-art comparison methods, including LDW [8], CBCS [5],
CSHS [7], SACS [9], ESMG [10], CBCS-S [5], BLSM [26], LR [27], on the Cosal2015 dataset, measured in term of (a) PR curves, (b) AP score,
and (c) F-measure score, respectively.
using feature learning also achieves comparable performance basis of the fine segmentation and the object prior on the
to our method because they have relatively simple background. basis of the coarse segmentation, respectively. SACS [9] is a
fusion-based method that integrates the result maps of multiple
existing (co-) saliency methods with the self-adaptive weights
E. Comparison With State-of-the-Art Methods obtained via low-rank decomposition. ESMG [10] is a visual
To comprehensively evaluate our co-saliency detection saliency guided model in which the queries are firstly selected
method, we compared it with eight state-of-the-art meth- based on the saliency maps of an existing method. Then an
ods, including LDW [8], CBCS [5], CSHS [7], SACS [9], efficient manifold ranking is used to refine the detection results
ESMG [10], CBCS-S [5], BLSM [26], LR [27], where the in group level. CBCS-S [5] is a cluster-based single-image
first five methods are proposed by state-of-the-art co-saliency saliency detection method which measures the cluster-level
detection methods and the last three methods are state-of-the- saliency by using two bottom-up information cues including
art single saliency detection methods. the contrast cue and the spatial cue. BLSM [26] is a bottom-
In brief, LDW [8] is a novel co-saliency detection method up single-image saliency model which examines both the low-
in which the wide and deep information are explored for the level cue and the mid-level cue. LR [27] is also a single-image
object proposal windows extracted in each image, and the saliency detection framework that integrates low-level features
co-saliency scores are calculated by integrating the intra-image with higher-level guidance (top-down priors) to detect salient
contrast and intra-group consistency via a principled Bayesian regions through low rank matrix recovery.
formulation. CBCS [5] is a cluster-based method that measures Figure 5 and Figure 6 present the overall performance
the cluster-level co-saliency using three bottom-up information comparison between our proposed method and eight state-
cues including the contrast cue, the corresponding cue, and of-the-art comparison methods on the MSRC dataset and
the spatial cue. CSHS [7] is a hierarchical segmentation- the Cosal2015 dataset, measured in terms of PR curves,
based method which integrates the regional similarity on the AP score and F-measure score, respectively. Here the results
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2481
Fig. 7. Co-saliency detection examples by using our proposed method and three representative methods on two benchmark datasets. For each image group,
the first row is the input image group, the second row is the ground truth masks, and the 3-6 rows are the co-saliency detection results obtained using the
proposed approach, ESMG [10], SACS [9], and LDW [8], respectively.
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2483
[41] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimedia Junwei Han received the B.S. and Ph.D. degrees
retrieval framework based on semi-supervised ranking and relevance from Northwestern Polytechnical University, Xi’an,
feedback,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, China, in 1999 and 2003, respectively. He is cur-
pp. 723–742, Apr. 2012. rently a Professor with Northwestern Polytechnical
[42] G. Ding, Y. Guo, J. Zhou, and Y. Gao, “Large-scale cross-modality University. His research interests include computer
search via collective matrix factorization hashing,” IEEE Trans. Image vision and multimedia processing.
Process., vol. 25, no. 11, pp. 5427–5440, Nov. 2016.
[43] C. Ge, K. Fu, F. Liu, L. Bai, and J. Yang, “Co-saliency detection via
inter and intra saliency propagation,” Signal Process., Image Commun.,
vol. 44, pp. 69–83, May 2016.
[44] Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe, “Feature selection
for multimedia analysis by sharing information among multiple tasks,”
IEEE Trans. Multimedia, vol. 15, no. 3, pp. 661–669, Apr. 2013.
[45] Y. Guo, G. Ding, L. Liu, J. Han, and L. Shao, “Learning to hash with
optimized anchor embedding for scalable retrieval,” IEEE Trans. Image
Process., vol. 26, no. 3, pp. 1344–1354, Mar. 2017. Gong Cheng received the B.S. degree from Xidian
[46] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo, “Semantic annotation of University, Xi’an, China, in 2007, and the M.S. and
high-resolution satellite images via weakly supervised learning,” IEEE Ph.D. degrees from Northwestern Polytechnical Uni-
Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3660–3671, Jun. 2016. versity, Xi’an, in 2010 and 2013, respectively. He is
[47] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, currently an Associate Professor with Northwestern
“SLIC superpixels compared to state-of-the-art superpixel methods,” Polytechnical University. His main research interests
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, are computer vision and pattern recognition.
Nov. 2012.
[48] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned
universal visual dictionary,” in Proc. IEEE Int. Conf. Comput. Vis.,
Oct. 2005, pp. 1800–1807.
[49] Y. Jia and M. Han, “Category-independent object-level saliency detec-
tion,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1761–1768.
[50] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo-
lutional neural networks for object detection in VHR optical remote
sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
pp. 7405–7415, Dec. 2016. Zhenpeng Li received the B.S. degree from North-
[51] P. Zhou, G. Cheng, Z. Liu, S. Bu, and X. Hu, “Weakly supervised target western Polytechnical University, Xi’an, China, in
detection in remote sensing images based on transferred deep features 2016, where he is currently pursuing the M.S.
and negative bootstrapping,” Multidimension. Syst. Signal Process., degree. His research interests are computer vision
vol. 27, no. 4, pp. 925–944, Oct. 2016. and pattern recognition.
[52] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-
cation: Benchmark and state of the art,” Proc. IEEE, to be published,
doi: 10.1109/JPROC.2017.2675998.
[53] G. Cheng and J. Han, “A survey on object detection in optical
remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117,
pp. 11–28, Jul. 2016.
[54] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. (2014).
“Return of the devil in the details: Delving deep into convolutional nets.”
[Online]. Available: https://arxiv.org/abs/1405.3531
[55] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective
and efficient midlevel visual elements-oriented land-use classification Dingwen Zhang received the B.S. degree from
using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., Northwestern Polytechnical University, Xi’an,
vol. 53, no. 8, pp. 4238–4249, Aug. 2015. China, in 2012, where he is currently pursuing
[56] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in the Ph.D. degree. His research interests include
optical remote sensing images based on weakly supervised learning and computer vision and multimedia processing,
high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, especially on saliency detection, co-saliency
no. 6, pp. 3325–3337, Jun. 2015. detection, and weakly supervised learning.
[57] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
detection and geographic image classification based on collection of part
detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132,
Dec. 2014.
Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.