You are on page 1of 11

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO.

10, OCTOBER 2018 2473

A Unified Metric Learning-Based Framework


for Co-Saliency Detection
Junwei Han, Senior Member, IEEE, Gong Cheng, Zhenpeng Li, and Dingwen Zhang

Abstract— Co-saliency detection, which focuses on extracting


commonly salient objects in a group of relevant images, has been
attracting research interest because of its broad applications.
In practice, the relevant images in a group may have a wide
range of variations, and the salient objects may also have large
appearance changes. Such wide variations usually bring about
large intra-co-salient objects (intra-COs) diversity and high simi-
larity between COs and background, which makes the co-saliency
detection task more difficult. To address these problems, we make
the earliest effort to introduce metric learning to co-saliency
detection. Specifically, we propose a unified metric learning-based
framework to jointly learn discriminative feature representation
and co-salient object detector. This is achieved by optimizing a
new objective function that explicitly embeds a metric learning
regularization term into support vector machine (SVM) training. Fig. 1. Motivation of this paper. The relevant images in a group generally
Here, the metric learning regularization term is used to learn a have a wide range of variations and the co-salient objects may also have
powerful feature representation that has small intra-COs scatter, large appearance changes. Such wide variations often bring about large intra-
but big separation between background and COs and the SVM co-salient objects (intra-COs for short) diversity (as shown in Figure 1 (a)) and
classifier is used for subsequent co-saliency detection. In the high similarity between COs and background (as shown in Figure 1 (b)), which
experiments, we comprehensively evaluate the proposed method makes the discovery of common and attractive objects more challenging. This
on two commonly used benchmark data sets. The state-of-the-art motivates us to learn a more powerful feature representation that has small
intra-COs scatter but big separation between COs and background.
results are achieved in comparison with the existing co-saliency
detection methods.
Index Terms— Co-saliency detection, metric learning, feature Compared with the traditional single image based saliency
learning. detection [20]–[27], co-saliency detection from one image
I. I NTRODUCTION group, which consists of two or more relevant images,

W ITH the rapid development of imaging equipments


and the growing popularity of photo-sharing websites
(e.g., Flickr and Facebook), it is much easier to acquire
is potentially more promising because the multiple relevant
images within an image group contain much richer and useful
information. However, this problem tends to be much more
a large collection of images or video data. Typically, such challenging as in practice, the relevant images in a group may
image collections are big in size and generally share common have a wide range of variations caused by varying illumination
objects or events. Thus, it is interesting and meaningful to conditions, diverse backgrounds, changing viewpoints, and the
identify the common and attractive objects from all images commonly salient objects may also have large appearance
in such data collections. Co-saliency detection, which focuses changes caused by non-rigid deformations, occlusions and
on extracting commonly salient objects simultaneously in a disguise. Such wide variations usually bring about large intra-
group of relevant images, has been proposed in the recent co-salient objects (intra-COs for short) diversity (as shown
years. As a newly emerging and rapidly growing research in Figure 1 (a)) and high similarity between COs and back-
topic, co-saliency detection has attracted a large amount of ground (as shown in Figure 1 (b)), which makes the dis-
research interests [1]–[11] because of its broad applications, covery of common and attractive objects more difficult and
such as image and video co-segmentation [12]–[17] and object challenging.
co-localization [18], [19]. During the past few years, significant efforts have been
Manuscript received December 31, 2016; revised April 9, 2017; accepted made in this research field to develop a series of works from
May 11, 2017. Date of publication May 19, 2017; date of current version different motivations and solutions for the task of co-saliency
October 24, 2018. This work was supported in part by the National Sci- detection [1]–[11]. Despite the remarkable success made so
ence Foundation of China under Grant 61473231, Grant 61522207, and
Grant 61401357, and in part by the Fundamental Research Funds for the Cen- far, the problems of intra-COs diversity and inter-similarity
tral Universities under Grant 3102016ZY023. This paper was recommended between COs and background still remain two challenges that
by Associate Editor W. Zuo. (Corresponding authors: Gong Cheng; are supposed to degenerate the performance of co-saliency
Dingwen Zhang.)
The authors are with the School of Automation, Northwestern Poly- detection. In this situation, how to exploit discriminative
technical University, Xi’an 710072, China (e-mail: gcheng@nwpu.edu.cn; information from the available image group to learn a more
zdw2006yyy@mail.nwpu.edu.cn). powerful feature representation that has small intra-COs scatter
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. but big separation between COs and background is highly
Digital Object Identifier 10.1109/TCSVT.2017.2706264 appealing.
1051-8215 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2474 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018

In recent years, many metric learning algorithms have been bottom-up methods [1], [5], [43], fusion-based meth-
developed to learn a desired distance metric from the given ods [9], [10], and learning-based methods [2], [4], [6], [8].
training samples [28]–[42], measured by which the samples The bottom-up methods [1], [5], [43] are almost the ear-
from the same class are as close as possible, while the samples liest and the simplest methods for co-saliency detection by
from different classes are as far as possible. Inspired by the scoring each pixel/region in the image group through manu-
core idea of metric learning technique, we propose a unified ally designed co-saliency cues. Generally, bottom-up methods
metric learning based framework to jointly learn discriminative consist of four main components including pre-processing,
feature representation and co-salient object detector, which feature extraction, exploring bottom-up cues, and weighted
can address the above-mentioned challenges and hence further combination. In brief, in the pre-processing step the input
improve the state-of-the-art performance. Specifically, this is images are first divided into a number of computational
achieved via optimizing a new objective function that explic- units. Afterwards, feature extraction is performed to represent
itly embeds a metric learning regularization term into support the property of each computational unit based on bottom-
vector machine (SVM) training. Here, the metric learning up cues. Finally, results obtained from each bottom-up cue
regularization term is used to learn a powerful feature rep- are integrated together to generate the co-saliency maps for
resentation that has small intra-COs scatter but big separation the input images. Bottom-up methods have achieved big
between background and COs and the learnt SVM classifier development during the past few years. However, since this
is used for subsequent co-saliency detection. kind of approaches heavily relies on handcrafted cues, they
To summarize up, our main contributions are as follows. are often too subjective and hence cannot generalize well to
First, to our best knowledge, we make the earliest effort various scenarios encountered in practice.
to introduce metric learning to co-saliency detection. To be More recently, the researchers have developed several
specific, we propose a unified framework for jointly learning fusion-based methods for co-saliency detection [9], [10].
discriminative features and training co-salient object detector Rather than devoting to discover informative cues from the
simultaneously. By using the metric learning regularization, we image group to represent co-salient objects, the fusion-based
transform the input data into a new feature space. Thus, we can methods mainly focus on mining useful knowledge from the
pull the pixels from co-salient objects closer while pushing the predicted results obtained with several off-the-shelf saliency or
different-class pixels (from foreground class and background co-saliency algorithms and then fusing them to obtain the final
class) farther away in the transformed new feature space, co-saliency maps. The fusion-based methods can often achieve
as well as make the input images to be easily classified into better results as they can make further improvement based
co-salient objects and background by using the learnt detector. on the existing (co-)saliency detection approaches. However,
Second, different from most existing co-saliency approaches since they severely rely on the existing (co-)saliency detection
that are based on weakly supervised learning strategies in methods, when most of the adopted (co-)saliency techniques
which there are only image-level labels indicating whether an lose their power, the final results of the fusion-based methods
image contains the to-be-detected co-salient objects or not, may be hurt significantly.
in this work we attempt to perform co-saliency detection With the development of machine learning and data min-
through supervised learning scheme in which the accurate ing, the learning-based methods [2], [4], [6], [8] have been
locations of those co-salient objects are supposed to be known attracting more and more research attention. This kind of
in advance. By using supervised learning, we can adequately methods usually casts co-saliency detection as a classification
exploit discriminative information from the available training problem for each image pixel/region. In such learning-based
images to learn feature representation that can generalize well framework, most of the knowledge about the co-salient object
to various scenarios encountered in practice. Third, by using regions is inferred by the designed learner automatically rather
the proposed method, we obtain state-of-the-art results on two than heavily relying on handcrafted cues as in other categories
commonly used benchmark datasets, compared with super- of co-saliency detection methods. By taking advantage of
vised learning based baseline and all existing co-saliency machine learning and data mining techniques, the learning-
detection methods. based methods usually achieve promising results. Besides,
This paper is organized as follows. Section II gives a brief co-saliency detection is also related to multi-task learning
review of related work. Section III describes the proposed problem [44] by taking each task as an input image. Our work
metric learning based co-saliency detection method in detail. also belongs to learning-based method due to its superiority
Section IV presents comprehensive experimental results on in comparison with other two kinds of co-saliency detection
two widely used benchmark datasets. Section V concludes the methods.
paper and discusses future work.
B. Metric Learning
II. R ELATED W ORK
In recent years, the researchers have proposed many
A. Co-Saliency Detection approaches to learn appropriate metrics from the given training
As a newly emerging research topic, co-saliency detection data in many visual classification tasks [28]–[41], [45], [46].
has attracted significant research efforts [1]–[10]. Based on In brief, metric learning aims to learn a desired distance metric,
the used strategy for co-saliency detection, the existing meth- measured by which the same-class samples are closer, while
ods can be roughly categorized into three main categories: the different-class samples are as far as possible. According to

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2475

Fig. 2. Illustration of the basic framework of our proposed metric learning-based co-saliency detection method. It consists of three main steps: super-pixel
segmentation and feature extraction, joint feature learning and co-saliency detector training, and co-saliency detection. Given an image group, the first step
partitions the input images into a set of super-pixels and extracts their features. The second step jointly learns discriminative feature representations and
trains co-salient object detector simultaneously. Finally, the third step infers the co-saliency of each super-pixel region with the learnt co-saliency detector and
generates co-saliency maps by using a spatial map recovery technique.

the availability of the class labels of training data, metric The SVM classifier is used for subsequent co-saliency detec-
learning methods can be generally categorized into unsuper- tion. By transforming the input features into a new feature
vised ones [33], semi-supervised ones [34] and supervised space through metric learning, the super-pixels from co-salient
ones [28], [30]–[32]. Some of them are based on pair-wise objects are pulled closer while the different-class super-pixels
constraints [28], [30], in which the distance metric is supposed from foreground class and background class are pushed farther
to keep the instances in similar constraints close, and at the away in the transformed new feature space, and hence making
same time the instances in dissimilar constraints separated. the input images be easily classified into co-salient objects
Besides, there are also methods for learning distance metrics and background by using the learnt SVM classifier. Finally,
with triplet constraints [32], [35]. In this paper, the pair- the third step infers the co-saliency of each super-pixel region
wise constraint is also adopted in our co-saliency detection with the learnt co-saliency detector and generates co-saliency
framework. However, different from most of the existing maps by using a simple spatial map recovery technique [21].
metric learning approaches, the metric learning regularization This first step is actually a pre-processing step and its
is not only used to learn discriminative feature representations detailed implementation will be described in Section 4. Next
but also well explored to train an effective co-saliency detector. we will describe the second and the third steps in detail.

III. T HE P ROPOSED M ETHOD B. Joint Feature Learning and Co-Saliency


A. Overview of the Proposed Method Detector Training via Metric Learning
Figure 2 illustrates the basic framework of our proposed Different from most of the existing co-saliency detection
metric learning-based co-saliency detection method. As can methods in which the feature representation and the detec-
be seen from Figure 2, the proposed method consists of three tor are trained separately, our proposed co-saliency detec-
main steps: super-pixel segmentation and feature extraction, tion method combines discriminative feature learning and
joint feature learning and co-saliency detector training, and co-salient object detector training into a unified metric
co-saliency detection. Given an image group containing mul- learning-based framework in which they can benefit each other.
tiple related images, the first step uses the super-pixel segmen- Given the training images from a specific image group,
tation method presented by [47] to partition the input images we suppose that X = [x1 , x2 , · · · , x N ] ∈ Rd×N is the training
into a number of computational units and then extracts their set containing all super-pixels from the training images and
features to represent the property of each computational unit. Y = [y1, y2 , · · · , y N ] ∈ R N is the label set of X, where N is
In the second step, we propose a unified framework for jointly the total number of training samples, xi ∈ Rd is the
learning discriminative feature representations and training i-th training sample (super-pixel) represented by a
co-salient object detector simultaneously. This is achieved by d-dimensional feature vector, and yi ∈ {+1, −1} indicates
optimizing a new objective function that explicitly embeds the class label of sample xi with yi = +1 denoting the
a metric learning regularization term into SVM training. positive training sample (i.e., the super-pixel coming from
Here, the metric learning regularization term is used to learn the co-salient objects) and yi = −1 being negative training
a powerful feature representation that has small intra-COs sample. In our implementation, we treat all super-pixels with
scatter but big separation between background and COs. ≥0.5 overlap with the ground-truth masks as positive training

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2476 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018

samples and the super-pixels with ≤0.2 overlap with the Algorithm 1 Joint Feature Learning and Co-Saliency
ground-truth masks as negatives. Detector Training via Metric Learning
For a training sample xi ∈ Rd from the training set X, let Input: the training set X = [x1 , x2 , · · · , x N ] ∈ Rd×N
A ∈ Rk×d (k < d means dimensionality reduction and vise that contains all super-pixels of the training images from
versa, we set k = d in our implementation) be the to-be- a specific image group, the label set
learned transformation matrix used for linearly projecting the Y = [y1, y2 , · · · , y N ] ∈ R N of X, where xi ∈ Rd is the
data from the original feature space to a latent feature space. i-th training sample (super-pixel) represented by a
Thus, the new feature representation of xi can be computed d-dimensional feature vector, and yi ∈ {+1, −1} indicates
by x i = Ax i . Apart from requiring that the objective function the class label of sample xi , the learning rate μ
should minimize the classification error on the training set, Output: the parameters of metric learning transformation
we also require that the feature representation obtained from matrix and co-salient object detector denoted by A and
the transformation matrix should have powerful discrimina- (w, b)
tion capability. Thus, in the new feature space, the super-
1: begin
pixels of co-salient objects should be mapped closely to each
2: Obtain the paired data {(xi , x j , i j )}
other while the distances of positive-negative super-pixel pairs
3: Initialize A and (w, b)
should be mapped farther apart. To this end, we propose
4: while stopping criterion has not been met do
a new objective function to jointly learn the transformation
matrix A and the co-salient object detector (i.e., a linear SVM 5: update (w, b) using Eq. (5) with A being fixed
parameterized with w and b) by the following formula 6: update A using Eq. (13) with (w, b) being fixed
7: end while
min J = J1 (X, Y ) + λJ2 (X, L) (1)
w,b,A 8: return A and (w, b)
where λ is a trade-off parameter that controls the relative 9: end begin
importance of these two terms. We set λ = 1 in our work.
The first term J1 (X, Y ) in Eq. (1) is the loss function of the
traditional soft margin SVM classifier. It seeks to minimize the with the same number of positive data pairs. τ is an adaptive
classification error for the given input-target pairs (X, Y ) by threshold to connect the margin between similar pairs and
 N dissimilar pairs. It is selected as follows: by assuming that
1
J1 (X, Y ) = w2 + C ξi the distances of positive data pairs and negative data pairs
2 obey normal distribution, we first draw the probability density
i=1
  distribution curves of positive data pairs and negative data
s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0,
pairs, respectively, and then obtain the junction (x 0 , y0 ) of the
∀i = 1, · · · , N (2) two curves. Here, x 0 is the to-be-selected adaptive threshold.
where ξi are slack variables, C is the only free parameter in As can be seen, Eq. (3) enforces the similar pairs from
linear SVM to control the trade-off between slack variable the co-salient objects to be mapped closely to each other
penalty and the maximization of the margin. By using a while the dissimilar pairs to be mapped apart. If this term
grid search scheme from the set of {0.001, 0.01, 0.1, 0.5, 1, outputs a small value, the feature representation is sought to be
10, 100}, we empirically set C = 0.5 in our implementation. discriminative.
The second term J2 (X, L) in Eq. (1) is a metric learning By incorporating Eqs. (2) and (3) into Eq. (1), we obtain
regularization term, which is imposed on the transformation the following discriminative objective function
matrix A to enforce the learnt feature representation to have 1  N
small intra-COs scatter but big separation between COs and min J = w2 + C ξi
w,b,A 2
background. Referring to the work of [28], we define the i=1
λ    2 
regularization term as + g 1 − i j τ − Axi − Ax j 2
1    2  2
J2 (X, L) = g 1 − i j τ − Axi − Ax j 2 (3) 
i, j

2
i, j s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
where L = {i j } with i and j denoting the selected train- (4)
ing samples, g (z) = γ1 log (1 + exp (γ z)) is the generalized
As can be seen from Eq. (4), the proposed new dis-
logistic loss function, which is a smoothed approximation of
criminative objective function not only minimizes the clas-
the hinge loss function h (z) = max (0, z), γ is a sharpness
sification loss, but also enforces the learnt features to be
parameter and is set to be 1 in our implementation.
 i j is a
more discriminative for distinguishing co-salient objects and
label indicator for the paired data xi , x j . If xi and x j are
background.
both from the co-salient objects i j = +1 and otherwise if
xi or x j is from the co-salient objects i j = −1. In practice,
the number of negative data pairs is much bigger than that C. Optimization Strategy
of positive data pairs. To prevent data imbalance, in our To solve this optimization problem of Eq. (4), we present an
implementation the negative data pairs were randomly selected effective EM-like iterative minimization algorithm by updating

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2477

(w, b) and A alternatively. Algorithm 1 gives the pseudo-code each super-pixel. Next, we transform the initial features to a
of the optimization of Eq. (4). new feature space by using the transformation matrix A(learnt
1) Updating (w, b): With A being fixed, Axi is explicit, and from the training image group with the corresponding object
thus, Eq. (4) can be reformulated as category) and predict the co-saliency score of each super-pixel
using the following formula
1  N
min J = w2 + C ξi cosal (xi ) = wT Axi + b (14)
w,b 2
i=1
 
where xi ∈ Rdis the initial feature of the i-th super-pixel
s.t. yi wT Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
represented by a d-dimensional feature vector. (w, b) are the
(5) learnt parameters of the co-salient object detector.
In order to obtain co-saliency maps with satisfactory spatial
This becomes exactly the primal form of soft margin SVM,
which can be solved by using off-the-shelf SVM solvers. recovery, as same as the work of [4], we adopt a graph based
2) Updating A: With (w, b) being fixed, Eq. (4) can be manifold ranking model [21] to smooth the co-saliency values
reformulated as of each super-pixel by exploring the spatial relationship of the
adjacent super-pixels in each image. To be specific, the graph
 λ    2 
N
is established by connecting the super-pixels adjacent with
min J = C ξi + g 1−i j τ − Axi − Ax j 2
A 2 each other as well as the super-pixels at the four image
i=1 i, j
  boundaries. Then, we obtain the foreground (salient) super-
s.t. yi w Axi + b ≥ 1 − ξi ; ξi ≥ 0, ∀i = 1, · · · , N
T
pixels with their co-saliency values being bigger than an
(6) adaptive threshold. In our work, the adaptive threshold is set to
be the average value of the co-saliency values over all super-
By introducing the hinge loss function h (z) = max (0, z) to pixels in one image by following the work of [8]. Finally,
eliminate the slack variables ξi , we can reformulate Eq. (6) as the smoothed co-saliency values of each super-pixel in each

N    image are calculated via a ranking function [21] as follows
min J = C h 1 − yi wT Axi + b cosalsmoot hed (xi )ni=1 = (D − αW)−1 q (15)
A
i=1
λ    2  where W = [wi j ]n×n and D = di ag {d11 , · · · , dnn } are
+ g 1 − i j τ − Axi − Ax j 2 (7)
2 defined as
i, j    
cosal (xi ) − cosal x j 2
The gradient of the objective function J with respect to the wi j = exp − (16)
transformation matrix A can be computed as follows σ2

∂J  N  dii = wi j (17)
h  (z 1 )yi wxiT + λ g  (z 2 )i j
j
= −C
∂A Here, q is a binary vector indicating which super-pixels are
i=1 i, j
foreground in an image. α and σ are two free parameters
× A(xi − x j )(xi − x j ) T
(8)
which are set to be 0.99 and 10, respectively, according to [21].
where h (z), g (z), z 1 , and z 2 are defined as follows cosalsmoot hed (xi )ni=1 indicates the smoothed co-saliency val-
ues of the super-pixels in the image, n is the total number of
h (z) = max (0, z) (9) super-pixels in the image. W and D are the affinity matrix and
1
g (z) = log (1 + exp (γ z)) (10) the degree matrix, respectively.
γ
 
z 1  1 − yi wT Ax i + b (11) IV. E XPERIMENTS
  2  A. Datasets
z 2  1 − i j τ − Axi − Ax j 2 (12)
In the experiments we evaluate the proposed metric learning
Then, the transformation matrix A can be updated by using based co-saliency detection method on two publicly avail-
the following gradient descent algorithm until convergence able benchmark datasets: the MSRC dataset [48] and the
∂J Cosal2015 dataset [8]. These two datasets have been widely
A=A−μ (13) used by the works of [2], [4], [5], and [7]–[10] for the task of
∂A
co-saliency detection.
where μ is the learning rate that controls the updating speed.
The MSRC dataset [48] consists of seven image groups
We set the learning rate to be 0.05 in this paper.
with a total of 240 images that are pixel-wisely labeled
for ground truth masks. Each image group contains about
D. Co-Saliency Detection 30 images. The complex background of the images makes
Given a test image group containing multiple related images MSRC dataset more challenging for co-saliency detection. The
and a common object category, we first partition each input Cosal2015 dataset is a new benchmark dataset established
image into a number of super-pixels by the method of [47] by Zhang et al. [8]. In this dataset, 50 image groups con-
and extract the initial features to represent the property of taining a total of 2015 images were collected. The image

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2478 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018

TABLE I
C OMPARISON OF AP S CORES B ETWEEN THE BASELINE M ETHOD AND
O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH
I MAGE G ROUP IN THE MSRC D ATASET. T HE E NTRIES
W ITH THE B EST APs FOR E ACH I MAGE
G ROUP A RE B OLD -FACED

TABLE II
C OMPARISON OF F-M EASURE B ETWEEN THE BASELINE M ETHOD AND
O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH
I MAGE G ROUP IN THE MSRC D ATASET. T HE E NTRIES W ITH
THE B EST F-M EASURE S CORES FOR E ACH
I MAGE G ROUP A RE B OLD -FACED

Fig. 3. Some example images from the MSRC dataset.

thresholding all the pixels in a co-saliency map into binary


co-salient object masks with a series of fixed integers from
0 to 255. The resulting true positive rate versus the precision
rate at each threshold value forms the PR curve. The AP score
is generated by computing the area under the PR curve, so the
higher the AP score is, the better the performance and vice
versa. The F-measure is obtained by using a self-adaptive
threshold T = μ + ε as suggested in [49] to segment the
co-saliency maps to obtain the precision and recall, where
μ and ε denote the mean value and the standard deviation of
the co-saliency map, respectively. After obtaining the precision
and recall via the adaptive threshold T , the F-measure is
computed by
(1 + β 2 )Pr eci si on × Recall
F-measure = (18)
β 2 Pr eci si on + Recall
where β 2 was set to be 0.3 as suggested in [3] and [21].

C. Implementation Details
Fig. 4. Some example images from the Cosal2015 dataset.
We used the off-the-shelf convolutional neural net-
number in each group changes from 26 to 52. To construct work (CNN) for initial feature extraction owing to its sig-
ground truth masks, 20 subjects were asked to view these nificant success in the computer vision community [50]–[53].
image groups and provide the pixel-level annotation manually. Specifically, we used all the convolutional layers in the CNN
To our best knowledge, the Cosal2015 dataset is the largest to establish the hyper-column feature representation for each
and the most challenging dataset so far used for evaluating super-pixel. To this end, we first resized each image to a fixed
different co-saliency detection algorithms. Figure 3 and Figure 224×224 pixel size and then extracted the convolutional fea-
4 show some example images from the MSRC dataset and the ture maps with a CNN model pre-trained on ImageNet dataset.
Cosal2015 dataset, respectively. Having the same architecture as the “CNN-S” model proposed
in [54], the CNN used in this work contains 13 convolutional
layers, five pooling layers, and one fully connected layer,
B. Evaluation Metrics where the five convolutional layers before each pooling layer
To evaluate the performance of the proposed method, were regarded as the feature maps to represent the image
we adopt three standard and commonly used measures, namely contents. As the convolution and pooling operations in CNN
precision recall (PR) curve, average precision (AP) score, and resulted in feature maps with different scales, we up-sampled
F-measure. The PR curve is based on the overlapping area each feature map to the scale of the original input image.
between detections and ground truth masks. It is obtained by Thus, the obtained feature maps can represent each pixel of the

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2479

TABLE III
C OMPARISON OF AP S CORES B ETWEEN THE BASELINE M ETHOD AND O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH I MAGE
G ROUP IN THE C OSAL 2015 D ATASET. T HE E NTRIES W ITH THE B EST APs FOR E ACH I MAGE G ROUP A RE B OLD -FACED

TABLE IV
C OMPARISON OF F-M EASURE S CORES B ETWEEN THE BASELINE M ETHOD AND O UR P ROPOSED C O -S ALIENCY D ETECTION M ETHOD FOR E ACH I MAGE
G ROUP IN THE C OSAL 2015 D ATASET. T HE E NTRIES W ITH THE B EST F-M EASURE S CORES FOR E ACH I MAGE G ROUP A RE B OLD -FACED

input image. Then, we max-pooled the feature vectors located regularization term (the second term in Eq. (1)), i.e., without
within each super-pixel region to obtain a set of 1888 dimen- considering feature learning.
sional feature vectors to represent each super-pixel. Finally, Tables I-II and Tables III-IV report the comparison results
these 1888 dimensional features were normalized to learn of the baseline method and our proposed method for each
more powerful and discriminative feature representations by image group in the MSRC dataset and the Cosal2015 dataset,
using Eq. (4). measured in terms of AP score and F-measure score, respec-
In our implementation each image group was randomly split tively. The results show that: (i) The proposed metric learning-
into 50% for feature learning and the co-salient object detector based method outperforms the baseline method for all seven
training and the remaining 50% for co-saliency detection. image groups from the MSRC dataset and for 48 out of
In addition, to obtain reliable results, we repeated the experi- 50 image groups from the Cosal2015 dataset; (ii) On the
ment five times by randomly selecting 50%-50% training-test challenging Cosal2015 dataset, our method improves upon
images and reported the mean result of the five runs. the baseline method significantly (at least two percent points)
for 39 out of 50, and 40 out of 50 image groups measured
in terms of AP score and F-measure, respectively. Especially
D. Comparison With Baseline Method for the challenging image groups such as “boat” and “snail”,
In our work, the linear SVM classifier trained with the we obtained 17.7% and 10.6% AP improvements, and 10.3%
original 1888 dimensional hyper-column feature representation and 12.3% F-measure improvements; (iii) From the perspec-
serves as a baseline method, which is called “baseline” in tive of overall performance measured in terms of AP score and
this paper for short. For fair comparison, we (i) used the F-measure score, our proposed method improves the baseline
same training images and test images for both baseline and method by 1.4% and 2.4% for the relatively simple MSRC
our proposed method; and (ii) adopted the same scheme dataset, and boosts the baseline method by 4% and 4.6%
for both baseline and our method, as described in subsec- for the challenging Cosal2015 dataset, which demonstrates
tion 3.4, to perform co-saliency detection. The only difference that our method is effective for learning discriminative feature
between baseline and our proposed method is that the baseline representations. It is worth noting that, for some image groups
SVM classifier was trained without using the metric learning such as “apple” and “motorbike”, the baseline method without

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018

Fig. 5. Overall performance comparison between our proposed method and eight state-of-the-art comparison methods, including LDW [8], CBCS [5],
CSHS [7], SACS [9], ESMG [10], CBCS-S [5], BLSM [26], LR [27], on the MSRC dataset, measured in term of (a) PR curves, (b) AP score,
and (c) F-measure score, respectively.

Fig. 6. Overall performance comparison between our proposed method and eight state-of-the-art comparison methods, including LDW [8], CBCS [5],
CSHS [7], SACS [9], ESMG [10], CBCS-S [5], BLSM [26], LR [27], on the Cosal2015 dataset, measured in term of (a) PR curves, (b) AP score,
and (c) F-measure score, respectively.

using feature learning also achieves comparable performance basis of the fine segmentation and the object prior on the
to our method because they have relatively simple background. basis of the coarse segmentation, respectively. SACS [9] is a
fusion-based method that integrates the result maps of multiple
existing (co-) saliency methods with the self-adaptive weights
E. Comparison With State-of-the-Art Methods obtained via low-rank decomposition. ESMG [10] is a visual
To comprehensively evaluate our co-saliency detection saliency guided model in which the queries are firstly selected
method, we compared it with eight state-of-the-art meth- based on the saliency maps of an existing method. Then an
ods, including LDW [8], CBCS [5], CSHS [7], SACS [9], efficient manifold ranking is used to refine the detection results
ESMG [10], CBCS-S [5], BLSM [26], LR [27], where the in group level. CBCS-S [5] is a cluster-based single-image
first five methods are proposed by state-of-the-art co-saliency saliency detection method which measures the cluster-level
detection methods and the last three methods are state-of-the- saliency by using two bottom-up information cues including
art single saliency detection methods. the contrast cue and the spatial cue. BLSM [26] is a bottom-
In brief, LDW [8] is a novel co-saliency detection method up single-image saliency model which examines both the low-
in which the wide and deep information are explored for the level cue and the mid-level cue. LR [27] is also a single-image
object proposal windows extracted in each image, and the saliency detection framework that integrates low-level features
co-saliency scores are calculated by integrating the intra-image with higher-level guidance (top-down priors) to detect salient
contrast and intra-group consistency via a principled Bayesian regions through low rank matrix recovery.
formulation. CBCS [5] is a cluster-based method that measures Figure 5 and Figure 6 present the overall performance
the cluster-level co-saliency using three bottom-up information comparison between our proposed method and eight state-
cues including the contrast cue, the corresponding cue, and of-the-art comparison methods on the MSRC dataset and
the spatial cue. CSHS [7] is a hierarchical segmentation- the Cosal2015 dataset, measured in terms of PR curves,
based method which integrates the regional similarity on the AP score and F-measure score, respectively. Here the results

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2481

Fig. 7. Co-saliency detection examples by using our proposed method and three representative methods on two benchmark datasets. For each image group,
the first row is the input image group, the second row is the ground truth masks, and the 3-6 rows are the co-saliency detection results obtained using the
proposed approach, ESMG [10], SACS [9], and LDW [8], respectively.

of these eight comparison methods are directly obtained from TABLE V


the literature of [8]. From Figure 5 and Figure 6 we can AVERAGE RUNNING T IME PER I MAGE OF N INE D IFFERENT M ETHODS
observe that: (i) With the same Recall, the Precision of our
method is higher than that of the other eight approaches,
which means that the false alarm rate of our method is lower
with the same true positives; (ii) With the same Precision,
the Recall of our method is also higher than that of other
eight approaches, which means that our method can detect
more real co-salient objects with the same false alarm rate;
(iii) Our proposed method outperforms all other comparison
methods and achieves state-of-the-art performance in terms of implemented in MATLAB and C without optimization and
AP and F-measure scores, respectively, on both two datasets; the CNN feature extraction was implemented in Python and
(iv) Especially on the challenging Cosal2015 dataset, our CUDA with a GTX Titan X GPU for acceleration. The
method improves upon the second best result by about 8% remaining comparison experiments were implemented in Mat-
in terms of AP score, which demonstrates the effectiveness lab without any optimization and GPU acceleration. Table V
and superiority of our method. reports the average running time per image of nine different
Figure 7 shows a number of qualitative co-saliency detection methods, where the running time of CSHS and ESGM are
examples by using our proposed method and three typical obtained from [10] which took the experiments on a laptop
methods including ESMG [10], SACS [9], and LDW [8] on with an Intel 1.8 GHz CPU and 8 GB RAM. As can be seen
two benchmark datasets. On Cosal2015 dataset, four image from Table V, one the one hand, among the top five methods
groups including bear, butterfly, guitar, and starfish were including our method, LDW [8], SACS [9], CSHS [7], and
selected and on MSRC dataset two image groups including BLSM [26], measured in terms of AP score and F-measure
cattle and car were selected. As can be seen from Figure 7, score on both MSRC dataset and Cosal2015 dataset, our
despite the large range of variations, the proposed method has method took the least running time. On the other hand, com-
successfully detected and located the co-salient objects. pared with the top two fastest methods including CBCS [5]
and CBCS-S [5], although our method took a relatively bigger
running time, it has a much larger performance improvement
F. Running Time
measured in terms of AP score and F-measure score. This
The experiments were run on a workstation with two demonstrates that our proposed method is effective and effi-
2.8GHz 6-core CPUs and 64GB memory. Our code was cient for co-saliency detection.

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
2482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 10, OCTOBER 2018

V. C ONCLUSIONS [17] W. Wang, J. Shen, R. Yang, and F. Porikli, “A unified spatiotem-


poral prior based on geodesic distance for video object segmen-
In this paper, we made the earliest effort to introduce metric tation,” IEEE Trans. Pattern Anal. Mach. Intell., to be published.
learning to co-saliency detection. To this end, we proposed an doi: 10.1109/TPAMI.2017.2662005.
effective metric learning based framework to simultaneously [18] K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei, “Co-localization in real-
world images,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog.,
learn discriminative feature and co-salient object detector. This Jun. 2014, pp. 1464–1471.
was achieved by optimizing a new objective function that [19] A. Joulin, K. Tang, and L. Fei-Fei, “Efficient image and video co-
explicitly imposes a metric learning regularization constraint localization with frank-wolfe algorithm,” in Proc. Eur. Conf. Comput.
Vis., 2014, pp. 253–268.
into SVM training. By taking advantage of the metric learning [20] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency
regularization, the input data is transformed into a new feature detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 10,
space, in which the pixels from co-salient objects are pulled pp. 1915–1926, Oct. 2012.
[21] C. Yang, L. Zhang, H. Lu, R. Xiang, and M.-H. Yang, “Saliency
closer and the different-class pixels (from foreground class and detection via graph-based manifold ranking,” in Proc. IEEE Conf.
background class) are pushed farther away, and hence making Comput. Vis. Pattern Recognit., Jun. 2013, pp. 3166–3173.
the input images to be easily classified into co-salient objects [22] X. Qian, J. Han, G. Cheng, and L. Guo, “Optimal contrast based saliency
detection,” Pattern Recognit. Lett., vol. 34, no. 11, pp. 1270–1278,
and background by using the learnt detector. In the exper- Aug. 2013.
iments, we comprehensively evaluated the proposed method [23] J. Han, D. Wang, L. Shao, X. Qian, G. Cheng, and J. Han, “Image
on two commonly-used benchmark datasets. State-of-the-art visual attention computation and application via the learning of object
results were achieved in comparison with the existing co- attributes,” Mach. Vis. Appl., vol. 25, no. 7, pp. 1671–1683, Oct. 2013.
[24] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object-
saliency detection methods. In the future work, we will try to oriented visual saliency detection framework based on sparse coding
apply our co-saliency detection method for weakly supervised representations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23,
learning based remote sensing image analysis [55]–[57]. no. 12, pp. 2009–2021, Dec. 2013.
[25] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background
prior-based salient object detection via deep reconstruction residual,”
R EFERENCES IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1309–1321,
Aug. 2015.
[1] H. Li and K. N. Ngan, “A co-saliency model of image pairs,” IEEE [26] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid
Trans. Image Process., vol. 20, no. 12, pp. 3365–3375, Dec. 2011. level cues,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1689–1698,
[2] D. Zhang, J. Han, C. Li, and J. Wang, “Co-saliency detection via looking May 2013.
deep and wide,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [27] X. Shen and Y. Wu, “A unified approach to salient object detection via
Jun. 2015, pp. 2994–3002. low rank matrix recovery,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
[3] H. Li, F. Meng, and K. N. Ngan, “Co-salient object detection Recognit., Jun. 2012, pp. 853–860.
from multiple images,” IEEE Trans. Multimedia, vol. 15, no. 8,
[28] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face
pp. 1896–1909, Dec. 2013.
verification in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
[4] D. Zhang, D. Meng, C. Li, L. Jiang, Q. Zhao, and J. Han, “A self-
Recognit., Jun. 2014, pp. 1875–1882.
paced multiple-instance learning framework for co-saliency detection,”
[29] J. Hu, J. Lu, and Y.-P. Tan, “Deep transfer metric learning,” in Proc.
in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 594–602.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 325–333.
[5] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE
Trans. Image Process., vol. 22, no. 10, pp. 3766–3778, Oct. 2013. [30] Z. Huang, R. Wang, S. Shan, and X. Chen, “Projection metric learning
[6] D. Zhang, J. Han, J. Han, and L. Shao, “Cosaliency detection based on on Grassmann manifold with application to video based face recogni-
intrasaliency prior transfer and deep intersaliency mining,” IEEE Trans. tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2015,
Neural Netw. Learn. Syst., vol. 27, no. 6, pp. 1163–1176, Jun. 2016. pp. 140–149.
[7] Z. Liu, W. Zou, L. Li, L. Shen, and O. Le Meur, “Co-saliency detection [31] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou, “Multi-manifold deep
based on hierarchical segmentation,” IEEE Signal Process. Lett., vol. 21, metric learning for image set classification,” in Proc. IEEE Int. Conf.
no. 1, pp. 88–92, Jan. 2014. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1137–1145.
[8] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of co-salient [32] Q. Qian, R. Jin, S. Zhu, and Y. Lin, “Fine-grained visual categorization
objects by looking deep and wide,” Int. J. Comput. Vis., vol. 120, no. 2, via multi-stage metric learning,” in Proc. IEEE Int. Conf. Comput. Vis.
pp. 215–232, Nov. 2016. Pattern Recognit., Jun. 2015, pp. 3716–3724.
[9] X. Cao, Z. Tao, B. Zhang, H. Fu, and W. Feng, “Self-adaptively weighted [33] R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised metric learning
co-saliency detection via rank constraint,” IEEE Trans. Image Process., for face identification in TV video,” in Proc. IEEE Int. Conf. Comput.
vol. 23, no. 9, pp. 4175–4186, Sep. 2014. Vis., Nov. 2011, pp. 1559–1566.
[10] Y. Li, K. Fu, Z. Liu, and J. Yang, “Efficient saliency-model-guided [34] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and
visual co-saliency detection,” IEEE Signal Process. Lett., vol. 22, no. 5, metric learning in semi-supervised clustering,” in Proc. IEEE Int. Conf.
pp. 588–592, May 2014. Mach. Learn., Jul. 2004, p. 11.
[11] X. Yao, J. Han, D. Zhang, and F. Nie, “Revisiting co-saliency detection: [35] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
A novel approach based on two-stage multi-view spectral rotation co- margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
clustering,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3196–3209, pp. 207–244, Feb. 2009.
Jul. 2017. [36] P. Zhu, L. Zhang, W. Zuo, and D. Zhang, “From point to set: Extend
[12] H. Fu, D. Xu, B. Zhang, S. Lin, and R. K. Ward, “Object-based the learning of distance metrics,” in Proc. IEEE Int. Conf. Comput. Vis.,
multiple foreground video co-segmentation via multi-state selection Dec. 2013, pp. 2664–2671.
graph,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3415–3424, [37] L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang, “Cross-domain
Nov. 2015. visual matching via generalized similarity measure and feature learning,”
[13] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1089–1102,
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2017.
Jun. 2015, pp. 3395–3402. [38] K. Wang, L. Lin, W. Zuo, S. Gu, and L. Zhang, “Dictionary pair classifier
[14] W. Wang, J. Shen, X. Li, and F. Porikli, “Robust video object coseg- driven convolutional neural networks for object detection,” in Proc. IEEE
mentation,” IEEE Trans. Image Process., vol. 24, no. 10, pp. 3137–3148, Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2138–2146.
Oct. 2015. [39] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang, “A kernel
[15] W. Wang and J. Shen, “Higher-order image co-segmentation,” IEEE classification framework for metric learning,” IEEE Trans. Neural Netw.
Trans. Multimedia, vol. 18, no. 6, pp. 1011–1021, Jun. 2016. Learn. Syst., vol. 26, no. 9, pp. 1950–1962, Sep. 2015.
[16] X. Dong, J. Shen, L. Shao, and M.-H. Yang, “Interactive cosegmenta- [40] P. Luo, L. Lin, and X. Liu, “Learning compositional shape models
tion using global and local energy optimization,” IEEE Trans. Image of multiple distance metrics by information projection,” IEEE Trans.
Process., vol. 24, no. 11, pp. 3966–3977, Nov. 2015. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1417–1428, Jul. 2016.

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: UNIFIED METRIC LEARNING-BASED FRAMEWORK FOR CO-SALIENCY DETECTION 2483

[41] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimedia Junwei Han received the B.S. and Ph.D. degrees
retrieval framework based on semi-supervised ranking and relevance from Northwestern Polytechnical University, Xi’an,
feedback,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, China, in 1999 and 2003, respectively. He is cur-
pp. 723–742, Apr. 2012. rently a Professor with Northwestern Polytechnical
[42] G. Ding, Y. Guo, J. Zhou, and Y. Gao, “Large-scale cross-modality University. His research interests include computer
search via collective matrix factorization hashing,” IEEE Trans. Image vision and multimedia processing.
Process., vol. 25, no. 11, pp. 5427–5440, Nov. 2016.
[43] C. Ge, K. Fu, F. Liu, L. Bai, and J. Yang, “Co-saliency detection via
inter and intra saliency propagation,” Signal Process., Image Commun.,
vol. 44, pp. 69–83, May 2016.
[44] Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe, “Feature selection
for multimedia analysis by sharing information among multiple tasks,”
IEEE Trans. Multimedia, vol. 15, no. 3, pp. 661–669, Apr. 2013.
[45] Y. Guo, G. Ding, L. Liu, J. Han, and L. Shao, “Learning to hash with
optimized anchor embedding for scalable retrieval,” IEEE Trans. Image
Process., vol. 26, no. 3, pp. 1344–1354, Mar. 2017. Gong Cheng received the B.S. degree from Xidian
[46] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo, “Semantic annotation of University, Xi’an, China, in 2007, and the M.S. and
high-resolution satellite images via weakly supervised learning,” IEEE Ph.D. degrees from Northwestern Polytechnical Uni-
Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3660–3671, Jun. 2016. versity, Xi’an, in 2010 and 2013, respectively. He is
[47] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, currently an Associate Professor with Northwestern
“SLIC superpixels compared to state-of-the-art superpixel methods,” Polytechnical University. His main research interests
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, are computer vision and pattern recognition.
Nov. 2012.
[48] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned
universal visual dictionary,” in Proc. IEEE Int. Conf. Comput. Vis.,
Oct. 2005, pp. 1800–1807.
[49] Y. Jia and M. Han, “Category-independent object-level saliency detec-
tion,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1761–1768.
[50] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo-
lutional neural networks for object detection in VHR optical remote
sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
pp. 7405–7415, Dec. 2016. Zhenpeng Li received the B.S. degree from North-
[51] P. Zhou, G. Cheng, Z. Liu, S. Bu, and X. Hu, “Weakly supervised target western Polytechnical University, Xi’an, China, in
detection in remote sensing images based on transferred deep features 2016, where he is currently pursuing the M.S.
and negative bootstrapping,” Multidimension. Syst. Signal Process., degree. His research interests are computer vision
vol. 27, no. 4, pp. 925–944, Oct. 2016. and pattern recognition.
[52] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-
cation: Benchmark and state of the art,” Proc. IEEE, to be published,
doi: 10.1109/JPROC.2017.2675998.
[53] G. Cheng and J. Han, “A survey on object detection in optical
remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117,
pp. 11–28, Jul. 2016.
[54] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. (2014).
“Return of the devil in the details: Delving deep into convolutional nets.”
[Online]. Available: https://arxiv.org/abs/1405.3531
[55] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective
and efficient midlevel visual elements-oriented land-use classification Dingwen Zhang received the B.S. degree from
using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., Northwestern Polytechnical University, Xi’an,
vol. 53, no. 8, pp. 4238–4249, Aug. 2015. China, in 2012, where he is currently pursuing
[56] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in the Ph.D. degree. His research interests include
optical remote sensing images based on weakly supervised learning and computer vision and multimedia processing,
high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, especially on saliency detection, co-saliency
no. 6, pp. 3325–3337, Jun. 2015. detection, and weakly supervised learning.
[57] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
detection and geographic image classification based on collection of part
detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132,
Dec. 2014.

Authorized licensed use limited to: SUNY AT STONY BROOK. Downloaded on May 26,2022 at 01:59:31 UTC from IEEE Xplore. Restrictions apply.

You might also like