Professional Documents
Culture Documents
sup mask
Multiplication
lenging task of recognizing the object regions from unseen Similarity
categories with only one annotated example as supervision. In Map
arXiv:1810.09091v4 [cs.CV] 12 May 2020
sup img
network to tackle the One-shot (SG-One) segmentation problem.
CNN
We aim at predicting the segmentation mask of a query image
Similarity Guidance
with the reference to one densely labeled support image of the
CNN
same category. To obtain the robust representative feature of
que img
the support image, we firstly adopt a masked average pooling Segmentation
strategy for producing the guidance features by only taking the
pixels belonging to the support image into account. We then
leverage the cosine similarity to build the relationship between
Fig. 1. An overview of the proposed SG-One approach for testing a new class.
the guidance features and features of pixels from the query image. Given a query image of an unseen category, e.g.. cow, its semantic object is
In this way, the possibilities embedded in the produced similarity precisely segmented with the reference to only one annotated example of this
maps can be adapted to guide the process of segmenting objects. category.
Furthermore, our SG-One is a unified framework which can
efficiently process both support and query images within one
network and be learned in an end-to-end manner. We conduct child to recognize various breeds of dogs with the reference
extensive experiments on Pascal VOC 2012. In particular, our SG- to only one picture of a dog. Inspired by this, one-shot
One achieves the mIoU score of 46.3%, surpassing the baseline learning dedicates to imitating this powerful ability of human
methods. beings. In other words, one-shot learning targets to recognize
Index Terms—Few-shot Learning, Image Segmentation, Neural new objects according to only one annotated example. This
Networks, Siamese Network is a great challenge for the standard learning methodology.
Instead of using tremendous annotated instances to learn the
characteristic patterns of a specific category, our target is to
I. I NTRODUCTION
learn one-shot networks to generalize to unseen classes with
BJECT Semantic Segmentation (OSS) aims at predicting
O the class label of each pixel. Deep neural networks
have achieved tremendous success on the OSS tasks, such
only one densely annotated example.
Concretely, one-shot image segmentation is to discover the
object pixels of a query image with the reference to only
as U-net [1], FCN [2] and Mask R-CNN [3]. However, one support image. The target objects in the support image
these algorithms trained with full annotations require many is densely annotated. Current existing methods [15], [16] are
investments to the expensive labeling tasks. To reduce the all based on the Siamese framework [17]. Briefly, a pair
budget, a promising alternative approach is to apply weak of parallel networks is trained for extracting the features of
annotations for learning a decent network of segmentation. labeled support images and query images, separately. These
For example, previous works have implemented image-level features are then fused to generate the probability maps of the
labels [4]–[6], scribbles [7]–[9], bounding boxes [10], [11] and target objects. The purpose of the network is actually to learn
points [12]–[14] as cheaper supervision information. Whereas the relationship between the annotated support image and the
the main disadvantage of these weakly supervised methods is query image within the highlevel feature space. These methods
the lack of the ability for generalizing the learned models to provide an advantage that the trained parameters of observed
unseen classes. For instance, if a network is trained to segment classes can be directly utilized for testing unseen classes
dogs using thousands of images containing various breeds of without finetuning. Nevertheless, there are some weaknesses
dogs, it will not be able to segment bikes without retraining with these methods: 1) The parameters of using the two
the network using many images containing bikes. parallel networks are redundant, which is prone to overfitting
In contrast, humans are very good at recognizing things and leading to the waste of computational resources; 2)
with a few guidance. For instance, it is very easy for a combining the features of support and query images by mere
multiplication is inadequate for guiding the query network to
X. Zhang, Y. Wei, Y. Yang are with the ReLER lab, Centre for Artificial
Intelligence, University of Technology Sydney, Ultimo, NSW 2007, Australia. learn high-quality segmentation masks.
( e-mail: Xiaolin.Zhang-3@student.uts.edu.au, Yunchao.Wei@uts.edu.au, To overcome the above mentioned weaknesses, we propose
Yi.Yang@uts.edu.au ) a Similarity Guidance Network for One-Shot Semantic Seg-
T. S. Huang is with the Department of Electrical and Computer Engineering
and Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, mentation (SG-One) in this paper. The fundamental idea of
IL, 61901 USA. ( e-mail:t-huang1@illinois.edu ) SG-One is to guide the segmentation process by effectively
IEEE TRANSACTIONS ON CYBERNETICS 2
incorporating the pixel-wise similarities between the features tion branch, to produce the guidance maps and segmentation
of support objects and query images. Particularly, we firstly masks. We forward both the support and query images through
extract the highlevel feature maps of the input support and the guidance branch to calculate the similarity maps. The fea-
query images. Highlevel features are usually abstract and the tures of query images also forward through the segmentation
embeddings of pixels belonging to the same category objects branch for prediction segmentation masks. The similarity maps
are close. The embeddings with respect to the background act as guidance attention maps where the target regions have
pixels are usually depressed and these embeddings are distant higher scores while the background regions have lower scores.
from the object embeddings. Therefore, we propose to obtain The segmentation process is guided by the similarity maps for
the representative vectors from support images by applying getting the precise target regions. After the training phase, the
the masked average pooling operation. The masked average SG-One network can predict the segmentation masks of a new
pooling operation can extract the object-related features by class without changing the parameters. For example, a query
excluding the influences from background noises. Then, we get image of an unseen class, e.g., cow, is processed to discover
the guidance maps by calculating cosine similarities between the pixels belong to the cow with only one annotated support
the representative vectors and the features of query images at image provided.
each pixel. The feature vectors corresponding to the pixels To sum up, our main contributions are three-fold:
of objects in query images are close to the representative • We propose to produce robust object-related representa-
vectors extracted from support images, hence the scores in tive vectors using masked average pooling for incorpo-
the guidance maps will be high. Otherwise, the scores in rating contextual information without changing the input
guidance maps will be low if the pixels belonging to the structure of networks.
background. The generated guidance maps are applied to • We produce the pixel-wise guidance using cosine simi-
supply the guidance information of desired regions to the larities between representative vectors and query features
segmentation process. In detail, the position-wise feature vec- for predicting the segmentation masks.
tors of query images are multiplied by the corresponding • We propose a unified network for processing support and
similarity values. Such a strategy can effectively contribute to query images. Our network achieves the cross-validate
activating the target object regions of query images following mIoU of 46.3% on the PASCAL-5i dataset in one-shot
the guidance of support images and their masks. Furthermore, segmentation setting, surpassing the baseline methods.
we adopt a unified network for producing similarity guidance
and predicting segmentation masks of query images. Such a
network is more capable of generalizing to unseen classes than II. R ELATED W ORK
the previous methods [15], [16]. Object semantic segmentation (OSS) aims at classifying
Our approach offers multiple appealing advantages over the every pixel in a given image for distinguishing different
previous state-of-the-arts, e.g.OSLSM [15] and co-FCN [16]. objects or contents. OSS with dense annotations as supervision
First, OSLSM and co-FCN incorporate the segmentation has achieved great success in precisely identifying various
masks of support images by changing the input structure kinds of objects [18]–[25]. Recently, most of the works with
of the network or the statistic distribution of input images. impressive performance are based on deep convolutional net-
Differently, we extract the representative vector from the works. FCN [2] and U-Net [1] abandon fully connected layers
intermediate feature maps with the masked average pooling and propose to only use convolutional layers for preserving
operation instead of changing the inputs. Our approach does relative positions of pixels. Based on the advantages of FCN,
not harm the input structure of the network, nor harm the DeepLab proposed by Chen et al. [26], [27], is one of the best
statistics of input data. Averaging only the object regions can algorithms for segmentation. It employs dilated convolution
avoid the influences from the background. Otherwise, when operations to increase the receptive field, and meanwhile to
the background pixels dominate, the learned features will bias save parameters in comparison with the large convolutional
towards the background contents. Second, OSLSM and co- kernel methods. He et al. [3] proposes segmentation masks
FCN directly multiply the representative vector to feature and detection bounding boxes can be predicted simultaneously
maps of query images for predicting the segmentation masks. using a unified network.
SG-One calculates the similarities between the representative Weakly object segmentation seeks an alternative approach
vector and the features at each pixel of query images, and for segmentation to reduce the expenses in labeling segmen-
the similarity maps are employed to guide the segmentation tation masks [14], [28]–[32]. Zhou [33] and Zhang [34], [35]
branch for finding the target object regions. Our method propose to discover precise object regions using a classification
is superior in the process of segmenting the query images. network with only image-level labels. Wei [4], [5] apply a two-
Third, both OSLSM and co-FCN adopt a pair of VGGnet-16 stage mechanism to predict segmentation masks. Concretely,
networks for processing support and query images, separately. confident regions of objects and background are firstly ex-
We employ a unified network to process them simultaneously. tracted via the methods of Zhou et al.or Zhang et al.Then, a
The unified network utilizes much fewer parameters, so as to segmentation network, such as DeepLab, can be trained for
reduce the computational burden and increase the ability to segmenting the target regions. An alternative weakly segmen-
generalize to new classes in testing. tation approach is to use scribble lines to indicate the rough
The overview of SG-One is illustrated in Figure 1. We apply positions of objects and background. Lin et al. [7] and Tang et
two branches, i.e., similarity guidance branch, and segmenta- al. [8] adopted spectral clustering methods to distinguish the
IEEE TRANSACTIONS ON CYBERNETICS 3
sup mask
Masked Average Pooling
que mask
Interp
sup img
Conv
Conv
Conv
CosineSimilarty cross-entropy
Stem
Conv 3x3
Conv 3x3
Conv 1x1
Conv 3x3
Conv 3x3
que img
Interp
Concatenate Multiplication Guidance Branch Segmentation Branch
Fig. 2. The network of the proposed SG-One approach. A query image and a labeled support image are fed into the network. Guidance Branch is to extract the
representative vector of the target object in the support image. Segmentation Branch is to predict the segmentation masks of the query image. We calculate the
cosine distance between the vector and the intermediate features of the query image. The CosineSimilarty maps are then employed to guide the segmentation
process. The blue arrows indicate data streams of support images, while the black are for query images. Stem is the conv1 to conv3 of VGG16. Interp refers
to the bilinear interpolation operation. Conv is a convoluational block. Conv k. × k is the convolutional filter with a kernel size of k × k
object pixels according to the similarity of adjacent pixels and these representations are broadly applicable to various tasks.
ground-truth scribble lines. Vinyals et al. [45] and Annadani et al. [46] propose to learn
Video Object Segmentation is also a challenging problem embedding vectors. The vectors of the same categories are
of segmenting a specific object in a video clip given merely the close, while the vectors of the different categories are apart.
annotation of the first frame [36]. The categories of training
and testing sets are disjoint, which makes the task similar III. M ETHODOLOGY
to our one-shot image segmentation. OSVOS [37] adopts a
direct approach of learning a segmentation network on the A. Problem Definition
training set, and then finetunes the trained network on the Suppose we have three datasets: a training set Ltrain =
Nsupport
augmented first frames of testing sets. Although OSVOS {(Ii , Yi )}N train
i=1 , a support set Lsupport = {(Ii , Yi )}i=1
achieves good segmentation performance, the main drawback Ntest
and a testing set Ltest = {Ii }i=1 , where Ii is an image, Yi is
is the latency of finetuning during testing a new video clip. the corresponding segmentation mask and N is the number of
PLM [38] applies a more sophisticated network to learn better images in each set. Both the support set and training set have
feature embeddings by involving intermediate feature maps annotated segmentation masks. The support set and testing set
of both search and query frames. A simple crop method share the same types of objects which are disjoint with the
is also applied by estimating the approximate location of training set. We denote l ∈ Y as the semantic class of the
target objects according to the relationship between successive mask Y . Therefore, we have {ltrain } ∩ {lsupport } = ∅, where
frames. SegFlow [39] leverages optical flow of moving objects ∩ denotes the intersection of the two sets. If there are K
to assist the segmentation process. Flownet [40] are internally annotated images for each class, the target few-shot problem
embedded to the framework of SegFlow, and updated in an is named K-shot. Our purpose is to train a network on the
end-to-end way. Consequently, the segmentation network and training set Ltrain , which can precisely predict segmentation
flow network can benefit from each other to learn better masks Ŷtest on testing set Ltest according to the reference
segmentation masks as well as optical flows. VideoMatch [41] of the support set Lsupport . Specifically, the predicted masks
learns the represents of both foreground and background contains two classes, i.e., object class and background. If the
regions, because the successive video clips usually maintain objects in query images share the same category label with the
similar or static background environments. Therefore, the annotated objects in support images, the corresponding values
learned robust represents can be easily applied to retrieve the in the predicted masks are supposed to be 1 for indicating
target object regions of query images. object pixels. Otherwise, the corresponding values should be
Few-shot learning algorithms dedicates to distinguishing 0 for indicating background pixels.
the patterns of classes or objects with only a few labeled sam- In order to better learn the connection between the sup-
ples [42], [43]. Networks should generalize to recognize new port and testing set, we mimic this mechanism in the train-
objects with few images based on the parameters of base mod- ing process. For a query image Ii , we construct a pair
els. The base models are trained using entirely different classes {(Ii , Yi ), (Ij , Yj )} by randomly selecting a support image
without any overlaps with the testing classes. Finn et al. [44] Ij whose mask Yj has the same semantic class as Yi . We
tries to learn some internal transferable representations, and are supposed to estimate the segmentation mask Ŷi with a
IEEE TRANSACTIONS ON CYBERNETICS 4
function Ŷj = fθ ((Ii , Yi ), Ij ), where θ is the parameters of the representative vector v = (v1 , v2 , ..., vc ) of the reference
the function. In testing, (Ii , Yi ) is picked from the support set object, where c is the number of channels. Suppose the feature
0 0
Lsupport and Ij is an image from testing set Ltest . maps of a query image I que is F que ∈ Rc×w ×h . We employ
the cosine distance to measure the similarity between the
B. Proposed Model representative vector v and each pixel within F que following
Eq. (2)
In this section, we firstly present the masked average pool- que
v ∗ Fx,y
ing operation for extracting the object-related representative sx,y = que , (2)
vector of annotated support images. Then, the similarity guid- ||v||2 ∗ ||Fx,y ||2
ance method is introduced for combining the representative where sx,y ∈ [−1, 1] is the similarity value at the pixel (x, y).
vectors and features of query images. The generated similarity que
Fx,y ∈ Rc×1 is the feature vector of query image at the pixel
guidance maps supply the information for precisely predicting (x, y). As a result, the similarity map S integrates the features
the segmentation masks. of the support object and the query image. We use the map
Masked Average Pooling The pairs of support images and S = {sx,y } as guidance to teach the segmentation branch
their masks are usually encoded into representative vectors. to discover the desired object regions. We do not explicitly
OSLSM [15] proposes to erase the background pixels from optimize the cosine similarity. In particular, we element-wisely
the support images by multiplying the binary masks to support multiply the similarity guidance map to the feature maps of
images. co-FCN [16] proposes to construct the input block of query images from segmentation branch. Then, we optimize
five channels by concatenating the support images with their the guided feature maps to fit the corresponding ground truth
positive and negative masks. However, there are two disadvan- masks.
tages of the two methods. First, erasing the background pixels
to zeros will change the statistic distribution of the support
image set. If we apply a unified network to process both the C. Similarity Guidance Method
query images and the erased support images, the variance of Figure 2 depicts the structure of our proposed model. SG-
the input data will greatly increase. Second, concatenating the One includes three components, i.e., Stem, Similarity Guid-
support images with their masks [16] breaks the input structure ance Branch, and Segmentation Branch. Different components
of the network, which will also prevent the implementation of have different structures and functionalities. Stem is a fully
a unified network. convolutional network for extracting intermediate features of
We propose to employ masked average pooling for ex- both support and query images.
tracting the representative vectors of support objects. Suppose Similarity Guidance Branch is fed the extracted features
we have a support RGB image I ∈ R3×w×h and its binary of both query and support images. We apply this branch
segmentation mask Y ∈ {0, 1}w×h , where w and h are the to produce the similarity guidance maps by combining the
width and height of the image. If the output feature maps of features of reference objects with the features of query images.
0 0
I is F 0 ∈ Rc×w ×h , where c is the number of channels, w0 For the features of support images, we implement three
and h0 are width and height of the feature maps. We firstly convolutional blocks to extract the highly abstract and se-
resize the feature maps to the same size as the mask Y via mantic features, followed by a masked averaged pooling layer
bilinear interpolation. We denote the resized feature maps as to obtain representative vectors. The extracted representative
F ∈ Rc×w×h . Then, the ith element vi of the representative vectors of support images are expected to contain the high-
vector v is computed by averaging the pixels within the object level semantic features of a specific object. For the features
regions on the ith feature map. of query images, we reuse the three blocks and employ the
Pw,h
x=1,y=1 Yx,y ∗ Fi,x,y
cosine similarity layer to calculate the closeness between the
vi = Pw,h , (1) representative vector and the features at each pixel of the query
x=1,y=1 Yx,y images.
As the discussion in FCN [2], fully convolutional networks Segmentation branch is for discovering the target object
are able to preserve the relative positions of input pixels. regions of query images with the guidance of the generated
Therefore, through masked average pooling, we expect to similarity maps. We employ three convolutional layers with the
extract the features of object regions while disregarding the kernel size of 3×3 to obtain the features for segmentation. The
background contents. Also, we argue that the input of con- inputs of the last two convolutional layers are concatenated
textual regions in our method is helpful to learn better object with the paralleling feature maps from Similarity Guidance
features. This statement has been discussed in DeepLab [26] Branch. Through the concatenation, Segmentation Branch can
which incorporates contextual information using dilated con- borrow features from the paralleling branch, and these two
volutions. Masked average pooling keeps the input structure branches can communicate information during the forward
of the network unchanged, which enables us to process both and backward stages. We fuse the generated features with
the support and query images within a unified network. the similarity guidance maps by multiplication at each pixel.
Similarity Guidance One-shot semantic segmentation aims Finally, the fused features are processed by two convolutional
to segment the target object within query images given a layers with the kernel size of 3 × 3 and 1 × 1, followed by
support image of the reference object. As we have discussed, a bilinear interpolation layer. The network finally classifies
the masked average pooling method is employed to extract each pixel to be the same class with support images or to
IEEE TRANSACTIONS ON CYBERNETICS 5
TABLE I the number of true positives, false positives and false negatives
T ESTING CLASSES FOR 4- FOLD CROSS - VALIDATION TEST. T HE TRAINING of the predicted masks. TheP mIoU is the average of IoUs over
I
CLASSES OF PASCAL-5 , I ={0,1,2,3} ARE DISJOINT WITH THE TESTING
CLASSES . different classes, i.e.(1/nl ) l IoUl , where nl is the number
of testing classes. We report the averaged mIoU on the four
Dataset Test classes cross-validation datasets.
PASCAL-50 aeroplane,bicycle,bird,boat,bottle
PASCAL-51 bus,car,cat,chair,cow
PASCAL-52 diningtable,dog,horse,motorbike,person B. Implementation details
PASCAL-53 potted plant,sheep,sofa,train,tv/monitor
We implement the proposed approach based on the VGG-16
network following the previous works [15], [16]. Stem takes
be background. We employ the cross-entropy loss function to the input of RGB images to extract middle-level features, and
optimize the network in an end-to-end manner. downsamples the images by the scale of 8. We use the first
One-Shot Testing One annotated support image for each three blocks of the VGG-16 network as Stem. For the first
unseen category is provided as guidance to segment the target two convolutional blocks of Similarity Guidance Branch, we
semantic objects of query images. We do not need to finetune adopt the structure of conv4 and conv5 of VGGnet-16 and
or change any parameters of the entire network. We only need remove the maxpooling layers to maintain the resolution of
to forward the query and support images through the network feature maps. One conv3×3 layer of 512 channels are added
for generating the expected segmentation masks. on the top without using ReLU after this layer. The following
K-Shot Testing Suppose there are K(K > 1) support images module is masked average pooling to extract the representative
i
Isup , i = {1, 2..., K} for each new category, we propose to vector of support images. In Segmentation Branch, all of
segment the query image Ique using two approaches. The first the convolutional layers with 3 × 3 kernel size have 128
one is to ensemble the segmentation masks corresponding channels. The last layer of conv1×1 kernel has two channels
to the K support images following OSLSM [15] and co- corresponding to categories of either object or background. All
FCN [16] based on Eq. (3) of the convolutional layers except for the third and the last one
1
Ŷx,y = max(Ŷx,y 2
, Ŷx,y K
, ..., Ŷx,y ), (3) are followed by a ReLU layer. We will justify this choice in
section IV-E.
i
where Ŷx,y , i = {1, 2, ..., K} is the predicted semantic label Following the baseline methods [15], [16], we use the pre-
i
of the pixel at (x, y) corresponding to the support image Isup . trained weights on ILSVRC [49]. All input images remain the
Another approach is to average the K representative vectors, original size without any data augmentations. The support and
and then use the averaged vector to guide the segmentation query images are fed into the network, simultaneously. The
process. It is notable that we do not need to retrain the network difference from them is the support image just go through the
using K-shot support images. We use the trained network in guidance branch to obtain the representative vector. The query
a one-shot manner to test the segmentation performance using images goes through both the guidance branch for calculating
K-shot support images. the guidance maps and go through the segmentation branch for
predicting the segmentation masks. We implement the network
IV. E XPERIMENTS using PyTorch [50]. We train the network with the learning
A. Dataset and Metric rate of 1e − 5. The batch size is 1, and the weight decay is
0.0005. We adopt the SGD optimizer with the momentum of
Following the evaluation method of the previous methods
0.9. All networks are trained and tested on NVIDIA TITAN
OSLSM [15] and co-FCN [16], we create the PASCAL-5i
X GPUs with 12 GB memory. Our source code is available at
using the PASCAL VOC 2012 dataset [47] and the extended
https://github.com/xiaomengyc/SG-One.
SDS dataset [48]. For the 20 object categories in PASCAL
VOC, we use cross-validation method to evaluate the proposed
model by sampling five classes as test categories Ltest = C. Comparison
{4i+1, ..., 4i+5} in Table I, where i is the fold number, while One-shot Table II compares the proposed SG-One approach
the left 15 classes are the training label-set Ltrain . We follow with baseline methods in one-shot semantic segmentation. We
the same procedure of the baseline methods e.g.OSLSM [15] observe that our method outperforms all baseline models.
to build the training and testing set. Particularly, we randomly The mIoU of our approach on the four divisions achieves
sample image pairs from the training set. Each image pair 46.3%, which is significantly better than co-FCN by 5.2%
have one common category label. One image is fed into the and OSLSM by 5.5%. Compared to the baselines, SG-One
network as a support image accompanied by its annotated earns the largest gain of 7.8% on PASCAL-51 , where the
mask. Another image is treated as a query image and its mask testing classes are bus, car, cat, chair and cow. co-FCN [16]
is applied to calculate the loss. For a fair comparison, we use constructs the input of the support network by concatenating
the same test set as OSLSM [15], which has 1,000 support- the support images, positive and negative masks, and it obtains
query tuples for each fold. 41.1%. OSLSM [15] proposed to feed only the object pixels as
Suppose the predicted segmentation mask is {M̂ }N test
i=1 and input by masking out the background regions, and this method
Ntest
the corresponding ground-truth annotation is {M }i=1 , given obtains 40.8%.
a specific class l. We define the Intersection over Union (IoUl ) OSVOS [37] adopts a strategy of finetuning the network us-
of class l as T Pl +FT PPll+F Nl , where the T P, F P and F N are ing the support samples in testing, and it achieves only 32.6%.
IEEE TRANSACTIONS ON CYBERNETICS 6
TABLE II
M EAN I O U RESULTS OF ONE - SHOT SEGMENTATION ON THE PASCAL-5 I DATASET. T HE BEST RESULTS ARE IN BOLD .
TABLE III
M EAN I O U RESULTS OF FIVE - SHOT SEGMENTATION ON THE PASCAL-5 I DATASET. T HE BEST RESULTS ARE IN BOLD .
Fig. 3. Segmentation results on unseen classes with the guidance of support images. For the failure pairs, the ground-truth is on the left side while the
predicted is on the right side.
TABLE IV
One-Shot
horse v.s. person table v.s. person person v.s. table car v.s. bus bottle v.s. cat cat v.s. sofa plant v.s. cat bottle v.s. table
sup
pred
similarity
Fig. 5. Similarity maps of different categories. With the reference to the support objects, the objects in the query images of the same categories will be
highlighted, while the distracting objects and the background are depressed. The predicted mask can precisely segment the target objects under the guidance
of the similarity maps.
future researches. We ascribe the failure to 1) the target object one-shot prediction. As we have also observed that five-shot
regions are too similar to background pixels, e.g.the side of testing can only improve the mIoU by 0.8 which is a marginal
the bus and the car; 2) the target region have very uncommon growth. We think the reason for this phenomenon is that the
features with the discovered discriminative regions, e.g.the highlevel features for different objects sharing the same class
vest of the dog, which may far distant with the representative labels are very close. Hence, averaging these features from
feature of support objects. different objects can only improve a little in terms of feature
Figure 5 illustrates the similarity maps of cosine distance expressiveness which causes the five-shot increase is limited.
between the support objects and the query images. We try to On the other side, our target of the similarity learning is exactly
segment the objects of the query images in the second row to produce the aligned features for each category. So, the five-
corresponding to the annotated objects of the support images shot results can only improve a little under the current one-shot
in the first row. Note there exist distracting classes in the given segmentation settings.
query images. We only expect to segment the objects whose For a fairer comparison, we also evaluate the proposed
categories are consistent with the support images. With the model regarding the same metric in co-FCN [16] and
reference to the extracted features of support objects, the cor- PL+SEG [51]. This metric firstly calculates the IoU of fore-
responding regions in the query images are highlighted while ground and background, and then obtains the mean IoU of the
the distracting regions and the background are depressed. The foreground and background pixels. We still report the averaged
predicted masks can be precisely predicted with the guidance mIoU on the four cross-validation datasets. Table IV compares
to the similairty maps. SG-One with the baseline methods regarding this metric in
Five-shot Table III illustrates the five-shot segmentation re- terms of one-shot and five-shot semantic segmentation. It is
sults on the four divisions. As we have discussed, we apply two obvious that the proposed approach outperforms all previous
approaches to five-shot semantic segmentation. The approach baselines. SG-One achieves 63.1% of one-shot segmentation
of averaging the representative vectors from the five support and 65.9% of five-shot segmentation, while the most competi-
images achieves 47.1% which significantly outperforms the tive baseline method PL+SEG only obtains 61.2% and 62.3%.
current state-of-the-art, co-FCN, by 5.7%. This result is also The proposed network is trained end-to-end, and our results
better than the corresponding one-shot mIoU of 46.3%. There- do not require any pre-processing or post-processing steps.
fore, the averaged support vector has a better expressiveness D. Multi-Class Segmentation
of the features in guiding the segmentation process. Another We conduct experiments to verify the ability of SG-One
approach is to solely fuse the final segmentation results by in segmenting images with multiple classes. We randomly
combining all of the detected object pixels. We do not observe select 1000 entries of the query and support images. Query
any improvement of this approach, comparing to the one- images may contain objects of multiple classes. For each
shot result. It is notable that we do not specifically train a entry, we sample five annotated images from the five testing
new network for five-shot segmentation. The trained network classes as support images. For every query image, we predict
in a one-shot way is directly applied to predict the five- its segmentation masks with the images of different support
shot segmentation results. Figure 4 compares the predicted classes. We fuse the segmentation masks of the five classes
segmentation masks of one-shot and five-shot. The segmen- by comparing the classification scores. The mIoU on the four
tation masks of five-shot are slightly better than that from datasets is 29.4%. We conduct the same experiment to test
the co-FCN algorithm [16], and co-FCN only obtains the
1 The details of baseline methods e.g.1-NN, LogReg and Siamese, refer
mIoU of 11.5%. Therefore, we can tell that SG-One is much
to OSLSM [15]. The results for co-FCN [16] are for our re-implemented
version. Table IV reports the evaluation results regarding the same metric more robust in dealing with multi-cluass images, although it
adopted in [16] for a fairer comparison. is trained with images of single class.
IEEE TRANSACTIONS ON CYBERNETICS 8
classes do not align, which prevents us from obtaining better [22] B. Wang, X. Yuan, X. Gao, X. Li, and D. Tao, “A hybrid level set with
features for input images. Second, the predicted masks some- semantic shape constraint for object segmentation,” TCYB, no. 99, pp.
1–2, 2018. 2
times can only cover part of the target regions and may include [23] X. Ben, C. Gong, P. Zhang, R. Yan, Q. Wu, and W. Meng, “Coupled
some background noises if the target object is too similar to bilinear discriminant projection for cross-view gait recognition,” TCSVT,
the background. 2019. 2
[24] B. Cheng, L.-C. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. Huang,
W.-M. Hwu, and H. Shi, “Spgnet: Semantic prediction guidance for
ACKNOWLEDGMENT scene parsing,” in ICCV, 2019. 2
[25] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet:
This work is supported by ARC DECRA DE190101315 and Criss-cross attention for semantic segmentation,” in ICCV, 2019. 2
[26] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
ARC DP200100938. “Semantic image segmentation with deep convolutional nets and fully
connected crfs,” preprint arXiv:1412.7062, 2014. 2, 4
[27] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
R EFERENCES decoder with atrous separable convolution for semantic image segmen-
tation,” arXiv preprint arXiv:1802.02611, 2018. 2
[1] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. [28] S. Wang, G. Peng, S. Chen, and Q. Ji, “Weakly supervised facial action
234–241. 1, 2 unit recognition with domain knowledge,” TCYB, vol. 48, no. 11, pp.
[2] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks 3265–3276, 2018. 2
for semantic segmentation,” in CVPR, 2015. 1, 2, 4 [29] L. Wang, D. Meng, X. Hu, J. Lu, and J. Zhao, “Instance annotation via
[3] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV. optimal bow for weakly supervised object localization,” TCYB, vol. 47,
IEEE, 2017, pp. 2980–2988. 1, 2 no. 5, pp. 1313–1324, 2017. 2
[4] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting [30] M. Liu, J. Zhang, C. Lian, and D. Shen, “Weakly supervised deep
dilated convolution: A simple approach for weakly-and semi-supervised learning for brain disease prognosis using mri and incomplete clinical
semantic segmentation,” in CVPR, 2018, pp. 7268–7277. 1, 2 scores,” TCYB, 2019. 2
[5] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, [31] Q. Hou, P. Jiang, Y. Wei, and M.-M. Cheng, “Self-erasing network for
“Object region mining with adversarial erasing: A simple classification integral object attention,” in NIPS, 2018, pp. 549–559. 2
to semantic segmentation approach,” in CVPR, 2017. 1, 2 [32] P.-T. Jiang, Q. Hou, Y. Cao, M.-M. Cheng, Y. Wei, and H.-K. Xiong,
[6] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, “Integral object mining via online attention accumulation,” in ICCV,
and S. Yan, “Stc: A simple to complex framework for weakly-supervised 2019, pp. 2070–2079. 2
semantic segmentation,” TPAMI, 2016. 1 [33] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba, “Learning Deep
[7] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble- Features for Discriminative Localization.” CVPR, 2016. 2
supervised convolutional networks for semantic segmentation,” CVPR, [34] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial
2016. 1, 2 complementary learning for weakly supervised object localization,” in
[8] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normal- CVPR, 2018. 2
ized cut loss for weakly-supervised cnn segmentation,” in CVPR, 2018. [35] X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guid-
1, 2 ance for weakly-supervised object localization,” in ECCV. Springer,
[9] B. Wang, G. Qi, S. Tang, T. Zhang, Y. Wei, L. Li, and Y. Zhang, “Bound- 2018. 2
ary perception guidance: a scribble-supervised semantic segmentation [36] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung,
approach,” in IJCAI, 2019, pp. 3663–3669. 1 and L. Van Gool, “The 2017 davis challenge on video object segmen-
[10] A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele, “Sim- tation,” arXiv:1704.00675, 2017. 3, 9
ple does it: Weakly supervised instance and semantic segmentation.” in [37] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and
CVPR, vol. 1, no. 2, 2017, p. 3. 1 L. Van Gool, “One-shot video object segmentation,” in CVPR 2017. 3,
[11] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to 5, 6, 9
supervise convolutional networks for semantic segmentation,” in ICCV, [38] J. Shin Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. So Kweon,
2015, pp. 1635–1643. 1 “Pixel-level matching for video object segmentation using convolutional
[12] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “Whats the neural networks,” in CVPR, 2017, pp. 2167–2176. 3, 9
point: Semantic segmentation with point supervision,” in ECCV, 2016, [39] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint learning
pp. 549–565. 1 for video object segmentation and optical flow,” in ICCV, 2017, pp. 686–
[13] Y. Tang, W. Zou, Z. Jin, Y. Chen, Y. Hua, and X. Li, “Weakly supervised 695. 3
salient object detection with spatiotemporal cascade neural networks,” [40] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
TCSVT, 2018. 1 P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical
[14] R. Qian, Y. Wei, H. Shi, J. Li, J. Liu, and T. Huang, “Weakly supervised flow with convolutional networks,” in ICCV, 2015, pp. 2758–2766. 3
scene parsing with point-based distance metric learning,” in Proceedings [41] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Videomatch: Matching
of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. based video object segmentation,” in ECCV, 2018, pp. 54–70. 3, 9
8843–8850. 1, 2 [42] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Bian, and Y. Yang, “Progressive
[15] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-shot learning learning for person re-identification with one example,” IEEE TIP,
for semantic segmentation,” BMVC, 2017. 1, 2, 4, 5, 6, 7, 8, 9 vol. 28, no. 6, pp. 2872–2881, 2019. 3
[16] K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine, “Condi- [43] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang, “Asymptotic soft
tional networks for few-shot semantic segmentation,” ICLR workshop, filter pruning for deep convolutional neural networks,” TCYB, pp. 1–11,
2018. 1, 2, 4, 5, 6, 7, 8, 9 2019. 3
[17] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for [44] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2, fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400,
2015. 1, 9 2017. 3
[18] F. Husain, B. Dellen, and C. Torras, “Consistent depth video segmenta- [45] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching
tion using adaptive surface models,” TCYB, vol. 45, no. 2, pp. 266–278, networks for one shot learning,” in NIPS, 2016, pp. 3630–3638. 3
2014. 2 [46] Y. Annadani and S. Biswas, “Preserving semantic relations for zero-shot
[19] H. Yang, S. Wu, C. Deng, and W. Lin, “Scale and orientation invariant learning,” in CVPR, 2018, pp. 7603–7612. 3
text segmentation for born-digital compound images,” TCYB, vol. 45, [47] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,
no. 3, pp. 519–533, 2014. 2 and A. Zisserman, “The pascal visual object classes challenge: A
[20] J. Peng, J. Shen, and X. Li, “High-order energies for stereo segmenta- retrospective,” IJCV, vol. 111, no. 1, pp. 98–136, 2014. 5
tion,” TCYB, vol. 46, no. 7, pp. 1616–1627, 2015. 2 [48] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic
[21] D. Nie, L. Wang, E. Adeli, C. Lao, W. Lin, and D. Shen, “3-d fully contours from inverse detectors,” in ICCV, 2011, pp. 991–998. 5
convolutional networks for multimodal isointense infant brain image [49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
segmentation,” TCYB, no. 99, pp. 1–14, 2018. 2 Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-
IEEE TRANSACTIONS ON CYBERNETICS 11