Journal Pre-Proofs: Neurocomputing

Journal Pre-proofs
MetricMask: Single Category Instance Segmentation by Metric Learning
Yang Wang, Wanlin Zhou, Qinwei Lv, Guangle Yao
PII: S0925-2312(22)00701-9
DOI: https://doi.org/10.1016/j.neucom.2022.05.117
Reference: NEUCOM 25307
To appear in: Neurocomputing
Received Date: 10 January 2022

Accepted Date: 28 May 2022
Please cite this article as: Y. Wang, W. Zhou, Q. Lv, G. Yao, MetricMask: Single Category Instance
Segmentation by Metric Learning, Neurocomputing (2022), doi: https://doi.org/10.1016/j.neucom.2022.05.117
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version
will undergo additional copyediting, typesetting and review before it is published in its final form, but we are
providing this version to give early visibility of the article. Please note that, during the production process, errors
may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2022 Published by Elsevier B.V.

Manuscript Click here to view linked References
MetricMask: Single Category Instance Segmentation by

Metric Learning
Yang Wanga,∗, Wanlin Zhoua , Qinwei Lva , Guangle Yaob

a College of mechanical and electrical engineering, Nanjing University of Aeronautics and
Astronautics, Nanjing, China
b College of Information Science and Technology, Chengdu University of Technology,
Sichuan, China
Abstract
In this paper, we introduce a novel single category instance segmentation method,

termed MetricMask, which can be used by easily embedding it into most off-the-
shelf detection and segmentation methods. Currently, instance segmentation
methods usually segment each instance after completing the object detection.
Our method can implement segmentation for all instances at once by fusing
object detection, semantic segmentation, and metric learning. The contribu-
tions are threefold, 1. The instance segmentation is converted to three parallel
tasks, including bounding box regression, mask regression, and embedding vec-
tor of pixel regression. 2. Random sampling metric loss is proposed to optimize
embedding vectors, saving GPU memory without losing instance segmentation
accuracy. 3. Based on the model output(bounding box, embedding vector of
pixel, and the category-level mask), we propose a Metric Operation to seg-
ment each instance. Finally, we conduct experiments on two standard datasets,
including the COCO person dataset and the ISOD dataset. The experiment re-
sults show that our method has excellent competitiveness with other methods.
In particular, our method has an excellent performance on the large object and
outperforms the existing state-of-the-art competitors on the ISOD dataset. We
∗ Correspondingauthor
Email address: youngnuaa@gmail.com (Yang Wang)
Preprint submitted to Journal of LATEX Templates January 10, 2022

hope the MetricMask of our proposed will provide a new method for instance
segmentation.
Keywords: Instance segmentation, Metric learning, Random sampling metric
loss, Metric Operation
3×3Conv+bn+Relu 1×1Conv
Center map
Features
Featuress澳 Mask map澳

Features Scale map
Features
Offset map
Embedding vector map
Segmentation Head Output Detection Head Output
(a) Segmentation Head (b) Detection Head
Figure 1: MetricMask Head, which consists of two main components: the segmentation head
and detection head. In the figure, the head network structure and output are shown. To visu-
ally display the embedding vector map, we employ the PCA method for reducing embedding
vector dimension to 3. Observe the embedding vector map, the colors of the same instance
are close, and the colors of different instances are very different. The close color indicates the
embedding vectors are similar, and the large color difference indicates the embedding vectors
are significant difference. Therefore, it shows that the embedding vectors of different instances
of the network output are distinguished.
1. Introduction
Object detection [1] provides the classes of the image objects and gives the
location of the image objects. The location is given in the form of bounding
boxes. Semantic segmentation [2] provides fine-grained results by predicting
5 the labels of every pixel in the input image. Each pixel is labeled according to
the object class within which it is enclosed. Furthering this evolution, instance
2
segmentation gives different labels for separate objects belonging to the same
category. Hence, instance segmentation may be defined as the technique of
simultaneously solving object detection and semantic segmentation.
10 Intuitively, instance segmentation can be solved by bounding box detection
and semantic segmentation within each box, adopted by two-stage methods,
such as Mask R-CNN [3] and CenterMask [4]. Currently, the vision community
has spent more effort in designing simpler pipelines of bounding box detectors
[5, 6, 7, 8, 9] and semantic segmentation tasks [2, 10, 11]. We aim to use these
15 simple and excellent methods to complete the task of instance segmentation.
The recent object detector Centernet [5] and Semantic segmentation SFnet [10]
are employed to instantiate our instance segmentation method, mainly for their
simplicity. Note that it is possible to use other methods such as RetinaNet [12],
YOLOF [7], Bisenetv2 [13] with minimal modification to our framework. There-
20 fore, we propose a conceptually simple instance-level mask prediction module
that can be easily plugged into many off-the-shelf object detection and semantic
segmentation methods.
Instance segmentation is usually solved by binary classification in a spatial
layout surrounded by bounding boxes. Each object needs to be segmented sep-
25 arately, which is luxurious. Instead, we point out that instance-level masks can
be recovered successfully and effectively if bounding box, category-level mask,
and pixel embedding vector are obtained. Specifically, we propose MetricMask,
which converts instance segmentation into object detection, semantic segmen-
tation, and metric learning of pixel embedding vector, rather than segmenting
30 individual instances independently. The object detection and semantic segmen-
tation methods are executed in parallel and a new branch is added on the head
of semantic segmentation to predicts the embedding vectors of each pixel. Fig-
ure 1(a) shows the semantic segmentation head, which yields a mask map and
embedding vector map. During the training process, embedding vectors from
35 the same instance of the embedding vector map are attracted, and embedding
3
vectors from different instances of the embedding vector map are penalized.
(a) (b) (c) (d) (e)
Figure 2: Example of MetricMask output on COCO person category. (a)images of MetricMask

input. (b,c,d) results of MetricMask output, where b is the category-level map, c is the
visualization of the embedding vector map, and d is the result of object detection. (e) the
result of the instance-level mask.
In order to maximize the advantages of MetricMask, random sampling metric

loss is proposed to optimize the embedding vector, and metric operation is
proposed to complete the instance segmentation base on network output. To
40 quickly verify the algorithm’s effectiveness, we select the COCO person category
dataset [14] and the ISOD dataset [15] to conduct experiments. MetricMask has
shown great competitiveness on the above two datasets. The main contributions
of this work are three-fold:
1. We introduce a brand new framework for single category instance seg-
45 mentation, termed MetricMask, which converts instance segmentation to three
parallel tasks: bounding box regression, mask regression, and embedding vector
of pixel regression. The main desirable characteristic of MetricMask is simplicity
and effectiveness.
2. Metric Loss is proposed to optimize the embedding vector distance of
50 pixels on the instance, which attracts the same instance’s embedding vectors
and penalizing the embedding vectors of different instances. However, when
4
Metric Loss is directly applied for all instances, a lot of GPU memory resources
are consumed. Therefore, we introduce a random sampling method to save GPU
memory without losing instance segmentation accuracy.
55 3. According to the results of the model output(bounding box, embedding
vector of pixel, and the category-level mask), we introduce a Metric Operation
to aggregate all pixels belonging to the same instance. The input, output, and
final results of the MetricMask are shown in Figure 2.
2. Related Work
60 In this section, we collect works of object detection, semantic segmentation,

metric learning, and instance segmentation based on deep learning. And some
representative algorithms are briefly introduced in these fields.
2.1. object detection
Object detectors can be roughly categorized into two types: anchor-based

65 object detectors and anchor-free object detectors.
Anchor-based object detectors can be divided into two categories: one-stage
detectors and two-stage detectors. Faster R-CNN [16] is one of the most popular
frameworks among the two-stage detectors. The first-stage module of Faster R-
CNN [16] is a region proposal network (RPN), which generates a batch of high-
70 quality regional proposals. The second stage is a region-wise prediction network,
predicting the category and refining the bounding box. Anchor-based one-stage
detectors, such as SSD [9], Retinanet [12], and YOLOv3 [8], are usually inference
faster than two-stage detectors due to their simpler architectures.
Anchor-free object detectors no longer need to generate anchors in advance,
75 avoiding anchors’ complex calculation. There are some representative anchor-
free object detectors, such as CornerNet [17], CenterNet [5], and FCOS [6].
CornerNet [17] detects a bounding box’s top-left corner and bottom-right corner
to get the final bounding box. CenterNet [5] applies keypoint estimation to find
5
the center points of objects and regresses to their size. FCOS [6] regards all
80 the locations inside the bounding box as positives and predicts four distances
between positives and four sides of the bounding box. All the detectors can be
embedded in our method. We chose CenterNet [5] to embed our method for
object detection because of its simplicity.
2.2. Semantic segmentation
85 Semantic segmentation is a challenging task in computer vision, which has

received increasing attention in recent years. Semantic segmentation is at the
pixel level task and needs to assign a label to each pixel. Long et al. [2] first
propose Fully Convolutional Network (FCN), which significantly improves past
semantic segmentation methods. However, the result of FCN appears coarse
90 because down-sampling and max-pooling operation reduce the feature resolution
and loss the position information. Therefore, many methods based on FCN
explore subsequently to solve these problems. PSPNet [11] proposes spatial
pyramid pooling, which fuses four different pyramid scales to capture contextual
information. SFnet [10] proposes a Flow Alignment Module (FAM) to learn
95 Semantic Flow for feature maps of adjacent levels. In this work, we obtain the
image’s semantic information and pixel embedding vectors based on SFnet [10].
2.3. Metric learning
Metric learning, also known as similarity learning, aims to learn a similarity

between samples. Metric learning gains much attention due to its wide applica-
100 tions, including person re-identification, face recognition, and object retrieval.
Contrastive loss [18] is the seminal work of deep metric learning (DML). The
main idea of contrastive loss is to ensure a short distance between similar in-
stances while a large margin separates the different instances. Based on the
contrastive loss, triplet loss [19] considers the relative distance between similar
105 instances and different instances. A triplet consists of an anchor, a positive
6
instance, and a negative instance. The triplet loss encourages the distance be-
tween positive pairs (anchor-positive) to be smaller than the distance between
negative pairs (anchor-negative) with a margin. In recent years, metric learning
has been successfully applied to instance segmentation tasks [20, 21, 22, 23].
110 Metric learning computes the likelihood that two pixels belong to the same ob-
ject instance and ensure that the pixels belonging to the same object have a
significant similarity. The pixels of the different objects are precisely the oppo-
site. In this paper, metric learning is also applied to optimize pixel embedding
vectors. Different from the above method, we only need to optimize the pixel
115 embedding vector, which comes from the intersecting area of the bounding box
of the instance.
2.4. Instance segmentation
Instance segmentation, based on semantic segmentation, further detects all

instances of each category in an image. Instance segmentation method can
120 be divided into two categories: two-stage instance segmentation and one-stage
instance segmentation.
Two-stage instance segmentation needs to detect bounding boxes of every
instance and then perform pixel classification in each bounding box to obtain
the final mask. Mask R-CNN [3] is the most representative work of two-stage
125 instance segmentation, which adds a mask branch to predict object mask based
on Faster R-CNN [16]. Following Mask R-CNN [3], PANet [24] improves the
network’s performance by presenting bottom-up path augmentation, fully con-
nected fusion, and adaptive feature pooling. Mask Scroing R-CNN [25] adds a
MaskIoU branch to evaluate the mask’s quality and improve the performance
130 of instance segmentation.
One-stage instance segmentation is generally faster than the two-stage in-
stance segmentation methods. Tensor-Mask [26] represents masks over a spatial
domain by structured 4D tensors. It achieves a similar performance with the
7
two-stage instance Mask R-CNN [3]. PolarMask [27] introduces a novel method
135 for instance segmentation by improving the anchor-free object detector FCOS
[6], which segment instance masks in a polar coordinate. SOLOv2 [28] divides
the image into a grid of S×S cells and then predicts the semantic category and
the instance mask in the grid cell where the center of an object falls into. Dif-
ferent from some previous works that typically solve instance segmentation as
140 bounding box detection and semantic segmentation within each box, we convert
instance segmentation tasks to three parallel tasks: object bounding box regres-
sion, mask regression, and embedding vector of pixel regression. Moreover, our
method can implement segmentation for all instances at once.
Figure 3: The architecture of MetricMask. s3 to s6 denote the feature map in the feature
pyramid of the backbone network. Detection and segmentation branches predict bounding
boxes of the object, the object’s mask, and the embedding vector of each pixel. Finally, the
network’s output is put into Metric Operation to obtain the result of the instance-level mask.
3. Our Method
145 In this section, we first briefly introduce the overall architecture of the Met-
ricMask�including the branch of detection, a branch of segmentation, and loss
function. Then, we introduce a novel concept of MetricMask and propose ran-
8
dom sampling metric loss. Finally, we propose a new Metric Operation to
achieve the instance clustering problem.
150 3.1. Overall Architecture
MetricMask is a simple, unified network composed of a backbone network,

two FPN(feature pyramid networks) [1], and five task-specific heads, shown in
Figure 3. The entire backbone network comes from DLA-34 [29] that is pre-
trained on Imagenet [30]. The branch of detection is the same as Centernet [5].
155 The branch of segmentation is the same as SFnet [10]. While there exist many
stronger candidates for those components, we choose Centernet [5] and SFnet
[10] to prove the effectiveness of the MetricMask method instead of refreshing
the accuracy of the dataset.
3.2. The Branches of Detection and Segmentation
160 There are two branches of the backbone network, namely the detection and
segmentation branches. They are composed of FPN and task-specific heads.
The detection FPN and detection head are shown in Figure 4(a) and Figure
1(b), which is built on a Centernet [5] object detector. In short, given input
image(I ∈ Rw×h×3 ), the detection branch produces three feature maps, namely
center map∈ R 4 × 4 ×1 , scale map∈ R 4 × 4 ×4 , and offset map∈ R 4 × 4 ×2 ,
w h w h w h
165
which are applied to obtain the center position, size, and offset of the object.
The segmentation FPN and segmentation head are shown in Figure 4(b) and
Figure 1(a), which is built on the SFnet [10] segmentation method. Compared
with the semantic segmentation method, we add an embedding vector map
170 output on the segmentation head, which is applied to predict the embedding
vector of each pixel. In short�given input image(I ∈ Rw×h×3 )�segmentation
branch produces two feature maps, namely mask map∈ Rw×h×1 and embedding
vector map∈ Rw×h×4 , which are employed to obtain the categrory-level mask
of object and the embedding vector of each pixel.
9
6WDJH )$0 1RGH &RQFDW 1RGH 330
6XQ 1RGH 8S 1RGH 'HIRUP &RQY
ϯϮ
Ϭ ϰ ϯϮ
Ϭ ϯϮ ϰ ,QSXW ,QSXW
ϭϲ ϭϲ ϰ ϭϲ ϭϲ ϰ
ϰ ϰ ϴ
ϴ ϴ ϴ ϰ ϴ ϴ ϰ
ϰ
ϰ ϰ ϰ ϰ ϰ ϰ ϰ
,QSXW 2XWSXW ,QSXW 2XWSXW 2XWSXW

D )31 RI 'HWHFWLRQ E )31 RI 6HJPHQWDWLRQ F $UFKLWHFWXUH RI )$0
Figure 4: The FPN module of the MetricMask. (a)The FPN module of the Detection, which
comes from Centernet. (b)The FPN module of the Segmentation, which comes from SFnet.
(c)The Architecture of FAM, which is applied to fuse feature maps of different scales. The
FAM is the core module in SFnet. The number inside the boxes represents the sampling rate
of the image. The PPM module is widely applied in semantic Segmentation to increase the
receptive field [1]
175 3.3. Loss function
In this section, we describe the optimized strategy of the detection head

and segmentation head, including the center map, scale map, offset map, mask
map, and embedding vector map. In addition, there are containing definitions
of ground truth targets (center target, scale target, offset target, mask target,
180 and embedding vector target) and Loss function.
3.3.1. Detection branch loss function
The Centernet [5] defines three ground truth targets, namely center target
C ∈ R 4 × 4 ×1 , scale target S ∈ R 4 × 4 ×2 , and offset target O ∈ R 4 × 4 ×2 , which
w h w h w h
are applied to calculate network loss with the center map, scale map, and offset
185 map. Compared with the Centernet method, we make four modifications.
10
1. The actual location of an object’s center point is( x4k , y4k ), instead of assigning
a single position (⌊ x4k ⌋, ⌊ y4k ⌋) as positive in Centernet [5]. We assign four corners
next to the actual location as positives while others are negatives for the center
target C ∈ R 4 × 4 ×1 , reducing sample imbalance. Figure 5 illustrates our
w h
190 design [31, 32].

2. We choose the mass-center as the object center instead of the box center,
which will be illustrated in detail in section 3.4. Therefore, the 4D vector v*=(l*
, t*, r*,b*) is regressed as the object’s location. Here l*, t*, r*, and b* are the
distances from the object’s center to the four sides of the bounding box, as
195 shown in Figure 7. So we assign log(v*) to the positive regions of each object
on the scale target S ∈ R 4 × 4 ×4 .
w h
3. The center of object is set to four positives, so we need to predict the

offset of the four points [32]. The offset value is set to ( x4k − ⌊ x4k ⌋, y4k − ⌊ y4k ⌋),
( x4k − ⌈ x4k ⌉, y4k − ⌊ y4k ⌋), ( x4k − ⌊ x4k ⌋, y4k − ⌈ y4k ⌉) and ( x4k − ⌈ x4k ⌉, y4k − ⌈ y4k ⌉) for
the offset target O ∈ R 4 × 4 ×2 . Figure 5 illustrates offset target design.
w h
200
4. The Giou loss [33] has been proven effective. We add giou loss based on
smoothl1 [34] for network optimization.
The loss function of the detection branch is follow.
Ld = λcls Lcls + λscale Lscale + λof f set Lof f set + λgiou Lgiou
Where λcls , λscale , λof f set , and λgiou are the weight of each loss. The classifica-
tion loss Lcls , scale regression loss Lscale and offset regression loss Lof f set are
the same as those in Centernet [5]. LGIOU is giou loss, which is introduced to
optimize offset and scale:
Lgiou = GIOU (sˆk , oˆk , sk , ok )
Where sˆk comes from scale map, which represents the predicted scale. sk comes
from scale target, which represents the scale ground truth. oˆk comes from offset
205 map, which represents the predicted offset, and ok comes from offset target,
which represents the offset ground truth.
11
D E
Figure 5: Illustration for the center and offset target. (a)The design of Centernet. (b)The
design of modification [31, 32]. The yellow circle point is the real center, while square boxes
denote four points next to it. Blue indicates positive, while grey is negative. The dashed line
denotes the offset, which from positives to the real center.
3.3.2. Segmentation branch loss function
The SFnet [10] defines ground truth targets, namely mask target Mask ∈
Rw×h×c , which is practiced to calculate network loss with the mask map. Com-
210 pared with the SFnet [10] method, We add a branch to predict the embedding
vector of the pixel, which is optimized by metric learning.
The loss function of the segmentation branch is as follows.
Lseg = λmask Lmask + λinst Linst
Where Lseg is total loss function of segmentation branch. λmask and λinst are
the weight of each loss. Linst is loss function of the embedding vector, which is
introduced in detail in the 3.4 section. Lmask is the segmentation loss function,
215 which is the same as in SFnet [10].
The overall MetricMask loss function is as follows.
Lsum = Ld + Lseg
12
3.4. Metric Mask Segmentation
In this section, we will describe how to complete instances segmentation by

metric learning in detail.
D E F G
Figure 6: The positional relationship between the two instances, where the green box is the
object’s bounding box, and the red box is the intersecting region of the bounding boxes.
(a)There is no intersection region between two bounding boxes. (b)There is no mask in the
intersection region. (c)There is the mask of one instance in the intersection region. (d)There
is the mask of two instances in the intersection region. Although the masks of a and b are
category-level, we can obtain the instance-level mask based on the bounding box. When there
is a mask in the intersection region of c and d, we cannot distinguish the instance-level mask
based on the bounding box.
3.4.1. MetricMask Representation
220 Conventional object detection and semantic segmentation can get the bound-
ing box and mask of the object. The mask obtained by the semantic segmen-
tation is category-level and cannot distinguish a single instance. We add a
branch on the head of semantic segmentation, which yields the embedding vec-
tor of each pixel in the image. We regard all the pixels of each instance as the
225 same class during the training process, so the pixels from different instances
are different categories. Then, metric learning is employed to optimize intra-
class compactness and inter-class discrepancy. Finally, we propose a Metric
Operation to achieve instance segmentation. The above is the core idea of our
method. When metric learning is applied to optimize the embedding vector of
230 all pixels, it requires many computing resources. According to the bounding box
and mask of the instance, we can obtain three prior knowledge. 1. The mask
13
of the instance is in the bounding box. 2. If the bounding box of an instance
does not intersect with the bounding boxes of other instances, the mask, which
is in the bounding box, is the mask of the instance. 3. If the bounding box
235 of an instance intersects with the bounding box of other instances, the mask
of the disjoint region is the mask of the instance. We need to determine which
instance the mask of the intersecting region came from. Details are shown in
Figure 6. Therefore, the pixel in the intersecting region needs to be metric
learning, which significantly reduces the demand for computing resources.
240 3.4.2. Random sampling Metric Loss
According to the previous section, the embedding vector of each pixel from
the intersection region needs to be optimized. The implementation steps are as
follows, There are many instances in an image, and the each instance is treated
as an anchor instance. If there is an intersecting region between the bounding
box of the anchor instance and the bounding box of another instance, then an in-
stance pair {Ii , Oij } is formed. Ii represents the i-th anchor instance, and Oij rep-
resents the j-th instance that intersects with the bounding box of the i-th anchor
instance. Through the above calculation, we obtain all instance pairs in an im-
age, which are expressed as follows {{I1 , O11 } , {I1 , O12 } , . . . , {I2 , O21 } , {I2 , O22 }
, . . . , {I1 , Oij }}. Take instance pair {Ii , Oij } as an example. If pixels belong to
the instance pair and are in the intersecting region, the embedding vectors of
those pixels need to be optimized. We propose the optimization strategy as
follows: the first step is to complete the definition of the embedding vector, and
the second step is to optimize the embedding vector of the definition. 1. The
embedding vectors of the center point of instance pair {Ii , Oij } are defined as
eI and eO , respectively. If pixel come from intersecting region of instance pair
{Ii , Oij } and belongs to instance Ii , the embedding vector of pixel is defined as
eI,i . If pixel come from intersecting region of instance pair {Ii , Oij } and belongs
to instance Oij , the embedding vector of pixel is defined as eO,i . 2.The opti-
14
mization of the embedding vectors is divided into two parts. In the first part,
we make the distance between eI and eI,i smaller than the distance between
eO and eI,i . In the second part, we make the distance between eO and eO,i
smaller than the distance between eI and eO,i . The loss function is constructed
as follows.
 ( )
 ∑
N
 1 2 2
max 0, Φ + |eI − eI,i |2 − |eO − eI,i |2 if N >0
N
L1ij = i=0

 0 else
 ( )
 ∑
M
 1 2 2
max 0, Φ + |eO − eO,i |2 − |eI − eO,i |2 if M >0
M
L2ij = i=0

 0 else
Where L1ij is the loss about the instance Ii . L2ij is the loss about the instance
Oij . N is the number of pixels that belong to the instance Ii in the intersecting
region of instance pair {Ii , Oij }. M is the number of pixels that belong to the
instance Oij in the intersecting region of instance pair {Ii , Oij }. Φ is margin,
245 which is set as 0.5.
Based on the above notations, we integrate the loss functions of the instance
pairs {Ii , Oij }.
LP = L1ij + L2ij
Where LP is the sum of the loss of instance pair {Ii , Oij }. There are P groups
of instance pairs in an image, so we need to establish loss for all instance pairs.
The simplest method is to calculate the sum of the loss for all instance pairs.
However, when an image has a large number of instance pairs, there may be
many embedding vectors that need to be optimized, which requires a large num-
ber of computing resources. Therefore, we design a random sampling method
to randomly select several instance pairs from the image to calculate the loss.
The unselected instance pairs do not participate in the loss calculation. The
above method dramatically reduces the demand for computing resources. The
15
designed loss function is as follows.

 ∑K

 K
1
randomi (Lp ) if P > K
i=0
Linst = ∑


P
 P
1
Lp else
p=0
Where Linst is the final instance of loss. K is the number of randomly sample
instance pairs, which is set to 4. Random is random sampling without replace-
ment.
*
ℎ * t *
l r
𝑤 2𝑤 ℎ *
2 t
2 2 * 𝑤 𝑤
b l
*
r
*
2 2
ℎ
*
2 b
ℎ
2
ℎ ℎ
𝑤 2 𝑤 t
*
𝑤 2𝑤 t
*
* *
2 2 l r 2 2
* *
l r
b
* ℎ
ℎ b
*
2
2
Figure 7: Visualize the object center of the instance pair. Left: box center; Right: mass
center. When two instances are very close, the center of the box may not fall into the other
instances.
3.4.3. Mass Center
250 The center of the object is applied for the detection branch and metric loss.
There are many choices for the center of the object, such as box center or
mass center. The object detection task generally selects the box center as the
center of the object. How to choose a better center depends on its effect on
instance prediction performance. Here we compare box center with mass center
255 and conclude that box center is more disadvantageous. The box center has a
greater probability of falling inside the instance. What’s more, the box center
falls on other instances. The above problems don’t affect the object detection
16
task, but the embedding vector prediction of the instance center is a disaster.
There are some examples of the box center and mass center is shown in Figure
260 7. Nevertheless, there are some extreme cases, which the center of mass is not
located inside the instance. We will work hard to consider this issue in our
future work.
3.4.4. Metric Operation
The detection branch produces the center map, scale map, and offset map.
265 We apply greedy-NMS [35] to obtain the bounding box and the center of the
object. The segmentation branch produces the mask map and embedding vector
map. According to the prior knowledge proposed by 3.4.1, we only need to
distinguish the pixels in the intersecting region of the instances. We first take
out the embedding vectors of the intersection area and the center of the instance
270 on the embedding vector map and then visually display the embedding vectors.
Figure 8 shows the spatial relationship between the embedding vectors. Figure
8(c) shows that the distance between embedding vectors from the same instance
is close, and the distance between embedding vectors between different instances
is farther. Therefore, combining the output of the detection branch and the
275 segmentation branch, we propose the Metric Operation method to complete
the instance segmentation. Take the Metric Operation of an instance Ii as an
example, and the steps are as follows.
1. Obtain the relevant parameters of the instance Ii . (The bounding box of
the instance Ii , Intersection region with instance Ii , embedding vector of the
280 instance Ii center, embedding vectors of intersection region, the mask map,
the embedding vector of the center of all the instances that intersect with the
instance Ii .)
2. Apply the bounding box of the instance Ii to crop the mask on the mask map.
(The cropped mask may contain pixels of other instances in the intersecting
285 region. We remove the pixels of other instances to obtain the mask of instance
17
ϭ
D E F G
Figure 8: The results of embedding vector distance visualization. (a)images of MetricMask

input. (b) the label of the image. There is a red mask in the non-intersecting region. The
yellow bounding box represents the intersection region, and different colors in the intersection
region represent the masks of different instances. (c)The distance between pixel embedding
vectors. Each point represents each pixel, which comes from the yellow intersection area in
figure 8(c). Large triangles and squares represent embedding vectors of the object center, and
small triangles and squares represent embedding vectors of the instance intersection region in
figure 8(c). Points of the same color or the same shape indicate that they are from the same
instance. (d)the result of the instance-level mask. It can be seen from c that the distance
between pixel embedding vectors of the same instance is close, and the distance between
pixel embedding vectors of different instances is farther. Therefore, according to the pixel
embedding distance, we can obtain the instance-level mask.
18
def getInstanceMap(bbx, iou_bbxs, center_vector, center_vectors, mask_map,
embedding_vector_map):
"""
:bbx: instance Ii bounding box. (4)
:iou_bbxes: intersecting region between instance Ii with other instance. (Nh4)
:center_vector: embedding vector of instance Ii center. (1h4)
:center_vectors: embedding vector of instance center that intersect with instance Ii. (Nh
4)
:mask_map: (HhW)
:embedding_vector_map: (HhWh4)
:return instance_map: (HhW)
"""
instance_map = np.zeros(mask_map.shape)
x1, y1, x2, y2 = bbx
instance_map[y1:y2, x1:x2] =mask_map[y1:y2, x1:x2]
for index in range(len(iou_bbxes)):
x1, y1, x2, y2 = iou_bbxes[index]
region = mask_map[y1:y2, x1:x2]
if region.sum() == 0:
continue
other_center_vector = center_vectors[index]
region_feature = embedding_vector_map[y1:y2, x1:x2]
diff_region_bbx = np.sum((np.power((region_feature - center_vector),2)),-1)
diff_region_iou=np.sum((np.power((region_feature - other_center_vector),2)),-1)
min_feature_dist = diff_region_bbx<diff_region_iou
min_feature_dist = min_feature_dist*region
instance_map[y1:y2, x1:x2] = instance_map[y1:y2, x1:x2]*min_feature_dist
return instance_map
Figure 9: Python code of Metric Operation
19
Ii .)
3. Classify the pixels in the intersecting region. Take a pixel in the intersection
region as an example. It falls in the bounding box of multiple instances. We
calculate the distance between the pixel embedding vector and the embedding
290 vector of the center of multiple instances. If the distance is closest between the
pixel embedding vector and the embedding vector of the center of the instance
Ii , and the pixel is on the mask, then the pixel belongs to Ii . According to the
above method, we process all pixels in the intersecting region. As a result, the
mask of the instance Ii is obtained. Figure 9 shows the pseudo-code of Metric
295 Operation in NumPy style.
Steps 1-3 are repeated until the mask of all instances is obtained. The
instance segmentation results are shown in Figure 10.
Figure 10: Results of MetricMask on COCO val dataset with DLA-34-DCN, achieving 40.3%
mask AP.
20
4. Experiments
In this section, we will first introduce the datasets and evaluation metrics
300 used in our experiments, as in the 4.1 section. Then, training details will be
described in the 4.2 section. Next, we will carefully examine our proposed
design and conduct ablation experiments in 4.3 sections. Finally, the results of
our method and the comparison with previous state-of-the-art methods will be
provided in the 4.4 section.
305 4.1. Dataset and Evaluation Metric
We evaluate the proposed MetricMask method on the two standard datasets,

including the person category of the COCO dataset subset and the ISOD dataset.
The COCO dataset is proposed by Lin et al. [14]. Our train set is the subset of
the 2017 COCO training set images that contain person (64115 images). Our
310 val set coincides with the 2017 COCO validation set (5000 images). The ISOD
dataset is proposed by Li et al. [15], which is employed salient instance seg-
mentation. This dataset consists of 1000 images in cluttered scenes with salient
instance annotations. Among them, 600 images are used for training, and the
other 400 images are used for testing. We use mask AP as average precision AP
315 (averaged over IoU thresholds), APs , APm , and APl (AP at different scales).
4.2. Training detail
The DLA-34 [29] is employed as our backbone network, and the same hyper-
parameters with Centernet [5] and SFnet [10] are applied. Specifically, our net-
work is trained with stochastic gradient descent (SGD) [36] for 50 epochs, with
320 the initial learning rate being 0.004 and a mini-batch of 16 images. The learning
rate is reduced by 10 times at iteration 35th epoch and 45th epoch, respectively.
Weight decay and momentum are set as 0.0001 and 0.9, respectively. We ini-
tialize our backbone networks with the weights pre-trained on ImageNet [30].
The input images are resized to 768×1280.
21
325 4.3. Ablation Study
We conduct an ablative analysis of the proposed method on the COCO

person category dataset [30]. To this end, each component of our proposed
MetricMask is evaluated, including the modification of the detection branch, the
selection of different object center, the sensitivity of embedding dimension, the
330 influence of random sampling frequency, and different scales of input in training
and testing. To quickly verify the effectiveness of the proposed method, all
models are trained/tested with single-scale, and only the flip augmentation is
applied to model training.
Table 1: Comparison to the modification of the detection branch on the COCO person category
val-dev. APm is mask AP. APb is box AP
Backbone Size GT Targets Giou APm APm 50 APm 75 APb APb 50 APb 75
DLA-34 640×640 32.7 59.0 33.3 33.7 62.5 33.5

√
DLA-34 640×640 33.7 61.0 34.0 35.1 65.1 34.9
√
DLA-34 640×640 33.1 59.1 34.3 34.0 61.2 34.4
√ √
DLA-34 640×640 34.4 61.0 35.2 37.2 65.0 39.1
4.3.1. The modification of the detection branch
335 The detection branch modification mainly consists of two parts, including
ground truth targets design and Giou loss [33] application. We compare the
modification of the detection branch with the typical configuration of Center-
net [5]. The modification method we proposed has a significant improvement in
both detection and segmentation tasks. Results are reported in Table 1. Re-
340 gardless of the ground truth targets design or the Giou loss application, there
are specific improvements to box AP and mask AP. The ground truth targets
design increases the box AP by 1.4% and the mask AP by 1.0%. The Giou loss
application increases 0.3% for box AP and 0.4% for mask AP. When the two
parts are applied at the same time, the improvement is more prominent. The
22
345 box AP increases by 3.5%, and the mask AP increases by 1.7%. If not specified,
we apply the modification in the following experiments.
4.3.2. Dimension of embedding vector
We analyze the sensitivity of MetricMask to the dimension of the embedding

vector. We conduct experiments by the varying dimension of embedding vector
350 with 2, 3, 4, and 8. The results are shown in Table 2. Different numbers of
dimensions do not affect the detection results. However, there is a significant
impact on the result of instance segmentation. When the dimension of the
embedding vector is lower than 4, the instance segmentation performance will
get better as the dimension increases. When the dimension of the embedding
355 vector is greater than 4, the instance segmentation performance does not change
as the dimension increases.
Table 2: Comparisons with the dimension of embedding vector. The detection branch
does not affect, but there is a significant effect on instance segmentation.
Backbone Size num APm APm 50 APm 75 APb APb 50 APb 75
DLA-34 640×640 2 32.0 58.0 32.0 37.5 65.0 39.2

DLA-34 640×640 3 33.8 60.3 34.6 37.7 65.2 39.3
DLA-34 640×640 4 34.4 61.0 35.2 37.2 65.0 39.1
DLA-34 640×640 8 34.3 60.9 35.4 37.3 64.4 38.6
4.3.3. Random sample number of instance pair
The number of instance pairs is counted on the COCO person category

training set. The number of instance pairs is different in each image, and there
360 are up to 66 pairs of instances in an image. We analyze the impact of the random
sample number of instance pairs on MetricMask. The random sample number of
instance pairs is set to 2, 4, 8, and 66 for the experiment. Experiment results are
shown in Table 3. We found that it is not that the larger the random sample
number, the better the performance of the MetricMask. When the random
23
365 sample number is 2, the result is the worst. When the random sample number
is 4, 8, and 66, the results are similar. During the training process, as the
random sample number increases, the GPU memory usage also increases.
Table 3: Comparison with random sample number of instance pairs. The GPU memory
occupied becomes higher and higher as the number of instance pairs increases. But the
instance segmentation performance has not been improved.
Backbone Size num memory/img AP AP50 AP75 APs APm APl
DLA-34 640×640 2 1.2G 32.7 59.0 33.3 9.2 39.3 57.4

DLA-34 640×640 4 1.4G 34.4 61.0 35.2 10.0 41.1 60.1
DLA-34 640×640 8 1.6G 34.5 61.8 35.2 10.3 41.3 59.0
DLA-34 640×640 66 4.5G 34.5 61.1 35.5 10.0 41.7 59.6
4.3.4. Box center and mass center
We analyze the influence of the selection of the object center on the detection
370 and segmentation results. The comparison experiments are shown in Table 4.
The performance of the box center is better on the box AP, which is 1% higher
than the performance of the mass center. But the performance of the box center
is worse on the mask AP, which is 1.4% lower than the performance of the mass
center. The box center is simple to regress the object’s bounding box. Therefore
375 box AP performs better. When the box center is the object center, many box
centers do not fall on the object’s mask and even fall on other objects. Therefore,
mask AP performs worse.
Table 4: Comparisons with different object center options. When mass center is employed as
the object center, the performance of instance segmentation is significantly improved.
Backbone Size Mass-center Box-center APm APm 50 APm 75 APb APb 50 APb 75
√
DLA-34 640×640 33.0 62.1 32.1 38.2 66.9 39.4
√
DLA-34 640×640 34.4 61.0 35.2 37.2 65.0 39.1
24
Table 5: Compare with different training and test input sizes. * is multi-
scale testing, using 512×512 size to test large objects, and 768×1280 to
test small and medium objects. + indicates multi-scale augmentation and
longer training time (3×).
Backbone Size AP AP50 AP75 APs APm APl
DLA-34 512×512 32.6 58.2 33.1 7.0 38.9 60.4

DLA-34 640×640 34.4 61.0 35.2 10.0 41.1 60.1
DLA-34 768×1280 36.9 65.7 38.0 17.5 44.1 55.6
DLA-34* 768×1280 39.1 67.1 40.3 17.5 43.9 60.3
DLA-34-DCN 768×1280 40.3 68.1 42.3 18.2 46.1 64.3
+
DLA-34-DCN 768×1280 43.0 71.9 46.0 21.5 48.5 66.1
4.3.5. The different scale input
We further experiment to compare the effects of different training and test

380 input scales on the MetricMask method. The results are shown in Table 5.
When inputting low-resolution images, the large object instance segmentation
performance is good, but the small object instance segmentation performance is
poor. When inputting high-resolution images, the segmentation performance of
small object instances is good, but the segmentation performance of large object
385 instances is poor. The original design input resolution of the detection branch of
MetricMask is 512×512. The network receptive field of the detection branch is
small, and it is more friendly to low-resolution input. However, when inputting
low-resolution images, the mask of the small object is tiny, so the performance
is poor. Nevertheless, when high-resolution input, because the receptive field
390 of the detection branch is not enough, the performance of the large object is
impoverished, and the details are displayed, as shown in Figure 11. According
to the problems mentioned above, we propose two optimization methods, 1. The
targets are predicted at different resolutions. The small and medium objects are
predicted at high resolution, and the large objects are predicted at resolution.
25
395 The performance is increased by 2.2%. 2. DCN convolution [37] is employed
in the detection branch to improve the network’s receptive field. All objects
are predicted at high resolution. The performance is increased by 3.4%. The
second scheme only requires a single model and single resolution input, which
is better. Finally, when more data augmentation methods are applied and the
400 training time is extended, the performance of the model improves by 6.7%.
D E F D E F
Figure 11: Comparison DCN or not on 768×1280 scale input. (a) input image (b) mask map
(c) bounding box and instance segmentation result. The first row displays the result of not
use DCN, and the second row displays the result of use DCN. From the results, it can be
found that the segmentation results of the two rows are not much different, and the detection
results are pretty different. Without DCN, the receptive field is insuﬀicient, and the bounding
box cannot cover the object.
4.4. Comparisons with state-of-the-art Methods
To prove that the MetricMask is effective, we compare the state-of-the-art

methods with two standard datasets (COCO person category dataset and ISOD
dataset).
26
Table 6: Instance segmentation mask AP on the COCO val.
Method Backbone AP AP50 AP75 APs APm APl
Mask-RCNN[3] ResNet-50 41.2 74.1 43.0 22.0 49.1 59.5

Mask-RCNN[3] ResNet-101 45.5 79.8 47.2 23.9 51.1 61.1
FCIS[38] ResNet-101 33.4 64.1 31.8 9.0 41.1 61.8
BBE[39] ResNet-50 36.8 62.8 37.4 12.5 44.0 62.2
MEinst[40] ResNet-50 42.0 77.5 42.7 27.1 48.5 56.1
PersonLab[41] ResNet-101 37.7 65.9 39.4 16.6 48.0 59.5
MetricMask DLA-34 43.0 71.9 46.0 21.5 48.5 66.1
405 4.4.1. Results on the COCO person category dataset
We evaluate MetricMask on the COCO dataset of person category and com-

pare results to state-of-the-art methods. Compared with resnet50 network,
DLA-34 has lower parameters and faster inference speed. The Centernet com-
pares the output result of DLA-34 as the backbone network with the output
410 result of other methods using resnet50 as the backbone network. Due to the de-
tection branch of MetricMask coming from the centernet method, we also adopt
dla-34 as the backbone network to benchmark other methods that adopt ren-
snet50 as the backbone network. The results are shown in Table 6. MetricMask
outperforms 1.8 mask AP than Mask-RCNN of the resnet50 backbone network.
415 It is worth noting that our method has an excellent performance on the large
object. Even if Mask-RCNN adopts resnet101 as the backbone network, our
method outperforms 5 mask AP on large objects. Instance segmentation meth-
ods are generally divided into two categories, bottom-up and top-down. The
top-down method resizes all objects to a uniform scale and then completes the
420 segmentation, which is more friendly to small marks. Our MetricMask belongs
to the bottom-up method and does not change the object’s scale directly for seg-
mentation. Therefore, small objects will be challenging to divide because they
are too small. But the large objects retain more details, so it performs better.
27
In addition, In addition, our method is also a one-stage and anchor-free instance
425 segmentation method, which is comparable with state-of-the-art methods.
4.4.2. Results on the ISOD dataset
Table 7: Instance segmentation mask AP on

the ISOD dataset [41]
Method Backbone AP AP50 AP75
MS R-CNN[25] ResNet-50 56.2 84.2 68.8

HTC[42] ResNet-50 45.4 81.5 55.9
CenterMask[4] ResNet-50 54.0 87.2 68.7
BlendMask[43] ResNet-50 53.6 88.0 67.4
DetectoRS[44] ResNet-50 50.4 82.7 63.7
SOLO[45] ResNet-50 53.5 84.2 65.3
S4Net[46] ResNet-50 52.3 86.7 63.6
RDPNet[47] ResNet-50 58.6 88.9 73.8
MetricMask DLA-34 59.1 89.4 74.4
We compare our method with recent well-known instance segmentation meth-

ods, including Mask Scoring (MS) R-CNN [25], HTC [42], CenterMask [4],
BlendMask [43], SOLO [45], DetectoRS [44], and RDPNet [47]. For a fair
430 comparison, RDPNet [47] completes training and testing on the ISOD dataset
[15]. The results are shown in Table 7. Our method achieves the best results
than the other two popular competitors and recent strong instance segmenta-
tion methods. Salient objects are often much larger and distinctive than noisy
backgrounds and uninteresting objects. In this section, we conclude that our
435 method is better at the segmentation of large objects and therefore performs
better on the salient instance segmentation task.
5. Conclusion and Future Work
In this paper, we propose a single shot anchor-free instance segmentation

method, termed MetricMask, for single category instance segmentation. Dif-
28
440 ferent from previous works that typically solve mask prediction as binary clas-
sification in a spatial layout, MetricMask converts instance segmentation task
to metric learning for pixel embedding vectors. MetricMask is simple and ef-
fective, which compose three tasks: bounding box regression, mask regression,
and embedding vector of pixel regression. We hope that the proposed Metric-
445 Mask framework can provide a new direction for instance segmentation. For our
future work, we plan to extend our approach to multiple instance categories.
References
[1] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature

pyramid networks for object detection (2017). arXiv:1612.03144.
450 [2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for seman-
tic segmentation (2015). arXiv:1411.4038.
[3] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn (2018). arXiv:

1703.06870.
[4] Y. Lee, J. Park, Centermask : Real-time anchor-free instance segmentation

455 (2020). arXiv:1911.06667.
[5] X. Zhou, D. Wang, P. Krähenbühl, Objects as points (2019). arXiv:1904.

07850.
[6] Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage

object detection (2019). arXiv:1904.01355.
460 [7] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, J. Sun, You only look
one-level feature (2021). arXiv:2103.09460.
[8] J. Redmon, A. Farhadi, Yolov3: An incremental improvement (2018).

arXiv:1804.02767.
29
[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg,
465 Ssd: Single shot multibox detector, Lecture Notes in Computer Science
(2016) 21–37doi:10.1007/978-3-319-46448-0_2.
URL http://dx.doi.org/10.1007/978-3-319-46448-0_2
[10] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, Y. Tong, Semantic flow
for fast and accurate scene parsing (2021). arXiv:2002.10120.
470 [11] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network
(2017). arXiv:1612.01105.
[12] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense
object detection (2018). arXiv:1708.02002.
[13] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, N. Sang, Bisenet v2: Bilat-
475 eral network with guided aggregation for real-time semantic segmentation
(2020). arXiv:2004.02147.
[14] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Per-

ona, D. Ramanan, C. L. Zitnick, P. Dollár, Microsoft coco: Common ob-
jects in context (2015). arXiv:1405.0312.
480 [15] G. Li, Y. Xie, L. Lin, Y. Yu, Instance-level salient object segmentation
(2017). arXiv:1704.03604.
[16] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object
detection with region proposal networks (2016). arXiv:1506.01497.
[17] H. Law, J. Deng, Cornernet: Detecting objects as paired keypoints (2019).

485 arXiv:1808.01244.
[18] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning

an invariant mapping, in: 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp.
1735–1742. doi:10.1109/CVPR.2006.100.
30
490 [19] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for
face recognition and clustering, 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR)doi:10.1109/cvpr.2015.7298682.
URL http://dx.doi.org/10.1109/CVPR.2015.7298682
[20] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, K. P.

495 Murphy, Semantic instance segmentation via deep metric learning (2017).
arXiv:1703.10277.
[21] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, S. Yan, Proposal-free network

for instance-level object segmentation (2015). arXiv:1509.02636.
[22] S. Kong, C. Fowlkes, Recurrent pixel embedding for instance grouping

500 (2017). arXiv:1712.08273.
[23] D. Novotny, S. Albanie, D. Larlus, A. Vedaldi, Semi-convolutional opera-

tors for instance segmentation (2018). arXiv:1807.10712.
[24] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance
segmentation (2018). arXiv:1803.01534.
505 [25] Z. Huang, L. Huang, Y. Gong, C. Huang, X. Wang, Mask scoring r-cnn
(2019). arXiv:1903.00241.
[26] X. Chen, R. Girshick, K. He, P. Dollár, Tensormask: A foundation for

dense object segmentation (2019). arXiv:1903.12174.
[27] E. Xie, P. Sun, X. Song, W. Wang, D. Liang, C. Shen, P. Luo, Polarmask:

510 Single shot instance segmentation with polar representation (2020). arXiv:
1909.13226.
[28] X. Wang, R. Zhang, T. Kong, L. Li, C. Shen, Solov2: Dynamic and fast
instance segmentation (2020). arXiv:2003.10152.
31
[29] F. Yu, D. Wang, E. Shelhamer, T. Darrell, Deep layer aggregation (2019).
515 arXiv:1707.06484.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-
scale hierarchical image database, in: 2009 IEEE Conference on Computer
Vision and Pattern Recognition, 2009, pp. 248–255. doi:10.1109/CVPR.
2009.5206848.
520 [31] J. Zhang, L. Lin, Y. Li, Y. chen Chen, J. Zhu, Y. Hu, S. C. H. Hoi,
Attribute-aware pedestrian detection in a crowd (2019). arXiv:1910.
09188.
[32] Y. Wang, C. Han, G. Yao, W. Zhou, Mapd: An improved multi-attribute

pedestrian detection in a crowd, Neurocomputing 432 (2021) 101–110. doi:
525 https://doi.org/10.1016/j.neucom.2020.12.005.
[33] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, Gen-

eralized intersection over union: A metric and a loss for bounding box
regression (2019). arXiv:1902.09630.
[34] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies

530 for accurate object detection and semantic segmentation (2014). arXiv:
1311.2524.
[35] A. Neubeck, L. Van Gool, Eﬀicient non-maximum suppression, in: 18th

International Conference on Pattern Recognition (ICPR’06), Vol. 3, 2006,
pp. 850–855. doi:10.1109/ICPR.2006.479.
535 [36] J. M. Cherry, C. Adler, C. A. Ball, S. A. Chervitz, D. Botstein, Sac-

charomyces genome database, in: C. Guthrie, G. R. Fink (Eds.), Guide
to Yeast Genetics and Molecular and Cell Biology - Part B, Vol. 350 of
Methods in Enzymology, Academic Press, 2002, pp. 329–346. doi:https:
//doi.org/10.1016/S0076-6879(02)50972-1.
32
540 [37] X. Zhu, H. Hu, S. Lin, J. Dai, Deformable convnets v2: More deformable,
better results (2018). arXiv:1811.11168.
[38] Y. Li, H. Qi, J. Dai, X. Ji, Y. Wei, Fully convolutional instance-aware

semantic segmentation (2017). arXiv:1611.07709.
[39] J. Richeimer, J. Mitchell, Bounding box embedding for single shot person
545 instance segmentation (2018). arXiv:1807.07674.
[40] R. Zhang, Z. Tian, C. Shen, M. You, Y. Yan, Mask encoding for single shot
instance segmentation (2020). arXiv:2003.11712.
[41] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, K. Mur-

phy, Personlab: Person pose estimation and instance segmentation with
550 a bottom-up, part-based, geometric embedding model (2018). arXiv:
1803.08225.
[42] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu,

J. Shi, W. Ouyang, C. C. Loy, D. Lin, Hybrid task cascade for instance
segmentation (2019). arXiv:1901.07518.
555 [43] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, Y. Yan, Blendmask: Top-
down meets bottom-up for instance segmentation (2020). arXiv:2001.
00309.
[44] S. Qiao, L.-C. Chen, A. Yuille, Detectors: Detecting objects with recursive
feature pyramid and switchable atrous convolution (2020). arXiv:2006.
560 02334.
[45] X. Wang, T. Kong, C. Shen, Y. Jiang, L. Li, Solo: Segmenting objects by

locations (2020). arXiv:1912.04488.
[46] R. Fan, M.-M. Cheng, Q. Hou, T.-J. Mu, J. Wang, S.-M. Hu, S4net: Single
stage salient-instance segmentation (2019). arXiv:1711.07618.
33
565 [47] Y.-H. Wu, Y. Liu, L. Zhang, W. Gao, M.-M. Cheng, Regularized densely-
connected pyramid network for salient instance segmentation, IEEE Trans-
actions on Image Processing 30 (2021) 3897–3907. doi:10.1109/tip.
2021.3065822.
URL http://dx.doi.org/10.1109/TIP.2021.3065822
34
Biography of the author(s) Click here to access/download;Biography of the
author(s);Author Biography .docx
Yang Wang
Yang Wang recieved his master's degree from Nanjing University of Aeronautics and
Astronautics in 2017. Currently studying for a Ph.D in Nanjing University of
Aeronautics and Astronautics. His research interest lies in the areas of computer vision.
Wanlin Zhou
Wanlin Zhou, born in January 1964, has a Ph.d degree. He is now a professor and
doctoral supervisor of the Department of aerospace manufacturing engineering, School
of mechanical and electrical engineering, Nanjing University of Aeronautics and
Astronautics. From 2001 to 2005, he worked in the State Key Laboratory of
Aeronautical intelligent material and structure, Nanjing University of Aeronautics and
Astronautics, and obtained the doctor's degree of engineering. In recent years, he is
mainly engaged in the research of intelligent materials and structures, artificial
intelligence, intelligent manufacturing, industrial image recognition technology,
structural health monitoring of aviation engineering, etc.
1
Qinwei Lv
Qinwei Lv, studying for a master's degree at Nanjing University of Aeronautics and
Astronautics. His research interests lie in the areas of computer vision.
Guangle Yao
Guangle Yao recieved his Ph.d degree from University of Electronic Science and
Technology of China in 2019. How he is an associate professor Chengdu University
of Technology. His research interests lie in the areas of computer vision.

Journal Pre-Proofs: Neurocomputing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journal Pre-Proofs: Neurocomputing

Uploaded by

Copyright:

Available Formats

Journal Pre-proofs

MetricMask: Single Category Instance Segmentation by Metric Learning

Yang Wang, Wanlin Zhou, Qinwei Lv, Guangle Yao

To appear in: Neurocomputing

Received Date: 10 January 2022

© 2022 Published by Elsevier B.V.

MetricMask: Single Category Instance Segmentation by

Yang Wanga,∗, Wanlin Zhoua , Qinwei Lva , Guangle Yaob

In this paper, we introduce a novel single category instance segmentation method,

Preprint submitted to Journal of LATEX Templates January 10, 2022

Featuress澳 Mask map澳

Embedding vector map

Segmentation Head Output Detection Head Output

(a) Segmentation Head (b) Detection Head

(a) (b) (c) (d) (e)

Figure 2: Example of MetricMask output on COCO person category. (a)images of MetricMask

In order to maximize the advantages of MetricMask, random sampling metric

60 In this section, we collect works of object detection, semantic segmentation,

2.1. object detection

Object detectors can be roughly categorized into two types: anchor-based

2.2. Semantic segmentation

85 Semantic segmentation is a challenging task in computer vision, which has

2.3. Metric learning

Metric learning, also known as similarity learning, aims to learn a similarity

2.4. Instance segmentation

Instance segmentation, based on semantic segmentation, further detects all

150 3.1. Overall Architecture

MetricMask is a simple, unified network composed of a backbone network,

3.2. The Branches of Detection and Segmentation

,QSXW 2XWSXW ,QSXW 2XWSXW 2XWSXW

175 3.3. Loss function

In this section, we describe the optimized strategy of the detection head

3.3.1. Detection branch loss function

190 design [31, 32].

3. The center of object is set to four positives, so we need to predict the

Lgiou = GIOU (sˆk , oˆk , sk , ok )

3.3.2. Segmentation branch loss function

Lseg = λmask Lmask + λinst Linst

In this section, we will describe how to complete instances segmentation by

3.4.1. MetricMask Representation

240 3.4.2. Random sampling Metric Loss

3.4.3. Mass Center

3.4.4. Metric Operation

Figure 8: The results of embedding vector distance visualization. (a)images of MetricMask

Figure 9: Python code of Metric Operation

305 4.1. Dataset and Evaluation Metric

We evaluate the proposed MetricMask method on the two standard datasets,

4.2. Training detail

We conduct an ablative analysis of the proposed method on the COCO

DLA-34 640×640 32.7 59.0 33.3 33.7 62.5 33.5

4.3.1. The modification of the detection branch

4.3.2. Dimension of embedding vector

We analyze the sensitivity of MetricMask to the dimension of the embedding

Backbone Size num APm APm 50 APm 75 APb APb 50 APb 75

DLA-34 640×640 2 32.0 58.0 32.0 37.5 65.0 39.2

4.3.3. Random sample number of instance pair

The number of instance pairs is counted on the COCO person category

Backbone Size num memory/img AP AP50 AP75 APs APm APl

DLA-34 640×640 2 1.2G 32.7 59.0 33.3 9.2 39.3 57.4

4.3.4. Box center and mass center

Backbone Size AP AP50 AP75 APs APm APl

DLA-34 512×512 32.6 58.2 33.1 7.0 38.9 60.4

4.3.5. The different scale input