Professional Documents
Culture Documents
PII: S0925-2312(22)00701-9
DOI: https://doi.org/10.1016/j.neucom.2022.05.117
Reference: NEUCOM 25307
Please cite this article as: Y. Wang, W. Zhou, Q. Lv, G. Yao, MetricMask: Single Category Instance
Segmentation by Metric Learning, Neurocomputing (2022), doi: https://doi.org/10.1016/j.neucom.2022.05.117
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version
will undergo additional copyediting, typesetting and review before it is published in its final form, but we are
providing this version to give early visibility of the article. Please note that, during the production process, errors
may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Sichuan, China
Abstract
∗ Correspondingauthor
Email address: youngnuaa@gmail.com (Yang Wang)
3×3Conv+bn+Relu 1×1Conv
Center map
Features
Offset map
Figure 1: MetricMask Head, which consists of two main components: the segmentation head
and detection head. In the figure, the head network structure and output are shown. To visu-
ally display the embedding vector map, we employ the PCA method for reducing embedding
vector dimension to 3. Observe the embedding vector map, the colors of the same instance
are close, and the colors of different instances are very different. The close color indicates the
embedding vectors are similar, and the large color difference indicates the embedding vectors
are significant difference. Therefore, it shows that the embedding vectors of different instances
of the network output are distinguished.
1. Introduction
Object detection [1] provides the classes of the image objects and gives the
location of the image objects. The location is given in the form of bounding
boxes. Semantic segmentation [2] provides fine-grained results by predicting
5 the labels of every pixel in the input image. Each pixel is labeled according to
the object class within which it is enclosed. Furthering this evolution, instance
2
segmentation gives different labels for separate objects belonging to the same
category. Hence, instance segmentation may be defined as the technique of
simultaneously solving object detection and semantic segmentation.
10 Intuitively, instance segmentation can be solved by bounding box detection
and semantic segmentation within each box, adopted by two-stage methods,
such as Mask R-CNN [3] and CenterMask [4]. Currently, the vision community
has spent more effort in designing simpler pipelines of bounding box detectors
[5, 6, 7, 8, 9] and semantic segmentation tasks [2, 10, 11]. We aim to use these
15 simple and excellent methods to complete the task of instance segmentation.
The recent object detector Centernet [5] and Semantic segmentation SFnet [10]
are employed to instantiate our instance segmentation method, mainly for their
simplicity. Note that it is possible to use other methods such as RetinaNet [12],
YOLOF [7], Bisenetv2 [13] with minimal modification to our framework. There-
20 fore, we propose a conceptually simple instance-level mask prediction module
that can be easily plugged into many off-the-shelf object detection and semantic
segmentation methods.
Instance segmentation is usually solved by binary classification in a spatial
layout surrounded by bounding boxes. Each object needs to be segmented sep-
25 arately, which is luxurious. Instead, we point out that instance-level masks can
be recovered successfully and effectively if bounding box, category-level mask,
and pixel embedding vector are obtained. Specifically, we propose MetricMask,
which converts instance segmentation into object detection, semantic segmen-
tation, and metric learning of pixel embedding vector, rather than segmenting
30 individual instances independently. The object detection and semantic segmen-
tation methods are executed in parallel and a new branch is added on the head
of semantic segmentation to predicts the embedding vectors of each pixel. Fig-
ure 1(a) shows the semantic segmentation head, which yields a mask map and
embedding vector map. During the training process, embedding vectors from
35 the same instance of the embedding vector map are attracted, and embedding
3
vectors from different instances of the embedding vector map are penalized.
4
Metric Loss is directly applied for all instances, a lot of GPU memory resources
are consumed. Therefore, we introduce a random sampling method to save GPU
memory without losing instance segmentation accuracy.
55 3. According to the results of the model output(bounding box, embedding
vector of pixel, and the category-level mask), we introduce a Metric Operation
to aggregate all pixels belonging to the same instance. The input, output, and
final results of the MetricMask are shown in Figure 2.
2. Related Work
5
the center points of objects and regresses to their size. FCOS [6] regards all
80 the locations inside the bounding box as positives and predicts four distances
between positives and four sides of the bounding box. All the detectors can be
embedded in our method. We chose CenterNet [5] to embed our method for
object detection because of its simplicity.
6
instance, and a negative instance. The triplet loss encourages the distance be-
tween positive pairs (anchor-positive) to be smaller than the distance between
negative pairs (anchor-negative) with a margin. In recent years, metric learning
has been successfully applied to instance segmentation tasks [20, 21, 22, 23].
110 Metric learning computes the likelihood that two pixels belong to the same ob-
ject instance and ensure that the pixels belonging to the same object have a
significant similarity. The pixels of the different objects are precisely the oppo-
site. In this paper, metric learning is also applied to optimize pixel embedding
vectors. Different from the above method, we only need to optimize the pixel
115 embedding vector, which comes from the intersecting area of the bounding box
of the instance.
7
two-stage instance Mask R-CNN [3]. PolarMask [27] introduces a novel method
135 for instance segmentation by improving the anchor-free object detector FCOS
[6], which segment instance masks in a polar coordinate. SOLOv2 [28] divides
the image into a grid of S×S cells and then predicts the semantic category and
the instance mask in the grid cell where the center of an object falls into. Dif-
ferent from some previous works that typically solve instance segmentation as
140 bounding box detection and semantic segmentation within each box, we convert
instance segmentation tasks to three parallel tasks: object bounding box regres-
sion, mask regression, and embedding vector of pixel regression. Moreover, our
method can implement segmentation for all instances at once.
Figure 3: The architecture of MetricMask. s3 to s6 denote the feature map in the feature
pyramid of the backbone network. Detection and segmentation branches predict bounding
boxes of the object, the object’s mask, and the embedding vector of each pixel. Finally, the
network’s output is put into Metric Operation to obtain the result of the instance-level mask.
3. Our Method
145 In this section, we first briefly introduce the overall architecture of the Met-
ricMask�including the branch of detection, a branch of segmentation, and loss
function. Then, we introduce a novel concept of MetricMask and propose ran-
8
dom sampling metric loss. Finally, we propose a new Metric Operation to
achieve the instance clustering problem.
160 There are two branches of the backbone network, namely the detection and
segmentation branches. They are composed of FPN and task-specific heads.
The detection FPN and detection head are shown in Figure 4(a) and Figure
1(b), which is built on a Centernet [5] object detector. In short, given input
image(I ∈ Rw×h×3 ), the detection branch produces three feature maps, namely
center map∈ R 4 × 4 ×1 , scale map∈ R 4 × 4 ×4 , and offset map∈ R 4 × 4 ×2 ,
w h w h w h
165
which are applied to obtain the center position, size, and offset of the object.
The segmentation FPN and segmentation head are shown in Figure 4(b) and
Figure 1(a), which is built on the SFnet [10] segmentation method. Compared
with the semantic segmentation method, we add an embedding vector map
170 output on the segmentation head, which is applied to predict the embedding
vector of each pixel. In short�given input image(I ∈ Rw×h×3 )�segmentation
branch produces two feature maps, namely mask map∈ Rw×h×1 and embedding
vector map∈ Rw×h×4 , which are employed to obtain the categrory-level mask
of object and the embedding vector of each pixel.
9
6WDJH )$0 1RGH &RQFDW 1RGH 330
6XQ 1RGH 8S 1RGH 'HIRUP &RQY
ϯϮ
Ϭ ϰ ϯϮ
Ϭ ϯϮ ϰ ,QSXW ,QSXW
ϭϲ ϭϲ ϰ ϭϲ ϭϲ ϰ
ϰ ϰ ϴ
ϴ ϴ ϴ ϰ ϴ ϴ ϰ
ϰ
ϰ ϰ ϰ ϰ ϰ ϰ ϰ
Figure 4: The FPN module of the MetricMask. (a)The FPN module of the Detection, which
comes from Centernet. (b)The FPN module of the Segmentation, which comes from SFnet.
(c)The Architecture of FAM, which is applied to fuse feature maps of different scales. The
FAM is the core module in SFnet. The number inside the boxes represents the sampling rate
of the image. The PPM module is widely applied in semantic Segmentation to increase the
receptive field [1]
The Centernet [5] defines three ground truth targets, namely center target
C ∈ R 4 × 4 ×1 , scale target S ∈ R 4 × 4 ×2 , and offset target O ∈ R 4 × 4 ×2 , which
w h w h w h
are applied to calculate network loss with the center map, scale map, and offset
185 map. Compared with the Centernet method, we make four modifications.
10
1. The actual location of an object’s center point is( x4k , y4k ), instead of assigning
a single position (⌊ x4k ⌋, ⌊ y4k ⌋) as positive in Centernet [5]. We assign four corners
next to the actual location as positives while others are negatives for the center
target C ∈ R 4 × 4 ×1 , reducing sample imbalance. Figure 5 illustrates our
w h
4. The Giou loss [33] has been proven effective. We add giou loss based on
smoothl1 [34] for network optimization.
The loss function of the detection branch is follow.
Ld = λcls Lcls + λscale Lscale + λof f set Lof f set + λgiou Lgiou
Where λcls , λscale , λof f set , and λgiou are the weight of each loss. The classifica-
tion loss Lcls , scale regression loss Lscale and offset regression loss Lof f set are
the same as those in Centernet [5]. LGIOU is giou loss, which is introduced to
optimize offset and scale:
Where sˆk comes from scale map, which represents the predicted scale. sk comes
from scale target, which represents the scale ground truth. oˆk comes from offset
205 map, which represents the predicted offset, and ok comes from offset target,
which represents the offset ground truth.
11
D E
Figure 5: Illustration for the center and offset target. (a)The design of Centernet. (b)The
design of modification [31, 32]. The yellow circle point is the real center, while square boxes
denote four points next to it. Blue indicates positive, while grey is negative. The dashed line
denotes the offset, which from positives to the real center.
The SFnet [10] defines ground truth targets, namely mask target Mask ∈
Rw×h×c , which is practiced to calculate network loss with the mask map. Com-
210 pared with the SFnet [10] method, We add a branch to predict the embedding
vector of the pixel, which is optimized by metric learning.
The loss function of the segmentation branch is as follows.
Where Lseg is total loss function of segmentation branch. λmask and λinst are
the weight of each loss. Linst is loss function of the embedding vector, which is
introduced in detail in the 3.4 section. Lmask is the segmentation loss function,
215 which is the same as in SFnet [10].
The overall MetricMask loss function is as follows.
Lsum = Ld + Lseg
12
3.4. Metric Mask Segmentation
D E F G
Figure 6: The positional relationship between the two instances, where the green box is the
object’s bounding box, and the red box is the intersecting region of the bounding boxes.
(a)There is no intersection region between two bounding boxes. (b)There is no mask in the
intersection region. (c)There is the mask of one instance in the intersection region. (d)There
is the mask of two instances in the intersection region. Although the masks of a and b are
category-level, we can obtain the instance-level mask based on the bounding box. When there
is a mask in the intersection region of c and d, we cannot distinguish the instance-level mask
based on the bounding box.
220 Conventional object detection and semantic segmentation can get the bound-
ing box and mask of the object. The mask obtained by the semantic segmen-
tation is category-level and cannot distinguish a single instance. We add a
branch on the head of semantic segmentation, which yields the embedding vec-
tor of each pixel in the image. We regard all the pixels of each instance as the
225 same class during the training process, so the pixels from different instances
are different categories. Then, metric learning is employed to optimize intra-
class compactness and inter-class discrepancy. Finally, we propose a Metric
Operation to achieve instance segmentation. The above is the core idea of our
method. When metric learning is applied to optimize the embedding vector of
230 all pixels, it requires many computing resources. According to the bounding box
and mask of the instance, we can obtain three prior knowledge. 1. The mask
13
of the instance is in the bounding box. 2. If the bounding box of an instance
does not intersect with the bounding boxes of other instances, the mask, which
is in the bounding box, is the mask of the instance. 3. If the bounding box
235 of an instance intersects with the bounding box of other instances, the mask
of the disjoint region is the mask of the instance. We need to determine which
instance the mask of the intersecting region came from. Details are shown in
Figure 6. Therefore, the pixel in the intersecting region needs to be metric
learning, which significantly reduces the demand for computing resources.
According to the previous section, the embedding vector of each pixel from
the intersection region needs to be optimized. The implementation steps are as
follows, There are many instances in an image, and the each instance is treated
as an anchor instance. If there is an intersecting region between the bounding
box of the anchor instance and the bounding box of another instance, then an in-
stance pair {Ii , Oij } is formed. Ii represents the i-th anchor instance, and Oij rep-
resents the j-th instance that intersects with the bounding box of the i-th anchor
instance. Through the above calculation, we obtain all instance pairs in an im-
age, which are expressed as follows {{I1 , O11 } , {I1 , O12 } , . . . , {I2 , O21 } , {I2 , O22 }
, . . . , {I1 , Oij }}. Take instance pair {Ii , Oij } as an example. If pixels belong to
the instance pair and are in the intersecting region, the embedding vectors of
those pixels need to be optimized. We propose the optimization strategy as
follows: the first step is to complete the definition of the embedding vector, and
the second step is to optimize the embedding vector of the definition. 1. The
embedding vectors of the center point of instance pair {Ii , Oij } are defined as
eI and eO , respectively. If pixel come from intersecting region of instance pair
{Ii , Oij } and belongs to instance Ii , the embedding vector of pixel is defined as
eI,i . If pixel come from intersecting region of instance pair {Ii , Oij } and belongs
to instance Oij , the embedding vector of pixel is defined as eO,i . 2.The opti-
14
mization of the embedding vectors is divided into two parts. In the first part,
we make the distance between eI and eI,i smaller than the distance between
eO and eI,i . In the second part, we make the distance between eO and eO,i
smaller than the distance between eI and eO,i . The loss function is constructed
as follows.
( )
∑
N
1 2 2
max 0, Φ + |eI − eI,i |2 − |eO − eI,i |2 if N >0
N
L1ij = i=0
0 else
( )
∑
M
1 2 2
max 0, Φ + |eO − eO,i |2 − |eI − eO,i |2 if M >0
M
L2ij = i=0
0 else
Where L1ij is the loss about the instance Ii . L2ij is the loss about the instance
Oij . N is the number of pixels that belong to the instance Ii in the intersecting
region of instance pair {Ii , Oij }. M is the number of pixels that belong to the
instance Oij in the intersecting region of instance pair {Ii , Oij }. Φ is margin,
245 which is set as 0.5.
Based on the above notations, we integrate the loss functions of the instance
pairs {Ii , Oij }.
LP = L1ij + L2ij
Where LP is the sum of the loss of instance pair {Ii , Oij }. There are P groups
of instance pairs in an image, so we need to establish loss for all instance pairs.
The simplest method is to calculate the sum of the loss for all instance pairs.
However, when an image has a large number of instance pairs, there may be
many embedding vectors that need to be optimized, which requires a large num-
ber of computing resources. Therefore, we design a random sampling method
to randomly select several instance pairs from the image to calculate the loss.
The unselected instance pairs do not participate in the loss calculation. The
above method dramatically reduces the demand for computing resources. The
15
designed loss function is as follows.
∑K
K
1
randomi (Lp ) if P > K
i=0
Linst = ∑
P
P
1
Lp else
p=0
Where Linst is the final instance of loss. K is the number of randomly sample
instance pairs, which is set to 4. Random is random sampling without replace-
ment.
*
ℎ * t *
l r
𝑤 2𝑤 ℎ *
2 t
2 2 * 𝑤 𝑤
b l
*
r
*
2 2
ℎ
*
2 b
ℎ
2
ℎ ℎ
𝑤 2 𝑤 t
*
𝑤 2𝑤 t
*
* *
2 2 l r 2 2
* *
l r
b
* ℎ
ℎ b
*
2
2
Figure 7: Visualize the object center of the instance pair. Left: box center; Right: mass
center. When two instances are very close, the center of the box may not fall into the other
instances.
250 The center of the object is applied for the detection branch and metric loss.
There are many choices for the center of the object, such as box center or
mass center. The object detection task generally selects the box center as the
center of the object. How to choose a better center depends on its effect on
instance prediction performance. Here we compare box center with mass center
255 and conclude that box center is more disadvantageous. The box center has a
greater probability of falling inside the instance. What’s more, the box center
falls on other instances. The above problems don’t affect the object detection
16
task, but the embedding vector prediction of the instance center is a disaster.
There are some examples of the box center and mass center is shown in Figure
260 7. Nevertheless, there are some extreme cases, which the center of mass is not
located inside the instance. We will work hard to consider this issue in our
future work.
The detection branch produces the center map, scale map, and offset map.
265 We apply greedy-NMS [35] to obtain the bounding box and the center of the
object. The segmentation branch produces the mask map and embedding vector
map. According to the prior knowledge proposed by 3.4.1, we only need to
distinguish the pixels in the intersecting region of the instances. We first take
out the embedding vectors of the intersection area and the center of the instance
270 on the embedding vector map and then visually display the embedding vectors.
Figure 8 shows the spatial relationship between the embedding vectors. Figure
8(c) shows that the distance between embedding vectors from the same instance
is close, and the distance between embedding vectors between different instances
is farther. Therefore, combining the output of the detection branch and the
275 segmentation branch, we propose the Metric Operation method to complete
the instance segmentation. Take the Metric Operation of an instance Ii as an
example, and the steps are as follows.
1. Obtain the relevant parameters of the instance Ii . (The bounding box of
the instance Ii , Intersection region with instance Ii , embedding vector of the
280 instance Ii center, embedding vectors of intersection region, the mask map,
the embedding vector of the center of all the instances that intersect with the
instance Ii .)
2. Apply the bounding box of the instance Ii to crop the mask on the mask map.
(The cropped mask may contain pixels of other instances in the intersecting
285 region. We remove the pixels of other instances to obtain the mask of instance
17
ϭ
D E F G
18
def getInstanceMap(bbx, iou_bbxs, center_vector, center_vectors, mask_map,
embedding_vector_map):
"""
:bbx: instance Ii bounding box. (4)
:iou_bbxes: intersecting region between instance Ii with other instance. (Nh4)
:center_vector: embedding vector of instance Ii center. (1h4)
:center_vectors: embedding vector of instance center that intersect with instance Ii. (Nh
4)
:mask_map: (HhW)
:embedding_vector_map: (HhWh4)
:return instance_map: (HhW)
"""
instance_map = np.zeros(mask_map.shape)
x1, y1, x2, y2 = bbx
instance_map[y1:y2, x1:x2] =mask_map[y1:y2, x1:x2]
for index in range(len(iou_bbxes)):
x1, y1, x2, y2 = iou_bbxes[index]
region = mask_map[y1:y2, x1:x2]
if region.sum() == 0:
continue
other_center_vector = center_vectors[index]
region_feature = embedding_vector_map[y1:y2, x1:x2]
diff_region_bbx = np.sum((np.power((region_feature - center_vector),2)),-1)
diff_region_iou=np.sum((np.power((region_feature - other_center_vector),2)),-1)
min_feature_dist = diff_region_bbx<diff_region_iou
min_feature_dist = min_feature_dist*region
instance_map[y1:y2, x1:x2] = instance_map[y1:y2, x1:x2]*min_feature_dist
return instance_map
19
Ii .)
3. Classify the pixels in the intersecting region. Take a pixel in the intersection
region as an example. It falls in the bounding box of multiple instances. We
calculate the distance between the pixel embedding vector and the embedding
290 vector of the center of multiple instances. If the distance is closest between the
pixel embedding vector and the embedding vector of the center of the instance
Ii , and the pixel is on the mask, then the pixel belongs to Ii . According to the
above method, we process all pixels in the intersecting region. As a result, the
mask of the instance Ii is obtained. Figure 9 shows the pseudo-code of Metric
295 Operation in NumPy style.
Steps 1-3 are repeated until the mask of all instances is obtained. The
instance segmentation results are shown in Figure 10.
Figure 10: Results of MetricMask on COCO val dataset with DLA-34-DCN, achieving 40.3%
mask AP.
20
4. Experiments
In this section, we will first introduce the datasets and evaluation metrics
300 used in our experiments, as in the 4.1 section. Then, training details will be
described in the 4.2 section. Next, we will carefully examine our proposed
design and conduct ablation experiments in 4.3 sections. Finally, the results of
our method and the comparison with previous state-of-the-art methods will be
provided in the 4.4 section.
The DLA-34 [29] is employed as our backbone network, and the same hyper-
parameters with Centernet [5] and SFnet [10] are applied. Specifically, our net-
work is trained with stochastic gradient descent (SGD) [36] for 50 epochs, with
320 the initial learning rate being 0.004 and a mini-batch of 16 images. The learning
rate is reduced by 10 times at iteration 35th epoch and 45th epoch, respectively.
Weight decay and momentum are set as 0.0001 and 0.9, respectively. We ini-
tialize our backbone networks with the weights pre-trained on ImageNet [30].
The input images are resized to 768×1280.
21
325 4.3. Ablation Study
Table 1: Comparison to the modification of the detection branch on the COCO person category
val-dev. APm is mask AP. APb is box AP
Backbone Size GT Targets Giou APm APm 50 APm 75 APb APb 50 APb 75
335 The detection branch modification mainly consists of two parts, including
ground truth targets design and Giou loss [33] application. We compare the
modification of the detection branch with the typical configuration of Center-
net [5]. The modification method we proposed has a significant improvement in
both detection and segmentation tasks. Results are reported in Table 1. Re-
340 gardless of the ground truth targets design or the Giou loss application, there
are specific improvements to box AP and mask AP. The ground truth targets
design increases the box AP by 1.4% and the mask AP by 1.0%. The Giou loss
application increases 0.3% for box AP and 0.4% for mask AP. When the two
parts are applied at the same time, the improvement is more prominent. The
22
345 box AP increases by 3.5%, and the mask AP increases by 1.7%. If not specified,
we apply the modification in the following experiments.
Table 2: Comparisons with the dimension of embedding vector. The detection branch
does not affect, but there is a significant effect on instance segmentation.
23
365 sample number is 2, the result is the worst. When the random sample number
is 4, 8, and 66, the results are similar. During the training process, as the
random sample number increases, the GPU memory usage also increases.
Table 3: Comparison with random sample number of instance pairs. The GPU memory
occupied becomes higher and higher as the number of instance pairs increases. But the
instance segmentation performance has not been improved.
We analyze the influence of the selection of the object center on the detection
370 and segmentation results. The comparison experiments are shown in Table 4.
The performance of the box center is better on the box AP, which is 1% higher
than the performance of the mass center. But the performance of the box center
is worse on the mask AP, which is 1.4% lower than the performance of the mass
center. The box center is simple to regress the object’s bounding box. Therefore
375 box AP performs better. When the box center is the object center, many box
centers do not fall on the object’s mask and even fall on other objects. Therefore,
mask AP performs worse.
Table 4: Comparisons with different object center options. When mass center is employed as
the object center, the performance of instance segmentation is significantly improved.
Backbone Size Mass-center Box-center APm APm 50 APm 75 APb APb 50 APb 75
√
DLA-34 640×640 33.0 62.1 32.1 38.2 66.9 39.4
√
DLA-34 640×640 34.4 61.0 35.2 37.2 65.0 39.1
24
Table 5: Compare with different training and test input sizes. * is multi-
scale testing, using 512×512 size to test large objects, and 768×1280 to
test small and medium objects. + indicates multi-scale augmentation and
longer training time (3×).
25
395 The performance is increased by 2.2%. 2. DCN convolution [37] is employed
in the detection branch to improve the network’s receptive field. All objects
are predicted at high resolution. The performance is increased by 3.4%. The
second scheme only requires a single model and single resolution input, which
is better. Finally, when more data augmentation methods are applied and the
400 training time is extended, the performance of the model improves by 6.7%.
D E F D E F
Figure 11: Comparison DCN or not on 768×1280 scale input. (a) input image (b) mask map
(c) bounding box and instance segmentation result. The first row displays the result of not
use DCN, and the second row displays the result of use DCN. From the results, it can be
found that the segmentation results of the two rows are not much different, and the detection
results are pretty different. Without DCN, the receptive field is insufficient, and the bounding
box cannot cover the object.
26
Table 6: Instance segmentation mask AP on the COCO val.
27
In addition, In addition, our method is also a one-stage and anchor-free instance
425 segmentation method, which is comparable with state-of-the-art methods.
28
440 ferent from previous works that typically solve mask prediction as binary clas-
sification in a spatial layout, MetricMask converts instance segmentation task
to metric learning for pixel embedding vectors. MetricMask is simple and ef-
fective, which compose three tasks: bounding box regression, mask regression,
and embedding vector of pixel regression. We hope that the proposed Metric-
445 Mask framework can provide a new direction for instance segmentation. For our
future work, we plan to extend our approach to multiple instance categories.
References
450 [2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for seman-
tic segmentation (2015). arXiv:1411.4038.
460 [7] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, J. Sun, You only look
one-level feature (2021). arXiv:2103.09460.
29
[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg,
465 Ssd: Single shot multibox detector, Lecture Notes in Computer Science
(2016) 21–37doi:10.1007/978-3-319-46448-0_2.
URL http://dx.doi.org/10.1007/978-3-319-46448-0_2
[10] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, Y. Tong, Semantic flow
for fast and accurate scene parsing (2021). arXiv:2002.10120.
470 [11] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network
(2017). arXiv:1612.01105.
[12] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense
object detection (2018). arXiv:1708.02002.
[13] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, N. Sang, Bisenet v2: Bilat-
475 eral network with guided aggregation for real-time semantic segmentation
(2020). arXiv:2004.02147.
480 [15] G. Li, Y. Xie, L. Lin, Y. Yu, Instance-level salient object segmentation
(2017). arXiv:1704.03604.
[16] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object
detection with region proposal networks (2016). arXiv:1506.01497.
30
490 [19] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for
face recognition and clustering, 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR)doi:10.1109/cvpr.2015.7298682.
URL http://dx.doi.org/10.1109/CVPR.2015.7298682
[24] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance
segmentation (2018). arXiv:1803.01534.
505 [25] Z. Huang, L. Huang, Y. Gong, C. Huang, X. Wang, Mask scoring r-cnn
(2019). arXiv:1903.00241.
[28] X. Wang, R. Zhang, T. Kong, L. Li, C. Shen, Solov2: Dynamic and fast
instance segmentation (2020). arXiv:2003.10152.
31
[29] F. Yu, D. Wang, E. Shelhamer, T. Darrell, Deep layer aggregation (2019).
515 arXiv:1707.06484.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-
scale hierarchical image database, in: 2009 IEEE Conference on Computer
Vision and Pattern Recognition, 2009, pp. 248–255. doi:10.1109/CVPR.
2009.5206848.
520 [31] J. Zhang, L. Lin, Y. Li, Y. chen Chen, J. Zhu, Y. Hu, S. C. H. Hoi,
Attribute-aware pedestrian detection in a crowd (2019). arXiv:1910.
09188.
32
540 [37] X. Zhu, H. Hu, S. Lin, J. Dai, Deformable convnets v2: More deformable,
better results (2018). arXiv:1811.11168.
[39] J. Richeimer, J. Mitchell, Bounding box embedding for single shot person
545 instance segmentation (2018). arXiv:1807.07674.
[40] R. Zhang, Z. Tian, C. Shen, M. You, Y. Yan, Mask encoding for single shot
instance segmentation (2020). arXiv:2003.11712.
555 [43] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, Y. Yan, Blendmask: Top-
down meets bottom-up for instance segmentation (2020). arXiv:2001.
00309.
[44] S. Qiao, L.-C. Chen, A. Yuille, Detectors: Detecting objects with recursive
feature pyramid and switchable atrous convolution (2020). arXiv:2006.
560 02334.
[46] R. Fan, M.-M. Cheng, Q. Hou, T.-J. Mu, J. Wang, S.-M. Hu, S4net: Single
stage salient-instance segmentation (2019). arXiv:1711.07618.
33
565 [47] Y.-H. Wu, Y. Liu, L. Zhang, W. Gao, M.-M. Cheng, Regularized densely-
connected pyramid network for salient instance segmentation, IEEE Trans-
actions on Image Processing 30 (2021) 3897–3907. doi:10.1109/tip.
2021.3065822.
URL http://dx.doi.org/10.1109/TIP.2021.3065822
34
Biography of the author(s) Click here to access/download;Biography of the
author(s);Author Biography .docx
Yang Wang
Yang Wang recieved his master's degree from Nanjing University of Aeronautics and
Astronautics in 2017. Currently studying for a Ph.D in Nanjing University of
Aeronautics and Astronautics. His research interest lies in the areas of computer vision.
Wanlin Zhou
Wanlin Zhou, born in January 1964, has a Ph.d degree. He is now a professor and
doctoral supervisor of the Department of aerospace manufacturing engineering, School
of mechanical and electrical engineering, Nanjing University of Aeronautics and
Astronautics. From 2001 to 2005, he worked in the State Key Laboratory of
Aeronautical intelligent material and structure, Nanjing University of Aeronautics and
Astronautics, and obtained the doctor's degree of engineering. In recent years, he is
mainly engaged in the research of intelligent materials and structures, artificial
intelligence, intelligent manufacturing, industrial image recognition technology,
structural health monitoring of aviation engineering, etc.
1
Qinwei Lv
Qinwei Lv, studying for a master's degree at Nanjing University of Aeronautics and
Astronautics. His research interests lie in the areas of computer vision.
Guangle Yao
Guangle Yao recieved his Ph.d degree from University of Electronic Science and
Technology of China in 2019. How he is an associate professor Chengdu University
of Technology. His research interests lie in the areas of computer vision.