Professional Documents
Culture Documents
Yue Wu1 , Yinpeng Chen2 , Lu Yuan2 , Zicheng Liu2 , Lijuan Wang2 , Hongzhi Li2 and Yun Fu1
1
Northeastern University, 2 Microsoft
{yuewu,yunfu}@ece.neu.edu, {yiche,luyuan,zliu,lijuanw,hongzhi.li}@microsoft.com
arXiv:1904.06493v4 [cs.CV] 2 Apr 2020
Abstract
RoIAlign
7x7 fc fc class
1024 1024 box
Two head structures (i.e. fully connected head and con- x256
volution head) have been widely used in R-CNN based de- RoI
0.8 0.8
Classification Score
Classification Score
Classification Score
0.8 0.8 0.8
0 0 0
0.4 0.4
0.6 0.6 0.6 Figure 3. Pearson correlation coefficient (PCC) between classi-
0.4 0.4 0.4 fication scores and IoUs. Left: PCC of predefined proposals for
0.2 0.2 0.2 large, medium and small objects. Right: PCC of proposals gener-
0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1 ated by RPN and detected boxes after NMS.
Proposal IoU Proposal IoU Proposal IoU
Figure 2. Comparison between fc-head and conv-head. Top row: Spatial Correlation of Spatial Correlation of Spatial Correlation of
Output Feature Map Output Feature Map Weight Parameters
mean and standard deviation of classification scores. Bottom row: 1
0.9
mean and standard deviation of IoUs between regressed boxes and 0.8
0.7
their corresponding ground truth. Classification scores in fc-head 0.6
0.4
has better regression results than fc-head. 0.3
0.2
0.1
and IoUs of regressed boxes. Figure 2 shows the results for conv-head fc-head
small, medium and large objects. Figure 4. Left: Spatial correlation in output feature map of conv-
head. Middle: Spatial correlation in output feature map of fc-head.
3.2. Comparison on Classification Task Right: Spatial correlation in weight parameters of fc-head. conv-
head has significantly more spatial correlation in output feature
The first row of Figure 2 shows the classification scores map than fc-head. fc-head has a similar spatial correlation pattern
for both fc-head and conv-head. Compared to conv-head, in output feature map and weight parameters.
fc-head provides higher scores for proposals with higher
IoUs. This indicates that classification scores of fc-head
are more correlated to IoUs between proposals and cor- 3.4. Discussion
responding ground truth than of conv-head, especially for Why does fc-head show more correlation between the
small objects. To validate this, we compute the Pearson classification scores and proposal IoUs, and perform worse
correlation coefficient (PCC) between proposal IoUs and in localization? We believe it is because fc-head is more
classification scores. The results (shown in Figure 3 (Left)) spatially sensitive than conv-head. Intuitively, fc-head ap-
demonstrate that the classification scores of fc-head are plies unshared transformations (fully connected layer) over
more correlated to the proposal IoUs. different positions of the input feature map. Thus, the spa-
We also compute Pearson correlation coefficient for the tial information is implicitly embedded. The spatial sensi-
proposals generated by RPN [35] and final detected boxes tivity of fc-head helps distinguish between a complete ob-
after NMS. Results are shown in Figure 3 (Right). Simi- ject and part of an object, but is not robust to determine the
lar to the predifined proposals, fc-head has higher PCC than offset of the whole object. In contrast, conv-head uses a
conv-head. Thus, the detected boxes with higher IoUs are shared transformation (convolutional kernels) on all posi-
ranked higher when calculating AP due to their higher clas- tions of the input feature map, and uses average pooling to
sification scores. aggregate.
Next, we inspect the spatial sensitivity of conv-head and
3.3. Comparison on Localization Task fc-head. For conv-head whose output feature map is a 7 × 7
grid, we compute the spatial correlation between any pair
The second row of Figure 2 shows IoUs between the re- of locations using the cosine distance between the corre-
gressed boxes and their corresponding ground truth for both sponding two feature vectors. This results in a 7 × 7 corre-
fc-head and conv-head. Compared to fc-head, the regressed lation matrix per cell, representing the correlation between
boxes of conv-head are more accurate when the proposal the current cell and other cells. Thus, the spatial correlation
IoU is above 0.4. This demonstrates that conv-head has bet- of a output feature map can be visualized by tiling the cor-
ter regression ability than fc-head. relation matrices of all cells in a 7 × 7 grid. Figure 4 (Left)
𝑍
shows the average spatial correlation of conv-head over 𝑍
ReLU
ReLU 1024×𝐻×𝑊
⊕
multiple objects. For fc-head whose output is not a feature 𝑍
ReLU ⊕
1024×𝐻×𝑊 1×1
512×𝐻×𝑊
map, but a feature vector with dimension 1024, we recon- 1024×𝐻×𝑊
⊕
1024×𝐻×𝑊 1×1
256×𝐻×𝑊 𝐻𝑊×𝐻𝑊 ⊗
𝐻𝑊×512
⊗
struct its output feature map. This can be done by splitting 1×1 1×1
256×𝐻×𝑊
3×3
256×𝐻×𝑊
𝐻𝑊×512
512×𝐻×𝑊
512×𝐻𝑊
512×𝐻×𝑊
𝐻𝑊×512
512×𝐻×𝑊
3×3 1×1
the weight matrix of fully connected layer (256·7·7×1024) 𝜃: 1×1 𝜙: 1×1 𝑔: 1×1
s = sf c + sconv (1 − sf c ) = sconv + sf c (1 − sconv ), (4) • Double-FC splits the classification and bounding box
regression into two fully connected heads, which have
where sf c and sconv are classification scores from fc-head the identical structure.
and conv-head, respectively. The increment from the first • Double-Conv splits the classification and bounding
score (e.g. sf c ) is a product of the second score and the box regression into two convolution heads, which have
reverse of the first score (e.g. sconv (1 − sf c )). This is dif- the identical structure.
ferent from [3] which combining all classifiers by average.
Note that this fusion is only applicable when λf c 6= 0 and • Double-Head includes a fully connected head (fc-
λconv 6= 1. head) for classification and a convolution head (conv-
head) for bounding box regression.
5. Experimental Results • Double-Head-Reverse switches tasks between two
heads (i.e. fc-head for bounding box regression and
We evaluate our approach on MS COCO 2017 dataset
conv-head for classification), compared to Double-
[28] and Pascal VOC07 dataset [8]. MS COCO 2017
Head.
dataset has 80 object categories. We train on train2017
(118K images) and report results on val2017 (5K images) The detection performances are shown in Table 1. The
and test-dev (41K images). The standard COCO-style top group shows performances for single head detectors.
Average Precision (AP) with different IoU thresholds from The middle group shows performances for detectors with
0.5 to 0.95 is used as evaluation metric. Pascal VOC07 double heads. The weight for each loss (classification and
dataset has 20 object categories. We train on trainval bounding box regression) is set to 1.0. Compared to the
with 5K images and report results on test with 5K im- middle group, the bottom group uses different loss weights
ages. We perform ablation studies to analyze different com- for fc-head and conv-head (ω f c = 2.0 and ω conv = 2.5),
ponents of our approach, and compare our approach to base- which are set empirically.
lines and state-of-the-art. Double-Head outperforms single head detectors by a
non-negligible margin (2.0+ AP). It also outperforms
5.1. Implementation Details
Double-FC and Double-Conv by at least 1.4 AP. Double-
Our implementation is based on Mask R-CNN bench- Head-Reverse has the worst performance (drops 6.2+ AP
mark in Pytorch 1.0 [31]. Images are resized such that the compared to Double-Head). This validates our findings that
shortest side is 800 pixels. We use no data augmentation for fc-head is more suitable for classification, while conv-head
testing, and only horizontal flipping augmentation for train- is more suitable for localization.
ing. The implementation details are described as follows: Single-Conv performs better than Double-Conv. We be-
Architecture: Our approach is evaluated on two FPN [26] lieve that the regression task helps the classification task
backbones (ResNet-50 and ResNet-101 [15]), which are when sharing a single convolution head. This is supported
pretrained on ImageNet [6]. The standard RoI pooling is by sliding window analysis (see details in section 3.1).
1 0.7
fc-head conv-head Single-FC
Double-FC
cls reg cls reg AP 0.8
Single-FC 1.0 1.0 - - 36.8
Classification Score
Classification Score
Single-Conv - - 1.0 1.0 35.9 0.6
Double-FC 1.0 1.0 - - 37.3
0.4 0.6
Double-Conv - - 1.0 1.0 33.8
Double-Head-Reverse - 1.0 1.0 - 32.6 0.2
Double-Head 1.0 - - 1.0 38.8
Double-FC 2.0 2.0 - - 38.1 0
Double-Conv - - 2.5 2.5 34.3
-0.2 0.5
Double-Head-Reverse - 2.0 2.5 - 32.0 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1
Double-Head 2.0 - - 2.5 39.5 Proposal IoU Proposal IoU
(a) (b)
1 0.9
Single-FC
Double-FC
Table 1. Evaluations of detectors with different head structures on
0.8
COCO val2017. The backbone is FPN with ResNet-50. The top
1 1
Single-Conv Single-Conv Figure 7. Comparison between Single-FC and Double-FC. (a):
Double-Conv Double-Conv
0.8
0.8
mean and standard deviation of classification scores. (b): zoom-
ing in of the box in plot-(a). (c): mean and standard deviation
Classification Score
0.6
0.6 of IoUs between regressed boxes and their corresponding ground
0.4 truth. (d): zooming in of the box in plot-(c). Double-FC has
0.4 slightly higher classification scores and better regression results
0.2
than Single-FC.
0.2
0
-0.2 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Proposal IoU Proposal IoU and their corresponding ground truth than Single-FC.
Figure 6. Comparison between Single-Conv and Double-Conv.
Depth of conv-head: We study the number of blocks for
Left: mean and standard deviation of classification scores. Right: the convolution head. The evaluations are shown in Table
mean and standard deviation of IoUs between regressed boxes and 2. The first group has K residual blocks (Figure 5-(a-b)),
their corresponding ground truth. Single-Conv has higher clas- while the second group has alternating (K + 1)/2 resid-
sification scores than Double-Conv, while regression results are ual blocks and (K − 1)/2 non-local blocks (Figure 5-(c)).
comparable. When using a single block in conv-head, the performance
is slightly behind FPN baseline (drops 0.1 AP) as it is too
shallow. However, adding another convolution block boosts
Figure 6 shows the comparison between Single-Conv and the performance substantially (gains 1.9 AP from FPN base-
Double-Conv. Their regression results are comparable. But line). As the number of blocks increases, the performance
Single-Conv has higher classification scores than Double- improves gradually with decreasing growth rate. Consid-
Conv on proposals which have higher IoUs with the ground ering the trade-off between accuracy and complexity, we
truth box. Thus, sharing regression and classification on a choose conv-head with 3 residual blocks and 2 non-local
single convolution head encourages the correlation between blocks (K = 5 in the second group) for the rest of the pa-
classification scores and proposal IoUs. This allows Single- per, which gains 3.0 AP from baseline.
Conv to better determine if a complete object is covered. More training iterations: When increasing training it-
In contrast, sharing two tasks in a single fully connected erations from 180k (1× training) to 360k (2× training),
head (Single-FC) is not as good as separating them in two Double-Head gains 0.6 AP (from 39.8 to 40.4).
heads (Double-FC). We believe that adding the regression Balance Weights λf c and λconv : Figure 8 shows APs
task in the same head with equal weight introduces con- for different choices of λf c and λconv . For each (λf c ,
fliction. This is supported by the sliding window analysis. λconv ) pair, we train a Double-Head-Ext model. The vanilla
Figure 7 shows that Double-FC has slightly higher classifi- Double-Head model is corresponding to λf c = 1 and
cation scores and higher IoUs between the regressed boxes λconv = 1, while other models involve supervision from
(a) cls: conv-head (b) cls: fc-head (c) cls: fc-head (d) cls: fc-head+conv-head
reg: conv-head reg: fc-head reg: conv-head reg: conv-head
N/A
1.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 36.7
40 37.1 36.9 36.7 35.4 0.0
1.0 38.7
40 39.6 39.5 40.0 39.8 39.8
1.0 0.0 N/A0.0 0.0 0.0 1.0400.0 0.0 0.0 0.0
40 0.0 0.0
0.8 38.4 38.4 38.7 38.6 38.1 37.8 38 37.0 37.0 36.4 34.9 0.0
0.8 37.4 38 39.3 39.7 39.7 39.3 39.5
0.8 39.5 38 39.8 40.3 40.2 40.0 40.0 0.838
0.8 40.1 40.1 39.8 40.3 40.2
N/A
conv
conv
6conv
conv
6conv
37 37.1 36.8 36.2 34.9 0.0 37 39.1 39.2 39.1 39.0 39.1 37 39.8 39.8 39.7 39.6 39.8 0.737
6
6
0.7 37.4 38.1 38.2 37.7 36.9 37.5
0.7 37.2 0.7 39.0 0.7 39.6 39.6 39.8 39.8 39.7
0.6 36.4 36.2 36.7 36.3 35.6 35.9 36 36.7 36.6 35.9 34.8 0.0
0.6 36.9 36 38.8 38.9 38.8 38.7 38.8
0.6 38.4 36 39.2 39.5 39.4 39.2 39.3 0.636
0.6 39.2 39.2 39.2 39.5 39.4
0.5 35.6 36.2 36.1 35.5 34.9 35.7 35 36.8 36.5 35.9 34.5 0.0
0.6 36.9 35 38.5 38.6 38.5 38.2 38.4
0.5 38.6 35 39.1 39.1 39.1 38.6 39.0 0.535
0.5 39.2 39.2 39.1 39.1 39.1
0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8
6fc 6fc 6fc 6fc 6fc
Figure 8. AP over balance weights λf c and λconv . For each (λf c , λconv ) pair, we trained a Double-Head-Ext model. Note that the vanilla
Double-Head is a special case with λf c = 1, λconv = 1. For each model, we evaluate AP in four ways: (a) using conv-head alone, (b)
using fc-head alone, (c) using classification from fc-head and bounding box from conv-head, and (d) using classification fusion from both
heads and bounding box from conv-head. Note that the first row in (a) and (d) is not available, due to the unavailability of classification
in conv-head when λconv = 1. The last column in (b) is not available, due to the unavailability of bound box regression in fc-head when
λf c = 1.
Table 5. Object detection results (bounding box AP) on COCO val2017. Note that FPN baseline only has fc-head. Our Double-Head
and Double-Head-Ext outperform both Faster R-CNN and FPN baselines on two backbones (ResNet-50 and ResNet-101).
Table 6. Object detection results (bounding box AP), vs. state-of-the-art on COCO test-dev. All methods are in the family of two-stage
detectors with a single training stage. Our Double-Head-Ext achieves the best performance.
5.3. Main Results used as the backbone. For fair comparison, the performance
of single-model inference is reported for all methods. Here,
Comparison with Baselines on VOC07: We conduct ex-
we only consider the two-stage detectors with a single train-
periments on Pascal VOC07 dataset and results are shown
ing stage. Our Double-Head-Ext achieves the best perfor-
in Table 4. Compared with FPN, our method gains 1.8 AP.
mance with 42.3 AP. This demonstrates the superior per-
Specifically, it gains 3.7 AP0.75 on the higher IoU threshold
formance of our method. Note that Cascade RCNN [3] is
0.75 and gains 1.0 AP0.5 on the lower IoU threshold 0.5.
not included as it involves multiple training stages. Even
Comparison with Baselines on COCO: Table 5 shows the
through our method only has one training stage, the per-
comparison between our method with Faster RCNN [35]
formance of our method is slightly below Cascade RCNN
and FPN [26] baselines on COCO val2017. Our method
(42.8 AP).
outperforms both baselines on all evaluation metrics. Com-
pared with FPN, our Double-Head-Ext gains 3.5 and 2.8
6. Conclusions
AP on ResNet-50 and ResNet-101 backbones, respectively.
Specifically, our method gains 3.5+ AP on the higher IoU In this paper, we perform a thorough analysis and find an
threshold (0.75) and 1.4+ AP on the lower IoU threshold interesting fact that two widely used head structures (convo-
(0.5) for both backbones. This demonstrates the advantage lution head and fully connected head) have opposite prefer-
of our method with double heads. ences towards classification and localization tasks in object
We also observe that Faster R-CNN and FPN have dif- detection. Specifically, fc-head is more suitable for the clas-
ferent preferences over object sizes when using ResNet-101 sification task, while conv-head is more suitable for the lo-
backbone: i.e. Faster R-CNN has better AP on medium calization task. Furthermore, we examine the output feature
and large objects, while FPN is better on small objects. maps of both heads and find that fc-head has more spatial
Even comparing with the best performance among FPN and sensitivity than conv-head. Thus, fc-head has more capabil-
Faster R-CNN across different sizes, our Double-Head-Ext ity to distinguish a complete object from part of an object,
gains 1.7 AP on small objects, 2.1 AP on medium objects but is not robust to regress the whole object. Based upon
and 2.5 AP on large objects. This demonstrates the supe- these findings, we propose a Double-Head method, which
riority of our method, which leverages the advantage of fc- has a fully connected head focusing on classification and
head on classification and the advantage of conv-head on a convolution head for bounding box regression. Without
localization. bells and whistles, our method gains +3.5 and +2.8 AP on
Comparison with State-of-the-art on COCO: We com- MS COCO dataset from FPN baselines with ResNet-50 and
pare our Double-Head-Ext with the state-of-the-art methods ResNet-101 backbones, respectively. We hope that our find-
on MS COCO 2017 test-dev in Table 6. ResNet-101 is ings are helpful for future research in object detection.
Easy Classes Medium Classes Hard Classes
A. APPENDIX 1.2
conv-head easy classes
fc-head easy classes
1.2
conv-head medium classes
fc-head medium classes
1.2
conv-head hard classes
fc-head hard classes
1 1 1
Classification Score
Classification Score
Classification Score
0.8 0.8 0.8
head on easy, medium and hard classes using sliding win- 0.6 0.6 0.6
category. We rank all object classes based upon AP re- Figure A.1. Comparison between fc-head and conv-head on easy,
sults of the FPN [26] baseline. Each group on difficulty medium and hard classes. Top row: mean and standard deviation
of the object category (easy/medium/hard) takes one third of classification scores. Bottom row: mean and standard deviation
of classes. The results are shown in Figure A.1. Similarly, of IoUs between regressed boxes and their corresponding ground
classification scores of fc-head are more correlated to IoUs truth.
between proposals and corresponding ground truth than of 1
conv-head
fc-head
conv-head, especially for hard objects. The corresponding
cantly compared with using conv-head alone and (ii) regres- 0.8
Double-Head-Ext conv-head
Double-Head-Ext fc-head 0.8
Classification Score
0.5891 0.9499
0.0688 0.5164
0.0191 0.8674
Figure A.4. fc-head head is more suitable for classification than conv-head. This figure includes three cases. Each case has two rows. The
bottom row shows ground truth, detection results using conv-head alone, fc-head alone, and our Double-Head-Ext (from left to right). The
conv-head misses objects in the red box. In contrast, these missing objects are successfully detected by fc-head (shown in the corresponding
green box). The top row zooms in the red and green boxes, and shows classification scores from the two heads for each object. The missed
objects in conv-head have small classification scores, compared to fc-head.
Proposals No duplicate Duplicate
Figure A.5. conv-head is more suitable for localization than fc-head. This figure includes two cases. Each case has two rows. The bottom
row shows ground truth, detection results using conv-head alone, fc-head alone, and our Double-Head-Ext (from left to right). fc-head has
a duplicate detection for the baseball bat in case I and for the surfing board in case II (in the red box). The duplicate detection is generated
from an inaccurate proposal (shown in the top row), but is not removed by NMS due to its low IoU with other detection boxes. In contrast,
conv-head has more accurate box regression for the same proposal, with higher IoU with other detection boxes. Thus it is removed by
NMS, resulting no duplication.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[1] Mscoco instance segmentation challenges
[16] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides,
2018 megvii (face++) team. http://
and Xiangyu Zhang. Bounding box regression with uncer-
presentations.cocodataset.org/ECCV18/
tainty for accurate object detection. In Proceedings of the
COCO18-Detect-Megvii.pdf, 2018.
IEEE Conference on Computer Vision and Pattern Recogni-
[2] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and tion, pages 2888–2897, 2019.
Larry S Davis. Soft-nms–improving object detection with
[17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
one line of code. In Proceedings of the IEEE international
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
conference on computer vision, pages 5561–5569, 2017.
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
[3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving
tional neural networks for mobile vision applications. arXiv
into high quality object detection. In The IEEE Conference
preprint arXiv:1704.04861, 2017.
on Computer Vision and Pattern Recognition (CVPR), June
[18] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang
2018.
Huang, and Xinggang Wang. Mask Scoring R-CNN. In
[4] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Ob-
CVPR, 2019.
ject detection via region-based fully convolutional networks.
[19] Sergey Ioffe and Christian Szegedy. Batch normalization:
In Advances in Neural Information Processing Systems 29,
Accelerating deep network training by reducing internal co-
pages 379–387. 2016.
variate shift. In International Conference on Machine Learn-
[5] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
ing, pages 448–456, 2015.
Zhang, Han Hu, and Yichen Wei. Deformable convolutional
networks. In Proceedings of the IEEE International Confer- [20] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
ence on Computer Vision, pages 764–773, 2017. ing Jiang. Acquisition of localization confidence for accurate
object detection. In Proceedings of the European Conference
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
on Computer Vision (ECCV), pages 784–799, 2018.
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and [21] Hei Law and Jia Deng. CornerNet: Detecting Objects as
pattern recognition, pages 248–255. Ieee, 2009. Paired Keypoints. In ECCV, 2018.
[7] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- [22] Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng.
ming Huang, and Qi Tian. Centernet: Keypoint triplets for Cornernet-lite: Efficient keypoint based object detection.
object detection. In The IEEE International Conference on arXiv preprint arXiv:1904.08900, 2019.
Computer Vision (ICCV), October 2019. [23] Ang Li, Xue Yang, and Chongyang Zhang. Rethinking clas-
[8] Mark Everingham, Luc Van Gool, Christopher KI Williams, sification and localization for cascade r-cnn. In Proceedings
John Winn, and Andrew Zisserman. The pascal visual object of the British Machine Vision Conference (BMVC), 2019.
classes (voc) challenge. International journal of computer [24] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang
vision, 88(2):303–338, 2010. Zhang. Scale-aware trident networks for object detection.
[9] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, In The IEEE International Conference on Computer Vision
and Alexander C Berg. Dssd: Deconvolutional single shot (ICCV), October 2019.
detector. arXiv preprint arXiv:1701.06659, 2017. [25] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yang-
[10] Ross Girshick. Fast R-CNN. In Proceedings of the Interna- dong Deng, and Jian Sun. Light-head r-cnn: In defense of
tional Conference on Computer Vision (ICCV), 2015. two-stage object detector. arXiv preprint arXiv:1711.07264,
[11] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra 2017.
Malik. Rich feature hierarchies for accurate object detec- [26] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S.
tion and semantic segmentation. In Proceedings of the IEEE Belongie. Feature pyramid networks for object detection.
Conference on Computer Vision and Pattern Recognition In 2017 IEEE Conference on Computer Vision and Pattern
(CVPR), 2014. Recognition (CVPR), pages 936–944, July 2017.
[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [27] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and
Malik. Rich feature hierarchies for accurate object detection Piotr Dollár. Focal loss for dense object detection. IEEE
and semantic segmentation. In Proceedings of the IEEE con- transactions on pattern analysis and machine intelligence,
ference on computer vision and pattern recognition, pages 2018.
580–587, 2014. [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
shick. Mask r-cnn. In Computer Vision (ICCV), 2017 Zitnick. Microsoft coco: Common objects in context. In
IEEE International Conference on, pages 2980–2988. IEEE, European conference on computer vision, pages 740–755.
2017. Springer, 2014.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Spatial pyramid pooling in deep convolutional networks for Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
visual recognition. In European conference on computer vi- Berg. Ssd: Single shot multibox detector. In European con-
sion, pages 346–361. Springer, 2014. ference on computer vision, pages 21–37. Springer, 2016.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
[30] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. [40] Lachlan Tychsen-Smith and Lars Petersson. Denet: Scalable
Shufflenet v2: Practical guidelines for efficient cnn architec- real-time object detection with directed sparse sampling. In
ture design. In Proceedings of the European Conference on Proceedings of the IEEE International Conference on Com-
Computer Vision (ECCV), pages 116–131, 2018. puter Vision, pages 428–436, 2017.
[31] Francisco Massa and Ross Girshick. maskrcnn-benchmark: [41] Lachlan Tychsen-Smith and Lars Petersson. Improving ob-
Fast, modular reference implementation of Instance Seg- ject localization with fitness nms and bounded iou loss. In
mentation and Object Detection algorithms in PyTorch. Proceedings of the IEEE Conference on Computer Vision
https://github.com/facebookresearch/ and Pattern Recognition, pages 6877–6885, 2018.
maskrcnn-benchmark, 2018. [42] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
[32] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali ers, and Arnold WM Smeulders. Selective search for ob-
Farhadi. You only look once: Unified, real-time object de- ject recognition. International journal of computer vision,
tection. In Proceedings of the IEEE conference on computer 104(2):154–171, 2013.
vision and pattern recognition, pages 779–788, 2016. [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[33] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, ing He. Non-local neural networks. In Proceedings of the
stronger. In Proceedings of the IEEE conference on computer IEEE Conference on Computer Vision and Pattern Recogni-
vision and pattern recognition, pages 7263–7271, 2017. tion, pages 7794–7803, 2018.
[34] Joseph Redmon and Ali Farhadi. Yolov3: An incremental [44] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth
improvement. arXiv preprint arXiv:1804.02767, 2018. Bodla, and Rama Chellappa. Deep regionlets for object de-
[35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. tection. In The European Conference on Computer Vision
Faster r-cnn: Towards real-time object detection with region (ECCV), September 2018.
proposal networks. In Advances in neural information pro- [45] Haichao Zhang and Jianyu Wang. Towards adversarially ro-
cessing systems, pages 91–99, 2015. bust object detection. In The IEEE International Conference
[36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- on Computer Vision (ICCV), October 2019.
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted [46] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
residuals and linear bottlenecks. In Proceedings of the IEEE Shufflenet: An extremely efficient convolutional neural net-
Conference on Computer Vision and Pattern Recognition, work for mobile devices. In Proceedings of the IEEE Con-
pages 4510–4520, 2018. ference on Computer Vision and Pattern Recognition, pages
[37] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Math- 6848–6856, 2018.
ieu, Robert Fergus, and Yann Lecun. Overfeat: Integrated
[47] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Ob-
recognition, localization and detection using convolutional
jects as points. arXiv preprint arXiv:1904.07850, 2019.
networks. In International Conference on Learning Repre-
[48] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krähenbühl.
sentations (ICLR2014), CBLS, April 2014, 2014.
Bottom-up object detection by grouping extreme and center
[38] K. Simonyan and A. Zisserman. Very deep convolutional
points. In CVPR, 2019.
networks for large-scale image recognition. In International
Conference on Learning Representations, 2015. [49] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
[39] Zhiyu Tan, Xuecheng Nie, Qi Qian, Nan Li, and Hao Li. formable convnets v2: More deformable, better results.
Learning to rank proposals for object detection. In The IEEE CVPR, 2019.
International Conference on Computer Vision (ICCV), Octo-
ber 2019.