You are on page 1of 10

Y. LI AND F.

REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 1


arXiv:1905.10011v1 [cs.CV] 24 May 2019
Light-Weight RetinaNet for Object Detection
Yixing Li School of Computing, Informatics, and
yixingli@asu.edu Decision Systems Engineering
Fengbo Ren Arizona State University
renfengbo@asu.edu

Abstract
Object detection has gained great progress driven by the development of deep learn-
ing. Compared with a widely studied task – classification, generally speaking, object
detection even need one or two orders of magnitude more FLOPs (floating point oper-
ations) in processing the inference task. To enable a practical application, it is essen-
tial to explore effective runtime and accuracy trade-off scheme. Recently, a growing
number of studies are intended for object detection on resource constraint devices, such
as YOLOv1, YOLOv2, SSD, MobileNetv2-SSDLite [11, 14, 16], whose accuracy on
COCO test-dev [8] detection results are yield to mAP around 22-25% (mAP-20-tier).
On the contrary, very few studies discuss the computation and accuracy trade-off scheme
for mAP-30-tier detection networks. In this paper, we illustrate the insights of why Reti-
naNet gives effective computation and accuracy trade-off for object detection and how
to build a light-weight RetinaNet. We propose to only reduce FLOPs in computational
intensive layers and keep other layer the same. Compared with most common way –
input image scaling for FLOPs-accuracy trade-off, the proposed solution shows a con-
stantly better FLOPs-mAP trade-off line. Quantitatively, the proposed method result in
0.1% mAP improvement at 1.15x FLOPs reduction and 0.3% mAP improvement at 1.8x
FLOPs reduction.

1 Introduction
Object detection serves as an important role in computer vision-based tasks [10, 16, 17]. It
is the key module in face detection, tracking objects, video surveillance, pedestrian detection
etc [13, 19]. With the recent development of deep learning, it boosts the performance of ob-
ject detection tasks. However, regarding the computational complexity (in terms of FLOPs),
the detection network can possibly consume three orders of magnitude more FLOPs than
a classification network, which makes it much more difficult to move towards low-latency
inference.
Recently, there have been growing number of studies investigating into detection on
resource constraint devices, such as mobile platforms. As the main concern of resource
constraint devices is the memory consumption, the existing solutions, such as YOLOv1,
YOLOv2, SSD, MobileNetv2-SSDLite [11, 14, 16] have pushed it hard to reduce the mem-
ory consumption by trading off their accuracy performance. Their accuracy on large dataset
– COCO test-dev 2017 [8] detection results are yield to mAP of 22-25%. Here, we use
the mAP as the indicator to categorize these solutions as mAP-20-tier. On the other side,
in the mAP-30-tier, popular solutions can be listed as Faster R-CNN, RetinaNet, YOLOv3
The source code is available at https://github.com/PSCLab-ASU/LW-RetinaNet.
2 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION

[10, 15, 17] and their variants. As these solutions are commonly deployed on mid- or high-
end GPUs or FPGAs, the memory resource is probably enough for preloading the weights.
In addtion, [7] also verifies the linear relation between number FLOPs and inference run-
time for the same kind of network. By applying Faster R-CNN, RetinaNet and YOLOv3
[10, 15, 17] on the same task for COCO detection dataset, which takes an input image
around 600x600-800x800, the mAP will hit in the range of 33%-36%. However, the FLOPs
of Faster R-CNN [17] is around 850 GFLOPs (gigaFLOPs), which is at lease 5x more than
that of RetinaNet and YOLOv3 [15]. Apparently, Faster R-CNN is not competitive in the
computational efficiency. From YOLOv2 [14] to YOLOv3 [15], it is interesting that the
authors have aggressively increased the number of FLOPs from 30 to 140 GFLOPs to gain
mAP improvement from 21% to 33%. Even with that, its mAP is 2.5% lower than RetinaNet
with 150 GFLOPs. This observation inspires us to take the RetinaNet as the baseline to
explore a more light-weight version.
There are two common methods to reduce the FLOPs in a detection network. One way
is to switch to another backbone, while the other is reducing the input image size. The first
one definitely results in noticeable accuracy drop if one switches from one of the ResNet
backbones [5] to the other. Typically it won’t be consider as a good accuracy-FLOPs trade-
off scheme with small variation. With regard to reduce the input image size, it is an intuitive
way to reduce the FLOPs. However, the accuracy-FLOPs trade-off line shows degradation
in an exponentially trend [7]. There is an opportunity to find a more linear degradation
trendancy line for better accuracy-FLOPs trade-off. We propose to only replace the certain
branches/layers of the detection network with light-weight architecture and keep the rest of
the network unchanged. For the RetinaNet, the heaviest branch is the succeeding layers
of the finest FPN (P3 in Fig. 1), which takes up 48% of the total FLOPs. We propose
different light-weight architecture variants. More importantly, the proposed method can also
be applied to other blockwise-FLOPs-imbalance detection networks.
The contributions of this paper can be summarized as follows: (1) We proposed only to
reduce the heaviest bottleneck layer for the light-weight RetinaNet mAP-FLOPs trade-off.
(2) The proposed solution shows a constantly better mAP-FLOPs trade-off line in a linear
degradation trend, while the input image scaling method degrades in a more exponentially
trend. (3) Quantitatively, the proposed method result in 0.1% mAP improvement at 1.15x
FLOPs reduction and 0.1% mAP improvement at 1.15x FLOPs reduction and 0.3% mAP
improvement at 1.8x FLOPs reduction.

2 Related Works
2.1 High-end object detection networks (mAP-30-tier)
Faster RCNN [17] is an advanced architecture, which boosts both the accuracy and runtime
performance from R-CNN and Fast R-CNN [1, 2].
The main body of Faster RCNN [17] is composed of three parts – Feature Network,
Region Proposal Network (RPN) and Detection Network. As Faster RCNN [17] replaces
the selective search (used by Fast R-CNN) with RPN, it significantly reduces the runtime
of gernerate the region proposals. However, in the Faster R-CNN inference stage, there are
still around 256-1000 boxes will feed into the detection network, which is really expensive
to process this large batch of data. As for a Faster RCNN to process COCO dataset detection
task with Inception-ResNetV2 [18] backbone, the total numbers of FLOPs can comes up to
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 3

38
FasterRCNN-
RetinaNet-800 InceptionResnetV2
36 RetinaNet-700
RetinaNet-600

34 YOLOv3-608
RetinaNet-500

32 RetinaNet-400 YOLOv3-416

30 mAP30-tier
mAP/%

YOLOv3-320

28

26 SSD

SSDLite-
24 MobileNetv2 mAP20-tier
YOLOv2-416
22
SSDLite-MobileNet
20
0 50 100 150 200 38250
GFLOPs
36
RetinaNe

34 RetinaNet-500

32
Figure 1: Example of a short caption, which should be centered. RetinaNet-400 YOLOv3

30

mAP/%
YOLOv3-320

over 800 GFLOPs. 28

Compared with Faster RCNN [17], the RetinaNet [10] targets a simpler
26 designSSDfor gain-

ing speedup. A feature pyramid network (FPN) [9] is attached to its backbone
SSDLite-
to generate
24 MobileNetv2
multi-scale pyramid features. Then, pyramid features go into classification and regression
YOLOv2-416
branches, whose weights can be shared across different levels of the 22
FPN. The focal loss
is applied to compensate the accuracy drop, which makes its accuracy performance to be
SSDLite-MobileNet
20
comparable with the Faster RCNN. 0 50 1

2.2 Light-weight object detection networks (mAP-20-tier)


The YOLO network family are among the most popular ones. The most distinguished feature
of YOLO is their predefined grid cell. The input image can be cut into SxS grid cells, and
each cell only predicts one object. On the good side, this idea apparently helps to reduce the
computation complexity. However, in the meantime, it increases the chances of undetected
objects and has relative bad performance on detecting small objects. The YOLOv1 [16] is
only evaluated on relative small datasets (PASCAL VOC), with the main contribution of
enabling real-time inference. From YOLOv2 [14] to YOLOv3 [15], the mAP performance
results on COCO test-dev2015 dataset is boosted from 21.6% to 33.0%. It’s worth noting
that the accuracy gain of YOLOv3 comes along with the FLOPs increment from 63 GFLOPs
to 141 GFLOPs. Interestingly, YOLOv3 [15] should not be categorized as a light-weight one
anymore, as the FLOPs counts and mAP is closed to those of RetinaNet-ResNet50-FPN (156
GFLOPs and mAP = 35.7%). We also include other light-weight objection detection network
such as SSD and SSDLite [11] in Fig.1 for providing an overview of mAP-20-tier detection
networks.
As the RetinaNet can win over YOLOv3 on both FLOPs and mAP, this inspire us to
4 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION

ResNet-50 FPN Detection


backend
1/32 Res5
1/32 1/64 D3-D7
0.5x
0.5x 1/32 M5 P5 P6
2x W×
1/16 Res4 0.5x W×H×256 H×KA
1/16
0.5x 1/16 M4 P4 P7
2x 1/128
1/8 Res3 W×H×256
W×H×4A
1/8
0.5x 1/8 M3 P3

1/4 Res2

0.5x
Image Res1 1/2

Figure 2: RetinaNet (ResNet50-FPN-800x800) network architecture.

take the RetinaNet as the baseline design to explore a better scheme for accuracy and FLOPs
tradeoff for the high-end detection tasks.

3 Light-weight RetinaNet
In Section 2, we have explained why we think the RetinaNet network architecture has the
potential to be tailored for a better accuracy and FLOPs traderoff. Here, in this Section, first
we will further analyze the RetinaNet network with a focus of the distribution of the number
of floating-point operations (FLOPs) across different layers in Section 3.1. Then Section 3.2
illustrates the scheme to help RetinaNet to lose weight.

3.1 RetinaNet Primer


The RetinaNet architecture is composed of three parts – a backbone, a feature pyramid net-
work (FPN) [9] and a detection back-end as shown in Fig. 2. The image will first be pro-
cessed by the backbone, which usually is the ResNet Architecture. Here, it is worthy to
notify that although MobileNet’s performance [6] is on a bar with ResNet in the classifica-
tion tasks, it is not true that MobileNet [6] can be an equivalence substitution for the ResNet.
From both [7] and our observation, use MobileNet [6] as the backbone for detection task will
suffer with much more accuracy drop than it does for the classification. One main reason
is that the confidence scores of a MobileNet-based backbone is reduced by trading off with
lower computation costs. Therefore, it wouldn’t be a desirable choice of the backbone for
high precision object detection networks. The backbone together with the succeeding FPN
forms an encoder-decoder-like network. The benefit of the FPN is that it merges the features
of consecutive layers from the coarsest to the finest level, which effectively propagate the
features in different level and different scale to the succeeding layer. After then, the multi-
scale pyramid features (P3-P7) will feed into the back-end where two detection branches
used for bounding box regression and object classification. Note that, for the detection and
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 5

ResNet-50 FPN Detection backend

48.1

6.9 10.5 5.1 11.5


1.0 5.1 0.8 0.4 4.5 4.5 1.1
0.2 0.1 0.3

0.8 4.0 3.7 0.4 0.8 1.7 2.1 2.1 3.9 3.9 3.9 3.9 3.9
22.1
memory/% FLOPs/%
42.9

Figure 3: The FLOPs and memory (parameter) distribution of RetinaNet across different
blocks.

bounding box branches do not share weights. While the weights of each branch are shared
across pyramid features (P3-P7).
The FLOPs distribution of RetinaNet across different blocks is shown in Fig. 3. Here,
each block is corresponding to the same block in Fig. 2. For the detection backend D3-D7,
they refer to succeeding layers of P3-P7, respectively. As in the original design, D3-D7
share the same weight parameters, we just show the average memory cost of D3-D7 in Fig.3
. The FLOPs count of the D3 block significantly dominates the total FLOPs count. This
unbalanced FLOPs distribution is quite different from that of the ResNet architecture, which
has small variant across different blocks. The unbalanced FLOPs distribution gives us the
chance to get a good overall FLOPs reduction percentile by only reducing the cost of the
heaviest layer. Quantitatively, for example, if we can reduce the FLOPs of D3 by half, the
total FLOPs can be reduced by 24%. In the following subsections, we will discuss the main
insights of how to get a tiny back-end.

3.2 Tiny back-end solution


3.2.1 Light-weight block

Intuitively, we can reduce the filter size in order to get FLOPs reduction. As shown in Fig. 4,
we have propose different block design for the detection branches of ResNet. The D-block-
v1 is applying the MobileNet [6] building block here. A 3x3 depth-wise (dw) convolution
is followed by a 1x1 convolutional block to substitute one orginal layer. The D-block-v2
alternately places 1x1 and 3x3 kernels. This one is inspired by the YOLOv1 [16], which has
replaces the 3x3 kernels without introducing residual blocks. In our design, we even make
it simpler to keep number of filters fixed across different layers. The D-block-v3 is more
aggressive, which replaces all the 3x3 convolutions with 1x1 convolutions.
Apparently, the light-weight block will cause certain accuracy drop to tradeoff less com-
putation cost. Therefore, we introduce limited overheads to compensate the accuracy drop
here, which is stated in the next subsection.
6 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION

3x3 conv, dw 1x1 conv, 256 1x1 conv, 256


1x1 conv, 256
3x3 conv, 256 1x1 conv, 256
3x3 conv, dw
1x1 conv, 256 1x1 conv, 256 1x1 conv, 256

3x3 conv, dw 3x3 conv, 256 1x1 conv, 256


1x1 conv, 256
3x3 conv, dw
1x1 conv, 256

D-block-v1 D-block-v2 D-block-v3

Figure 4: light-weight blocks for detection backend.

3x3 conv D4-D7

P7
W×H×KA
W×H×256
P6
3x3 conv
D3-D7
P7 P5
W×H×4A
W×H×256
P6 W×H×KA P4
W×H×256
Shared weight for D4-D7
P5
D-block-v1/2/3 D3
P4 W×H×4A
W×H×256
W×H×KA
WxHx256
P3

Fully shared weight for D3-D7 P3

W×H×4A
WxHx256

Independent weight for D3


(a) (b)

Figure 5: Fully and partially shared weights for detection backend.

3.2.2 Partially shared weights

As illustrated in Section 3.2.1., the light-weight detection blocks are trading off lower com-
putational complexity with accuracy drop. To compensate the accuracy drop, we proposed
to replace the fully shared weight scheme in the original RetinaNet to a partial shard weight
scheme. As shown in Fig. 3, P3-P7 is the multi-scale feature maps outputs of FPN, which
then feed into detection backend D3-D7 respectively. Although D3-D7 shared the weight
parameters, D3-D7 have unique input size (P3-P7) and are processed in serial. Fig. 3(a)
is the original one that D3-D7 fully share the weights. In Fig. 3(b), only D4-D7 share the
weights with original configuration, while D3 will be processed by the light-weight D-block-
v1/v2/v3 that proposed in Section 3.2.1.
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 7

Light-weight block scale mAP ∆mAP\% GFLOPs ∆FLOPs/%


original 800 35.7 0 156 0
D-block-v1 800 34.3 1.4 135 15.4
D-block-v2 800 35.6 0.1 135 6.4
D-block-v3 800 35.1 0.6 89 15.4

Table 1: Comparison between different light-weight block.

Partially shared weights scheme mainly has two advantages. For one thing, as D3 has
its independent weight parameters, it can learn more tailored features for its branch which
can compensate the accuracy drop brought by lower computational complexity. For another,
this enables us not to touch the rest of the network but only solving the heaviest bottleneck
block. Also, as the backbone (ResNet-50) dominate the memory consumption (as shown in
Fig. 3), the overhead of memory consumption here can be negligible. Quantitatively, weight
parameter increment introduced here is less than 1% of the total weights.

4 Results and discussion


4.1 Experimental setup
We perform our experiments in Caffe2 with 4 Titan X GPUs. We build upon the open source
code of RetinaNet in [3]. As the original work is training with 8 GPUs, we scales down
the base learning rate by 2x and extend the training epochs by 2x as suggested in [4]. Be-
sides, [12] proves the deep neural network is less easier to overfit when its computational
complextity is reduced by network compression. It also suggests to extend the training time
accordingly for better accuracy rate. As we reduce the total FLOPs in light-weight Reti-
naNet, we further extend the training epoch with the same factor of FLOPs reduction. In all
the experiments, we fix the network configuration as RetinaNet-ResNet50-FPN.

4.2 Performance on COCO dataset


The COCO dataset [8] is the considered as the most complicated one for object detection. As
we are targeting on discussing the trade-off for high-end object detection network, we only
perform experiments on COCO dataset (same as the original RetinaNet does [10]). We train
the light-weight RetinaNet on 2017 COCO training dataset and test it on COCO test-dev.
Table 1 shows the comparison among different light-weight blocks that we proposed in
Section 3.2.1. In this set of experiment, we only use the light-weight block in the regression
branch (for the bounding box) of detection backend, which is the upper branch shown in
Fig. 2 detection backend. The results of Table 1 show that the D-block-v1 – the one with
the MobileNet building block has 0.8% lower mAP compared with the D-block-v3, which
has the same FLOPs reduction percentile. It also aligns with our analysis in Section 3.2 that
although MobileNet is proved to a powerful light-weight classification network architecture,
MobileNet building block is not guaranteed to be the best building block substitution for
other vision-based tasks. Therefore, with the same scale of FLOPs reduction, we will choose
D-block-v3 instead of D-block-v1 in the following experiment. As the D-block-v2 performs
less aggressive FLOPs reduction, its mAP is only reduced by 0.1%, which is a good trade-off
for small scale FLOPs reduction (15%).
8 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION

Detection backend
Light-weight block
Classification Bouding box
LW-RetinaNet-v1 D-block-v2 √
LW-RetinaNet-v2 D-block-v3 √
LW-RetinaNet-v3 D-block-v3 √ √

Table 2: Configurations of different light-weight(LW) RetinaNet.

scale mAP AP50 AP75 APS APM APL GFLOPs ratio


RetinaNet 800 35.7 55.0 38.5 18.9 38.9 46.3 156 0.0
RetinaNet 700 35.1 54.2 37.7 18.0 39.3 46.4 119 1.3x
RetinaNet 600 34.3 53.2 36.9 16.2 37.4 47.4 88 1.8x
LW-RetinaNet-v1 800 35.4 54.4 38.2 18.3 38.7 46.0 135 1.1x
LW-RetinaNet-v2 800 35.1 54.3 37.7 17.9 38.4 45.7 114 1.4x
LW-RetinaNet-v3 800 34.6 53.1 37.3 15.7 38.7 44.6 89 1.8x

Table 3: Comparison of original RetinaNet and proposed light-weight RetinaNet.

The configurations for different versions of light-weight RetinaNet with D-block-v2 or


D-block-v3 light-weight blocks are shown in Table 2. Table 2 shows the information of
which light-weight block is applied to one or both of classification and bounding box (re-
gression) branch. The corresponding light-weight RetinaNet performance results are shown
in Table 3. As scaling down input image size is the most common practice for FLOPs and
accuracy trade-off, we also list the performance results of original RetinaNet with different
input scales (we directly cited the results from the RetinaNet paper). To give a more straight-
forward understanding of the proposed method versus input image scaling, we visualize the
FLOPs and accuracy trade-off in Fig. 6. Each data point in Fig. 6 is corresponding to one
row of results in Table 3. We mark the trending line of light-weight RetinaNet in red dot line
and that of original RetinaNet in blue dot line. As we stated before, the upper-left corner
means the best trade-off in this kind of mAP-FLOPs graph. As the red line is constantly
closer to the upper-left corner, it indicates that the proposed method has better mAP-FLOPs
trade-off than the conventional input image scaling method. The difference between these
two methods result in 0.1% mAP at the same number of FLOPs with low reduction ratio.
However, as we further reduce the number of FLOPs, the proposed method shows a trend in
linear degradation, while the input image scaling method degrade in a more exponentially di-
rection. Fig. 6 clearly shows a divergence at the GFLOPs around 90, where the conventional
method is yielded to 0.3% more accuracy drop than the proposed method.
As any detection methods with FPN structure can result in an imbalanced FLOPs distri-
bution, the proposed method can be potentially applied to such kinds of detection network
for a better choice for mAP-FLOP.

5 Conclusion
In this paper, We proposed only to reduce the FLOPs in the heaviest bottleneck layer for
a blockwise-FLOPs-imbalance RetinaNet to get its light-weight version. The proposed so-
lution shows a constantly better mAP-FLOPs trade-off line in a linear degradation trend,
while the input image scaling method degrades in a more exponentially trend. Quantita-
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 9

35.8
LW-RetinaNet-
35.6 v1
RetinaNet-800
35.4 LW-RetinaNet-
v2 +0.1%
35.2 mAP
mAP/%
35 LW-RetinaNet-
RetinaNet-700
34.8 v3

34.6
+0.3%
34.4 mAP

34.2 RetinaNet-600
34
80 90 100 110 120 130 140 150 160
GFLOPs

Figure 6: FLOPs and mAP trade-off for input image size scaling versus the proposed method.

tively, the proposed method result in 0.1% mAP improvement at 1.15x FLOPs reduction and
0.3% mAP improvement at 1.8x FLOPs reduction. The proposed method can be potentially
applied to any FPN-based blockwise-FLOPs-imbalance detection network.

References
[1] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on
computer vision, pages 1440–1448, 2015.
[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based con-
volutional networks for accurate object detection and segmentation. IEEE transactions
on pattern analysis and machine intelligence, 38(1):142–158, 2016.
[3] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He.
Detectron. https://github.com/facebookresearch/detectron, 2018.
[4] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch
sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[7] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.
Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings
10 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION

of the IEEE conference on computer vision and pattern recognition, pages 7310–7311,
2017.
[8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-
manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in
context. In European conference on computer vision, pages 740–755. Springer, 2014.
[9] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.

[10] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss
for dense object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.
[11] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European
conference on computer vision, pages 21–37. Springer, 2016.
[12] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking
the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
[13] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In
bmvc, volume 1, page 6, 2015.
[14] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 7263–7271,
2017.
[15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767, 2018.
[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:
Unified, real-time object detection. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 779–788, 2016.

[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
[18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.
Rethinking the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[19] Xiaogang Wang. Intelligent multi-camera video surveillance: A review. Pattern recog-
nition letters, 34(1):3–19, 2013.

You might also like