Professional Documents
Culture Documents
Kai Ye1 , Haoqin Ji1 , Yuan Li2 , Lei Wang2 , Peng Liu2 , and Linlin Shen1(B)
1
Computer Vision Institute, School of Computer Science and Software Engineering,
Shenzhen University, Shenzhen 518055, China
yekai2020@email.szu.edu.cn, llshen@szu.edu.cn
2
Qualcomm AI Research, San Diego, USA
{yuali,wlei,peli}@qti.qualcomm.com
1 Introduction
compared with anchor-based detectors. However, they can’t well detect large
instances due to the constraint of the receptive field. To address the issue above,
we introduce Adaptive Gaussian Mask (AGM) to adaptively re-weight losses cal-
culated on objects of different scales, such that larger weight is assigned to the
objects of larger scale. Further, we propose Cascade Attention Module to cap-
ture relationship across feature maps at different FPN [10] levels, which applies
the self-attention mechanism [18] to provide a global receptive field.
Since knowledge distillation is a paradigm for inheriting information from
a complicated teacher network to a light student network, it is widely utilized
in object detection tasks currently [3,19,24]. While only [24] try to experiment
on SSD [12] and YOLO [14], most of current works use large backbone (e.g.,
ResNet50 [4]) as a student model, which can hardly efficiently run on the edge
devices. In this work, we apply Center-based knowledge distillation on detection
output. We use MobileNetV3-large [6] to train a teacher model, and transfer the
informative knowledge to light models like MobileNetV3-small. To be specific,
we regard the probability of the teacher’s output as a soft label to supervise the
probability of the student’s and distill regression output only on positive samples
detected by the teacher.
Well-labeled data is essential to the performance of deep neural network.
However, since there are limited open source datasets that meet our needs at
present, one possible solution is cross-dataset training, which aims to utilize two
or more datasets labeled with different object classes to train a single model
that performs well on all the classes. Previously, Yao et al.[23] propose dataset-
aware classification loss, which builds an avoidance relationship across datasets.
Different from their work, we propose Class-specific Branch Mask to mask the
loss from datasets without required category annotation.
To sum up, this paper makes the following contributions:
2 Related Works
2.1 Lightweight Object Detection Methods
SSD [12] is a pioneer work of the one-stage detectors, which inspires the following
anchor-based detection frameworks in many aspects. The multiscale prediction
it introduced has solved the problem of detecting objects of different scales and
further contribute to the idea of Feature Pyramid Network (FPN) [10]. Another
CCF-Net 179
one-stage model pioneer is YOLO [13], which directly regresses the bounding
box of objects. Variants of YOLO [1,14] are continuously developing in recent
years, which did lots of modification based on YOLO, trying to find a trade-off
between speed and accuracy.
However, both SSD and YOLO series predefine several prior boxes on each
pixel and predict quantities of overlapped boxes, so they require time-consuming
Non-Maximum Suppression (NMS) to suppress the boxes. Different from the
methods above, our method is anchor-free. So we don’t need to carefully design
prior boxes and can avoid the time-consuming post-processing procedure.
Knowledge distillation was first proposed by Hinton et al. [5] and is a general
way to improve the performance of small models. It transfers the dark knowledge
of the teacher model to the student model, so that the student model not only
learns from ground-truth labels but also from teacher model. Recently, knowledge
distillation has been widely utilized in object detection task. Wang et al. [19]
introduced imitation masks to distill the regions of feature maps close to objects
for distillation, which demonstrates distilling foreground is the key to boost the
detection performance. Du et al. [27] generated attention map for distillation
based on classification scores. Zhang et al. [24] proposed an attention-guided
method to distill useful information and introduced non-local module to capture
relation inside pixels of backbone feature maps.
Due to the center-based manner of our method, we don’t need to carefully
design masks according to ground-truth to acquire positive regions, which dis-
cards some tedious procedures. Instead, we directly apply distillation on the
output of our detection head.
3 Methodology
1 (x − xi )2 + (y − yi )2
Gi = √ exp(− ) (1)
2πσ 2σ 2
where (x2 + y 2 ) < r. i is the index of the ground-truth. r and σ are hyperparam-
eters to control the area and the response value of generated Gaussian kernel,
respectively. We set r as 2, σ as 1 in all later experiments.
where Lmin and Lmax denote the shortest side and longest side of ground-truth
bounding boxes in the training set, respectively. Lxi denotes the longest side of
the i-th ground-truth bounding box in this image. δ is also a hyperparameter to
control the weight. Unless specified, we set δ as 10 in all later experiments.
After that, we multiply each generated Gaussian kernel in Mg with its cor-
responding weight to get the final result.
N
Mg = Gi × Wi (3)
i=0
where N denotes the number of objects in this image. For objects in larger size,
the activation value in Mg is higher in spatial position. Details are shown in
Fig. 2. Mg is then used to re-weight positive classification loss Lpos calculated
with ground truth, which can be formulated as Lpos = Mg ⊗ Lpos .
I = CBA(F̃ ) (4)
where CBA includes a 1 × 1 convolution layer followed by Batch Normal-
ization and a ReLU activation. I contains information on different feature maps
and can be reshaped as feature vector V ∈ RHW C . We then iteratively refine it
using a stack of N identical self-attention blocks defined as follows:
R = SAB(V + P E) (5)
where SAB denotes a layer of self-attention block illustrated in Fig. 3. P E
denotes positional encoding detailed in [18], which has the same dimension as
V , so that the two can be summed.
The refined feature vector R can be reshaped as a feature map with the same
size as I, we upsample it to 4× smaller than the input image, using nearest
neighbor interpolation. Then we use a residual connection between P2 and the
refined feature map to enhance the information in the feature map.
182 K. Ye et al.
body) that meet our needs at present, which is an issue often encountered in
deep learning. One possible idea is to take advantage of partially labeled data
for cross-dataset training. However, directly mixing different datasets for train-
ing will mislead the learning of deep neural network, e.g., if hand and person are
detected on an image of a dataset that only contain face annotations, these detec-
tions will be considered as FPs (False Positive), since there is no corresponding
annotation to supervise the detected hand and person.
In order to make full use of partially labeled data, we propose Class-specific
Branch Mask to promote cross-dataset training. When applying loss function
on the predictions with ground-truth label, we use Lpred ∈ RH×W ×C to denote
the output loss, C represents the number of categories, which in our case is 3.
Since we are mixing different datasets for training, while sampling training data,
images of different datasets will appear in a mini-batch. In this case, given an
image, the categories that do not exist in the annotation will not produce losses,
e.g., an image sampled from WiderFace only produces loss of face category. Based
on the explanation above, we first generate a binary mask M ∈ {0, 1}H×W ×C
for each image:
1 if I(k ) ∈ C
Mi,j,k = i ∈ RH , j ∈ RW (9)
0 if I(k ) ∈
/C
We define the number of classes as C, and I denotes a mapping function
from channel index to category. The value of every element in k -th channel of
M is 1 if its corresponding category appears in this image, and 0 otherwise.
Figure 4 illustrates our Class-specific Branch Mask. Then we use the generated
binary mask to mask the loss produced by specific channels. The masked loss is
formulated as:
4 Experimental Results
4.1 Datasets
Two public human parts detection benchmarks, namely Human-Parts [9] and
COCO Human Parts [20] are used for evaluation.
The Human-Parts dataset is annotated with three comprehensive human
parts categories, including person, f ace, and hand. It contains 14,962 images
with 12,000 images for training and 2,962 for validation. The COCO Human
Parts is the first instance-level human parts dataset, which contains 66,808
images with 1,027,450 instance annotations. The dataset is a more complex and
challenging benchmark.
Table 4. Comparison with other lightweight models. We measure the latency on a Kirin
990 ARM CPU with different input resolutions. MBN denotes MobileNet. Blue and
purple texts denote the results for input with resolution of 224 and 320, respectively.
only use AGM and CAM in SOTA experiments, Class-specific Branch Mask and
Center-based Knowledge Distillation are not applied. In Table 5, with ResNet50
[4] backbone, CCF-Net shows the state-of-the-art AP50 of 92.16% on Human-
Parts dataset, surpassing all the previous state-of-the-arts with faster inference
speed. Especially, when compared with DIDNet, the previous SOTA human parts
detector, we show better performance on detecting human parts and outperform
DIDNet by 1.06% in terms of AP50 . On more challenging COCO Human Parts
dataset, CCF-Net achieves AP50 of 64.8%, outperforming the other methods by
large margins.
Since we use a single-layer feature map for detection instead of using multi-
layer feature maps like other methods, it’s observed that our AP of person is
relatively lower than the other methods. However, compared with them, our
model saves more memory, which is crucial for edge devices. Besides, as the
feature map we use is only 4× smaller than the input image, it is beneficial to
detect small objects, thus our performances of f ace and hand surpass others
with considerable improvement.
Table 6. Compare our CCF-Net with state-of-the-arts on COCO Human Parts dataset,
R50 denotes ResNet50.
5 Conclusion
In this paper, we propose a light-weight detection framework CCF-Net for human
parts detection. Cascade Attention Module and Adaptive Gaussian Mask are
proposed to bridge the performance gap among objects of different scales. Addi-
tionally, we apply Center-based Knowledge Distillation to boost the performance
of our light model. Further, we combine several datasets together to train the
model by Class-specific Branch Mask, which solves the issue that currently only
a few datasets are annotated with multicategory human parts. Through experi-
ments, we evaluate the performance of the proposed method and prove that the
proposed method substantially outperforms the state-of-the-arts object detec-
tors. In the future, we will continue to explore improving the performance of
lightweight models.
References
1. Bochkovskiy, A., et al.: Yolov4: optimal speed and accuracy of object detection.
arXiv (2020)
2. Dai, J., et al.: R-FCN: Object detection via region-based fully convolutional net-
works. In: NIPS (2016)
3. Guo, J., et al.: Distilling object detectors via decoupled features. In: CVPR (2021)
4. He, K., et al.: Deep residual learning for image recognition. In: CVPR (2016)
5. Hinton, G., et al.: Distilling the knowledge in a neural network. arXiv (2015)
6. Howard, A., et al.: Searching for mobilenetv3. In: ICCV (2019)
7. Kim, K., et al.: Probabilistic anchor assignment with IoT prediction for object
detection. In: ECCV (2020)
8. Kong, T., et al.: Foveabox: Beyound anchor-based object detection. IEEE Trans.
Image Process. (99):1-1 (2020)
9. Li, X., et al.: Detector-in-detector: multi-level analysis for human-parts. In: ACCV
(2018)
10. Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017)
11. Lin, T.Y., et al.: Focal loss for dense object detection. In: ICCV (2017)
12. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.:
SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M.
(eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-46448-0_2
13. Redmon, J., et al.: You only look once: Unified, real-time object detection. In:
CVPR (2016)
14. Redmon, J., et al.: Yolov3: An incremental improvement. arXiv (2018)
15. Ren, et al.: Faster r-CNN: towards real-time object detection with region proposal
networks. In: NIPS (2015)
16. Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for
bounding box regression. In: CVPR (2019)
17. Tian, Z., et al.: Fcos: fully convolutional one-stage object detection. In: ICCV
(2019)
18. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
19. Wang, T., Yuan, L., Zhang, X., Feng, J.: Distilling object detectors with fine-
grained feature imitation. In: CVPR (2019)
20. Yang, L., et al.: HIER R-CNN: instance-level human parts detection and a new
benchmark. Trans. I. Process 30, 39–54 (2020)
21. Yang, S., et al.: Wider face: A face detection benchmark. In: CVPR (2016)
22. Yang, Z., et al.: Reppoints: point set representation for object detection. In: ICCV
(2019)
23. Yao, Y., et al.: Cross-dataset training for class increasing object detection. arXiv
(2020)
24. Zhang, L., et al.: Improve object detection with feature-based knowledge distilla-
tion: towards accurate and efficient detectors. In: ICLR (2020)
25. Zhang, S., et al.: Bridging the gap between anchor-based and anchor-free detection
via adaptive training sample selection. In: CVPR (2020)
26. Zhang, S., et al.: Distribution alignment: A unified framework for long-tail visual
recognition. In: CVPR (2021)
CCF-Net 189
27. Zhixing, D., et al.: Distilling object detectors with feature richness. In: NIPS (2021)
28. Zhou, X., et al.: Objects as points. arXiv (2019)
29. Zhu, B., et al.: Autoassign: Differentiable label assignment for dense object detec-
tion. arXiv (2020)
30. Zhu, C., et al.: Feature selective anchor-free module for single-shot object detection.
In: CVPR (2019)