CCF-Net: A Cascade Center-Based Framework Towards E Cient Human Parts Detection

CCF-Net: A Cascade Center-Based
Framework Towards Efficient Human

Parts Detection
Kai Ye1 , Haoqin Ji1 , Yuan Li2 , Lei Wang2 , Peng Liu2 , and Linlin Shen1(B)
1
Computer Vision Institute, School of Computer Science and Software Engineering,
Shenzhen University, Shenzhen 518055, China
yekai2020@email.szu.edu.cn, llshen@szu.edu.cn
2
Qualcomm AI Research, San Diego, USA
{yuali,wlei,peli}@qti.qualcomm.com
Abstract. Human parts detection has made remarkable progress due to

the development of deep convolutional networks. However, many SOTA
detection methods require large computational cost and are still diffi-
cult to be deployed to edge devices with limited computing resources. In
this paper, we propose a lightweight Cascade Center-based Framework,
called CCF-Net, for human parts detection. Firstly, a Gaussian-Induced
penalty strategy is designed to ensure that the network can handle
objects of various scales. Then, we use Cascade Attention Module to cap-
ture relations between different feature maps, which refines intermediate
features. With our novel cross-dataset training strategy, our framework
fully explores the datasets with incomplete annotations and achieves bet-
ter performance. Furthermore, Center-based Knowledge Distillation is
proposed to enable student models to learn better representation with-
out additional cost. Experiments show that our method achieves a new
SOTA performance on Human-Parts and COCO Human Parts bench-
marks(The Datasets used in this paper were downloaded and experi-
mented on by Kai Ye from Shenzhen University.).
Keywords: Object detection · Human parts · Knowledge distillation
1 Introduction
Human parts detection is a sub-problem of general object detection, which

has attracted increasing attention in various real-world applications including
surveillance video analysis, autonomous driving, and some other areas. However,
existing approaches on object detection [2,7,15,22,29] can hardly efficiently run
on the edge devices due to limited resources.
In this paper, we design a novel anchor-free human parts detection frame-
work, i.e., CCF-Net, which meets the practical requirements of edge devices.
CCF-Net models an object as the center point of its bounding box. Previous
work [28] convert object detection to standard keypoint estimation problem using
single level point annotations, and avoid the time-consuming post-processing
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
D.-T. Dang-Nguyen et al. (Eds.): MMM 2023, LNCS 13834, pp. 177–189, 2023.
https://doi.org/10.1007/978-3-031-27818-1_15
178 K. Ye et al.
compared with anchor-based detectors. However, they can’t well detect large
instances due to the constraint of the receptive field. To address the issue above,
we introduce Adaptive Gaussian Mask (AGM) to adaptively re-weight losses cal-
culated on objects of different scales, such that larger weight is assigned to the
objects of larger scale. Further, we propose Cascade Attention Module to cap-
ture relationship across feature maps at different FPN [10] levels, which applies
the self-attention mechanism [18] to provide a global receptive field.
Since knowledge distillation is a paradigm for inheriting information from
a complicated teacher network to a light student network, it is widely utilized
in object detection tasks currently [3,19,24]. While only [24] try to experiment
on SSD [12] and YOLO [14], most of current works use large backbone (e.g.,
ResNet50 [4]) as a student model, which can hardly efficiently run on the edge
devices. In this work, we apply Center-based knowledge distillation on detection
output. We use MobileNetV3-large [6] to train a teacher model, and transfer the
informative knowledge to light models like MobileNetV3-small. To be specific,
we regard the probability of the teacher’s output as a soft label to supervise the
probability of the student’s and distill regression output only on positive samples
detected by the teacher.
Well-labeled data is essential to the performance of deep neural network.
However, since there are limited open source datasets that meet our needs at
present, one possible solution is cross-dataset training, which aims to utilize two
or more datasets labeled with different object classes to train a single model
that performs well on all the classes. Previously, Yao et al.[23] propose dataset-
aware classification loss, which builds an avoidance relationship across datasets.
Different from their work, we propose Class-specific Branch Mask to mask the
loss from datasets without required category annotation.
To sum up, this paper makes the following contributions:
1) We propose a lightweight human parts detector CCF-Net, which can accu-

rately and efficiently detect human parts.
2) We propose Cascade Attention Module (CAM) and Adaptive Gaussian Mask
(AGM) to alleviate performance gap between objects of different scales.
3) Our Class-specific Branch Mask promotes cross-dataset training, and can be
easily implemented with our network structure.
4) Extensive experiments on Human-Parts [9] and COCO Human Parts [20]
datasets show that our proposed CCF-Net surpasses existing object detection
methods, which validates the effectiveness of our CCF-Net.
2 Related Works
2.1 Lightweight Object Detection Methods
SSD [12] is a pioneer work of the one-stage detectors, which inspires the following
anchor-based detection frameworks in many aspects. The multiscale prediction
it introduced has solved the problem of detecting objects of different scales and
further contribute to the idea of Feature Pyramid Network (FPN) [10]. Another
CCF-Net 179
Fig. 1. An overview of our CCF-Net.
one-stage model pioneer is YOLO [13], which directly regresses the bounding
box of objects. Variants of YOLO [1,14] are continuously developing in recent
years, which did lots of modification based on YOLO, trying to find a trade-off
between speed and accuracy.
However, both SSD and YOLO series predefine several prior boxes on each
pixel and predict quantities of overlapped boxes, so they require time-consuming
Non-Maximum Suppression (NMS) to suppress the boxes. Different from the
methods above, our method is anchor-free. So we don’t need to carefully design
prior boxes and can avoid the time-consuming post-processing procedure.
2.2 Distillation in Detector
Knowledge distillation was first proposed by Hinton et al. [5] and is a general
way to improve the performance of small models. It transfers the dark knowledge
of the teacher model to the student model, so that the student model not only
learns from ground-truth labels but also from teacher model. Recently, knowledge
distillation has been widely utilized in object detection task. Wang et al. [19]
introduced imitation masks to distill the regions of feature maps close to objects
for distillation, which demonstrates distilling foreground is the key to boost the
detection performance. Du et al. [27] generated attention map for distillation
based on classification scores. Zhang et al. [24] proposed an attention-guided
method to distill useful information and introduced non-local module to capture
relation inside pixels of backbone feature maps.
Due to the center-based manner of our method, we don’t need to carefully
design masks according to ground-truth to acquire positive regions, which dis-
cards some tedious procedures. Instead, we directly apply distillation on the
output of our detection head.
3 Methodology
The architecture of the proposed CCF-Net is illustrated in Fig. 1. We fuse 4

feature maps from different levels in FPN. To gather features from different
levels, we first resize feature maps of different levels to an intermediate size and
180 K. Ye et al.
forward them into CAM. After a sequence of self-attention operations, we can

obtain a refined feature map by adding residual connection from the backbone
feature. We then integrate multiple branches to the detection head, where every
branch is responsible to detect classification heatmap and regression heatmap of
a specific category. In the following sections, we provide more details about each
module.
3.1 Adaptive Gaussian Mask

As size of person is often larger than that of f ace and hand, our model should
have the ability to detect objects of different scales. However, since our model
only performs detection on one layer of feature maps, the receptive field is lim-
ited. Therefore, there might be a gap between the detection performance of large
objects and small objects. Intuitively, we need to give more penalty to the objects
of large scale during training because they are hard samples for the network to
learn [26]. From this point of view, we propose to dynamically adjust penal-
ties for objects of different scales. More specifically, we additionally generate a
mask Mg ∈ RH×W to keep weights of loss, where H and W denote the height
and width of ground-truth heatmap, respectively. We generate a small Gaussian
kernel G around every peak in ground-truth.
1 (x − xi )2 + (y − yi )2
Gi = √ exp(− ) (1)
2πσ 2σ 2
where (x2 + y 2 ) < r. i is the index of the ground-truth. r and σ are hyperparam-
eters to control the area and the response value of generated Gaussian kernel,
respectively. We set r as 2, σ as 1 in all later experiments.
Fig. 2. Visualization of AGM. We generate a Gaussian kernel around every center

point, whose response value is adaptive with object scales. As can be observed, the
response value of person is higher than the other two categories.
To adaptively adjust penalties for objects of different scales, we design the

penalty weights according to their longest side, which can be formulated as
follows:
Lxi − Lmin
Wi = δ (2)
Lmax − Lmin
CCF-Net 181
where Lmin and Lmax denote the shortest side and longest side of ground-truth
bounding boxes in the training set, respectively. Lxi denotes the longest side of
the i-th ground-truth bounding box in this image. δ is also a hyperparameter to
control the weight. Unless specified, we set δ as 10 in all later experiments.
After that, we multiply each generated Gaussian kernel in Mg with its cor-
responding weight to get the final result.

N
Mg = Gi × Wi (3)
i=0
where N denotes the number of objects in this image. For objects in larger size,
the activation value in Mg is higher in spatial position. Details are shown in
Fig. 2. Mg is then used to re-weight positive classification loss Lpos calculated
with ground truth, which can be formulated as Lpos = Mg ⊗ Lpos .
3.2 Cascade Attention Module

As discussed in Sect. 3.1, the receptive field of single layer feature map can only
cover a limited scale range, resulting in a performance gap among objects of
various scales. To achieve the goal of well detecting all objects, we should also
generate an output feature with various receptive fields.
Motivated by the self-attention mechanism, we propose Cascade Attention
Module (CAM) to alleviate the above issue. Specifically, we first resize the feature
maps generated from backbone {P2, P3, P4, P5} to a fixed size. Considering
the computational cost, we resize them to the size of P5 with nearest neighbor
interpolation, which is 32× smaller than the input image. Then we concatenate
them as F̃ ∈ RH×W ×LC , where L denotes the number of levels of feature maps.
We further apply a bottleneck on F̃ to capture the informative feature across
levels and reduce the number of channels. The process of generating gathered
feature I can be formulated as follows:
I = CBA(F̃ ) (4)
where CBA includes a 1 × 1 convolution layer followed by Batch Normal-
ization and a ReLU activation. I contains information on different feature maps
and can be reshaped as feature vector V ∈ RHW C . We then iteratively refine it
using a stack of N identical self-attention blocks defined as follows:
R = SAB(V + P E) (5)
where SAB denotes a layer of self-attention block illustrated in Fig. 3. P E
denotes positional encoding detailed in [18], which has the same dimension as
V , so that the two can be summed.
The refined feature vector R can be reshaped as a feature map with the same
size as I, we upsample it to 4× smaller than the input image, using nearest
neighbor interpolation. Then we use a residual connection between P2 and the
refined feature map to enhance the information in the feature map.
182 K. Ye et al.
Fig. 3. The illustration of the proposed Cascade Attention Module (CAM).
3.3 Center-Based Knowledge Distillation

Knowledge distillation (KD) is widely used to improve the accuracy of light
models without additional cost. In this work, we explore the possibility of imple-
menting KD on heatmaps. We use MobileNetV3-large and MobileNetV3-small
as teacher and student, respectively. X ∈ RH×W ×C denotes the classification
heatmap, and R ∈ RH×W ×K denotes the regression heatmap, where C = 3
denotes the number of categories, K = 4 × C denotes the coordinates to be
regressed. Our KD loss consists of two parts: probability distillation loss and
regression distillation loss. Similar to the probability distillation of image classi-
fication, our method uses the probability of the teacher’s output as a soft label
to supervise the probability of the student’s, so the loss function of probability
distillation is defined as follows:
1 H W C
s 2
Lprob
distill = (X t − Xijc ) (6)
H × W × C i=0 j=0 c=0 ijc
where superscript s and t denote student and teacher, respectively.

The regression distillation is applied only on positive samples. Based on the
center-based manner of our framework, we can easily extract positive samples
(e.g., peaks of the classification heatmap) with simple operation. We transfer
the regression information from teacher to student through L2 loss, which can
be expressed as:
1
Lreg
distill = Mpos (Rijc
t
− Rijc
s
)2 (7)
Npos
where Npos denotes the number of positive samples and Mpos ∈ RH×W ×C is
generated through a simple max pooling on X t , as described in [28]. Its values
are 1 if the pixels are peak points detected by the teacher model, and 0 otherwise.
Then the total distillation loss is summed as follows:
Ldistill = Lreg prob
distill + Ldistill
(8)
3.4 Class-Specific Branch Mask

Well-labeled data is essential to the performance of deep neural network. How-
ever, there are limited open source datasets (including hand, face, and human
CCF-Net 183
body) that meet our needs at present, which is an issue often encountered in
deep learning. One possible idea is to take advantage of partially labeled data
for cross-dataset training. However, directly mixing different datasets for train-
ing will mislead the learning of deep neural network, e.g., if hand and person are
detected on an image of a dataset that only contain face annotations, these detec-
tions will be considered as FPs (False Positive), since there is no corresponding
annotation to supervise the detected hand and person.
In order to make full use of partially labeled data, we propose Class-specific
Branch Mask to promote cross-dataset training. When applying loss function
on the predictions with ground-truth label, we use Lpred ∈ RH×W ×C to denote
the output loss, C represents the number of categories, which in our case is 3.
Since we are mixing different datasets for training, while sampling training data,
images of different datasets will appear in a mini-batch. In this case, given an
image, the categories that do not exist in the annotation will not produce losses,
e.g., an image sampled from WiderFace only produces loss of face category. Based
on the explanation above, we first generate a binary mask M ∈ {0, 1}H×W ×C
for each image:

1 if I(k ) ∈ C
Mi,j,k = i ∈ RH , j ∈ RW (9)
0 if I(k ) ∈
/C
We define the number of classes as C, and I denotes a mapping function
from channel index to category. The value of every element in k -th channel of
M is 1 if its corresponding category appears in this image, and 0 otherwise.
Figure 4 illustrates our Class-specific Branch Mask. Then we use the generated
binary mask to mask the loss produced by specific channels. The masked loss is
formulated as:
Lpred = M ⊗ Lpred (10)

where ⊗ denotes element-wise multiplication.
Fig. 4. An explanation of Class-specific Branch Mask. We sample training data from a

hybrid dataset. As we can observe in case 1, the image is sampled in WiderFace dataset,
which only has f ace annotations. We multiply the calculated loss with a binary mask,
where the white part represents the value of 1, while the black part represents zero. In
that case, we only keep the loss of f ace. In case 2, the image is sampled in Human-Parts
dataset. We generate a mask whose values are all 1 to keep the losses of all categories.
184 K. Ye et al.
4 Experimental Results
4.1 Datasets
Two public human parts detection benchmarks, namely Human-Parts [9] and
COCO Human Parts [20] are used for evaluation.
The Human-Parts dataset is annotated with three comprehensive human
parts categories, including person, f ace, and hand. It contains 14,962 images
with 12,000 images for training and 2,962 for validation. The COCO Human
Parts is the first instance-level human parts dataset, which contains 66,808
images with 1,027,450 instance annotations. The dataset is a more complex and
challenging benchmark.
4.2 Implementation Details

We use MobileNetV3-small as the backbone in our ablation studies, and train
the network with no pre-training on any external dataset. The input resolution
is 800 × 800 and the size of the feature map used for detection is 200 × 200,
which is 4× smaller than the input image. Adam is adopted to optimize the
following training loss:
L = α · Lbbox + β · Lcls + γ · Ldistill (11)
where α, β, and γ denote the weights of different losses, which are set to 10, 1,
5, respectively. Lbbox denotes GIoU loss [16]. Lcls denotes Focal loss [11]. Ldistill
is defined in Sect. 3.3. We train the network for 150 epochs with initial learning
rate as 0.001 and a mini-batch of 18 on 2 NVIDIA A100 GPUs. The learning
rate is divided by 10 at epoch 60, 120, respectively.
4.3 Ablation Study

Number of Encoders in CAM. CAM consists of several transformer
encoders, which can capture the relation between objects of different scales.
We evaluate the performance with different numbers of transformer encoders
in Table 1. As CAM enlarges the receptive field, the detection performances of
human parts are improved, especially person. We also measure the inference
latency and model complexity on an NVIDIA A100. As we apply CAM on the
smallest feature map, it brings a marginal increase in cost. Unless specified, CAM
is equipped with 4 encoders in the following experiments.
Effectiveness of Each Component. In this section, we analyze the effective-

ness of AGM, CAM, and CKD by gradually adding the three components to
the baseline one by one, with the default setting detailed in Sect. 4.2. Table 2
shows the contribution of each component. AGM improves the AP of baseline
from 85.5% to 87.37%. When combining AGM with CAM, the AP is improved
to 87.69%. When all the components are used, the AP is further improved to
88.84%. In conclusion, our ablation studies show that AGM, CAM, and CKD
can effectively boost the detection performance.
CCF-Net 185
Table 1. The performance of different numbers of encoders in CAM. The baseline

applies AGM and uses MobileNetV3-small as the backbone. As shown in the table,
CAM can improve the detection performance with an acceptable increase in inference
cost.
CAM AP(person) Latency(ms) FLOPs(G) Params(M)

- 84.03 11.90 25.62 6.73
×1 84.99(0.96) 14.75 25.71 6.84
×2 84.55(0.52) 15.75 25.8 6.89
×3 84.88(0.85) 16.75 25.88 6.94
×4 85.64(1.61) 17.21 25.97 6.99
Table 2. Ablation studies on integrating each component on Human-Parts validation

set. AGM, CAM, and CKD denote Adaptive Gaussian Mask, Cascade Attention Mod-
ule, and Center-based Knowledge Distillation, respectively. In the last two rows, we
show the performances of teacher models in CKD experiments, whose backbones are
MobileNetV3-large.
AGM CAM CKD Face Hand Person mAP

95.34 82.01 79.15 85.50
95.73 82.34 84.03 87.37(1.87)
95.25 82.18 85.64 87.69(2.19)
95.81 83.95 85.18 88.31(2.81)
95.84 84.45 86.22 88.84(3.34)
Teacher 96.47 86.74 86.98 90.06
Teacher w/ CAM 96.65 86.66 87.91 90.40
Class-Specific Branch Mask. We evaluate the effectiveness of Class-specific

Branch Mask by cross-dataset training with WiderFace [21] and Human-Parts [9]
dataset. Our baseline applies AGM and use MobileNetV3-small as the backbone.
As shown in Table 3, simply mixing different datasets for training will greatly
decrease the performance because of the label conflict in two datasets. However,
we can observe that our Class-specific Branch Mask improves the overall AP by
1.6%. Though the extra dataset only contains face category, performances of all
the three categories are improved, e.g., it brings 1.76% AP gain on hand and
2.52% in person. We believe that the model learns better to deal with objects
of various scales, which demonstrates the effectiveness of Class-specific Branch
Mask. Since we only make minor modifications to the loss function, it brings
no additional cost. We believe that if we can mix more datasets together for
co-training, the performance will be further improved.
4.4 Comparisons with Lightweight Models

To evaluate the efficiency of our model, we measure the inference latency with
some mainstream lightweight models on HUAWEI nova5 in Table 4, which is
equipped with a Kirin 990 ARM CPU. For a fair comparison, we don’t measure
the latency of post-processing since YOLO and SSD series adopt time-consuming
186 K. Ye et al.
Table 3. Ablation studies of Class-specific Branch Mask. H denotes Human-Parts and

W denotes WiderFace. CBM denotes whether to use Class-specific Branch Mask.
Dataset CBM Face Hand Person mAP

H - 95.73 82.34 84.03 87.37
H+W 96.04 81.33 81.15 86.17
H+W 96.27 84.10 86.55 88.97
Table 4. Comparison with other lightweight models. We measure the latency on a Kirin
990 ARM CPU with different input resolutions. MBN denotes MobileNet. Blue and
purple texts denote the results for input with resolution of 224 and 320, respectively.
Model Backbone Resolution Latency(ms)

SSD [12] MBN-V2 224/320 44.59/88.55
YOLOV3 [14] MBN-V2 224/320 56.63/113.27
YOLOV3 [14] MBN-V3-small 224/320 71.03/138.39
YOLOV4 [1] MBN-V2 224/320 48.04/97.84
YOLOV4 [1] MBN-V3-Small 224/320 67.39/133.15
Ours MBN-V3-Small 224/320 35.65/73.72
non-maximum suppression (NMS) to suppress overlapped boxes. The results

suggest that the speed of our model surpasses all of the other models, e.g., when
the same backbone of MobileNetV3-small is used, our efficiency is almost twice
that of YOLOV4.
Table 5. Compare our CCF-Net with state-of-the-arts on Human-Parts dataset, R50

denotes ResNet50, methods with ∗ denotes multiscale training.
Method BB Face Hand Person mAP FPS

SSD [12] VGG16 90.4 77.4 84.3 84.0 -
Faster R-CNN∗ [15] R50 93.0 82.3 89.2 88.1 31.3
RFCN∗ [2] R50 93.2 84.5 88.9 88.9 -
FPN∗ [10] R50 96.5 85.4 87.9 88.9 -
DID-Net∗ [9] R50 96.1 87.5 89.6 91.1 -
ATSS [25] R50 96.8 88.0 91.6 92.1 30.1
FSAF [30] R50 96.6 86.6 91.1 91.4 32.3
FCOS [17] R50 96.6 88.0 91.4 92.0 33.9
RepPoint [22] R50 96.4 86.0 89.1 90.5 30.1
Ours R50 97.2 89.7 89.6 92.2 39.4
4.5 Comparisons with the State-of-Arts
We compare the proposed CCF-Net to state-of-the-art detectors on Human-

Parts and COCO Human Parts dataset in Table 5 and Table 6, respectively. We
CCF-Net 187
only use AGM and CAM in SOTA experiments, Class-specific Branch Mask and
Center-based Knowledge Distillation are not applied. In Table 5, with ResNet50
[4] backbone, CCF-Net shows the state-of-the-art AP50 of 92.16% on Human-
Parts dataset, surpassing all the previous state-of-the-arts with faster inference
speed. Especially, when compared with DIDNet, the previous SOTA human parts
detector, we show better performance on detecting human parts and outperform
DIDNet by 1.06% in terms of AP50 . On more challenging COCO Human Parts
dataset, CCF-Net achieves AP50 of 64.8%, outperforming the other methods by
large margins.
Since we use a single-layer feature map for detection instead of using multi-
layer feature maps like other methods, it’s observed that our AP of person is
relatively lower than the other methods. However, compared with them, our
model saves more memory, which is crucial for edge devices. Besides, as the
feature map we use is only 4× smaller than the input image, it is beneficial to
detect small objects, thus our performances of f ace and hand surpass others
with considerable improvement.
Table 6. Compare our CCF-Net with state-of-the-arts on COCO Human Parts dataset,
R50 denotes ResNet50.
Method BB Face Hand Person mAP FPS

ATSS [25] R50 54.9 46.4 77.2 59.5 30.5
FoveaBox [8] R50 53.9 46.3 75.8 58.7 37.0
PAA [7] R50 59.1 47.4 75.0 60.5 13.5
AutoAssign [29] R50 60.1 49.0 76.6 61.9 31.6
FSAF [30] R50 59.6 49.0 75.6 61.4 32.5
Ours R50 65.6 55.8 73.0 64.8 39.4
5 Conclusion
In this paper, we propose a light-weight detection framework CCF-Net for human
parts detection. Cascade Attention Module and Adaptive Gaussian Mask are
proposed to bridge the performance gap among objects of different scales. Addi-
tionally, we apply Center-based Knowledge Distillation to boost the performance
of our light model. Further, we combine several datasets together to train the
model by Class-specific Branch Mask, which solves the issue that currently only
a few datasets are annotated with multicategory human parts. Through experi-
ments, we evaluate the performance of the proposed method and prove that the
proposed method substantially outperforms the state-of-the-arts object detec-
tors. In the future, we will continue to explore improving the performance of
lightweight models.
Acknowledgements. This work was supported by the National Natural Science

Foundation of China under Grant 91959108, and Shenzhen Municipal Science and
Technology Innovation Council under Grant JCYJ20220531101412030. We thank Qual-
comm Incorporated to support us.
188 K. Ye et al.
References
1. Bochkovskiy, A., et al.: Yolov4: optimal speed and accuracy of object detection.
arXiv (2020)
2. Dai, J., et al.: R-FCN: Object detection via region-based fully convolutional net-
works. In: NIPS (2016)
3. Guo, J., et al.: Distilling object detectors via decoupled features. In: CVPR (2021)
4. He, K., et al.: Deep residual learning for image recognition. In: CVPR (2016)
5. Hinton, G., et al.: Distilling the knowledge in a neural network. arXiv (2015)
6. Howard, A., et al.: Searching for mobilenetv3. In: ICCV (2019)
7. Kim, K., et al.: Probabilistic anchor assignment with IoT prediction for object
detection. In: ECCV (2020)
8. Kong, T., et al.: Foveabox: Beyound anchor-based object detection. IEEE Trans.
Image Process. (99):1-1 (2020)
9. Li, X., et al.: Detector-in-detector: multi-level analysis for human-parts. In: ACCV
(2018)
10. Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017)
11. Lin, T.Y., et al.: Focal loss for dense object detection. In: ICCV (2017)
12. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.:
SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M.
(eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-46448-0_2
13. Redmon, J., et al.: You only look once: Unified, real-time object detection. In:
CVPR (2016)
14. Redmon, J., et al.: Yolov3: An incremental improvement. arXiv (2018)
15. Ren, et al.: Faster r-CNN: towards real-time object detection with region proposal
networks. In: NIPS (2015)
16. Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for
bounding box regression. In: CVPR (2019)
17. Tian, Z., et al.: Fcos: fully convolutional one-stage object detection. In: ICCV
(2019)
18. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
19. Wang, T., Yuan, L., Zhang, X., Feng, J.: Distilling object detectors with fine-
grained feature imitation. In: CVPR (2019)
20. Yang, L., et al.: HIER R-CNN: instance-level human parts detection and a new
benchmark. Trans. I. Process 30, 39–54 (2020)
21. Yang, S., et al.: Wider face: A face detection benchmark. In: CVPR (2016)
22. Yang, Z., et al.: Reppoints: point set representation for object detection. In: ICCV
(2019)
23. Yao, Y., et al.: Cross-dataset training for class increasing object detection. arXiv
(2020)
24. Zhang, L., et al.: Improve object detection with feature-based knowledge distilla-
tion: towards accurate and efficient detectors. In: ICLR (2020)
25. Zhang, S., et al.: Bridging the gap between anchor-based and anchor-free detection
via adaptive training sample selection. In: CVPR (2020)
26. Zhang, S., et al.: Distribution alignment: A unified framework for long-tail visual
recognition. In: CVPR (2021)
CCF-Net 189
27. Zhixing, D., et al.: Distilling object detectors with feature richness. In: NIPS (2021)
28. Zhou, X., et al.: Objects as points. arXiv (2019)
29. Zhu, B., et al.: Autoassign: Differentiable label assignment for dense object detec-
tion. arXiv (2020)
30. Zhu, C., et al.: Feature selective anchor-free module for single-shot object detection.
In: CVPR (2019)

CCF-Net: A Cascade Center-Based Framework Towards E Cient Human Parts Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CCF-Net: A Cascade Center-Based Framework Towards E Cient Human Parts Detection

Uploaded by

Copyright:

Available Formats

CCF-Net: A Cascade Center-Based

Framework Towards Eﬃcient Human

Abstract. Human parts detection has made remarkable progress due to

Keywords: Object detection · Human parts · Knowledge distillation

Human parts detection is a sub-problem of general object detection, which

1) We propose a lightweight human parts detector CCF-Net, which can accu-

Fig. 1. An overview of our CCF-Net.

2.2 Distillation in Detector

The architecture of the proposed CCF-Net is illustrated in Fig. 1. We fuse 4

forward them into CAM. After a sequence of self-attention operations, we can

3.1 Adaptive Gaussian Mask

Fig. 2. Visualization of AGM. We generate a Gaussian kernel around every center

To adaptively adjust penalties for objects of diﬀerent scales, we design the

3.2 Cascade Attention Module

Fig. 3. The illustration of the proposed Cascade Attention Module (CAM).

3.3 Center-Based Knowledge Distillation

where superscript s and t denote student and teacher, respectively.

3.4 Class-Speciﬁc Branch Mask

Lpred = M ⊗ Lpred (10)

Fig. 4. An explanation of Class-speciﬁc Branch Mask. We sample training data from a

4.2 Implementation Details

4.3 Ablation Study

Eﬀectiveness of Each Component. In this section, we analyze the eﬀective-

Table 1. The performance of diﬀerent numbers of encoders in CAM. The baseline

CAM AP(person) Latency(ms) FLOPs(G) Params(M)

Table 2. Ablation studies on integrating each component on Human-Parts validation

AGM CAM CKD Face Hand Person mAP

Class-Specific Branch Mask. We evaluate the effectiveness of Class-specific

4.4 Comparisons with Lightweight Models

Table 3. Ablation studies of Class-speciﬁc Branch Mask. H denotes Human-Parts and

Dataset CBM Face Hand Person mAP

Model Backbone Resolution Latency(ms)

non-maximum suppression (NMS) to suppress overlapped boxes. The results

Table 5. Compare our CCF-Net with state-of-the-arts on Human-Parts dataset, R50

Method BB Face Hand Person mAP FPS

4.5 Comparisons with the State-of-Arts

We compare the proposed CCF-Net to state-of-the-art detectors on Human-

Method BB Face Hand Person mAP FPS

Acknowledgements. This work was supported by the National Natural Science

You might also like