Professional Documents
Culture Documents
REPRESENTATION
ABSTRACT
The detection of human body and its related parts (e.g., face,
head or hands) have been intensively studied and greatly im-
proved since the breakthrough of deep CNNs. However, most
of these detectors are trained independently, making it a chal-
lenging task to associate detected body parts with people.
This paper focuses on the problem of joint detection of hu-
man body and its corresponding parts. Specifically, we pro-
pose a novel extended object representation that integrates
the center location offsets of body or its parts, and construct
a dense single-stage anchor-based Body-Part Joint Detector
(BPJDet). Body-part associations in BPJDet are embedded
into the unified representation which contains both the seman- Fig. 1. The illustration of the difference between our pro-
tic and geometric information. Therefore, BPJDet does not posed single-stage BPJDet and other two-stage body-part
suffer from error-prone association post-matching, and has a joint detection methods (e.g., JointDet [17], BFJDet [18] and
better accuracy-speed trade-off. Furthermore, BPJDet can be BodyHands [9]). Their two-stage refers to training the detec-
seamlessly generalized to jointly detect any body part. To ver- tion and association modules separately, rather than our joint
ify the effectiveness and superiority of our method, we con- detection and association framework.
duct extensive experiments on the CityPersons, CrowdHu-
performance of human bodies and related parts has been sig-
man and BodyHands datasets. The proposed BPJDet detec-
nificantly improved due to breakthroughs of deep CNN-based
tor achieves state-of-the-art association performance on these
general object detection (e.g., Faster R-CNN [10], FPN [11],
three benchmarks while maintains high accuracy of detection.
RetinaNet [12] and YOLO [13]) and construction of large-
Code is in https://github.com/hnuzhy/BPJDet.
scale high-quality datasets (e.g., COCO [14], CityPersons
Index Terms— Association, body-part joint detection, [15] and CrowdHuman [16]), the joint discovery of human
object detection, object representation body and its parts is still challenging but meaningful.
In this paper, we study the problem of body-part joint de-
1. INTRODUCTION tection and propose a Body-Part Joint Detector (BPJDet). As
shown in Fig. 1, the body part can be a face or hand. Our
Human body detection [1, 2, 3] is a research hotspot in com- goal is to improve the body-part association accuracy on the
puter vision. Accurate and fast human detection in an arbi- premise of ensuring the detection precision. Prior to ours, sev-
trary scene can support many down-stream vision tasks such eral researches have attempted to address the joint body and
as pedestrian re-id, person tracking and human pose estima- part detection problem. DA-RCNN [19] and JointDet [17] fo-
tion. Also, the detection of body parts like face [4, 5], head cus on joint body-head detection. BFJDet [18] concentrates
[6, 7] and hands [8, 9] is equally important. They may serve on the similarity body-face joint detection task. BodyHands
as precursors to specific missions like face recognition, crowd [9] and [8] tackle the problem of detecting hands and finding
counting and hand pose estimation. Although the detection the location of the corresponding person.
Unlike above methods using explicit strategies to model
* Corresponding author: fjiang@mail.ecnu.edu.cn. body-part relationships by exploiting heuristic post-matching
This paper is supported by NSFC (No. 62176155, 62207014), Shang-
hai Municipal Science and Technology Major Project (2021SHZDZX0102).
or learning branched association networks, as illustrated in
Hongtao Lu is also with MOE Key Lab of Artificial Intelligence, AI Institute, Fig. 1, we propose a novel extended object representation that
Shanghai Jiao Tong University. can be harmlessly applied to single-stage anchor-based detec-
tors like YOLO series [13, 20, 21]. Besides bounding boxes, the crowd occlusion problem in human detection by captur-
confidence and objectness contained in the classic object rep- ing and cross-optimizing body and head parts in pairs with
resentation, our extension adds location offsets of body parts. double anchor RPN. JointDet [17] presents a head-body rela-
Regressing offsets is popular in object detection [22, 23] and tionship discriminating module to perform relational learning
human pose estimation [24, 25, 26, 27]. Our simple yet ef- between heads and human bodies. Recently, BFJDet [18] in-
ficient innovation has at least three advantages. 1) The rela- vestigates the performance of body-face joint detection for
tional learning of body-part can benefit from the training of the first time, and proposes a bottom-up scheme that outputs
general detectors without painstakingly designing association body-face pairs for each pedestrian. It also adopts indepen-
subnets. 2) The final prediction naturally implies the detected dent detection followed by pairwise association. For body-
body bounding boxes and their associated parts, which avoids hand joint detection, [8] devises heuristic strategies to match
error-prone and tedious association post-processing. 3) This hand-raising gestures with body skeletons [24] in classroom
general representation enables us to jointly detect arbitrary scenes. BodyHands [9], in unconstrained conditions, pro-
body part such as face and hand without major modifications. poses a novel association network to jointly detect hands and
In experiments, we performed extensive tests on three public the body location, and introduces a new corresponding hand-
datasets to verify the superiority of our method. body association dataset. Unlike all of them, our approach
To sum up, we mainly have following three contributions: BPJDet is not restricted to a certain body part, and tackles
1) We propose a novel end-to-end trainable Body-Part Joint detection and association in an end-to-end way.
Detector (BPJDet) by extending the classic object represen- Representation of Object Detection: Most modern ob-
tation with regressing offsets of body parts. 2) We reveal the ject detection systems including anchor-based [10, 11, 12]
feasibility and expedience of joint training of bounding boxes and anchor-free [13, 20, 23, 21] detectors represent objects
and offsets with designing suitable multi-task loss functions. with bounding boxes. CenterNet [22] models objects using
3) We achieve state-of-the-art body-part association perfor- heatmap-based center points and represents human poses as
mance on three public benchmarks while maintaining high a 2K-dimensional property of the center point. It firstly ex-
detection accuracy of body and parts. tends the traditional object representation, and reveals essen-
tial overlap between the tasks of object detection and human
pose estimation. Similarly, for instance, FCPose [26] adapts
2. RELATED WORK
single-stage object detector FCOS [23] with proposed dy-
namic filters to process person detections by predicting key-
Human Body and Part Detection: We here discuss emerg-
point heatmaps and regressing offsets. Other single-stage
ing powerful CNN-based detectors that achieve promising re-
pose estimators SPM [25] and AdaptivePose [27] also learn
sults over traditional approaches using hand-crafted features.
to regress keypoints as series of joint offsets. These works
For human body detection, it belongs to either general ob-
motivate us to detect human body with regressing its parts
ject detection containing the person category [10, 11, 12, 13,
jointly by an extended object representation.
20, 22, 23, 21] trained on common datasets like COCO [14],
or pedestrian detection [1, 2, 3, 28, 29] trained on specific
benchmarks CityPersons [15] and CrowdHuman [16]. A key 3. OUR METHOD
challenge in human body detection is occlusion. Therefore,
many researches propose customized loss functions [1, 2], im- 3.1. Extended Object Representation
proved NMS [28, 3] or tailored distribution model [29] for
In our proposed BPJDet, we train a dense single-stage anchor-
alleviating problems of crowded people detection. On the
based detector to directly predict a set of objects {O b∈O b k
other hand, detection of body part has also been intensively box b dis b b p
O = cat(O , O ), O = O ∪ O }, which contains human
b b b b
and extensively studied. Face detection [4, 5] and head detec-
body set O b b and body-part set O b p concurrently. A typical
tion [6, 7] are two mostly vigorous fields. Face detectors are
prediction O is concatenated of the bounding box O
b bbox and
usually based on well-designed networks and trained on ded- dis
icated datasets. Head detectors are rarely studied alone, and corresponding center point displacement O . It locates an
b
often used for facilitating crowd counting, enhancing crowd object with a tight bounding box b̂ = (b̂x , b̂y , b̂w , b̂h ) where
human detection or preprocessing of head pose estimation. coordinates (b̂x , b̂y ) are the center position, b̂w and b̂h are the
Besides, some literatures present detection of hands such as width and height of b̂, respectively. It also records the relative
hand-raisers [8] or body-hand pairs [9]. displacement d̂ = (dˆxi , dˆyi )|ki=1 of center point of body-part
Body-Part Joint Detection: We mainly pay attention to (Ob∈O b b ) or affiliated body (O b∈O b p ) to each b̂. Considering
the joint detection of face, head and hands body part. First that body-part may be more than one component (e.g., both
of all, body-head joint detection is the most popular couple, hands part with k = 2), we allow d̂ to be compatible with k-
because the human head is a salient structural body part and dimensional 2D offsets. For these regressed offsets, we will
plays a vital role in recognizing people, especially when it is explain how to decode them for the body-part association with
occluded. For example, DA-RCNN [19] proposes to handle fusing detected body-part set O b p in Section 3.4.
Fig. 2. Left: Our BPJDet adopts YOLOv5 as the backbone N to extract features and predict grids G b from one augmented input
image I. During training, target grids G are used to supervise the loss function L. In inference, NMS and association decoding
are applied on predicted objects O ebox and related offsets set O
b to obtain final boxes set O edis . Right: Examples for grid cell
predictions with human body objects in red color and body part objects (e.g., face) in green color.
Intuitively, we can benefit a lot from this extended repre- chor box at grid cell (x, y). As shown in the example of grid
sentation. On one hand, an appropriate large body bounding cells in Fig. 2 right, one typical predicted O b embedding con-
box O bbox possesses both strong local characteristics and weak sists of four parts: the objectness or probability ô that an ob-
global features (such as surrounding background and anatom- ject exists, the candidate bounding box b̂0 = (b̂0x , b̂0y , b̂0w , b̂0h ),
ical position) for its body part Obdis offset regression. This the object classification score ĉk+1 , and the candidate of
enables the network to learn their intrinsic relationships. On body-part offsets d̂0 = (dˆ0xi , dˆ0yi )|ki=1 . Thus, Ao = 3k + 6.
the other hand, compared to methods [17, 18, 9] training mul- To transform the candidate b̂0 into coordinates b̂ relative to
tiple subnetworks or stages, mixing O bbox and Obdis up can be s
the grid cell Gbi,x,y , we apply conversions as below:
leveraged directly in BPJDet without the need of complicated
post-processing. By designing a one-stage network that uses b̂x = 2φ(b̂0x ) − 0.5, b̂y = 2φ(b̂0y ) − 0.5
shared heads to jointly predict O bbox and Obdis , our approach Biw Bih (1)
b̂w = [2φ(b̂0w )]2 , b̂h = [2φ(b̂0h )]2
can achieve high accuracy with minimal computational bur- s s
den during training and inference. where φ is the sigmoid function that limits model predictions
in the range (0, 1). Similarly, this detection strategy can be
3.2. Overall Network Architecture extended to the offset of body-part. A body-part’s intermedi-
ate offsets d̂0 are predicted in the grid coordinates and relative
Our network structure is shown in Fig. 2 left. We choose the to the grid cell origin (x, y) using:
recently most cost-effective one-stage YOLOv5 [21] as the w h
B B
basic backbone N . Specifically, following YOLOv5, we feed dˆxi = i [4φ(dˆ0xi ) − 2], dˆyi = i [4φ(dˆ0yi ) − 2] (2)
s s
N one RGB image I ∈ Rh×w×3 as the input, keep its bene- w h
B B
ficial data augmentation strategies (e.g., Mosaic and MixUp), In the way, dˆxi and dˆyi are constrained to ±2 si and ±2 si ,
and output four grids G b = {Gbs k s ∈ {8, 16, 32, 64}} from four respectively. To learn b̂ and d̂, losses are applied in the grid
h w
multi-scale heads. Each grid Gb ∈ RAa ×Ao × s × s contains space. Sample targets of b and d are shown in Fig. 2 right.
dense object outputs O produced from Aa anchor channels
b
and Ao output channels. Supposing that one target object O 3.3. Multi-Loss Functions
is centered at (bx , by ) in the feature map Fs , the correspond- For a set of predicted grids G,
b we firstly build target grids
b
ing grid Gbs at cell ( bsx , sy ) should be highly confident. When set G following formats introduced in Section 3.2. Then, we
having defined Aa anchor boxes B s = {(Biw , Bih )|A i } for the
a
mainly compute following four loss components:
s
grid Gb , we will generate Aa anchor channels at each cell X 1 XkG s k
b
( bsx , sy ). Furthermore, to obtain robust capability, YOLOv5 Lbox = [1 − CIoU(b̂i , bi )] (3)
s kG s k i=1
allows detection redundancy of multiple objects and four sur- X ws XkG k s
rounding cells matching for each cell. This redundancy makes Lobj = BCE(ô, o · CIoU(b̂i , bi )) (4)
s kG s k i=1
sense to the detection of body or part position Obox , but it is X 1 XkG s k
not completely facilitative to the regression of offsets Odis . Lcls = BCE(ĉ, c) (5)
s kG s k i=1
We interpret this in Section 3.3.
Then, we explain the arrangement of one prediction O
X 1 XkG s k Xk
Lbpd = ϕ(v > 0)kd̂0i − d0i k2 (6)
b
s
from Ao output channels of Gi,x,y which is related to i-th an-
b s kG s k i=1 1
For the bounding box regression loss Lbox , we adopt the 32
54
31
complete intersection over union (CIoU) across four grids G s . 53
avg-mMR−2
30
52
BCE in the objectness loss Lobj and classification loss Lcls is
MR−2
29
51
the binary cross-entropy. The multiplier o in Lobj is used for 28 −2
MR body↓ 50
27
penalizing candidates without hitting target grid cells (o = 26
MR−2 face↓ 49
0), and encouraging candidates around target anchor ground- avg-mMR−2 ↓ 48
25
0.5 1 1.5 2 2.5 3
truths (o = 1). The ws is a balance weight for different grid
level. Finally, we utilize the mean squared error (MSE) to
measure offset outputs d̂0 and normalized targets d0 in body- Fig. 3. The influence of parameter λ (x-axis, enlarged 100×).
part displacement loss Lbpd . Before that, we apply a filter Table 1. The performance comparison of the joint body-face
ϕ(·) with the visibility label of body-part to remove out those detection task in the val-set of CityPersons.
false-positive offset predictions from O.
b Finally, we calculate
AP↑ AP↑ mMR−2 ↓
the total training loss L = (G,
b G) as follows: Methods
body face Reasonable Partial Bare Heavy
RetinaNet+POS [12, 18] 78.5 35.3 40.0 42.8 38.7 67.0
L = Nbs (αLbox + βLobj + γLcls + λLbpd ) (7) RetinaNet+BFJ [12, 18] 79.3 36.2 39.5 41.5 38.5 63.1
FPN+POS [11, 18] 80.6 65.5 33.5 32.7 34.1 56.6
FPN+BFJ [11, 18] 84.4 68.0 32.7 30.6 33.0 53.5
where Nbs is the batch size. The α, β, γ and λ are weights of BPJDet-S (Ours) 75.1 58.2 29.3 28.9 29.3 57.2
losses Lbox , Lobj , Lcls and Lbpd , respectively. BPJDet-M (Ours) 76.7 58.8 27.5 31.6 24.9 55.8
BPJDet-L (Ours) 75.5 61.0 26.4 27.7 25.5 46.2