Professional Documents
Culture Documents
Abstract
arXiv:1911.07451v2 [cs.CV] 24 Nov 2019
…
work directly predicts instance-aware keypoints for all the (xk-2, yk-2)
instances from a raw input image, eliminating the need (xk-1, yk-1)
for heuristic grouping in bottom-up methods or bounding-
box detection and RoI operations in top-down ones. We
also propose a novel Keypoint Alignment (KPAlign) mech-
anism, which overcomes the main difficulty—the feature Backbone Final Predictions
mis-alignment between the convolutional features and pre-
Figure 1 – The naive direct end-to-end keypoint detection
dictions in this end-to-end framework. KPAlign improves framework. As shown in the figure, the framework requires
the framework’s performance by a large margin while still a single feature vector on the final feature maps to encode
keeping the framework end-to-end trainable. With the only all the essential information of an instance (e.g., the pre-
post-processing non-maximum suppression (NMS), our pro- cise locations of some keypoints for the instance, denoted as
posed framework can detect multi-person keypoints with (x0 , y0 ), ...(xk−1 , yk−1 )). As shown in experiments, this sin-
or without bounding-boxes in a single shot. Experiments gle feature vector faces difficulty in preserving sufficient in-
demonstrate that the end-to-end paradigm can achieve com- formation for challenging instance-level recognition tasks such
petitive or better performance than previous strong base- as keypoint detection. Here we propose a keypoint alignment
lines of both bottom-up and top-down methods. We hope (KPAlign) module to overcome the issue.
that our end-to-end approach can provide a new perspec-
tive for the human pose estimation task.
down methods can avoid the heuristic grouping process,
they come with the price of long computational time since
1. Introduction they cannot fully leverage the sharing computation mech-
anism of convolutional neural networks (CNNs). More-
Multi-person pose estimation (a.k.a. keypoint detection) over, the running time of top-down methods depends on the
is a crucial step in the understanding of human behavior number of instances in the image, making them unreliable
in images and videos. Previous methods for the task can in some instant applications such as autonomous vehicles.
be roughly categorized into bottom-up [1, 19, 21, 23] and Importantly, both bottom-up and top-down methods are not
top-down [9, 26, 6, 3] methods. Bottom-up methods first end-to-end1 , which is in conflict with deep learning’s phi-
detect all the possible keypoints in an input image in an losophy of learning everything together.
instance-agnostic fashion, which are followed by a group- Recently, anchor-free object detection [28, 11] is emerg-
ing or assembling process to produce the final instance- ing and has demonstrated superior performance than pre-
aware keypoints. The grouping process is often heuris- vious anchor-based object detection. These anchor-free
tic and many tricks are involved to achieve a good perfor- object detectors directly regress two corners of a target
mance. In contrast, top-down methods first detect each in- bounding-box, without using pre-defined anchor boxes.
dividual instance with a bounding-box and then reduce the
1 Here we mean ‘direct end-to-end’; i.e., the model is trained end-to-
task to single-instance keypoint detection. Although top-
end with keypoint annotations solely during training, and for inference,
∗ The first two authors equally contributed to this work. Corresponding the model is able to map an input to keypoints for each individual instance
author: chunhua.shen@adelaide.edu.au without box detection and grouping post-processing.
1
Heads Heads
P7 H×W×256 Classification
H×W×1
p6 Heads
Centerness
Heads ×4
C5 P5 H×W×1
The straightforward and effective solution for object de- KPAlign aligns the convolutional features with a target key-
tection gives rise to a question: can keypoint detection be point (or a group of keypoints) as possible as it can, and
solved with this simple framework as well? It is easy to then predicts the location of the target keypoint(s) with the
see that the keypoints for an instance can be considered as aligned features. Since the target keypoints and the used
a special bounding-box with more than two corner points, features are roughly aligned, the features are only required
and thus the task could be solved by attaching more out- to encode the information in its neighborhood. It is evi-
put heads to the object detection networks. This solution dent that encoding the neighborhood is much easier than
is intriguing since 1) it is end-to-end trainable (i.e., directly encoding the whole receptive field, which thus results in an
mapping a raw input image to the desired instance-aware improved performance. Moreover, the KPAlign module is
keypoints). 2) It can avoid the shortcomings of both top- differentiable, thus keeping the model end-to-end trainable.
down and bottom-up methods as it needs neither grouping Additionally, it is well-known that learning a regression-
or bounding-box detection. 3) It can unify object detection based model is difficult. However, in this work, we find the
and keypoint detection in a single simple and elegant frame- regression task can largely benefit from a heatmap-based
work. learning. As a result, we propose to jointly learn the two
However, we show that such a naive approach performs tasks during training. When testing, the heatmap-based
unsatisfactorily, mainly due to the fact that these object de- branch is disabled and thus does not impose any overheads
tectors resort to a single feature vector to regress all the to the framework.
keypoints of interest for an instance, with the hope that the To summarize, the proposed one-stage regression-based
single feature vector can faithfully preserve the essential in- keypoint detection enjoys the followings advantages over
formation (e.g., the precise locations of all the keypoints) previous top-down or bottom-up approaches.
in its receptive field, as shown in Fig. 1. While the single
feature vector may be sufficiently good to carry information • The proposed framework is direct, totally end-to-end
for simple bounding-box detection as shown in [28], where trainable. To predict, it maps an input image to key-
only two corner points are involved in a bounding-box, it points for each individual instance directly, relying on
has difficulties in encoding rich information for the more neither intermediate operators like RoI feature crop-
challenging keypoint detection. As shown in our experi- ping, nor grouping post-processing, which sets our
ments, this straightforward approach yields inferior perfor- work apart from previous frameworks [9, 1] with mul-
mance. tiple steps.
In this work, we propose a keypoint alignment (KPAlign) • Our proposed framework can bypass the major short-
mechanism to largely overcome the aforementioned prob- comings of both top-down and bottom-up methods.
lem of the solution. Instead of using a single feature vector For example, compared to top-down methods, our
to regress all the keypoints for an instance, the proposed framework can avoid the issue of early commitment
and decouple computational complexity from the num- Additionally, both top-down and bottom-up methods re-
ber of instances in an input image. Compared to quires multiple steps to obtain the final keypoint detection
bottom-up methods, our framework eliminates the results. Some of the steps are non-differentiable and make
heuristic post-processing assembling the detected key- these methods impossible to be trained in an end-to-end
points into full-body instances. fashion, which is the major difference between our meth-
• Moreover, unlike previous top-down and bottom-up ods and previous ones.
methods, both of which require a heatmap-based FCNs
to detect keypoints and thus are with quantization er- 2. Our Approach
ror, our proposed framework directly regresses the pre-
cise coordinates of keypoints and thus decouple the Conceptually, our proposed keypoint detection frame-
output resolution of the networks and the precision of work is a simple extension of the anchor-free object detector
keypoint localization. It makes our framework have FCOS [28], with one new output branch for keypoint detec-
the potential to detect very dense keypoints (i.e., the tion. In this section, we first introduce FCOS detector and
keypoints crowd together). show how it can be extended to keypoint detection. Next,
• Finally, the framework suggests that the keypoint de- we illustrate our proposed KPAlign module, which allows
tection task can also be solved with the same method- the framework to leverage the feature-prediction alignment
ology as bounding-box detection (i.e., directly re- and improves the performance by a large margin. Finally,
gressing all the keypoints or the corners of bounding- we present that how the jointly learning of the regression-
boxes), resulting in a unifying framework for both based task and a heatmap-based task can be used to further
tasks. boost the precision of keypoint localization.
…
K offsets (xk-2, yk-2)
Conv1x1
(xk-1, yk-1)
K offsets
Conv1x1
…
Feature
…
…
Conv1x1
Sampler
Conv1x1
Our End-to-End Framework: The keypoint representa- deviates from the center of its receptive field. As shown
tion results in our naive keypoint detection archiecture. As in many FCN-based frameworks [9, 17], keeping the fea-
shown in Fig. 2, it is implemented by applying a convolu- ture and prediction aligned is crucial to good performance.
tional branch on all levels of the output feature maps of FPN Thus the feature only needs to encode the information in a
[14] (i.e., P3 , P4 , P5 , P6 and P7 ). The downsampling ratios local patch, which is much easier.
of these feature maps to the input image are 8, 16, 32, 64 and In this work, we propose a keypoint alignment (KPAlign)
128, respectively. Note that the parameters of the branch are module to recover the feature-prediction alignment in the
shared between FPN levels as in the bounding-box detection framework. KPAlign is used to replace the convolutional
branch of the FCOS detector. The output channels of the layer for the final keypoint detection in the naive framework
branch is 2K, where K is the number of keypoints for each and take as input the same feature maps, denoted as F ∈
instance. The original bounding-box branch can be kept for RH×W ×C , where C being 256 is the number of channels
simultaneous keypoint and bounding-box detection. More- of the feature maps. Analogous to a convolution operation,
over, for keypoint-only detection, it is worth noting that KPAlign is densely slid through the input feature maps F.
we only use keypoint annotations without bounding-boxes. For simplicity, we take as an example a specific location
However, during training, FCOS requires a bounding-box (i, j) on F to illustrate how KPAlign works. As shown in
for each instance to determine a positive or negative label Fig. 3, KPAlign consists of two components — an aligner
for each location on the FPN feature maps. Here, we em- ζ and a predictor φ. The aligner consists of a locator and
ploy the minimum enclosing rectangles of keypoints of the a feature sampler, and outputs the aligned feature vectors.
instances as pseudo-boxes for computing the training labels. The aligner can be formulated as,
2.2. Keypoint Alignment (KPAlign) Module o 0 , o 1 , ..., o K−1 , v 0 , v 1 , ..., v K−1 = ζ(F), (1)
We conduct preliminary experiments with the aforemen- where o t ∈ R2 , produced by the locator in Fig. 3, is the
tioned naive keypoint detection framework. However, as location where the feature vector used to predict the t-th
shown in our experiments, it has inferior performance. We keypoint of an instance should be sampled. v t ∈ RC is the
attribute the inferior performance to the lack of the align- sampled feature vector. Note that the location o t is defined
ment between the features and the predicted keypoints. Es- over R2 and thus it can be fractional. Following [4, 9], we
sentially, the naive framework makes use of a single feature make use of bilinear interpolation to compute the features at
vector at a location on the input feature maps to regress all a fractional location. Additionally, the location is encoded
the keypoints for an instance. As a result, the single fea- as the coordinates relative to (i, j) and thus is translation
ture vector is required to encode all the required information invariant.
for the instance. This is difficult because many keypoints Next, the predictor φ takes the outputs of the aligner as
are far away from the center of the feature vector’s recep- inputs to predict the final coordinates of the keypoints. As
tive field and it has been shown in [18] that the intensity shown in Fig. 3, the predictor includes K convolution lay-
of the feature’s response decays quickly as the input signal ers (i.e., one for each keypoint). Let us assume that we are
looking for the t-th keypoint for the instance and let φt de- high-resolution low-level features with a smaller receptive
note the t-th convlutional layer in the predictor. φt takes v t field. To this end, we feed lower levels of feature maps into
as input and predicts the coordinates of the t-th keypoint rel- the sampler. Specifically, if a locator uses feature maps PL
ative to the location where v t is sampled (i.e., o t ). Finally, and PL is not the finest feature maps, the sampler will take
the coordinates of the t-th keypoint, denoted as x t , are the PL−1 as the input. If PL is already the finest feature maps,
sum of the two sets of coordinates. Formally, the sampler will still sample on it.
Table 2 – Ablation experiments on COCO minival for the Table 3 – Ablation experiments on COCO minival for Direct-
design choices in KPAlign. “+ Grouped”: using Grouped Pose with heatmap prediction. Baseline: without the heatmap
KPAlign, which has slightly better performance than the original learning. “16× Heatmaps”: predicting the heatmaps with
but largely reduce the number of sampled feature vectors, thus downsampling ratio being 16. “8× Heatmaps”: predicting the
being faster. “+ Sep. features”: using separate (but slimmer) heatmaps with downsampling ratio being 8 (i.e., using P3 ). “+
feature maps for different keypoint groups, which has similar Long sched.”: increasing the number of training epochs from 25
computational complexity but improved performance. “+ Better to 100. As shown in the table, learning with heatmap prediction
sampling”: the predictor samples features on finer feature maps can largely improve the performance. Moreover, using the de-
(i.e., from PL to PL−1 ). sign choices of the heatmaps (e.g., the resolution) have a small
impact on the final performance, which is one of the advantages
of our framework over previous bottom-up methods.
use the Grouped KPAlign for all the following experiments.
We also attempted other ways forming the groups but they w/ BBox APbb APbb
50 APbb
75 APkp APkp
50 APkp
75
achieve a similar performance. - - - 63.1 85.6 68.8
X 55.3 81.5 59.9 61.5 84.3 67.5
3.1.4 Using separate convolutional features
Table 4 – Our framework with person bounding-box detection
As described before, using separate feature maps for these on COCO minival. The proposed framework can achieve rea-
keypoint groups can improve the performance. Here we sonable person detection results (55.3% in AP). As a reference,
conduct an experiment for this. As shown in Table 2, using the Faster R-CNN person detector in Mask R-CNN [9] achieves
separate feature maps can boost the performance by 0.8% 53.7% in AP.
in APkp (from 50.6% to 51.4%). Note that the number of
channels of these separate feature maps is reduced from 256
to 64, and thus the model has similar computational com- largely improved from 52.2% to 58.0%. Note that the
plexity to the original one. As a result, the model using heatmap prediction is only used during training to provide
separate feature maps achieves a better trade-off between the multi-task regularization. Moreover, we also conduct
speed and accuracy. experiments with the heatmap prediction with a lower reso-
lution (i.e., “w/ 16× Heatmaps”). As shown in Table 3, even
with the low-resolution heatmaps, the model can still yield
3.1.5 Where to sample features in KPAlign? a similar performance. This suggests that our method is not
In this experiment, the sampler samples on finer feature sensitive to the design choices for the heatmap learning and
maps, as described in Sec. 2.2, since the locator requires thus eliminates the heuristic tuning for the heatmap branch.
low-resolution high-level feature maps with a larger recep- This sets our method apart from previous heatmap-based
tive field while the predictor prefers high-resolution low- bottom-up methods, whose performance highly depends on
level feature maps. As shown in Table 2 (“+ Better Sam- the design of the heatmap branch (e.g., the heatmaps’ reso-
pling”), using the sampling strategy can improve the perfor- lution and etc.).
mance to 52.2%. Note that using the sampling strategy does Moreover, we find that our method is highly under-fitting
not increase the computational complexity of the model. and previous methods such as [26] with heatmaps learning
Moreover, the better sampling strategy improves the APkp are trained with much more epochs than ours, and therefore
50
and APkp we increase the number of epochs from 25 to 100. As shown
75 by 0.1% and 1.0%, respectively, which implies
that the sampling strategy can result in more accurate key- in Table 3, this improves the performance by 5.1% in APkp .
point predictions because the improvement mainly comes 3.2. Combining with Bounding Box Detection
from the APkp s at higher thresholds.
As mentioned before, by simply adding a bounding-box
3.1.6 Regularization from heatmap learning branch, the proposed framework can simultaneously de-
tect bounding boxes and keypoints. Here we confirm it
As shown in Table 3 (“w/ 8× Heatmaps”), by jointly learn- by the experiment. The bounding-box detection is imple-
ing the regression-based model with a heatmap prediction mented by adding the original box detection head of FCOS
task, the performance of the regression-based task can be to the framework. As shown in Table 4, our framework can
Method APkp APkp
50 APkp
75 APkp
M APkp
L better than the method with refining. Compared to Per-
Top-down Methods sonLab, with the same backbone ResNet-101 and single-
Mask R-CNN [9] 62.7 87.0 68.4 57.4 71.1 scale testing, our proposed framework also has a competi-
CPN [3] 72.1 91.4 80.0 68.7 77.2 tive performance with it (63.3% vs. 65.5%). Note that our
RMPE [6] 72.3 89.2 79.1 68.0 78.6 proposed framework is much simpler than these bottom-up
CFN [12] 72.6 86.1 69.7 78.3 64.1 methods, in both training and testing.
HRNet-W48 [26] 75.5 92.5 83.3 71.9 81.5
Bottom-up Methods
CMU-Pose∗† [1] 61.8 84.9 67.5 57.1 68.2 Compared to Top-down Methods: With the same back-
AE [19] 56.6 81.8 61.8 49.8 67.0 bone ResNet-50, the proposed method has a similar perfor-
AE∗ 62.8 84.6 69.2 57.5 70.6 mance with previous strong baseline Mask R-CNN (62.2%
AE∗† 65.5 86.8 72.3 60.6 72.6 vs. 62.7%). Our model is still behind other top-down meth-
PersonLab [21] 65.5 87.1 71.4 61.3 71.5 ods. However, it is worth noting that these methods often
PersonLab† 67.8 89.0 75.4 64.1 75.5 employ a separate bounding-box detector to obtain person
Direct End-to-end Methods instances. These instances are then cropped from the orig-
Ours (R-50) 62.2 86.4 68.2 56.7 69.8 inal image and a single person pose estimation method is
Ours (R-50)† 63.0 86.8 69.3 59.1 69.3 separately applied to each the cropped image to obtain the
Ours (R-101) 63.3 86.7 69.4 57.8 71.2 final results. As noted before, this strategy is slow as it can-
Ours (R-101)† 64.8 87.8 71.1 60.4 71.5
not take advantage of the sharing computation mechanism
in CNNs. In contrast, our proposed end-to-end framework
Table 5 – The performance of our proposed end-to-end frame-
work on COCO test-dev split. ∗ and † respectively denote is much simpler and faster since it directly maps the raw in-
using refining and multi-scale testing. As shown in the table, put images to the final instance-aware keypoint detections
the new end-to-end framework achieves competitive or better with a fully convolutional network.
performance than previous strong baselines (e.g., Mask R-CNN
and CMU-Pose). Timing: The averaged inference time of our model on
COCO minival split is 74ms and 87ms per image with
ResNet-50 and ResNet-101, respectively, which is slightly
achieve a reasonable person detection performance, which
faster than Mask R-CNN with the same hardware and back-
is similar to the Faster R-CNN detector in Mask R-CNN
bones (Mask R-CNN takes 78ms per image with ResNet-
(55.3% vs. 53.7%). Although Mask R-CNN can also simul-
50). Additionally, the running time of Mask R-CNN de-
taneously detect bounding-boxes and keypoints, we further
pends on the number of the instances while our model, sim-
unify the two tasks into the same methodology.
ilar to one-stage object detectors, has nearly constant infer-
3.3. Comparisons with State-of-the-art Methods ence time for any number of instances.
In this section, we evaluate the proposed end-to-end key- 4. More Discussions and Results
point detection framework on MS-COCO test-dev split
and compare it with previous bottom-up and top-down ones. Here 1) we further compare our proposed DirectPose
We make use of the best model in ablation experiments. against the recent SPM method [20]. 2) We show the visual-
As shown in Table 5, without any bells and whistles (e.g., ization results of the proposed KPAlign module. 3) The loss
multi-scale and flipping testing, the refining in [1, 19], and curves of training with or without the proposed heatmap
any other tricks), the end-to-end framework achieves 62.2% learning are shown as well. 4) We show some final detec-
and 63.3% in APkp on COCO test-dev split, with ResNet- tion results with or without the simultaneous bounding-box
50 and ResNet-101 as the backbone, respectively. With detection.
multi-scale testing, our framework can achieve 63.0% and
64.8% with ResNet-50 and ResNet-101, respectively. Qual- 4.1. Comparison to SPM
itative results will be provided in the supplemental material. Here we highlight the difference between our proposed
DirectPose and SPM [20]. SPM makes use of Hierarchi-
Compared to Bottom-up Methods: The performance cal Structured Pose Representations (Hierarchical SPR) to
of our ResNet-50 based end-to-end framework is better avoid learning the long-range displacements between the
(62.2% vs. 61.8%) than the strong baseline CMU-Pose [1] root and the keypoints, which shares a similar motivation
that uses multi-scale testing and post-processing with CPM with the proposed KPAlign. However, SPM considers all
[29], and filters the results with an object detector. Our the key nodes (including the root nodes and intermediate
framework also achieves much better performance than the nodes) in the hierarchical SPR as the regression targets, and
bottom-up method AE [19] (63.3% vs. 56.6%) and is even instance-agnostic heatmaps are used to predict these nodes.
2.0 w/o Heatmap Learning 5. Conclusion
w/ Heatmap Learning
1.8
We have proposed the first direct end-to-end human pose
1.6
estimation framework, termed DirectPose. Our proposed
Keypoint Regression Loss
Figure 5 – Visualization results of KPAlign on MS-COCO minival. The first image in each group shows the outputs of the locator in
KPAlign (i.e., the locations where the sampler samples the features used to predict the keypoints). The orange point denotes the original
location where the features will be used if KPAlign is not used. The second image shows the final keypoint detection results. As shown
in the figure, the proposed KPAlign can make use of the features near the keypoints to predict them. The final image shows that the
ground-truth keypoints. Zoom in for a better look.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. deeper, stronger, and faster multi-person pose estimation
Deep residual learning for image recognition. In Proc. IEEE model. In Proc. Eur. Conf. Comp. Vis., pages 34–50.
Conf. Comp. Vis. Patt. Recogn., pages 770–778, 2016. 6 Springer, 2016. 3
[11] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- [14] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
box: Unifying landmark localization with end to end object Bharath Hariharan, and Serge Belongie. Feature pyramid
detection. arXiv preprint arXiv:1509.04874, 2015. 1 networks for object detection. In Proc. IEEE Conf. Comp.
[12] Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarse- Vis. Patt. Recogn., pages 2117–2125, 2017. 4
fine network for keypoint localization. In Proc. IEEE Int. [15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Conf. Comp. Vis., pages 3028–3037, 2017. 8 Piotr Dollár. Focal loss for dense object detection. In Proc.
[13] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, IEEE Conf. Comp. Vis. Patt. Recogn., pages 2980–2988,
Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A 2017. 3, 6
Figure 6 – Visualization results of the proposed DirectPose on MS-COCO minival. DirectPose can directly detect a wide range of
poses. Note that some small-scale people do not have ground-truth keypoint annotations in the training set of MS-COCO, thus they
might be missing when testing.
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, phy. Towards accurate multi-person pose estimation in the
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence wild. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages
Zitnick. Microsoft coco: Common objects in context. In 4903–4911, 2017. 3
Proc. Eur. Conf. Comp. Vis., pages 740–755. Springer, 2014. [23] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern
6 Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt
[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully Schiele. Deepcut: Joint subset partition and labeling for
convolutional networks for semantic segmentation. In Proc. multi person pose estimation. In Proc. IEEE Conf. Comp.
IEEE Conf. Comp. Vis. Patt. Recogn., pages 3431–3440, Vis. Patt. Recogn., pages 4929–4937, 2016. 1, 3
2015. 4 [24] Leonid Pishchulin, Arjun Jain, Mykhaylo Andriluka,
[18] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Thorsten Thormählen, and Bernt Schiele. Articulated peo-
Understanding the effective receptive field in deep convolu- ple detection and pose estimation: Reshaping the future. In
tional neural networks. In Proc. Adv. Neural Inf. Process. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3178–
Syst., pages 4898–4906, 2016. 4 3185. IEEE, 2012. 3
[19] Alejandro Newell, Zhiao Huang, and Jia Deng. Associa- [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
tive embedding: End-to-end learning for joint detection and Faster R-CNN: Towards real-time object detection with re-
grouping. In Proc. Adv. Neural Inf. Process. Syst., pages gion proposal networks. In Proc. Adv. Neural Inf. Process.
2277–2287, 2017. 1, 5, 8 Syst., pages 91–99, 2015. 3
[20] Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng [26] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
Yan. Single-stage multi-person pose machines. In Proc. high-resolution representation learning for human pose esti-
IEEE Int. Conf. Comp. Vis., October 2019. 3, 8 mation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages
[21] George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros 5693–5703, 2019. 1, 3, 5, 7, 8
Gidaris, Jonathan Tompson, and Kevin Murphy. Person- [27] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen
lab: Person pose estimation and instance segmentation with Wei. Integral human pose regression. In Proc. Eur. Conf.
a bottom-up, part-based, geometric embedding model. In Comp. Vis., pages 529–545, 2018. 5
Proc. Eur. Conf. Comp. Vis., pages 269–286, 2018. 1, 8 [28] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
[22] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Fully convolutional one-stage object detection. arXiv
Toshev, Jonathan Tompson, Chris Bregler, and Kevin Mur- preprint arXiv:1904.01355, 2019. 1, 2, 3
Figure 7 – Visualization results of the proposed DirectPose with the simultaneous bounding-box detection on MS-COCO minival.