Professional Documents
Culture Documents
Arshoe: Real-Time Augmented Reality Shoe Try-On System On Smartphones
Arshoe: Real-Time Augmented Reality Shoe Try-On System On Smartphones
Smartphones
†
Shan An1,2 , Guangfu Che1 , Jinghao Guo1 , Haogang Zhu2,3 , Junjie Ye1 , Fangru Zhou1 ,
Zhaoqi Zhu1 , Dong Wei1 , Aishan Liu2 , Wei Zhang1
1 JD.COM
Inc., Beijing, China
2 State
Key Lab of Software Development Environment, Beihang University, Beijing, China
3 Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beijing, China
arXiv:2108.10515v1 [cs.CV] 24 Aug 2021
ABSTRACT
Virtual try-on technology enables users to try various fashion items
using augmented reality and provides a convenient online shop-
ping experience. However, most previous works focus on the virtual
try-on for clothes while neglecting that for shoes, which is also a
promising task. To this concern, this work proposes a real-time aug-
mented reality virtual shoe try-on system for smartphones, namely
ARShoe. Specifically, ARShoe adopts a novel multi-branch network
to realize pose estimation and segmentation simultaneously. A so-
lution to generate realistic 3D shoe model occlusion during the
try-on process is presented. To achieve a smooth and stable try-
on effect, this work further develop a novel stabilization method.
Moreover, for training and evaluation, we construct the very first
large-scale foot benchmark with multiple virtual shoe try-on task-
related labels annotated. Exhaustive experiments on our newly
constructed benchmark demonstrate the satisfying performance
of ARShoe. Practical tests on common smartphones validate the
real-time performance and stabilization of the proposed approach.
CCS CONCEPTS
• Human-centered computing → Mixed / augmented reality;
• Computing methodologies → Image-based rendering; Im- Input
age segmentation; Neural networks. Target
shoes
KEYWORDS (a)
Augmented reality; Virtual try-on; Pose estimation; Segmentation;
Stabilization; Benchmark
58.40
Speed (FPS)
Upsample
+ module Overlay
pafmap generation
14×64×64
Low-level features Encoder
Input Output
3×256×256 Upsample Occlusion 3D model 3×256×256
Conv Depthwise separable conv Depthwise conv module on input stabilization
segmap
Residual bottleneck Pyramid pooling Upsample Decoder Post processing
2×64×64
Figure 2: The working procedure of ARShoe. Consisting of an encoder and a decoder, the multi-branch network is trained
to predict keypoints heatmap (heatmap), PAFs heatmap (pafmap), and segmentation results (segmap) simultaneously. Post
processes are then performed for a smooth and realistic virtual try-on.
the authors also follow a similar procedure. Pixel-wise voting net- 3 ARSHOE SYSTEM
work (PVNet) [23] predicts pixel-wise vectors pointing to the object As shown in Fig. 2, our ARShoe system has three important compo-
keypoints and uses these vectors to vote for keypoint locations. nents: 1) A fast and accurate network for keypoints prediction, pose
This method exhibits well performance for occluded and truncated estimation, and segmentation; 2) A realistic occlusion generation
objects. Dense pose object detector (DPOD) [37] refines the ini- procedure; 3) A 3D model stabilization approach. The foot pose is
tial poses estimations after a 6-DOF pose computed using 2D-3D estimated by the predicted part affinity fields (PAFs) [3], targeting
correspondences between images and corresponding 3D models. to group estimated keypoints into correct foot instances. Afterward,
PVN3D [14] extends 2D keypoint approaches, which detects 3D 6-DoF poses of feet are obtained by mapping 2D keypoints from
keypoints of objects and derives the 6D pose information using each foot. Then, the 3D shoe models can be rendered to the corre-
least-squares fitting. In our case, one of the branches in our network sponding poses. The segmentation results are used to help find the
is well-trained to predict 2D keypoints in feet, then 6-DoF poses correct area on the 3D shoe model that should be occluded by the
can be obtained utilizing the PnP algorithm. leg. Lastly, the 3D model stabilization module ensures a smooth and
realistic virtual try-on effect. Moreover, for effective and accurate
annotation, an origin 6-DoF foot pose annotation method is also
presented in this section.
2.3 Virtual Try-On Techniques
In recent years, the virtual try-on community puts great effort to 3.1 Network in ARShoe
develop virtual try-on for clothes and achieves promising progress. Network structure: We devise an encoder-decoder network for
The main approaches of virtual try-on can be divided into two the 2D foot keypoints localization and the human leg and foot
categories, namely image-based and video-based methods. For the segmentation. Inspired by [24], the encoder network fuses high-
former, VITON [12], and VTNFP [33] generate a synthesized image level and low-level features for better representation. The decoder
of a person with a desired clothing item. For the latter, FW-GAN [9] network contains three branches: one to predict the heatmap of foot
synthesizes the video of virtual clothes try-on based on target keypoints, another for PAFs prediction, and the other for human
images and poses. Virtual try-on for other fashion items has not
attracted as much research attention like that for clothes, such as
virtual try-on for cosmetics, glasses, and shoes. For glasses try-
on, X. Yuang et al. [36] reconstruct 3D eyeglasses from an image
and synthesize the eyeglasses on the user’s face in the selfie video.
In [38], the virtual eyeglasses model is superimposed on the human
face according to the head pose estimation. Chao-Te Chou et al. [7]
propose to encode feet and shoes and decode them to the desired Conv ResBlock
Pixel
image. However, this approach works well in scenes with a simple
shuffle×2 Pixel
background and can hardly generalize to practical usage conditions. shuffle×2
Moreover, the huge computational cost makes it unsuitable for
mobile devices. In contrast, this work focuses on virtual shoe try-on Figure 3: The architecture of the upsample module. It con-
and proposes a neat and novel framework, realizing appeal realistic sists of a residual block and two consecutive pixel shuffle
rendering effect while running in real-time on smartphones. layers, which upsamples the 16 × 16 feature map to 64 × 64.
Bilinear interpolation
k3 k4
k1 k2
H1 H3 F0 F1
2 2
0
5 66 0
5
3 7 4 4 7
3
1 1
3.3 3D Model Stabilization The average 𝐿1 distance between points set C1 and C2 is:
To stabilize the overlaid 3D shoe model, we propose a method that 𝐷 = ∥C1 − C2 ∥ 1 (7)
combines the movement of corner points and the estimated 6-DoF The updating weight 𝑤 R of the rotation matrix is:
foot pose. As we are processing image sequence, we denote each
frame as I𝑙 , where 𝑙 ∈ {0, ..., 𝑘 }. Assuming I𝑙 is the current image, 𝑤 R = 𝛼 × ln(𝐷) + 𝛽 (8)
thus the previous frame is I𝑙−1 . Here 𝛼 = 0.432 and 𝛽 = 2.388 are empirical parameters. The
Firstly, we extract FAST corner points of I𝑙−1 and I𝑙 in the foot updating weight 𝑤 T of the translation matrix is:
area according to the foot segmentation. These points are repre-
sented as P𝑙−1 and P𝑙 . Then we use bi-direction optical flow to ob- 𝑤T = 𝑤R × 𝑤R (9)
′ = [𝑝 1 , ..., 𝑝 𝑛 ] and P ′ = [𝑝 1 , ..., 𝑝 𝑛 ],
tain corner points pairs P𝑙−1 𝑙−1 𝑙−1 𝑙 𝑙 𝑙 Finally, we compute the updated R𝑙′′ and T𝑙′′
using the following
which have a matching relationship. Here 𝑛 is the number of match- steps:
ing points. 1. The rotation matrix R𝑙 and R𝑙−1 are changed to the quaternions
The pose of a foot is represented by a 4 × 4 Homogeneous ma- M𝑙 and M𝑙−1 .
trix, which is from its 3D model coordinate frame to the camera 2. The quaternion M𝑙′′ is updated by:
coordinate frame:
M𝑙′′ = 𝑤 R × M𝑙 + (1 − 𝑤 R ) × M𝑙−1 (10)
R t
T= ∈ SE(3), R ∈ SO(3), t ∈ R3 (1) 3. The quaternion M𝑙′′
is transformed to the rotation matrix R𝑙′′ .
0 1
4. We update the translation matrix at time 𝑙:
where R and t = [𝑡 x, 𝑡 y, 𝑡 z ] T are the rotation matrix and the trans- T𝑙′′ = 𝑤 T × T𝑙 + (1 − 𝑤 T ) × T𝑙′ (11)
lation matrix. The pose of the current frame is T𝑙 , while the pose of
the previous frame is T𝑙−1 . The point clouds of the 3D shoe model Between two consecutive frames, the change of the foot pose is
are denoted as C, and the camera intrinsics is K: continuous and subtle compared with the irregular and significant
noise. To ensure the stability and consistency of the foot pose in
𝑓x 0 𝑐 x the sequence, we first calculate the approximate position of the
human foot in 3D space according to Eq. (5). When T is updated,
K = 0 𝑓y 𝑐 y ∈ R3×3
(2)
0 0 the rotation t contributes the most to the reprojection error. After
1
the weighted filtering of the rotation matrix, the change of angle
where 𝑓x and 𝑓y are the focal length. 𝑐 x and 𝑐 y are the principal point will keep a certain order with the previous frame. Hence, when the
offset. Because of keypoint detection noise, there are inevitable jit- change of foot poses across two consecutive frames is slight, the
ters when using T𝑙 for the virtual shoe model rendering. We lever- pose of the previous frame will gain more weight than that of the
′ and P ′ to update a refined pose T ′′ for the stabilization.
age P𝑙−1 current frame, vise versa.
𝑙 𝑙
Figure 9: Illustration of the annotation. A 3D shoe model is
adjusted manually to fit the shoe or foot in the image.
YOLACT-Darknet53 16.416 47.763 29.660 1035.180 Lightweith OpenPose [22] 0.719 0.958 0.788 0.399 68.18
YOLACT-Resnet50 12.579 31.144 26.595 1116.825 Fast-human-pose-teacher [39] 0.817 0.976 0.884 0.617 15.39
YOLACT-Resnet101 17.441 50.137 40.234 1876.076 Fast-human-pose-student [39] 0.786 0.974 0.857 0.490 16.33
SOLOv2-Resnet50-FPN 6.161 26.412 23.313 893.511 Fast-human-pose-student-FPD [39] 0.797 0.976 0.863 0.524 16.33
SOLOv2-Resnet101-FPN 10.081 45.788 35.411 1041.213 ARShoe 0.776 0.972 0.855 0.490 84.43
SOLOv2-DCN101-FPN 14.334 55.311 40.211 1322.024
ARShoe 0.984 1.292 11.844 59.919
Table 4: Comparative results of foot and leg segmentation.
84.43 FPS on GPU, ARShoe can still achieve a promising segmen- virtual try-on video generated by ARShoe is also provided in the
tation performance, which is not inferior to the SOTA approaches supplementary material.
designed specifically for segmentation. The mIoU and AP of AR-
Shoe are 0.901 and 0.927, which are 0.042 and 0.033 lower than 5 CONCLUSION
the first place, respectively. Considering the practicability in mo-
This work presents a high efficient virtual shoe try-on system
bile devices, our approach realizes the best balance of computation
namely ARShoe. A novel multi-branch network is proposed to
efficiency and accuracy.
estimate keypoints, PAFs, and segmentation masks simultaneously
Remark 4: Since this work targets real-world industrial applica-
and effectively. To realize a smooth and realistic try-on effect, an ori-
tions, we argue that the slight margin in performance is insignificant
gin 3D shoe overlay generation approach and a novel stabilization
compared to the high practicability that ARShoe brings.
method are proposed. Exhaustive quantitative and qualitative eval-
6-DoF pose estimation performance: We also evaluate the
uations demonstrate the superior virtual try-on effect generated
performance of our 6-DoF pose estimation method on the test set.
by ARShoe, with a real-time speed on smartphones. Apart from
The average error of three Euler angles is 6.625◦ , and the average
methodology, this work constructs the very first large-scale foot
distance error is only 0.384 cm. Such satisfying 6-DoF pose estima-
benchmark with all virtual shoe try-on task-related information
tion performance demonstrates that ARShoe can precisely handle
annotated. To sum up, we strongly believe both the novel ARShoe
the virtual try-on task in real-world scenarios.
virtual shoe try-on system and the well-constructed foot benchmark
Qualitative Evaluation: Some virtual shoe try-on results gen-
will significantly contribute to the community of virtual try-on.
erated by ARShoe are shown in Fig. 11. It can be seen that our
system is robust to handle various conditions of different foot poses
and can realize a realistic virtual try-on effect in practical scenes. A 6 ACKNOWLEDGMENTS
This work was supported by grants from the National Key Research
and Development Program of China (Grant No. 2020YFC2006200).
REFERENCES the IEEE Winter Conference on Applications of Computer Vision (WACV). 1535–1544.
[1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. https://doi.org/10.1109/WACV45572.2020.9093638
2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In [20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proceed-
(CVPR). 3686–3693. https://doi.org/10.1109/CVPR.2014.471 ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. YOLACT: Real- 936–944. https://doi.org/10.1109/CVPR.2017.106
Time Instance Segmentation. In Proceedings of the IEEE/CVF International Confer- [21] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks
ence on Computer Vision (ICCV). 9156–9165. https://doi.org/10.1109/ICCV.2019. for human pose estimation. In Proceedings of the European Conference on Computer
00925 Vision (ECCV). 483–499. https://doi.org/10.1007/978-3-319-46484-8_29
[3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2021. [22] Daniil Osokin. 2018. Real-time 2D Multi-Person Pose Estimation on CPU: Light-
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. weight OpenPose. In arXiv preprint arXiv:1811.12004.
IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2021), 172– [23] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. 2019. PVNet:
186. https://doi.org/10.1109/TPAMI.2019.2929257 Pixel-Wise Voting Network for 6DoF Pose Estimation. In Proceedings of the
[4] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4556–
Human Pose Estimation with Iterative Error Feedback. In Proceedings of the 4565. https://doi.org/10.1109/CVPR.2019.00469
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4733–4742. [24] Rudra Poudel, Stephan Liwicki, and Roberto Cipolla. 2019. Fast-SCNN: Fast
https://doi.org/10.1109/CVPR.2016.512 Semantic Segmentation Network. In Proceedings of the British Machine Vision
[5] Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1 € Filter: A Simple Speed- Conference (BMVC). 1–12. https://doi.org/10.5244/C.33.187
Based Low-Pass Filter for Noisy Input in Interactive Systems. In Proceedings of [25] Mahdi Rad and Vincent Lepetit. 2017. BB8: A Scalable, Accurate, Robust to Partial
the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527–2530. Occlusion Method for Predicting the 3D Poses of Challenging Objects without
https://doi.org/10.1145/2207676.2208639 Using Depth. In Proceedings of the IEEE International Conference on Computer
[6] Yifei Chen, Haoyu Ma, Deying Kong, Xiangyi Yan, Jianbao Wu, Wei Fan, and Vision (ICCV). 3848–3856. https://doi.org/10.1109/ICCV.2017.413
Xiaohui Xie. 2020. Nonparametric Structure Regularization Machine for 2D Hand [26] N Dinesh Reddy, Minh Vo, and Srinivasa G. Narasimhan. 2018. CarFusion:
Pose Estimation. In Proceedings of the IEEE Winter Conference on Applications of Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction
Computer Vision (WACV). 370–379. https://doi.org/10.1109/WACV45572.2020. of Vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and
9093271 Pattern Recognition (CVPR). 1906–1915. https://doi.org/10.1109/CVPR.2018.00204
[7] Chao-Te Chou, Cheng-Han Lee, Kaipeng Zhang, Hu-Cheng Lee, and Winston H. [27] Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A
Hsu. 2019. PIVTONS: Pose Invariant Virtual Try-On Shoe with Conditional RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Con-
Image Completion. In Proceedings of the Asian Conference on Computer Vision ference on Computer Vision and Pattern Recognition (CVPR). 567–576. https:
(ACCV). 654–668. https://doi.org/10.1007/978-3-030-20876-9_41 //doi.org/10.1109/CVPR.2015.7298655
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen [28] Difei Tang, Juyong Zhang, Ketan Tang, Lingfeng Xu, and Lu Fang. 2014. Making
Wei. 2017. Deformable Convolutional Networks. In Proceedings of the IEEE 3D Eyeglasses Try-on practical. In Proceedings of the IEEE International Conference
International Conference on Computer Vision (ICCV). 764–773. https://doi.org/10. on Multimedia and Expo Workshops (ICMEW). 1–6. https://doi.org/10.1109/
1109/ICCV.2017.89 ICMEW.2014.6890545
[9] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and [29] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. 2018. Real-Time Seamless Single
Jian Yin. 2019. FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try- Shot 6D Object Pose Prediction. In Proceedings of the IEEE/CVF Conference on
On. In Proceedings of the IEEE/CVF International Conference on Computer Vision Computer Vision and Pattern Recognition (CVPR). 292–301. https://doi.org/10.
(ICCV). 1161–1170. https://doi.org/10.1109/ICCV.2019.00125 1109/CVPR.2018.00038
[10] Xiaochuan Fan, Kang Zheng, Yuewei Lin, and Song Wang. 2015. Combining local [30] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng
appearance and holistic view: Dual-Source Deep Neural Networks for human pose Yang. 2018. Toward Characteristic-Preserving Image-Based Virtual Try-On
estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Network. In Proceedings of the European Conference on Computer Vision (ECCV).
Recognition (CVPR). 1347–1355. https://doi.org/10.1109/CVPR.2015.7298740 607–623. https://doi.org/10.1007/978-3-030-01261-8_36
[11] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for au- [31] Jiahang Wang, Tong Sha, Wei Zhang, Zhoujun Li, and Tao Mei. 2020. Down to
tonomous driving? The KITTI vision benchmark suite. In Proceedings of the the Last Detail: Virtual Try-on with Fine-Grained Details. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3354–3361. ACM International Conference on Multimedia (MM). 466–474. https://doi.org/10.
https://doi.org/10.1109/CVPR.2012.6248074 1145/3394171.3413514
[12] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S. Davis. 2018. VI- [32] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. 2020.
TON: An Image-Based Virtual Try-on Network. In Proceedings of the IEEE/CVF SOLOv2: Dynamic and Fast Instance Segmentation. In Proceedings of the Advances
Conference on Computer Vision and Pattern Recognition (CVPR). 7543–7552. in Neural Information Processing Systems (NeurIPS), Vol. 33. 17721–17732.
https://doi.org/10.1109/CVPR.2018.00787 [33] Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. 2019. VTNFP: An Image-Based Virtual
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Try-On Network With Body and Clothing Feature Preservation. In Proceedings of
Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer the IEEE/CVF International Conference on Computer Vision (ICCV). 10510–10519.
Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR. https://doi.org/10.1109/ICCV.2019.01061
2016.90 [34] Xuzheng Yu, Tian Gan, Yinwei Wei, Zhiyong Cheng, and Liqiang Nie. 2020.
[14] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Personalized Item Recommendation for Second-Hand Trading Platform. In Pro-
2020. PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose ceedings of the ACM International Conference on Multimedia (MM). 3478–3486.
Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and https://doi.org/10.1145/3394171.3413640
Pattern Recognition (CVPR). 11629–11638. https://doi.org/10.1109/CVPR42600. [35] Miaolong Yuan, Ishtiaq Rasool Khan, Farzam Farbiz, Arthur Niswar, and Zhiyong
2020.01165 Huang. 2011. A Mixed Reality System for Virtual Glasses Try-On. In Proceedings
[15] Chia-Wei Hsieh, Chieh-Yun Chen, Chien-Lung Chou, Hong-Han Shuai, Jiaying of the International Conference on Virtual Reality Continuum and Its Applications
Liu, and Wen-Huang Cheng. 2019. FashionOn: Semantic-Guided Image-Based in Industry (VRCAI). 363–366. https://doi.org/10.1145/2087756.2087816
Virtual Try-on with Detailed Human and Clothing Information. In Proceedings of [36] Xiaoyun Yuan, Difei Tang, Yebin Liu, Qing Ling, and Lu Fang. 2017. Magic
the ACM International Conference on Multimedia (MM). 275–283. https://doi.org/ Glasses: From 2D to 3D. IEEE Transactions on Circuits and Systems for Video
10.1145/3343031.3351075 Technology 27, 4 (2017), 843–854. https://doi.org/10.1109/TCSVT.2016.2556439
[16] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. [37] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. 2019. DPOD: 6D Pose Object
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing Detector and Refiner. In Proceedings of the IEEE/CVF International Conference on
in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Computer Vision (ICCV). 1941–1950. https://doi.org/10.1109/ICCV.2019.00203
Intelligence 36, 7 (2014), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248 [38] Boping Zhang. 2018. Augmented reality virtual glasses try-on technology based
[17] Shuhui Jiang, Yue Wu, and Yun Fu. 2016. Deep Bi-Directional Cross-Triplet on iOS platform. EURASIP Journal on Image and Video Processing 132, 2018 (2018),
Embedding for Cross-Domain Clothing Retrieval. In Proceedings of the ACM 1–19. https://doi.org/10.1186/s13640-018-0373-8
International Conference on Multimedia (MM). 52–56. https://doi.org/10.1145/ [39] Feng Zhang, Xiatian Zhu, and Mao Ye. 2019. Fast Human Pose Estimation. In
2964284.2967182 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
[18] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimiza- (CVPR). 3512–3521. https://doi.org/10.1109/CVPR.2019.00363
tion. In Proceedings of the International Conference for Learning Representations [40] Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang
(ICLR). 1–15. Nie. 2019. Virtually Trying on New Clothing with Arbitrary Poses. In Proceedings
[19] Deying Kong, Haoyu Ma, Yifei Chen, and Xiaohui Xie. 2020. Rotation-invariant of the ACM International Conference on Multimedia. 266–274. https://doi.org/10.
Mixed Graphical Model Network for 2D Hand Pose Estimation. In Proceedings of 1145/3343031.3350946
SUPPLEMENTARY MATERIAL Input 3×256×256
A ABSTRACT
Conv 32×128×128
This supplementary material presents the detailed network struc-
ture, parameter settings, and training losses of ARShoe.
DSConv 48×64×64
B DETAILS OF NETWORK ARCHITECTURE
Figure 12 illustrates the detailed network architecture and parame- DSConv 64×64×64
ter settings of our multi-branch network. Firstly, the encoder part
extracts low-level features utilizing a convolution block and two
Bottleneck 64×32×32
depthwise separable convolution blocks (DSConv). The high-level
features are then obtained through three residual bottleneck blocks
(Bottleneck), a pyramid pooling block, and a depthwise convolution Bottleneck 96×16×16
block (DWConv). Afterward, both low-level and high-level features
go through a convolution block to unify their channel numbers. Bottleneck 128×16×16
To integrate high-level semantic features and low-level shallow
features for better image representation, high-level and low-level Pyramid 128×16×16
features are summated and go through an activation function ReLU.
Pooling
In the decoding process, three upsample modules are adopted to
decode the fused features to keypoint heatmap (heatmap), PAFs
heatmap (pafmap), and segmentation results (segmap). The dimen- Upsample 128×32×32
sion variation process of feature maps is presented in Fig. 12.
DWConv 96×32×32
C TRAINING LOSSES
We utilize L2 loss for all three subtasks in our network. The ground Conv Conv 96×32×32
truths of the keypoints prediction branch, the PAFs estimation
branch, and the segmentation branch are the annotated keypoints, +
connections between keypoints, and segmentation of foot and leg,
respectively. Denoting losses of these three branches as Lheatmap ,
Lpafsmap , and Lsegmap , respectively, the overall loss L of our net- Upsample Upsample Upsample
work can be fomulated as: module module module
L = 𝜆1 Lheatmap + 𝜆2 Lpafsmap + 𝜆3 Lsegmap (12) heatmap pafmap segmap
8×64×64 14×64×64 2×64×64
where 𝜆1 , 𝜆2 , and 𝜆3 are coefficients to balance the contributions
of each loss. In our implementation, they are set to 2, 1, and 0.1,
respectively. Figure 12: The detailed network architecture of ARShoe.