Professional Documents
Culture Documents
Back
Front
Input Image Front & Back Normal Estimation (2D) Front & Back d-BiNI Surfaces (2.5D) Full Clothed Human Mesh (3D)
Figure 1. Human digitization from a color image. ECON combines the best aspects of implicit and explicit surfaces to infer high-fidelity
3D humans, even with loose clothing or in challenging poses. It does so in three steps: (1) It infers detailed 2D normal maps for the front and
back side (Sec. 3.1). (2) The normal maps are converted into detailed, yet incomplete, 2.5D front and back surfaces guided by a SMPL-X
estimate (Sec. 3.2). (3) It then “inpaints” the missing geometry between two surfaces (Sec. 3.3). If the face or hands are noisy, they can
optionally be replaced with the ones from SMPL-X, which have a cleaner geometry. See the video on our website for more results.
Figure 3. Overview. ECON takes as input an RGB image, I, and a SMPL-X body, Mb . Conditioned on the rendered front and back body
c c . These two normals along with body depth maps, Z b ,
normal images, N b , ECON first predicts front and back clothing normal maps, N
are fed into a d-BiNI optimizer to produce front and back surfaces, {MF , MB }. Based on such partial surfaces, and body estimate Mb ,
IF-Nets+ implicitly completes RIF . With optional face or hands from Mb , screened Poisson combines above all as final watertight R.
body landmarks in an additional loss term LJ diff , to further method to jointly optimize for the front and back clothed
optimize the SMPL-X body, Mb , inferred from PIXIE [18]. depth maps, ZbFc and ZbBc :
Specifically, we optimize SMPL-X’s shape, β, pose, θ, and
b c, N
d-BiNI(N b c , Z b , Z b ) → Zbc , Zbc . (2)
translation, t, to minimize: F B F B F B
LSMPL-X = LN diff + LS diff + LJ diff , b∗c is the front or back clothed normal map predicted
Here, N
(1) by G from {I, N b }, and Z∗b is the front or back coarse
NB
LJ diff = λJ diff |J b − J
cc |,
body depth image rendered from the SMPL-X mesh, Mb .
where LN diff and LS diff are the normal-map loss and sil- Specifically, our objective function consists of five terms:
houette loss introduced from ICON [67], and LJ diff is the
joint loss (L2) between 2D landmarks J cc , which could be min Ln (ZbFc ; N
bFc ) + Ln (ZbBc ; N
bBc )+
bc ,Z
Z bc
F B
estimated by off-the-shell 2D pose estimator from the RGB
λd Ld (ZbFc ; ZFb ) + λd Ld (ZbBc ; ZBb )+ (3)
image I, and corresponding re-projected 2D joints J b from
Mb . For more implementation details, see Sec. A.1 in Appx. λs Ls (Zbc , Zbc ),
F B
3.2. Front and back surface reconstruction where Ln is front and back BiNI terms introduced by
We now lift the clothed normal maps to 2.5D surfaces. BiNI [9], Ld is front and back depth prior terms, and Ls
We expect these 2.5D surfaces to satisfy three conditions: (1) is a front-back silhouette consistency term. For a more de-
high-frequency surface details agree with predicted clothed tailed discussion on these terms, see Sec. A.2 in Appx.
normal maps, (2) low-frequency surface variations, including With Eq. (3), we make two technical contributions beyond
discontinuities, agree with SMPL-X’s ones, and (3) the depth BiNI [9]. First, we use the coarse depth prior rendered from
of front and back silhouettes are close to each other. the SMPL-X body mesh, Zib , to regularize BiNI:
Unlike PIFuHD [61] or ICON [67], which train a neural
Ld (Zbic ; Zib ) = |Zbic − Zib | i ∈ {F, B}. (4)
network to regress the implicit surface from a clothed nor-
mal map, we explicitly model the depth-normal relationship This addresses the key problem of putting the front and back
using variational normal integration methods [9, 57]. Specifi- surfaces together in a coherent way to form a full body. Op-
cally, we tailor the recent bilateral normal integration (BiNI) timizing BiNI terms Ln remains an arbitrary global offset
method [9] to full-body mesh reconstruction by harnessing between front and back surfaces. The depth prior terms Ld
the coarse prior, depth maps, and silhouette consistency. encourage the surfaces with undecided offsets to be consis-
To satisfy the three conditions, we propose a depth-aware tent with the SMPL-X body. For further intuitions on Ln
silhouette-consistent bilateral normal integration (d-BiNI) and Ld , see Fig. A.3 and Fig. A.4 in Appx.
Good
Second, we use a silhouette consistency term to encour- Inpainting with IF-Nets+ (RIF ). To improve reconstruc-
age the front and back depth values to be the same at the tion coherence, we use a learned implicit-function (IF) model
silhouette boundary: to “inpaint” the missing geometry given front and back
d-BiNI surfaces. Specifically, we tailor a general-purpose
Ls (ZbFc , ZbBc ) = |ZbFc − ZbBc |silhouette . (5)
shape completion method, IF-Nets [12], to a SMPL-X-
The silhouette term improves the physical consistency of the guided one, denoted as IF-Nets+. IF-Nets [12] completes the
reconstructed front and back clothed depth maps. Without 3D shape from a deficient 3D input, such as an incomplete
this term, d-BiNI produces intersections of the front and back 3D human shape or a low-resolution voxel grid. Inspired
surfaces around the silhouette, causing “blobby” artifacts by Li et al. [42], we adapt IF-Nets by conditioning it on a
and hurting reconstruction quality; see Fig. A.5 in Appx. voxelized SMPL-X body to deal with pose variation; for de-
tails see Sec. A.3 in Appx. IF-Nets+ is trained on voxelized
3.3. Human shape completion front and back ground-truth clothed depth maps, {ZFc , ZBc },
For simple body poses without self-occlusions, merging and a voxelized (estimated) body mesh, Mb , as input, and
front and back d-BiNI surfaces in a straight-forward way, is supervised with ground-truth 3D shapes. During training,
as done in FACSIMILE [62] and Moduling Humans [21], we randomly mask {ZFc , ZBc } for robustness to occlusions.
can result in a complete 3D clothed scan. However, often During inference, we feed the estimated ZbFc , ZbBc and Mb
poses result in self-occlusions, which cause large portions of into IF-Nets+ to obtain an occupancy field, from which we
the surfaces to be missing. In such cases, screened Poisson extract the inpainted mesh, RIF , with Marching cubes [48].
Surface Reconstruction (sPSR) [35] leads to blobby artifacts.
sPSR completion with SMPL-X and RIF (ECONIF ). To
sPSR completion with SMPL-X (ECONEX ). A naive way obtain our final mesh, R, we apply sPSR to stitch (1) d-BiNI
to “infill” the missing surface is to make use of the esti- surfaces, (2) sided and occluded triangle soup Mcull from
mated SMPL-X body. We remove the triangles from Mb RIF , and optionally, (3) face or hands cropped from the esti-
that are visible to front or back cameras. The remaining mated SMPL-X body. Although RIF is already a complete
triangle “soup” Mcull contains both side-view boundaries human mesh, we only use its side and occluded parts be-
and occluded regions. We apply sPSR [35] to the union of cause its front and back regions lack sharp details compared
Mcull and d-BiNI surfaces {MF , MB } to obtain a watertight to d-BiNI surfaces. This is due to the lossy voxelization of in-
reconstruction R. This approach is denoted as ECONEX . Al- puts and limited resolution of Marching cubes; see the local
though ECONEX avoids missing limbs or sides, it does not difference between ECON{IF,EX} and RIF in Fig. 4. Further,
produce a coherent surface for the originally missing cloth- we use the face or hands cropped from Mb because these
ing and hair surfaces because of the discrepancy between parts are often poorly reconstructed in RIF , see difference
SMPL-X and actual clothing or hair; see ECONEX in Fig. 4. in Fig. 5. The approach is denoted as ECONIF .
Methods OOD poses (CAPE) OOD outfits (Renderpeople)
4. Experiments Chamfer ↓ P2S ↓ Normals ↓ Chamfer ↓ P2S ↓ Normals ↓
w/o SMPL body prior
4.1. Datasets PIFu * 1.722 1.548 0.0674 1.706 1.642 0.0709
PIFuHD† 3.767 3.591 0.0994 1.946 1.983 0.0658
Training on THuman2.0. THuman2.0 [70] contains 525 w/ SMPL body prior
high-quality human textured scans in various poses, which PaMIR * 1.004 1.004 0.0421 1.334 1.486 0.0523
ICON 1.272 1.139 0.0605 1.510 1.653 0.0686
are captured by a dense DSLR rig, along with their corre- ECONEX 0.837 0.829 0.0331 1.409 1.506 0.0521
sponding SMPL-X fits. We use THuman2.0 to train ICON’s ECONIF 0.844 0.841 0.0332 1.384 1.455 0.0511
variants, IF-Nets+, IF-Nets, PIFu and PaMIR.
Quantitative evaluation on CAPE & Renderpeople. We Table 1. Evaluation against the state of the art. All models use a
primarily evaluate on CAPE [50] and Renderpeople [58]. resolution of 256 for marching cubes. ∗ Method is re-implemented
in ICON for a fair comparison in terms of network settings and
Specifically, we use the “CAPE-NFP” set (100 scans), which
training data. † Official model is trained on Renderpeople dataset.
is used by ICON to analyze the robustness on complex hu-
man poses. Moreover, we select another 100 scans from
Renderpeople, containing loose clothing, such as dresses,
skirts, robes, down jackets, costumes, etc. With such cloth-
ICON [67] PIFuHD [61]
ing variance, Renderpeople helps numerically evaluate the
Challenging poses 0.423 0.182
flexibility of reconstruction methods w.r.t. shape topology. Loose clothing 0.213 0.457
Samples of the two datasets are shown in Fig. 6. Fashion images 0.335 0.519
Figure 7. Qualitative results on in-the-wild images. We show 8 examples for reconstructing detailed clothed 3D humans from images
with: (a) challenging poses and (b) loose clothing. For each example we show the input image along with two views (front and rotated) of
the reconstructed 3D humans. Our approach is robust to pose variations, generalizes well to loose clothing, and contains detailed geometry.
Methods OOD poses (CAPE [50]) OOD outfits (Renderpeople) [58] Speed Methods OOD poses (CAPE) OOD outfits (Renderpeople)
RMSE ↓ MAE ↓ RMSE ↓ MAE ↓ FPS ↑ Chamfer ↓ P2S ↓ Normals ↓ Chamfer ↓ P2S ↓ Normals ↓
BiNI [9] 27.64 21.11 20.61 16.07 0.52 IF-Nets [12] 2.116 1.233 0.075 1.883 1.622 0.070
d-BiNI 13.43 10.29 14.43 11.26 0.69 IF-Nets+ 1.401 1.353 0.056 1.477 1.564 0.055
ECONIF 0.844 0.841 0.033 1.384 1.455 0.051
Table 3. BiNI vs d-BiNI. Comparison between BiNI and d-BiNI Table 4. Evaluation for shape completion. Same metrics
surfaces w.r.t. reconstruction accuracy and optimization speed. as Tab. 1, and ECONIF is added as a reference.
Weiyang Liu for their feedback, and Tsvetelina Alexiadis for >
B − bB ) WB (AB z
+(AB zc c B − bB )
her help with the perceptual study. This project has received
+λd (zbF − zF )> M(zbF − zF ) (A.1)
funding from the European Union’s Horizon 2020 research
>
and innovation programme under the Marie Skłodowska- zB − zB ) M(c
+λd (c zB − zB )
Curie grant agreement No.860768 (CLIPE project). +λs (zbF − zc >
B ) S(zbF − z
c B ).
𝜕Ω! local
p
d-BiNI IF 0/1
global
Voxelize
bc
N bc
N ZFb ZBb p
F B
back normal maps; Ωz is where depth priors are rendered from the Input Multi-scale voxel encoder Implicit regressor Reconstruction
SMPL-X mesh and identical for front and back depth maps. Ωn is,
in general, different from Ωz . ∂Ωn is the silhouette of normal maps. Figure A.2. Overview of IF-Nets+.
necessary condition for the global optimum is obtained by feature grids to account for both local and global infor-
equating the gradient of Eq. (A.4) to 0: mation, F1 , F2 , . . . , Fn , Fk ∈ RK×K×K×Ck , n = 6.
These deep features are with decreasing resolution K =
N
(A> WA + λd M e z = A> Wb + λd Mz.
f + λs S)b f (A.5) 2k −1
, N = 256 and variable dimension channels C =
{32, 32, 64, 128, 128, 128}. Following Li et al. [14], the po-
Equation (A.5) is a large-scale sparse linear system with sitional embedding (N freqs = 6) of query points is con-
a symmetric positive definite coefficient matrix. We solve catenated with multi-scale deep features to account for high-
Eq. (A.5) using a CUDA-accelerated sparse conjugate gradi- frequency details. All these features are then fed into an
ent solver with a Jacobi preconditioner 3 . implicit function regressor, parameterized by a Multi-Layer
Hyper-parameters. d-BiNI has three hyper-parameters: λd , Perceptron (MLP), to predict the occupancy value of point
λs , and k. λd and λs are used in the objective function Eq. (3), P. This MLP regressor is trained with BCE loss.
which control the influence of coarse depth prior term Eq. (4)
and silhouette consistency term Eq. (5) separately. k is Training setting. IF-Nets and IF-Nets+ share the same train-
used in the original BiNI [9] to control the surface stiffness ing setting. The voxelization resolution for both SMPL-X
(See Sup.Mat-A in BiNI [9] for more explanation of k). and d-BiNI surfaces is 2563 . We use RMSprop as an op-
Empirically, we set λd = 1e−4 , λs = 1e−6 , and k = 2. timizer, with a learning rate 1e−4 , and weight decay by a
factor of 0.1 every 10 epochs. These networks are trained
Discussion of hyper-paramters. Figure A.3 shows the on an NVIDIA A100 for 20 epochs with a batch size of
d-BiNI integration results under different values of k. It 48. Following ICON [67], we sampled 10000 points with
can be seen that a small k leads to tougher d-BiNI surfaces the mixture of cube-uniform sampling and surface-around
where discontinuities are not accurately recovered, while a sampling, with standard deviation of 5cm.
large k softens the surface, and redundant discontinuities and
noisy artifacts are introduced. Figure A.4 shows the effects Dataset details. We augment THuman2.0 [70] by (1) ro-
of λd , which controls how much d-BiNI surfaces agree on the tating the scans every 10 degrees around the yaw axis, to
SMPL-X mesh. Small λd causes misalignment between the generate 525 × 36 = 18900 samples in total, and (2) ran-
d-BiNI surface and the SMPL-X mesh, which will produce domly selecting a rectangle region from the d-BiNI depth
stitching artifacts. While an excessively large λd enforces maps, and erasing its pixels [77]. In particular, the erasing
d-BiNI to rely over heavily on SMPL-X, thus smoothing operation is being performed with p = 0.8 probability, the
out the high-frequency details obtained from normals. Fig- range of aspect ratio of erased area is between 0.3 and 3.3,
ure A.5 justifies the necessity of the silhouette consistency and its range of proportion are {0.01, 0.05, 0.2}.
term. Without this term, the front and back d-BiNI surfaces
intersect each other around the silhouettes, which will cause B. Qualitative results
“blobby” artifacts after screened Poisson reconstruction [35].
Figures A.6 to A.8 show more comparisons used in our
A.3. IF-Nets+ perceptual study, containing the results on in-the-wild im-
ages with challenging poses, loose clothing, and standard
Network structure. As Fig. A.2 shows, similar to fashion poses, respectively. For each image, we display the
IF-Nets [12], IF-Nets+ applies multi-scale voxel 3D CNN results obtained by ECON, PIFuHD [61], and ICON [67]
encoding on voxelized d-BiNI and the SMPL-X surface, from the top to the bottom row. In each row, we show nor-
namely F1d-BiNI and F1SMPL-X , generating multi-scale deep mal maps rendered in four views {0◦ , 90◦ , 180◦ , 270◦ }. The
3 https : / / docs . cupy . dev / en / stable / reference / video on our website shows more reconstructions with a
generated/cupyx.scipy.sparse.linalg.cg.html rotating virtual camera.
Normal Prediction k=1 2 (Ours) 4 10 100
Figure A.3. The effects of the hyper-parameter k on d-BiNI results. k controls the stiffness of the target surface [9]. A smaller k leads to
smooth d-BiNI surfaces, while a large k introduces unnecessary discontinuities and noise artifacts.
Figure A.4. The effects of the hyperparameter λd on d-BiNI results. λd controls how much d-BiNI surfaces agree with the SMPL-X
mesh. A small λd causes a misalignment between the d-BiNI surface and the SMPL-X mesh, thus it produces stitching artifacts. An
excessively large λd enforces d-BiNI to rely too heavily on SMPL-X, thus it smooths out the high-frequency details obtained from normals.
Image Front Normal Prediction Back Normal Prediction w/o silhouette consistency w/ silhouette consistency Blobby artifacts after screened poisson
Figure A.5. Necessity of silhouette consistency. This term can be regarded as the mediator between front and back d-BiNI surfaces,
preventing these surfaces from intersecting. Such intersection causes blobby artifacts after screened Poisson reconstruction [35].
Figure A.6. Results on in-the-wild images with challenging poses. For each example (gray box) the format is as follows. Top → bottom:
ECON, PIFuHD [61] and ICON [67]. Left → right: Virtual camera rotated by {0◦ , 90◦ , 180◦ , 270◦ }. ü Zoom in to see 3D details.
Figure A.7. Results on in-the-wild images with loose clothing. For each example (gray box) the format is as follows. Top → bottom:
ECON, PIFuHD [61] and ICON [67]. Left → right: Virtual camera rotated by {0◦ , 90◦ , 180◦ , 270◦ }. ü Zoom in to see 3D details.
Figure A.8. Results on in-the-wild fashion images. For each example (gray box) the format is as follows. Top → bottom: ECON,
PIFuHD [61] and ICON [67]. Left → right: Virtual camera rotated by {0◦ , 90◦ , 180◦ , 270◦ }. ü Zoom in to see 3D details.
References [16] Junting Dong, Qi Fang, Yudong Guo, Sida Peng, Qing Shuai,
Xiaowei Zhou, and Hujun Bao. Totalselfscan: Learning full-
[1] HumanAlloy. humanalloy.com, 2018. 7 body avatars from self-portrait videos of faces, hands, and
[2] Thiemo Alldieck, Marcus A. Magnor, Bharat Lal Bhatnagar, bodies. In Conference on Neural Information Processing
Christian Theobalt, and Gerard Pons-Moll. Learning to re- Systems (NeurIPS), 2022. 8
construct people in clothing from a single RGB camera. In
[17] Zijian Dong, Chen Guo, Jie Song, Xu Chen, Andreas Geiger,
Computer Vision and Pattern Recognition (CVPR), 2019. 3
and Otmar Hilliges. PINA: Learning a personalized implicit
[3] Thiemo Alldieck, Marcus A. Magnor, Weipeng Xu, Christian
neural avatar from a single RGB-D video sequence. In Com-
Theobalt, and Gerard Pons-Moll. Detailed human avatars
puter Vision and Pattern Recognition (CVPR), 2022. 3
from monocular video. In International Conference on 3D
Vision (3DV), 2018. 3 [18] Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios
[4] Thiemo Alldieck, Marcus A. Magnor, Weipeng Xu, Christian Tzionas, and Michael J. Black. Collaborative regression of
Theobalt, and Gerard Pons-Moll. Video based reconstruc- expressive bodies using moderation. In International Confer-
tion of 3D people models. In Computer Vision and Pattern ence on 3D Vision (3DV), 2021. 3, 4
Recognition (CVPR), 2018. 3 [19] Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black,
[5] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and and Timo Bolkart. Capturing and animation of body and cloth-
Marcus A. Magnor. Tex2Shape: Detailed full human body ing from monocular video. In International Conference on
geometry from a single image. In International Conference Computer Graphics and Interactive Techniques (SIGGRAPH),
on Computer Vision (ICCV), 2019. 3 2022. 8
[6] AXYZ. secure.axyz-design.com, 2018. 7 [20] Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin,
[7] Alexander W Bergman, Petr Kellnhofer, Yifan Wang, Eric R Chen Qian, Chen Change Loy, Wayne Wu, and Ziwei Liu.
Chan, David B Lindell, and Gordon Wetzstein. Generative StyleGAN-Human: A data-centric odyssey of human genera-
neural articulated radiance fields. In Conference on Neural tion. In European Conference on Computer Vision (ECCV),
Information Processing Systems (NeurIPS), 2022. 7 2022. 7
[8] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, [21] Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin,
and Gerard Pons-Moll. Multi-Garment Net: Learning to Cordelia Schmid, and Gregory Rogez. Moulding humans:
dress 3D people from images. In International Conference Non-parametric 3D human shape estimation from single
on Computer Vision (ICCV), 2019. 3 images. In International Conference on Computer Vision
[9] Xu Cao, Hiroaki Santo, Boxin Shi, Fumio Okura, and Ya- (ICCV), 2019. 3, 5
suyuki Matsushita. Bilateral normal integration. In European [22] Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang,
Conference on Computer Vision (ECCV), 2022. 2, 4, 7, 9, 10, Xiaoou Tang, and Ping Luo. A Versatile Benchmark for Detec-
11 tion, Pose Estimation, Segmentation and Re-Identification of
[10] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Clothing Images. Computer Vision and Pattern Recognition
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas (CVPR), 2019. 7
Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, [23] Artur Grigorev, Karim Iskakov, Anastasia Ianina, Renat
and Gordon Wetzstein. Efficient Geometry-aware 3D Gener- Bashirov, Ilya Zakharkin, Alexander Vakhitov, and Victor
ative Adversarial Networks. In Computer Vision and Pattern Lempitsky. Stylepeople: A generative model of fullbody hu-
Recognition (CVPR), 2022. 7 man avatars. In Computer Vision and Pattern Recognition
[11] Xu Chen, Tianjian Jiang, Jie Song, Jinlong Yang, Michael J. (CVPR), 2021. 7
Black, Andreas Geiger, and Otmar Hilliges. gDNA: Towards [24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
generative detailed neural avatars. In Computer Vision and shick. Mask r-cnn. In International Conference on Computer
Pattern Recognition (CVPR), 2022. 2 Vision (ICCV), 2017. 9
[12] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. [25] Tong He, John P. Collomosse, Hailin Jin, and Stefano Soatto.
Implicit functions in feature space for 3D shape reconstruction Geo-PIFu: Geometry and pixel aligned implicit functions for
and completion. In Computer Vision and Pattern Recognition single-view human reconstruction. In Conference on Neural
(CVPR), 2020. 2, 3, 5, 7, 10 Information Processing Systems (NeurIPS), 2020. 2, 3
[13] Julian Chibane, Aymen Mir, and Gerard Pons-Moll. Neural
[26] Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, and
unsigned distance fields for implicit function learning. In Con-
Tony Tung. ARCH++: Animation-ready clothed human re-
ference on Neural Information Processing Systems (NeurIPS),
construction revisited. In International Conference on Com-
2020. 3
puter Vision (ICCV), 2021. 3
[14] Vasileios Choutas, Lea Müller, Chun-Hao P. Huang, Siyu
[27] Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and
Tang, Dimitrios Tzionas, and Michael J. Black. Accurate
Ziwei Liu. EVA3D: Compositional 3D human generation
3D body shape regression via linguistic attributes and anthro-
from 2D image collections. arXiv:2210.04888, 2022. 7
pometric measurements. In Computer Vision and Pattern
Recognition (CVPR), 2022. 3, 10 [28] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony
[15] Enric Corona, Albert Pumarola, Guillem Alenyà, Ger- Tung. ARCH: Animatable reconstruction of clothed humans.
ard Pons-Moll, and Francesc Moreno-Noguer. SMPLicit: In Computer Vision and Pattern Recognition (CVPR), 2020.
Topology-aware generative model for clothed people. In 3
Computer Vision and Pattern Recognition (CVPR), 2021. 3
[29] Yasamin Jafarian and Hyun Soo Park. Learning High Fidelity [44] Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle
Depths of Dressed Humans by Watching Social Media Dance Olszewski, and Hao Li. Monocular real-time volumetric
Videos. In Computer Vision and Pattern Recognition (CVPR), performance capture. In European Conference on Computer
2021. 2 Vision (ECCV), 2020. 9
[30] Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. Self- [45] Zhe Li, Tao Yu, Chuanyu Pan, Zerong Zheng, and Yebin Liu.
recon: Self reconstruction your digital avatar from monocular Robust 3D self-portraits in Seconds. In Computer Vision and
video. In Computer Vision and Pattern Recognition (CVPR), Pattern Recognition (CVPR), 2020. 3
2022. 8 [46] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou
[31] Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Tang. Deepfashion: Powering robust clothes recognition
Liu, and Hujun Bao. BCNet: Learning body and cloth shape and retrieval with rich annotations. In Computer Vision and
from a single image. In European Conference on Computer Pattern Recognition (CVPR), 2016. 7
Vision (ECCV), 2020. 3 [47] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
[32] Suyi Jiang, Haoran Jiang, Ziyu Wang, Haimin Luo, Wenzheng Pons-Moll, and Michael J. Black. SMPL: A skinned multi-
Chen, and Lan Xu. HumanGen: Generating Human Radiance person linear model. Transactions on Graphics (TOG), 2015.
Fields with Explicit Priors. arXiv preprint:2212.05321, 2022. 2, 3
7 [48] William E. Lorensen and Harvey E. Cline. Marching cubes:
[33] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A high resolution 3D surface construction algorithm. Inter-
A 3D deformation model for tracking faces, hands, and bodies. national Conference on Computer Graphics and Interactive
In Computer Vision and Pattern Recognition (CVPR), 2018. Techniques (SIGGRAPH), 1987. 5
3 [49] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris Mc-
[34] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Clanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-
Jitendra Malik. End-to-end recovery of human shape and Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Me-
pose. In Computer Vision and Pattern Recognition (CVPR), diapipe: A framework for building perception pipelines.
2018. 3 arXiv:1906.08172, 2019. 9
[35] Michael Kazhdan and Hugues Hoppe. Screened poisson [50] Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Ger-
surface reconstruction. Transactions on Graphics (TOG), ard Pons-Moll, Siyu Tang, and Michael J. Black. Learning to
2013. 2, 5, 10, 11 dress 3D people in generative clothing. In Computer Vision
[36] Muhammed Kocabas, Nikos Athanasiou, and Michael J. and Pattern Recognition (CVPR), 2020. 2, 6, 7
Black. VIBE: Video inference for human body pose and [51] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer, Se-
shape estimation. In Computer Vision and Pattern Recogni- bastian Nowozin, and Andreas Geiger. Occupancy networks:
tion (CVPR), 2020. 3 Learning 3D reconstruction in function space. In Computer
[37] Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, Vision and Pattern Recognition (CVPR), 2019. 2, 3
and Michael J. Black. PARE: Part attention regressor for [52] Gyeongsik Moon, Hyeongjin Nam, Takaaki Shiratori, and
3D human body estimation. In International Conference on Kyoung Mu Lee. 3D clothed human reconstruction in the
Computer Vision (ICCV), 2021. 3 wild. In European Conference on Computer Vision (ECCV),
[38] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and 2022. 3
Kostas Daniilidis. Learning to reconstruct 3D human pose [53] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya
and shape via model-fitting in the loop. In International Harada. Unsupervised learning of efficient geometry-aware
Conference on Computer Vision (ICCV), 2019. 3 neural articulated representations. In European Conference
[39] Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll. on Computer Vision (ECCV), 2022. 7
360-Degree textures of people in clothing from a single image. [54] Jeong Joon Park, Peter Florence, Julian Straub, Richard A.
In International Conference on 3D Vision (3DV), 2019. 3 Newcombe, and Steven Lovegrove. DeepSDF: Learning
[40] Binh Le and Zhigang Deng. Smooth skinning decomposition continuous signed distance functions for shape representation.
with rigid bones. Transactions on Graphics (TOG), 2012. 8 In Computer Vision and Pattern Recognition (CVPR), 2019.
[41] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, 2, 3
and Cewu Lu. HybrIK: A hybrid analytical-neural inverse [55] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo
kinematics solution for 3D human pose and shape estimation. Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and
In Computer Vision and Pattern Recognition (CVPR), 2021. Michael J. Black. Expressive body capture: 3D hands, face,
3 and body from a single image. In Computer Vision and Pat-
[42] Lei Li, Zhizheng Liu, Weining Ren, Liudi Yang, Fangjinhua tern Recognition (CVPR), 2019. 2, 3
Wang, Marc Pollefeys, and Songyou Peng. 3D textured shape [56] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J.
recovery with learned geometric priors. arXiv:2209.03254, Black. ClothCap: Seamless 4D clothing capture and retarget-
2022. 5 ing. Transactions on Graphics (TOG), 2017. 3
[43] Ruilong Li, Kyle Olszewski, Yuliang Xiu, Shunsuke Saito, [57] Yvain Quéau, Jean-Denis Durou, and Jean-François Aujol.
Zeng Huang, and Hao Li. Volumetric human teleportation. In Normal integration: A survey. Journal of Mathematical Imag-
SIGGRAPH Real-Time Live, 2020. 9 ing and Vision, 2018. 4
[58] RenderPeople. renderpeople.com, 2018. 2, 6, 7
[59] Javier Romero, Dimitrios Tzionas, and Michael J. Black. [69] Ze Yang, Shenlong Wang, Sivabalan Manivasagam, Zeng
Embodied hands: Modeling and capturing hands and bodies Huang, Wei-Chiu Ma, Xinchen Yan, Ersin Yumer, and Raquel
together. Transactions on Graphics (TOG), 2017. 3 Urtasun. S3: Neural shape, skeleton, and skinning fields
[60] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- for 3D human modeling. In Computer Vision and Pattern
ishima, Hao Li, and Angjoo Kanazawa. PIFu: Pixel-aligned Recognition (CVPR), 2021. 2, 3
implicit function for high-resolution clothed human digitiza- [70] Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qionghai
tion. In International Conference on Computer Vision (ICCV), Dai, and Yebin Liu. Function4D: Real-time human volumet-
2019. 2, 3, 6 ric capture from very sparse consumer RGBD sensors. In
[61] Shunsuke Saito, Tomas Simon, Jason M. Saragih, and Han- Computer Vision and Pattern Recognition (CVPR), 2021. 6,
byul Joo. PIFuHD: Multi-level pixel-aligned implicit function 7, 10
for high-resolution 3D human digitization. In Computer Vi- [71] Ilya Zakharkin, Kirill Mazur, Artur Grigorev, and Victor Lem-
sion and Pattern Recognition (CVPR), 2020. 2, 3, 4, 6, 10, pitsky. Point-based modeling of human clothing. In Interna-
12, 13, 14 tional Conference on Computer Vision (ICCV), 2021. 3
[62] David Smith, Matthew Loper, Xiaochen Hu, Paris Mavroidis, [72] Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng
and Javier Romero. FACSIMILE: Fast and accurate scans Li, Liang An, Zhenan Sun, and Yebin Liu. PyMAF-X: To-
from an image in less than a second. In International Confer- wards well-aligned full-body model regression from monocu-
ence on Computer Vision (ICCV), 2019. 3, 5 lar images. arXiv:2207.06400, 2022. 3
[63] Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J. [73] Jianfeng Zhang, Zihang Jiang, Dingdong Yang, Hongyi Xu,
Black. Putting people in their place: Monocular regression Yichun Shi, Guoxian Song, Zhongcong Xu, Xinchao Wang,
of 3D people in depth. In Computer Vision and Pattern and Jiashi Feng. AvatarGen: A 3D generative model for
Recognition (CVPR), 2022. 3 animatable human avatars. In arXiv:2208.00561, 2022. 7
[64] Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya [74] Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong
Jia. Image inpainting via generative multi-column convolu- Zheng, Qionghai Dai, and Yebin Liu. DeepMultiCap: Perfor-
tional neural networks. Conference on Neural Information mance capture of multiple characters using sparse multiview
Processing Systems (NeurIPS), 2018. 3 cameras. In International Conference on Computer Vision
[65] Donglai Xiang, Fabian Prada, Chenglei Wu, and Jessica K. (ICCV), 2021. 3
Hodgins. MonoClothCap: Towards temporally coherent cloth- [75] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai. PaMIR:
ing capture from monocular RGB video. In International Parametric model-conditioned implicit representation for
Conference on 3D Vision (3DV), 2020. 3 image-based human reconstruction. Transactions on Pat-
[66] Yongqin Xiang, Julian Chibane, Bharat Lal Bhatnagar, Bernt tern Analysis and Machine Intelligence (TPAMI), 2022. 2, 3,
Schiele, Zeynep Akata, and Gerard Pons-Moll. Any-Shot 6
GIN: Generalizing implicit networks for reconstructing novel [76] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and Yebin
classes. In International Conference on 3D Vision (3DV), Liu. DeepHuman: 3D human reconstruction from a single im-
2022. 3 age. In International Conference on Computer Vision (ICCV),
[67] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J. 2019. 7
Black. ICON: Implicit Clothed humans Obtained from Nor- [77] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and
mals. In Computer Vision and Pattern Recognition (CVPR), Yi Yang. Random erasing data augmentation. In AAAI Con-
2022. 2, 3, 4, 6, 9, 10, 12, 13, 14 ference on Artificial Intelligence, 2020. 10
[68] Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, [78] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang
William T. Freeman, Rahul Sukthankar, and Cristian Smin- Yang. Detailed human shape estimation from a single image
chisescu. GHUM & GHUML: Generative 3D human shape by hierarchical mesh deformation. In Computer Vision and
and articulated pose models. In Computer Vision and Pattern Pattern Recognition (CVPR), 2019. 3
Recognition (CVPR), 2020. 3