Banmo: Building Animatable 3D Neural Models From Many Casual Videos

BANMo: Building Animatable 3D Neural Models from Many Casual Videos
Gengshan Yang2 * Minh Vo3 Natalia Neverova1 Deva Ramanan2 Andrea Vedaldi1 Hanbyul Joo1
1
Meta AI 2 Carnegie Mellon University 3 Meta Reality Labs
arXiv:2112.12761v2 [cs.CV] 24 Dec 2021
Color: Skinning weights Pose 1
BANMo Pose 2
Bone
Casual Videos of An Object Canonical Space View 1 View 2 Canonical Embeddings
Figure 1. Given multiple casual videos capturing a deformable object, BANMo reconstructs an animatable 3D model, including an im-
plicit canonical 3D shape, appearance, skinning weights, and time-varying articulations, without pre-defined shape templates or registered
cameras. Left: Input videos; Middle: 3D shape, bones, and skinning weights (visualized as surface colors) in the canonical space; Right:
Posed reconstruction at each time instance with color and canonical embeddings (correspondences are shown as the same colors).
Abstract can be self-supervised with cycle consistency. On real and

synthetic datasets, BANMo shows higher-fidelity 3D recon-
Prior work for articulated 3D shape reconstruction of- structions than prior works for humans and animals, with
ten relies on specialized sensors (e.g., synchronized multi- the ability to render realistic images from novel viewpoints
camera systems), or pre-built 3D deformable models (e.g., and poses. Project webpage: banmo-www.github.io.
SMAL or SMPL). Such methods are not able to scale to
diverse sets of objects in the wild. We present BANMo,
a method that requires neither a specialized sensor nor a
pre-defined template shape. BANMo builds high-fidelity, 1. Introduction
articulated 3D “models” (including shape and animatable
skinning weights) from many monocular casual videos in We are interested in developing tools that can reconstruct
a differentiable rendering framework. While the use of accurate and animatable models of 3D objects from casu-
many videos provides more coverage of camera views and ally collected videos. A representative application is con-
object articulations, they introduce significant challenges tent creation for virtual and augmented reality, where the
in establishing correspondence across scenes with differ- goal is to 3D-ify images and videos captured by users for
ent backgrounds, illumination conditions, etc. Our key in- consumption in a 3D space or creating animatable assets
sight is to merge three schools of thought; (1) classic de- such as avatars. For rigid scenes, traditional Structure from
formable shape models that make use of articulated bones Motion (SfM) approaches can be used to leverage large col-
and blend skinning, (2) volumetric neural radiance fields lection of uncontrolled images, such as images downloaded
(NeRFs) that are amenable to gradient-based optimization, form the web, to build accurate 3D models of landmarks and
and (3) canonical embeddings that generate correspon- entire cities [1, 41, 42]. However, these approaches do not
dences between pixels and an articulated model. We in- generalize to deformable objects such as family members,
troduce neural blend skinning models that allow for dif- friends or pets, which are often the focus of user content.
ferentiable and invertible articulated deformations. When We are thus interested in reconstructing 3D deformable
combined with canonical embeddings, such models allow objects from casually collected videos. However, individ-
us to establish dense correspondences across videos that ual videos may not contain sufficient information to obtain
good reconstruction of a given subject. Fortunately, we can
* Work done when interning at Meta AI expect that users may collect several videos of the same sub-
1
jects, such as filming a family member over the span of sev- an active sampling strategy for inverse rendering (Sec. 3.4).
eral months or years. In this case, we wish our system to
pool information from all available videos into a single 3D 2. Related work
model, bridging any time discontinuity.
Human and animal body models. A large body of work in
In this paper, we present BANMo, a Builder of
3D human and animal reconstruction uses parametric shape
Animatable 3D Neural Models from multiple casual RGB
models [23, 33, 53, 63, 64], which are built from registered
videos. By consolidating the 2D cues from thousands of
3D scans of real humans or toy animals, and serve to re-
images into a fixed canonical space, BANMo learns a high-
cover 3D shapes given a single image and 2D annotations
fidelity neural implicit model for appearance, 3D shape, and
or predictions (2D keypoints and silhouettes) at test time
articulations of the target non-rigid object. The articulation
[2, 3, 13, 13, 62]. Although parametric body models achieve
of the output model of BANMo is expressed by a neural
great success in reconstructing categories for which large
blend skinning, similar to [5, 56, 57], making the output an-
amounts of ground-truth 3D data are available (mostly in
imatable by manipulating bone transformations. As shown
the case of human reconstruction), it is challenging to apply
in NRSfM [4], reconstructing a freely moving non-rigid ob-
the same methodology to categories with limited 3D data,
ject from monocular video is a highly under-constrained
such as animals and humans in diverse sets of clothing.
task where epipolar constraints are not directly applicable.
Category reconstruction from image/video collections.
In our approach, we address three core challenges: (1) how
A number of recent methods build deformable 3D models
to represent 3D appearance and deformation of the target
of object categories from images or videos with weak 2D
model in a canonical space; (2) how to find the mapping
annotations, such as keypoints, object silhouettes, and op-
between canonical space to each individual frame; (3) how
tical flow, obtained from human annotators or predicted by
to find 2D correspondences across images under view and
off-the-shelf models [6,10,14,18,19,52,60]. Such methods
light changes, and object deformations.
often rely on a coarse shape template [16, 47], and are not
Concretely, we utilize neural implicit functions [27] to able to recover fine-grained details or large articulations.
represent color and 3D surface in the canonical space. This
Category-agnostic video shape reconstruction. Non-
representation enables higher-fidelity 3D geometry recon-
rigid structure from motion (NRSfM) methods [4, 7, 15,
struction compared to approaches based on 3D meshes [56,
17, 39] reconstruct non-rigid 3D shapes from a set of 2D
57]. The use of neural blending skinning in BANMo pro-
point trajectories in a class-agnostic way. However, due
vides a way to constrain the deformation space of the tar-
to difficulties in obtaining accurate long-range correspon-
get object, allowing better handling of pose pose variations
dences [38, 45], they do not work well for videos in the
and deformations with unknown camera parameters, com-
wild. Recent efforts such as LASR and ViSER [56, 57] re-
pared to dynamic NeRF approaches [5, 20, 31, 36]. We also
construct articulated shapes from a monocular video with
present a module for fine-grained registration between pix-
differentiable rendering. As our results show, they may still
els and the canonical space by matching to an implicit fea-
produce blurry geometry and unrealistic articulations.
ture volume. To jointly optimize over a large number of
Neural radiance fields. Prior works on NeRF optimize a
video frames with a manageable computational cost, we ac-
continuous scene function for novel view synthesis given
tively sample pixels locations based on uncertainty. In a
a set of images, often assuming the scene is rigid and
nutshell, BANMo presents a way to merge the recent non-
camera poses can be accurately registered to the back-
rigid object reconstruction approaches [56,57] in a dynamic
ground [9, 21, 25–27, 51]. To extend NeRF to dynamic
NeRF framework [5, 20, 31, 36], to achieve higher-fidelity
scenes, recent works introduce additional functions to de-
non-rigid object reconstruction. We show experimentally
form observed points to a canonical space or over time
that BANMo produces higher-fidelity 3D shape details than
[20, 31, 32, 36, 46, 49]. However, they heavily rely on back-
previous state-of-the art approaches [57], by taking better
ground registration, and fail when the motion between ob-
advantage of the large number of frames in multiple videos.
jects and background is large. Moreover, the deformations
Our contributions are summarized as follows: (1) cannot be explicitly controlled by user inputs. Similar to our
BANMo allows for high-fidelity reconstruction of animat- goal, some recent works [22, 30, 34, 35, 43] produce pose-
able 3D models from casual videos containing thousands of controllable NeRFs, but they rely on a human body model,
frames; (2) To handle large articulated body motions, we or synchronized multi-view video inputs.
propose a novel neural blend skinning model (Sec. 3.2);
(3) To register pixels from different frames, we introduce 3. Method
a method for canonical embedding matching and learning
(Sec. 3.3); (4) To robustify the optimization, we propose an We model the deformable object in a canonical time-
pipeline for rough root body pose initialization (Sec. 3.4); invariant space, i.e. the “rest” body pose space, that can
(5) Finally, to reconstruct detailed geometry, we introduce be transformed to the “articulated” pose in the camera space
2
Volume Rendering Implicit Representations (Sec. 3.1)
(Sec 3.1) Backward Warp. (Sec. 3.2)
<latexit sha1_base64="by2EXrk8ymnCHE/bC17V3YYH0CU=">AAAB7XicbVDLSgNBEOyNrxhfqx69DAbBU9gVQY8BLx4jmIckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHfFKWfGBsG3V1pb39jcKm9Xdnb39g/8w6OWUZkmtEkUV7oTY0M5k7RpmeW0k2qKRcxpOx7fzPz2E9WGKXlvJymNBB5KljCCrZNaPcOGAvf9alAL5kCrJCxIFQo0+v5Xb6BIJqi0hGNjumGQ2ijH2jLC6bTSywxNMRnjIe06KrGgJsrn107RmVMGKFHalbRorv6eyLEwZiJi1ymwHZllbyb+53Uzm1xHOZNpZqkki0VJxpFVaPY6GjBNieUTRzDRzN2KyAhrTKwLqOJCCJdfXiWti1oY1MK7y2r9oYijDCdwCucQwhXU4RYa0AQCj/AMr/DmKe/Fe/c+Fq0lr5g5hj/wPn8Ao0qPOg==</latexit>
Xti
<latexit sha1_base64="76w10YEtETzUXdaT0wTZt0xBig8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuCG5cV7EPaacmkmTY0kxmSO0oZ+h9uXCji1n9x59+YtrPQ1gOBwzn3ck9OkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3grGNzO/9ci1EbG6x0nC/YgOlQgFo2ilXjeiOArCrD3tYV/0yxW36s5BVomXkwrkqPfLX91BzNKIK2SSGtPx3AT9jGoUTPJpqZsanlA2pkPesVTRiBs/m6eekjOrDEgYa/sUkrn6eyOjkTGTKLCTs5Rm2ZuJ/3mdFMNrPxMqSZErtjgUppJgTGYVkIHQnKGcWEKZFjYrYSOqKUNbVMmW4C1/eZU0L6qeW/XuLiu1h7yOIpzAKZyDB1dQg1uoQwMYaHiGV3hznpwX5935WIwWnHznGP7A+fwB59aS1Q==</latexit>
sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>
sha1_base64="ALY/6c+yJ5gA/Lj+5R3BD944h3M=">AAAB6nicbVBNSwMxFHxbv2qtWr16CRbBU9n1okfBi8cK9gPabcmm2TY0yS7JW6Us/R9ePCjiD/LmvzHb9qCtA4Fh5j3eZKJUCou+/+2VtrZ3dvfK+5WD6uHRce2k2rZJZhhvsUQmphtRy6XQvIUCJe+mhlMVSd6JpneF33nixopEP+Is5aGiYy1iwSg6adBXFCdRnHfnAxyKYa3uN/wFyCYJVqQOKzSHta/+KGGZ4hqZpNb2Aj/FMKcGBZN8XulnlqeUTemY9xzVVHEb5ovUc3LhlBGJE+OeRrJQf2/kVFk7U5GbLFLada8Q//N6GcY3YS50miHXbHkoziTBhBQVkJEwnKGcOUKZES4rYRNqKENXVMWVEKx/eZO0rxqB3wgefCjDGZzDJQRwDbdwD01oAQMDL/AG796z9+p9LOsqeaveTuEPvM8fj0WRXg==</latexit>
sha1_base64="J0bUOso/uejqONF+gr9Z1NwYuBo=">AAAB9XicbVC7TsMwFL3hWcqrwMhiUSExVQkLjJVYGItEH6hNK8d1WquOE9k3oCrqf7AwgBAr/8LG3+C0GaDlSJaOzrlX9/gEiRQGXffbWVvf2NzaLu2Ud/f2Dw4rR8ctE6ea8SaLZaw7ATVcCsWbKFDyTqI5jQLJ28HkJvfbj1wbEat7nCbcj+hIiVAwilbq9yKK4yDMOrM+DsSgUnVr7hxklXgFqUKBxqDy1RvGLI24QiapMV3PTdDPqEbBJJ+Ve6nhCWUTOuJdSxWNuPGzeeoZObfKkISxtk8hmau/NzIaGTONAjuZpzTLXi7+53VTDK/9TKgkRa7Y4lCYSoIxySsgQ6E5Qzm1hDItbFbCxlRThraosi3BW/7yKmld1jy35t251fpDUUcJTuEMLsCDK6jDLTSgCQw0PMMrvDlPzovz7nwsRtecYucE/sD5/AHmlpLR</latexit>
Forward Warp. (Sec. 3.2)

X⇤i
<latexit sha1_base64="wKKE7yVAX2LXfl0fCkkuip40484=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBHERZkRQZcFNy4r2Ie005JJM21oJjMkd5Qy9D/cuFDErf/izr8xbWehrQcCh3Pu5Z6cIJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwYeJUM15nsYx1K6CGS6F4HQVK3ko0p1EgeTMY3Uz95iPXRsTqHscJ9yM6UCIUjKKVup2I4jAIs9ake94TvVLZrbgzkGXi5aQMOWq90lenH7M04gqZpMa0PTdBP6MaBZN8UuykhieUjeiAty1VNOLGz2apJ+TUKn0Sxto+hWSm/t7IaGTMOArs5DSlWfSm4n9eO8Xw2s+ESlLkis0PhakkGJNpBaQvNGcox5ZQpoXNStiQasrQFlW0JXiLX14mjYuK51a8u8ty9SGvowDHcAJn4MEVVOEWalAHBhqe4RXenCfnxXl3PuajK06+cwR/4Hz+AHcakos=</latexit>
(X⇤i )
color
density
<latexit sha1_base64="0Q3PNdwUTyjvy3/Zd46cnh2h4C0=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARqouSiKDLghuXFexDmhgm00k7dGYSZiZCCdn4K25cKOLWz3Dn3zhps9DWAxcO59zLvfeECaNKO863tbS8srq2Xtmobm5t7+zae/sdFacSkzaOWSx7IVKEUUHammpGeokkiIeMdMPxdeF3H4lUNBZ3epIQn6OhoBHFSBspsA89RYccwbrHkR6FUdbLA/pwdhrYNafhTAEXiVuSGijRCuwvbxDjlBOhMUNK9V0n0X6GpKaYkbzqpYokCI/RkPQNFYgT5WfTB3J4YpQBjGJpSmg4VX9PZIgrNeGh6SzOVPNeIf7n9VMdXfkZFUmqicCzRVHKoI5hkQYcUEmwZhNDEJbU3ArxCEmEtcmsakJw519eJJ3zhus03NuLWvO+jKMCjsAxqAMXXIImuAEt0AYY5OAZvII368l6sd6tj1nrklXOHIA/sD5/AM9glfQ=</latexit>
(X⇤i ) canonical
2D flow
embedding
<latexit sha1_base64="zLTJ8G65T9kj2UALAYNiRrSYprA=">AAACC3icbVC7TsMwFHXKq5RXgJHFaoVUGKoEIcFYiYWxSPSBmhA5jtNadeLIdpCqKDsLv8LCAEKs/AAbf4PTZoCWK1k+Oude3XOPnzAqlWV9G5WV1bX1jepmbWt7Z3fP3D/oSZ4KTLqYMy4GPpKE0Zh0FVWMDBJBUOQz0vcnV4XefyBCUh7fqmlC3AiNYhpSjJSmPLPu+JwFchrpL3MSSXPYdCKkxn6YDXKP3p+eeGbDalmzgsvALkEDlNXxzC8n4DiNSKwwQ1IObStRboaEopiRvOakkiQIT9CIDDWMUUSkm81uyeGxZgIYcqFfrOCM/T2RoUgWdnVnYVMuagX5nzZMVXjpZjROUkViPF8UpgwqDotgYEAFwYpNNUBYUO0V4jESCCsdX02HYC+evAx6Zy3batk35432XRlHFRyBOmgCG1yANrgGHdAFGDyCZ/AK3own48V4Nz7mrRWjnDkEf8r4/AEo7Jso</latexit>
2D color c(xt )
2D opacity o(xt )
<latexit sha1_base64="/NxVbjiSFkKRfDP6dqe151Iuji8=">AAAB+HicbVDLSgNBEOz1GeMjqx69DAYhXsKuCHoMePEYwTwkiWF2MpsMmX0w0yvGJV/ixYMiXv0Ub/6Ns8keNLFgoKjqpmvKi6XQ6Djf1srq2vrGZmGruL2zu1ey9w+aOkoU4w0WyUi1Paq5FCFvoEDJ27HiNPAkb3njq8xvPXClRRTe4iTmvYAOQ+ELRtFIfbvEKt2A4sjz08fpPZ727bJTdWYgy8TNSRly1Pv2V3cQsSTgITJJte64Toy9lCoUTPJpsZtoHlM2pkPeMTSkAde9dBZ8Sk6MMiB+pMwLkczU3xspDbSeBJ6ZzELqRS8T//M6CfqXvVSEcYI8ZPNDfiIJRiRrgQyE4gzlxBDKlDBZCRtRRRmaroqmBHfxy8ukeVZ1nap7c16u3eV1FOAIjqECLlxADa6hDg1gkMAzvMKb9WS9WO/Wx3x0xcp3DuEPrM8fmWuTHA==</latexit>
Camera Space
<latexit sha1_base64="NydBMU7obeIRbi2iaJm1iilQleY=">AAAB+HicbVDLSgMxFL3js9ZHR126CRahbsqMCLosuHFZwT6krSWTZtrQTDIkGbEO/RI3LhRx66e482/MtLPQ1gOBwzn3ck9OEHOmjed9Oyura+sbm4Wt4vbO7l7J3T9oapkoQhtEcqnaAdaUM0EbhhlO27GiOAo4bQXjq8xvPVClmRS3ZhLTXoSHgoWMYGOlvluSlW6EzSgI08fpvTntu2Wv6s2AlomfkzLkqPfdr+5AkiSiwhCOte74Xmx6KVaGEU6nxW6iaYzJGA9px1KBI6p76Sz4FJ1YZYBCqewTBs3U3xspjrSeRIGdzELqRS8T//M6iQkveykTcWKoIPNDYcKRkShrAQ2YosTwiSWYKGazIjLCChNjuyraEvzFLy+T5lnV96r+zXm5dpfXUYAjOIYK+HABNbiGOjSAQALP8ApvzpPz4rw7H/PRFSffOYQ/cD5/AKxDkyg=</latexit>
xt
<latexit sha1_base64="fglfFNNfFJ1LSzytA6p8EIsF9U4=">AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbcuKxgH9KmZTKdtEMnD2Zu1BLyH25cKOLWf3Hn3zhps9DWAwOHc+7lnjleLIVG2/62VlbX1jc2S1vl7Z3dvf3KwWFLR4livMkiGamORzWXIuRNFCh5J1acBp7kbW9ynfvtB660iMI7nMbcDegoFL5gFI3U7wUUx56fPmX9FLNBpWrX7BnIMnEKUoUCjUHlqzeMWBLwEJmkWncdO0Y3pQoFkzwr9xLNY8omdMS7hoY04NpNZ6kzcmqUIfEjZV6IZKb+3khpoPU08MxknlIvern4n9dN0L9yUxHGCfKQzQ/5iSQYkbwCMhSKM5RTQyhTwmQlbEwVZWiKKpsSnMUvL5PWec2xa87tRbV+X9RRgmM4gTNw4BLqcAMNaAIDBc/wCm/Wo/VivVsf89EVq9g5gj+wPn8AYeiTJQ==</latexit>
Registration via Canonical Embeddings
(Sec 3.3)
An image at time t Canonical Space

Figure 2. Method overview. BANMo optimizes a set of shape and deformation parameters (Sec. 3.1) that describe the video observations
in pixel colors, silhouettes, optical flow, and higher-order features descriptors, based on a differentiable volume rendering framework.
BANMo uses a neural blend skinning model (Sec. 3.2) to transform 3D points between the camera space and the canonical space, enabling
handling large deformations. To register pixels across videos, BANMo jointly optimizes an implicit canonical embedding (CE) (Sec. 3.3).
at each time instance with forward mappings, and transform The shape is given by MLPSDF , computing the Signed-
back with backward mappings. We use implicit functions to Distance Function (SDF) of a point to the surface. For ren-
represent the 3D shape, color, and dense semantic embed- dering, we follow [50, 58] and convert the SDF to a density
dings of the object. Our neural 3D model can be deformed value σ ∈ [0, 1] using Γβ (·), the cumulative distribution
and rendered into images at each time instance via differ- function of the Laplace distribution with zero mean and β
entiable volume rendering, and optimized to ensure consis- scale. β is a single learnable parameter that controls the
tency between the rendered images and multiple cues in the solidness of the object, approaching zero for solid objects.
observed images, including color, silhouette, optical flow, Compared to learning σ directly, this provides a principled
and 2D pixel feature embeddings. We refer readers to an way of extracting surface as the zero level-set of the SDF.
overview in Fig. 2 and a list of notations in the supplement. Finally, the network ψ maps points to a feature descrip-
We employ neural blend skinning to express object ar- tor (or canonical embedding) that can be matched by pixels
ticulations similarly to [16, 56] but modify it for im- from different viewpoints, enabling long-range correspon-
plicit surface representations rather than meshes. Our self- dence across frames and videos. This feature can be in-
supervised semantic feature embedding produces dense pix- terpreted as a variant of Continuous Surface Embeddings
elwise correspondences across frames of different videos, (CSE) [28] but defined volumetrically and fine-tuned in a
which is critical for optimization on large video collections. self-supervised manner (described in Sec 3.3).
Space-time warping model. We consider a pair of time-
3.1. Shape and Appearance Model
dependent warping functions: forward warping function
We first represent shape and appearance of deformable W t,→ : X∗ → Xt mapping canonical location X∗ to
objects in a canonical time-invariant rest pose space and camera space location Xt at current time and the backward
then model time-dependent deformations by a neural blend warping function W t,← : Xt → X∗ for inverse mapping.
skinning function (Sec. 3.2). Prior work such as Nerfies [31] and Neural Scene Flow
Canonical shape model. In order to model the shape and Fields (NSFF) [20] learn deformations assuming that (1)
appearance of an object in a canonical space, we use a camera transformations are given, and (2) the residual ob-
method inspired by Neural Radiance Fields (NeRF) [27]. ject deformation is small. As detailed in Sec. 3.2 and
A 3D point X∗ ∈ R3 in a canonical space is associated Sec. 3.4, we do not make such assumptions; instead, we
with three properties: color c ∈ R3 , density σ ∈ [0, 1], and adopt a neural blend-skinning model that can handle large
a learned canonical embedding ψ ∈ R16 . These properties deformations, but without assuming a pre-defined skeleton.
are predicted by the Multilayer Perceptron (MLP) networks: Volume rendering. To render images, we use the volume
ct = MLPc (X∗ , vt , ω te ), (1) rendering in NeRF [27], but modified to warp the 3D ray to
account for the deformation [31]. Specifically, let xt ∈ R2
σ = Γβ (MLPSDF (X∗ )), (2) be the pixel location at time t, and Xti ∈ R3 be the i-th 3D
∗
ψ = MLPψ (X ). (3) point sampled along the ray emanates from xt . The color c
As in NeRF, color c depends on a time-varying view di- t and the opacity o ∈ [0, 1] of the pixel are given by:
rection vt ∈R2 and a learnable environment code ω te ∈R64 N N
that is designed to capture environment illumination condi-
X X
c(xt ) = τi ct W t,← Xti , o(xt ) =

τi ,
tions [25], shared across frames in the same video. i=1 i=1
3
where N is the number of samples, τi is the probability Xti Wbt,→ and Wbt,← represent blend skinning weights for
point X∗ and Xt relative to the b-th bone (described fur-
Qi−1
visible to the camera and is given by τi = j=1 pj (1 − pi ).
Here pi = exp (−σi δi ) is the probability that the photon ther below), and ∆Jtb = Jtb J∗b −1 . We represent root pose
is transmitted through the interval δi between the i-th Xti Gt and body pose J∗b , Jtb with angle-axis rotations and 3D
sample and the next, and σi = σ (W t,← (Xti )) is the den- translations, regressed from MLPs:
sity from Eq. 2. Note that we pull back the ray points in the
observed space to the canonical space by using the warp- Gt = MLPG (ω tr ), Jtb = MLPJ (ω tb ) (10)
ing W t,← , as the color and density are only defined in the
canonical space. where ω tr and ω tb are 128-dimensional latent codes for root
We obtain the expected 3D location X∗ of the intersec- pose and body pose at frame t respectively. Similarly, we
tion of the ray with the canonical object: have J∗b = MLPJ (ω ∗b ) and ω ∗b is the 128-dimensional
latent code for the rest body pose. We find such over-
N
X parameterized representations empirically easier to opti-
X∗ (xt ) = τi W t,← (Xti ) ,

(4) mize with stochastic first-order gradient methods.
i=1
Skinning weights. Similar to SCANimate [37], we define
which can be mapped to another time t0 via forward warping a skinning weight function S : (X, ω b ) → W ∈ RB that
W t,→ to find its projected pixel location: assigns X to bones given body pose code ω b . To compute
0 0 the forward and backward skinning weights in Eq. 9, we
xt = Πt W t,→ (X∗ (xt )) ,

(5)
apply S separately at rest pose as well as time t pose, and
0
where Πt is the projection matrix of a pinhole camera. we have Wt,→ = S(X∗ , ω ∗b ), Wt,← = S(Xt , ω tb ).
0
We optimize video-specific Πt given a rough initialization. Directly representing S as neural networks can be diffi-
With this, we compute a 2D flow rendering as: cult to optimize. Therefore, we condition neural skinning
0
weights on explicit 3D Gaussian ellipsoids that move along
F xt , t → t0 = xt − xt .

(6) with the bone coordinates. Following LASR [56], the Gaus-
3.2. Deformation Model via Neural Blend Skinning sian skinning weights are determined by the Mahalanobis
distance between X and the Gaussian ellipsoids:
We define mappings W t,→ and W t,← based on a neural
blend skinning model approximating articulated body mo- Wσ = (X − Cb )T Qb (X − Cb ), (11)
tion. Defining invertible warps for neural deformation rep-
resentations is difficult [5]. Our formulation represents 3D where Cb is the bone center and Qb = VbT Λ0b Vb is the pre-
warps as compositions of weighted rigid-body transforma- cision matrix composed by bone orientation matrix Vb and
tions, each of which is differentiable and invertible. scale Λ0b . Bone centers and orientations are transformed
from the “zero-configurations” as Vb |Cb = Jb Vb0 |C0b .

Blend skinning deformation. Given a 3D point Xt at time
t, we wish to find its corresponding 3D point X∗ in the Λ0b , Vb0 and C0b are learnable parameters and Jb are time-
canonical space. Conceptually, X∗ can be considered as variant body poses parameters determined by body pose
points in the “rest” pose at a fixed camera view point. Our code ω ∗b and ω tb .
formulation finds mappings between Xt and X∗ by blend- To model skinning weights for fine geometry, we
ing the rigid transformations of B bones (3D coordinate learn delta skinning weights with an MLP, W∆ =
systems). Let Gt ∈ SE(3) be a root body transformation of MLP∆ (X, ω b ). The final skinning function is the
the object from canonical space to time t, and Jtb ∈ SE(3) softmax-normalized sum of the coarse and fine components,
be a rigid transformation that moves the b-th bone from its
“zero-configuration” to deformed state t, then we have W = S(X, ωb ) = σsoftmax Wσ + W∆ . (12)
Xt = W t,→ (X∗ ) = Gt Jt,→ X∗ ,

(7) The Gaussian component regularizes the skinning weights
∗ t,← t t,← t −1 t to be spatially smooth and temporally consistent, and

X = W (X ) = J (G ) X , (8)
handles large deformations better than purely implicitly-
where Jt,→ and Jt,← are weighted averages of B rigid defined ones. Furthermore, our formulation of the skinning
transformations {∆Jtb }b∈B that move the bones between weights are dependent on only pose status by construction,
rest configurations (denoted as J∗b ) and time t configura- and therefore regularizes the space of skinning weights.
tions Jtb , following linear blend skinning deformation [8]:
B B
3.3. Registration via Canonical Embeddings
X X
J t,→
= Wbt,→ ∆Jtb , J t,←
= Wbt,← (∆Jtb )−1 . To register pixel observations at different time instances,
b=1 b=1 BANMo maintains a canonical feature embedding that en-
(9) codes semantic information of 3D points in the canonical
4
3.4. Optimization
Given multiple videos, we optimize all parameters
MLP X⇤
<latexit sha1_base64="ncfd9BtbnP7oQWE8ybVWqKMwMKI=">AAAB9XicbVDLSgMxFL3xWeur6tJNsAjiosyIoMuCG5cV7EPaacmkmTY0kxmSjFKG+Q83LhRx67+482/MtLPQ1gOBwzn3ck+OHwuujeN8o5XVtfWNzdJWeXtnd2+/cnDY0lGiKGvSSESq4xPNBJesabgRrBMrRkJfsLY/ucn99iNTmkfy3kxj5oVkJHnAKTFW6vdCYsZ+kHayfnqeDSpVp+bMgJeJW5AqFGgMKl+9YUSTkElDBdG66zqx8VKiDKeCZeVeollM6ISMWNdSSUKmvXSWOsOnVhniIFL2SYNn6u+NlIRaT0PfTuYp9aKXi/953cQE117KZZwYJun8UJAIbCKcV4CHXDFqxNQSQhW3WTEdE0WosUWVbQnu4peXSeui5jo19+6yWn8o6ijBMZzAGbhwBXW4hQY0gYKCZ3iFN/SEXtA7+piPrqBi5wj+AH3+AMBHkrs=</latexit>
described above, including MLPs, {MLPc , MLPSDF ,

<latexit sha1_base64="DkV9+r+2PsJ1e8ywPR1nbyz1vKA=">AAACCHicbVC7TsMwFHXKq5RXgJEBiwqpMFQJQoKxEgtjkegDNaFyHKe16tiR7SBVUUYWfoWFAYRY+QQ2/gan7QAtV7J8dM69uueeIGFUacf5tkpLyyura+X1ysbm1vaOvbvXViKVmLSwYEJ2A6QIo5y0NNWMdBNJUBww0glGV4XeeSBSUcFv9TghfowGnEYUI22ovn3oBYKFahybL/MSRfOaFyM9DKKsm9+fnvTtqlN3JgUXgTsDVTCrZt/+8kKB05hwjRlSquc6ifYzJDXFjOQVL1UkQXiEBqRnIEcxUX42OSSHx4YJYSSkeVzDCft7IkOxKryazsKkmtcK8j+tl+ro0s8oT1JNOJ4uilIGtYBFKjCkkmDNxgYgLKnxCvEQSYS1ya5iQnDnT14E7bO669Tdm/Nq424WRxkcgCNQAy64AA1wDZqgBTB4BM/gFbxZT9aL9W59TFtL1mxmH/wp6/MHNo+aIg==</latexit>
(X⇤ )2 R16
<latexit sha1_base64="VGD13lWEwiGGLvBCUVRgdVu12lU=">AAAB/HicbVDLSsNAFL2pr1pf0S7dDBbBVUlE1GXBjcsq9iFNLJPppB06mYSZiRBC/RU3LhRx64e482+ctllo64GBwzn3cs+cIOFMacf5tkorq2vrG+XNytb2zu6evX/QVnEqCW2RmMeyG2BFORO0pZnmtJtIiqOA004wvpr6nUcqFYvFnc4S6kd4KFjICNZG6ttVjwnkRViPgiC/nTzk7vmkb9ecujMDWiZuQWpQoNm3v7xBTNKICk04VqrnOon2cyw1I5xOKl6qaILJGA9pz1CBI6r8fBZ+go6NMkBhLM0TGs3U3xs5jpTKosBMTmOqRW8q/uf1Uh1e+jkTSaqpIPNDYcqRjtG0CTRgkhLNM0MwkcxkRWSEJSba9FUxJbiLX14m7dO669Tdm7Na476oowyHcAQn4MIFNOAamtACAhk8wyu8WU/Wi/VufcxHS1axU4U/sD5/AFwolKQ=</latexit>
MLPψ , MLPG , MLPJ , MLP∆ }, learnable codes
{ω te , ω tr , ω tb , ω ∗b } and pixel embeddings ψ I .
Losses. The model is learned by minimizing three types
Canonical Space
Fine-tuned Fine-tuned of losses: reconstruction losses, feature registration losses,
DensePose CNN DensePose CNN and a 3D cycle-consistency regularization loss:
x t1
x t2
<latexit sha1_base64="xnbcb3NcIJiA4aP+15D21QhxdTI=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw47KCfUgbw2Q6aYdOJmHmplhC/sSNC0Xc+ifu/BsnbRbaemDgcM693DMnSATX4DjfVmVtfWNzq7pd29nd2z+wD486Ok4VZW0ai1j1AqKZ4JK1gYNgvUQxEgWCdYPJTeF3p0xpHst7mCXMi8hI8pBTAkbybXsQERgHYfaUP2bgu7lv152GMwdeJW5J6qhEy7e/BsOYphGTQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3VJKIaS+bJ8/xmVGGOIyVeRLwXP29kZFI61kUmMkip172CvE/r59CeO1lXCYpMEkXh8JUYIhxUQMecsUoiJkhhCpusmI6JopQMGXVTAnu8pdXSeei4ToN9+6y3nwo66iiE3SKzpGLrlAT3aIWaiOKpugZvaI3K7NerHfrYzFascqdY/QH1ucPCh2T+g==</latexit>

L = Lsil + Lrgb + LOF + Lmatch + L2D-cyc + L3D-cyc .
<latexit sha1_base64="4crcMECC5/6Guukcp6wXZeQY2CY=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBFclaQIuiy4cVnBPqSNYTKdtEMnkzAzKZaQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJEs6Udpxva219Y3Nru7JT3d3bPzi0j447Kk4loW0S81j2AqwoZ4K2NdOc9hJJcRRw2g0mN4XfnVKpWCzu9SyhXoRHgoWMYG0k37YHEdbjIMye8sdM+43ct2tO3ZkDrRK3JDUo0fLtr8EwJmlEhSYcK9V3nUR7GZaaEU7z6iBVNMFkgke0b6jAEVVeNk+eo3OjDFEYS/OERnP190aGI6VmUWAmi5xq2SvE/7x+qsNrL2MiSTUVZHEoTDnSMSpqQEMmKdF8ZggmkpmsiIyxxESbsqqmBHf5y6uk06i7Tt29u6w1H8o6KnAKZ3ABLlxBE26hBW0gMIVneIU3K7NerHfrYzG6ZpU7J/AH1ucPC6KT+w==</latexit>
Image Space at t1 Image Space at t2

| {z } | {z }
<latexit sha1_base64="SVG2hxvF7EcP+hdssaUPWfkvBZw=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPBi8eK9kPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8PxzcxvP3FtRKIecJLyIKZDJSLBKFrpHvt+3616NW8Oskr8glShQKPvfvUGCctirpBJakzX91IMcqpRMMmnlV5meErZmA5511JFY26CfH7qlJxZZUCiRNtSSObq74mcxsZM4tB2xhRHZtmbif953Qyj6yAXKs2QK7ZYFGWSYEJmf5OB0JyhnFhCmRb2VsJGVFOGNp2KDcFffnmVtC5qvlfz7y6r9ccijjKcwCmcgw9XUIdbaEATGAzhGV7hzZHOi/PufCxaS04xcwx/4Hz+AAyDjbM=</latexit> <latexit sha1_base64="YX137MIq8yNr4LLnvGCMgoYJ0TI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI8FLx4r2g9pQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+54lrI2L1gNOE+xEdKREKRtFK9zioDcoVt+ouQNaJl5MK5GgOyl/9YczSiCtkkhrT89wE/YxqFEzyWamfGp5QNqEj3rNU0YgbP1ucOiMXVhmSMNa2FJKF+nsio5Ex0yiwnRHFsVn15uJ/Xi/F8NrPhEpS5IotF4WpJBiT+d9kKDRnKKeWUKaFvZWwMdWUoU2nZEPwVl9eJ+1a1XOr3t1VpfGYx1GEMziHS/CgDg24hSa0gMEInuEV3hzpvDjvzseyteDkM6fwB87nDw4HjbQ=</latexit>
reconstruction losses feature registration losses
Figure 3. Canonical Embeddings. We jointly optimize an im-

Reconstruction losses are similar to those in existing differ-
plicit function to produce canonical embeddings from 3D canoni-
cal points that match to the 2D DensePose CSE embeddings [28].
entiable rendering pipelines [27, 59]. Besides color recon-
struction loss Lrgb and silhouette reconstruction loss Lsil , we
space, which can be uniquely matched by the pixel features, further compute flow reconstruction losses LOF by compar-
and provide strong cues for registration via a joint optimiza- ing the rendered F defined in Eq. 6 with the observed 2D
tion of shape, articulations, and embeddings (Sec. 3.4). optical flow F̂ computed by an off-the-shelf flow network:
Canonical embeddings matching. Given a pixel at xt of
c(xt ) − ĉ(xt ) 2 , Lsil = o(xt ) − ŝ(xt ) 2 ,
X X
frame t, our goal is to find a point X∗ in the canonical space Lrgb =
whose feature embedding ψ(X∗ ) ∈ R16 best matches the xt xt
pixel feature embedding ψ tI (xt ) ∈ R16 . The pixel embed- X 2

dings ψ tI (of frame t) are computed by a CNN. Different F xt , t → t0 − F̂ xt , t → t0 , (14)

LOF =

from ViSER [57] that learns embeddings from scratch, we xt ,(t,t0 )
initialize pixel embeddings with CSE [28, 29] that produces where ĉ and ŝ are observed color and silhouette.
consistent features for semantically corresponding pixels, Additionally, we define registration losses to enforce 3D
and optimize pixel and canonical embeddings jointly. Re- point prediction via canonical embedding X̂∗ (xt ) (Eq. 13)
call that the embedding of a canonical 3D point is computed to match the prediction from backward warping (Eq. 4):
as ψ(X∗ ) = MLPψ (X∗ ) in Eq. 3. Intuitively, MLPψ is
optimized to ensure the output 3D descriptor matches 2D X
∗ t
2
Lmatch = X (x ) − X̂∗ (xt ) , (15)

descriptors of corresponding pixels across multiple views. 2
xt
To compute the 3D surface point corresponding to a 2D
point xt , we apply soft argmax descriptor matching [11,24]: and a 2D cycle consistency loss [16, 57] that forces the im-
X age projection after forward warping of X̂∗ (xt ) to land back
X̂∗ (xt ) = s̃t (xt , X)X, (13)
on its original 2D coordinates:
X∈V∗
where V∗ are sampled points in a canonical 3D grid, X 2

Π W t,→ (X̂∗ (xt )) − xt .
whose size is dynamically determined during optimiza- t
L2D-cyc = (16)

2
tion (see supplement), and s̃ is a normalized feature xt
t t
matching distribution over the

t t 3D grid: s̃ (x , X) = Similar to NSFF [20], we regularize the deformation func-
σsoftmax αs ψ I (x ), ψ(X∗ ) , where αs is a learnable

tion W t,→ (·) and W t,← (·) by a 3D cycle consistency loss,
scaling
to control the peakness of the softmax function and which encourages a sampled 3D point in the camera coor-
., . is the cosine similarity score. dinates to be backward deformed to the canonical space and
Self-supervised canonical embedding learning. As de- forward deformed to its original location:
scribe later in Eq. 15-16, the canonical embedding is self-
2
supervised by enforcing the consistency between feature X
L3D-cyc = τi W t,→ W t,← (Xti , t), t − Xti , (17)

matching and geometric warping. By jointly optimizing the 2
i
shape and articulation parameters via consistency losses,
canonical embeddings provide strong cues to register pixels where τi is the opacity that weighs the sampled points so
from different time instance to the canonical 3D space, and that a point near the surface receives heavier regularization.
enforce a coherent reconstruction given observations from Our optimization is highly non-linear with local minima,
multiple videos, as validated in ablation studies (Sec. 4.3). and we consider two strategies for robust optimization.
5
Root pose initialization. Due to ambiguities between ob- a human, denoted by casual-cat and casual-human
ject geometry and root poses, we find it helpful to pro- respectively. Object silhouette and two-frame optical flow
vide a rough per-frame initialization of root poses (Gt in images (used for reconstruction losses Eq. 14) are extracted
Eq. 7), similar to NeRS [61]. Specifically, we train a sepa- from off-the-shelf models, PointRend and VCN-robust re-
rate network PoseNet , which is applied to every test video spectively [12, 55]. Two special challenges arise from the
frame. Similar to DenseRaC [54], PoseNet takes a Dense- casual nature of the video captures. First, each video collec-
Pose CSE [28] feature image as input and predicts the root tion contains around 2k images, orders of magnitudes larger
pose Gt0 = PoseNet(ψ tI ), where ψ tI ∈ R112×112×16 is the those used in prior work [20, 27, 31, 57], which requires the
embedding output of DensePose CSE [28] from an RGB method to handle reconstructions at a larger scale. Sec-
image It . We train PoseNet by a synthetic dataset produced ond, the dataset makes no assumption about camera move-
offline. See supplement for details on training. Given the ment or object movement, making registration difficult. In
pre-computed Gt0 , BANMo only needs to compute a delta particular, objects freely moves in a video and background
root pose via MLP: changes across videos, posing challenges to standard SfM
pipelines. We show results on casual-cat below and
Gt = MLPG (ω tr )Gt0 . (18) casual-human in the supplement.
Active sampling over (x, y, t). Inspired by iMAP [44],

Comparison with Nerfies. Nerfies [31] is designed for a
our sampling strategy follows an easy-to-hard curriculum.
single continuously captured video, assuming object root
At the early iterations, we randomly sample a batch of
body does not move and root poses can be registered by
N p = 8192 pixels for volume rendering and compute re-
SfM. However, in our setup object root body moves and
construction losses. At the same time, we optimize a com-
SfM does not provide registration of root poses, causing
pact 5-layer MLP to represent the uncertainty over image
Nerfies to fail. To make a fair comparison, we provide
coordinates and frame index, by comparing against the re-
Nerfies with rough initial root poses (obtained from our
construction errors in the current batch. More details can
PoseNet, Sec. 3.4), and optimize Nerfies on a per-video
be found in the supplement. After a half of the optimiza-
basis. Meshes are extracted by running marching cubes on
tion steps, we select N a = 8192 additional active samples
a 2563 grid. We first compare single-video BANMo with
from pixels with high uncertainties, and combine with the
Nerfies in Fig. 4. Although Nerfies reconstructs reason-
uniform samples. Empirically, active samples dramatically
able 3D shapes of moving objects given rough initial root
improves reconstruction fidelity, as shown in Fig. 8.
pose, it fails to reconstruct large articulations, such as the
fast motion of the cat’s head (2nd row), and in general the
4. Experiments
reconstruction is coarser than single-video BANMo. Fur-
Implementation details. Our implementation of implicit thermove, as shown in Fig. 5, Nerfies is not able to leverage
shape and appearance models follows NeRF [27], except more videos to improve the reconstruction quality, while the
that our shape model outputs SDF, which is transformed to reconstruction of BANMo improves given more videos.
density for volume rendering. To extract the rest surface, we
find the zero-level set of SDF by running marching cubes Comparison with ViSER. Another baseline, ViSER [57],
on a 2563 grid. To obtain articulated shapes at each time in- reconstructs articulated objects given multiple videos.
stance, we articulate points on the rest surface with forward However, it assumes root body poses are given by run-
deformation W t,→ . We use B = 25 bones, whose centers ning LASR, which produces less robust root pose estima-
are initialized around an approximation of object surface. tions than our PoseNet. Therefore, we provide ViSER the
Optimization details. In a single batch, we sample N p = same root poses from our initialization pipeline. As shown
8192 pixels for volume rendering. Empirically, we found in Fig. 4, ViSER produces reasonable articulated shapes.
the number of iterations to achieve the same level of recon- However, detailed geometry, such as ears, eyes, nose and
struction quality scales with the number of input frames. rear limbs of the cat are blurred out. Furthermore, detailed
To determine the number of iterations, we use an empirical articulation, such as head rotation and leg switching are not
equation: N opt = 1000 # frames
N p (k), roughly 60k iterations recovered. In contrast, BANMo faithfully recovers these
for 1k frames (15 hours on 8 V100 GPUs). Please find a high-fidelity geometry and motion. We observed that im-
complete list of hyper-parameters in supplement. plicit shape representation (specifically VolSDF [58] with a
gradually reducing transparency value) is less prune to bad
4.1. Reconstruction from Casual Videos
local optima, which helps recovering body parts and artic-
Casual videos dataset. We demonstrate BANMo’s ability ulations. Furthermore, implicit shape representation allows
to reconstruct 3D models on collections of casual videos, in- for maintaining a continuous geometry (compared to finite
cluding 20 videos of a British Shorthair cat and 10 videos of number of vertices), enabling us to recover detailed shape.
6
Reference image Ours (multi-videos) ViSER (multi-videos) Ours (single video) Nerfies (single video)
Figure 4. Qualitative comparison of our method with prior art [31, 57]. From top to bottom: AMA’s samba, casual-cat, eagle.
Table 1. Quantitative results on AMA and Animated Objects.

3D Chamfer distances (cm) averaged over all frames (↓). The 3D
models for eagle and hands are resized such that the largest
edge of the axis-aligned object bounding box is 2m. * with ground-
truth root poses. (S) refers to single-video results.
1 video 8 videos 1 video 8 videos Input videos
BANMo (Ours) Nerfies
Method swing samba mean eagle hands
∗
Ours 5.95 5.86 5.91 3.93 3.10∗
ViSER 15.16 16.31 15.74 19.72∗ 7.42∗
Using 120 frames Using 400 frames Using 800 frames Ours (S) 6.30 6.31 6.31 6.29∗ 5.96∗
BANMo (Ours)
Nerfies (S) 11.47 11.36 11.42 8.61∗ 10.82∗
Figure 5. Reconstruction completeness vs number of input
videos and video frames. BANMo is capable of registering more
camera poses) and ground-truth silhouettes to BANMo and
input videos if they are available, improving the reconstruction.
baselines. Optical flow is computed by VCN-robust [55].
Metric. We quantify our method and baselines against the
4.2. Quantitative Evaluation on Multi-view Videos 3D ground-truth in terms of per-frame reconstruction errors
We use two sources of data with 3D ground truth for estimated as Chamfer distances between the ground-truth
quantitative evaluation of our method. mesh points S∗ and the estimated mesh points Ŝ:
AMA human dataset. For human reconstruction, we use X X
dCD S∗ , Ŝ = min kx−yk22 + min∗ kx−yk22 .

the Articulated Mesh Animation (AMA) dataset [48]. AMA
y∈Ŝ x∈S
contains 10 sets of multi-view videos captured by 8 syn- x∈S∗ y∈Ŝ
chronized cameras. We select 2 sets of videos of the same
actor (swing and samba), totaling 2600 frames, and treat Before computing dCD , we align the estimated shape to the
them as unsynchronized monocular videos as the input. ground-truth via Iterative Closest Point (ICP) up to a 3D
Time synchronization and camera extrinsics are not used similarity transformation.
for optimization. We use the ground-truth object silhouettes Results. We compare BANMo against ViSER and Nerfies
and the optical flow estimated by VCN-robust [55]. in Tab. 1. To set up a fair comparison with Nerfies, we run
Animated Objects dataset. To quantitatively evaluate on both methods on a per-video basis, denoted as Ours (S) and
other object categories, we download free animated 3D Nerfies (S). We found Nerfies that uses RGB reconstruction
models from TurboSquid, including an eagle model and loss only does not produce meaningful results due to the ho-
a model for human hands. We render them from differ- mogeneous background color. To improve Nerfies results,
ent camera trajectories with partially overlapping motions. we provide it with ground-truth object silhouettes, and op-
Each animated object is rendered as 5 videos with 150 timize a carefully balanced RGB+silhouette loss [59]. Still,
frames per video. We provide ground-truth root poses (or our method produces lower errors than ViSER and Nerfies.
7
4.3. Diagnostics
We ablate the importance of each component, by using a
subset of videos. To also ablate root pose initialization and
registration, we test on AMA’s samba and swing (325
frames in total). We include exhaustive ablations in supple- Reference w/o active sampling (Sec. 3.4) Visualization of active samples (red)
ment, and only highlight crucial aspects of BANMo below.

Root pose initialization. We show the effect of PoseNet Figure 8. Diagnostics of active sampling over (x, y) (Sec. 3.4).
for root pose initialization (Sec 3.4) in Fig. 6: without it, With no active sampling, our method converges slower and misses
details (such as ears and eyes). Active samples focus on face and
the root poses (or equivalently camera poses) collapsed to a
boundaries pixels where the color reconstruction errors are higher.
degenerate solution after optimization.
Initial cameras Final cameras Initial cameras Final cameras ularization from the Gaussian component, and also handle
complex deformation such as close contact of hands.
reconstruction reconstruction
Reference Without root pose initialization (Sec. 3.4)

Reference Neural blend skinning MLP-SE(3) MLP-translation
image (Ours) (Nerfies) (NSFF, D-NeRF)
Figure 6. Diagnostics of root pose initialization (Sec.3.4). With
randomly initialized root poses, the estimated poses (on the right) Figure 9. Diagnostics of deformation modeling (Sec.3.2). Re-
collapsed to a degenerate solution, causing reconstruction to fail. placing our neural blend skinning with MLP-SE(3) results in less
regular deformation in the non-visible region. Replacing our neu-
Registration. In Fig. 7, we show the benefit of using ral blend skinning with MLP-translation as in NSFF and D-Nerf
canonical embeddings (Sec 3.3), and measured 2D flow results in reconstructing ghosting wings due to significant motion.
(Eq. 14) to register observations across videos and within a
video. Without the canonical embeddings and correspond- Ability to leverage more videos. We compare BANMo to
ing losses (Eq. 15-16), each video will be reconstructed sep- Nerfies in terms of the ability to leverage more available
arately. With no flow reconstruction loss, multiple ghosting video observations. To demonstrate this, we compare the
structures are reconstructed due to failed registration. reconstruction quality of optimizing 1 video vs. 8 videos
from the AMA samba sequences. Results are shown in
single coherent recon. samba recon. swing recon. recon. 1 recon. 2 recon. 3
Fig. 5. Given more videos, our method can register them
to the same canonical space, improving the reconstruction
completeness and reducing shape ambiguities. In contrast,
Nerfies does not produce better results given more video
observations.
Reference w/o feature registration (Sec. 3.3) further remove flow loss (Eq. 14)
Figure 7. Diagnostics of registration (Sec. 3.3). Without canon-

ical embeddings (middle) or flow loss (right), our method fails to Driving frame Re-targeted pose rending Re-targeted pose Source model (cat)
register frames to a canonical model, creating ghosting effects.
Figure 10. Motion re-targeting from a pre-optimized cat model
Active sampling. We show the effect of active sampling to a tiger. Color coded by point locations in the canonical space.
(Sec 3.4) on a casual-cat video (Fig. 8): removing it
results in slower convergence and inaccurate geometry. Motion re-targeting. As a distinctive application, we
Deformation modeling. We demonstrate the benefit of us- demonstrate BANMo’s ability of to re-target the articula-
ing our neural blend skinning model (Sec 3.2) on an eagle tions of a driving video to our 3D model by optimizing the
sequence, which is challenging due to its large wing ar- frame-specific root and body pose codes ω tr , ω tb , as shown
ticulations. If we swap neural blend skinning for MLP- in Fig. 10. To do so, we first optimize all parameters over
SE(3) [31], the reconstruction is less regular. If we swap for a set of training videos from our casual-cat dataset.
MLP-translation [20, 36], we observe ghosting wings due Given a driving video of a tiger, we freeze the shared model
to wrong geometric registration (caused by large motion). parameters (including shape, skinning, and canonical fea-
Our method can model large articulations thanks to the reg- tures) of the cat model, and only optimize the video-specific
8
and frame-specific codes, i.e. root and body pose codes ω tr ,
ω tb , as well as the environment lighting code ω te .
5. Discussion
We have presented BANMo, a method to reconstruct
high-fidelity animatable 3D models from a collection of
casual videos, without requiring a pre-defined shape tem-
plate or pre-registered cameras. BANMo registers thou-
sands of unsynchronized video frames to the same canon-
ical space by fine-tuning a generic DensePose prior to spe-
cific instances. We obtain fine-grained registration and re-
construction by leveraging neural implicit representation
for shape, appearance, canonical features, and skinning
weights. On real and synthetic datasets, BANMo shows
strong empirical performance for reconstructing clothed hu-
man and quadruped animals, and demonstrates the ability to
recover large articulations, reconstruct fine-geometry, and
render realistic images from novel viewpoints and poses.
Limitations. BANMo uses a pre-trained DensePose-CSE
(with 2D keypoint annotaionts [29]) to provide rough root
body pose registration, and therefore not currently applica-
ble to categories beyond humans and quadruped animals.
Further investigation is needed to extend our method to ar-
bitrary object categories. Similar to other works in differen-
tiable rendering, BANMo requires a lot of compute, which
increases linearly with the number of input images. We
leave speeding up the optimization as future work.
9
A. Notations Generate random CSE feature
viewpoints rendering
We refer readers to a list of notations in Tab. 4 and a list
of learnable parameters in Tab. 5. Densepose CSE
surface embedding
B. Method details PoseNet
Augmentation:
Random masks
B.1. Root Pose Initialization

As discussed in Sec. 3.4, to make optimization robust,
we train a image CNN (denoted as PoseNet) to initialize Figure 12. Training pipeline of PoseNet. To train PoseNet,
root body transforms Gt that aligns the camera space of we use DesePose CSE surface embeddings, which is pertained on
time t to the canonical space of CSE, as shown in Fig. 11. 2D annotations of human and quadruped animals. We first gen-
erate random viewpoints on a sphere that faces the origin. Then
we render surface embeddings as 16-channel images. We further
PoseNet augment the feature images with random adversarial masks to im-
prove the robustness to occlusions. Finally, the rotations predicted
by PoseNet are compared against the ground-truth rotations with
Input frames
DensePose
ResNet-18 SO(3) initializations geodesic distance.
CNN
Figure 11. Inference pipeline of PoseNet. To initialize the op-

timization, we train a CNN PoseNet to predict root poses given predictions in presence of occlusions and in case of out-of-
a single image. PoseNet uses a DensePose-CNN to extract pixel distribution appearance.
features and decodes the pixel features into root pose predictions Loss and inference. We use the geodesic distance between
with a ResNet-18. We visualize the initial root poses on the right. the ground-truth and predicted rotations as a loss to update
Cyan color represents earlier time stamps and magenta color rep- PoseNet,
resent later timestamps.
Lgeo = || log(R∗ RT )||, R = PoseNet(ψ rnd ), (19)
Preliminary DensePose CSE [28, 29] trains pixel embed-
where we find learning to predict rotation is sufficient for
dings ψ I and surface feature embeddings ψ for humans and
initializing the root body pose. In practice, we set the
quadruped animals using 2D keypoing annotations. It rep-
initial object-to-camera translation to be a constant T =
resents surface embeddings by a canonical surface with N
(0, 0, 3)T . We run pose CNN on each test video frame to
vertices and vertex features ψ ∈ RN ×16 . A SMPL mesh is
obtain the initial root poses Gt0 = R, T , and compute a
used for humans, and a sheep mesh is used for quadraped
delta root pose with the root pose MLP:
animals. The embeddings are trained such that given an
pixel feature, a 3D point on the canonical surface can be Gt = MLPG (ωrt )Gt0 . (20)
uniquely located via feature matching.
Naive PnP solution Given 2D-3D correspondences pro- B.2. Active Sampling Over Pixels
vided by CSE, one way to solve for Gt is to use perspective-
As discussed in Sec. 3.4, we optimize a compact 5-layer
n-points (PnP) algorithm assuming objects are rigid. How-
MLP function to represent the uncertainty over the image
ever, the PnP solution suffers from catastrophic failures
coordinate and frame index: Û(x, y, t) = MLPU (x, y, t).
due to the non-rigidity of the object, which motivates our
The uncertainty MLP is optimized by comparing against the
PoseNet solution. By training a feed-forward network with
color reconstruction errors in the current forward step:
data augmentations, our PoseNet solution produces fewer
gross errors than the naive PnP solution. X
LU = Lrgb (xt ) − Û(xt ) . (21)

Synthetic dataset genetarion. We train separate PoseNet, x,t
one for human, and one for quadruped animals. The train-
ing pipeline is shown in Fig. 12. Specifically, we render sur- Note that the gradient from LU to Lrgb (xt ) is stopped such
face features as feature images ψ rnd ∈ R112×112×16 given that LU does not generate gradients to parameters besides
viewpoints G∗ = (R∗ , T∗ ) randomly generated on a unit MLPU . After a half of the optimization steps, we select
sphere facing the origin. We apply occlusion augmenta- N a = 8192 additional active samples from pixels with high
0
tions [40] that randomly mask out a rectangular region in uncertainties. To do so, we randomly sample N a = 32768
the rendered feature image and replace with mean values pixels, and evaluate their uncertainties by passing their co-
of the corresponding feature channels. The random occlu- ordinates and frame indices to MLPU . The pixels with
sion augmentation forces the network to be robust to out- high uncertainties are selected as active samples and com-
lier inputs, and empirically helps network to make robust bined with the random samples.
10
Table 2. Table of hyper-parameters.
Name Value Description

B 25 Number of bones
N 128 Sampled points per ray
Np 8192 Sampled rays per batch
Na 8192 Active ray samples per batch
(H, W ) (512,512) Resolution of observed images
B.3. Optimization details

Canonical 3D grid. As mentioned in Sec 3.3, we define a
canonical 3D grid V∗ ∈ R20×20×20 to compute the match-
ing costs between pixels and canonical space locations. The
canonical grid is centered at the origin and axis-aligned
with bounds [xmin , xmax ], [ymin , ymax ], and [zmin , zmax ]. The
bounds are initialized as loose bounds and are refined dur-
ing optimization. For every 200 iterations, we update the
bounds of the canonical volume as an approximate bound
of the object surface. To do so, we run marching cubes on
a 643 grid to extract a surface mesh and then set L as the
Table 3. Diagnostics. Averaged 3D Chamfer distances in cm (↓).
axis-aligned (x, y, z) bounds of the extracted surface.
Near-far planes. To generate samples for volume render-
ing, we dynamically compute the depth of near-far planes Method swing samba mean
(dtn , dtf ) of frame t at each iterations of the optimization. Reference 7.05 6.59 6.82
To do so, we compute the projected depth of the canon- w/o canonical feature, Sec 3.3 53.60 56.69 55.15
ical surface points dti = (Πt Gt X∗i )2 . The near plane is w/o root pose init., Sec 3.4 55.07 56.76 55.92
set as dtn = min(di ) − L and the far plane is set as w/o flow loss, Eq. 13 47.41 48.55 47.85

dtf = max(di ) + L , where L = 0.2 max(di ) − min(di ) .
To avoid the compute overhead, we approximate the surface
with an axis-aligned bounding box with 8 points.
Hyper-parameters. We refer readers to a complete list of
hyper-parameters in Tab. 2.
C. Additional results
C.1. Quantitative evaluation
In Sec. 4.3, we presented qualitative results of diagnos-
tics experiments. In Tab. 3, we show quantitative evalu-
ations corresponding to each experiment. As a result, re-
moving canonical features, root pose initialization or flow
loss leads to much higher 3D Chamfer distances.
C.2. Qualitative results
We refer readers to our supplementary webpage for com-
plete qualitative results.
11
Table 4. Table of notations.
Symbol Description
Index
t Frame index, t ∈ {1, . . . , T }
b Bone index b ∈ {1, . . . , B} in neural blend skinning
i Point index b ∈ {1, . . . , N } in volume rendering
Points
x Pixel coordinate x = (x,y)
Xt 3D point locations in the frame t camera coordinate
X∗ 3D point locations in the canonical coordinate
X̂∗ Matched canonical 3D point locations via canonical embedding
Property of 3D points
c ∈ R3 Color of a 3D point
σ∈R Density of a 3D point
ψ ∈ R16 Canonical embedding of a 3D point
W ∈ RB Skinning weights of assigning a 3D point to B bones
Functions on 3D points
W t,← (Xt ) Backward warping function from Xt to X∗
W t,→ (X∗ ) Forward warping function from X∗ to Xt
S(X, ωb ) Skinning function that computes skinning weights of X under body pose ωb
Rendered and Observed Images
c/ĉ Rendered/observed RGB image
o/ŝ Rendered/observed object silhouette image
F/F̂ Rendered/observed optical flow image
12
Table 5. Table of learnable parameters.
Symbol Description
Canonical Model Parameters
MLPc Color MLP
MLPSDF Shape MLP
MLPψ Canonical embedding MLP
Deformation Model Parameters
Λ0 ∈ R3×3 Scale of the bones in the “zero-configuration” (diagonal matrix).
V0 ∈ R3×3 Orientation of the bones in the “zero-configuration”.
C0 ∈ R 3 Center of the bones in the “zero-configuration”.
MLP∆ Delta skinning weight MLP
MLPG Root pose MLP
MLPJ Body pose MLP
Learnable Codes
ω ∗b ∈ R128 Body pose code for the rest pose
ω tb ∈ R128 Body pose code for frame t
ω tr ∈ R128 Root pose code for frame t
ω te ∈ R64 Environment lighting code for frame t, shared across frames of the same video
Other Learnable Parameters
ψI CNN pixel embedding initialized from DensePose CSE
αs Temperature scalar for canonical feature matching
β Scale parameter that controls the solidness of the object surface
Π ∈ R3×3 Intrinsic matrix of the pinhole camera model
13
References adaptation for consistent mesh reconstruction in the wild. In
NeurIPS, 2020. 2
[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si-
[19] Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello,
mon, Brian Curless, Steven M Seitz, and Richard Szeliski.
Varun Jampani, Ming-Hsuan Yang, and Jan Kautz. Self-
Building rome in a day. Communications of the ACM, 2011.
supervised single-view 3d reconstruction via semantic con-
1
sistency. ECCV, 2020. 2
[2] Marc Badger, Yufu Wang, Adarsh Modh, Ammon Perkes,
[20] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang.
Nikos Kolotouros, Bernd Pfrommer, Marc Schmidt, and
Neural scene flow fields for space-time view synthesis of dy-
Kostas Daniilidis. 3D bird reconstruction: a dataset, model,
namic scenes. In CVPR, 2021. 2, 3, 5, 6, 8
and shape recovery from a single view. In ECCV, 2020. 2
[3] Benjamin Biggs, Ollie Boyne, James Charles, Andrew [21] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Si-
Fitzgibbon, and Roberto Cipolla. Who left the dogs out: 3D mon Lucey. Barf: Bundle-adjusting neural radiance fields.
animal reconstruction with expectation maximization in the In ICCV, 2021. 2
loop. In ECCV, 2020. 2 [22] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu
[4] Christoph Bregler, Aaron Hertzmann, and Henning Bier- Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor:
mann. Recovering non-rigid 3d shape from image streams. Neural free-view synthesis of human actors with pose con-
In CVPR, 2000. 2 trol. SIGGRAPH Asia, 2021. 2
[5] Xu Chen, Yufeng Zheng, Michael J Black, Otmar Hilliges, [23] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
and Andreas Geiger. Snarf: Differentiable forward skinning Pons-Moll, and Michael J. Black. SMPL: A skinned multi-
for animating non-rigid neural implicit shapes. 2021. 2, 4 person linear model. SIGGRAPH Asia, 2015. 2
[6] Shubham Goel, Angjoo Kanazawa, and Jitendra Malik. [24] Diogo C Luvizon, Hedi Tabia, and David Picard. Human
Shape and viewpoints without keypoints. In ECCV, 2020. pose regression by combining indirect part detection and
2 contextual information. Computers & Graphics, 2019. 5
[7] Paulo FU Gotardo and Aleix M Martinez. Non-rigid struc- [25] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi,
ture from motion with complementary rank-3 spaces. In Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck-
CVPR, 2011. 2 worth. NeRF in the Wild: Neural Radiance Fields for Un-
[8] Alec Jacobson, Zhigang Deng, Ladislav Kavan, and JP constrained Photo Collections. In CVPR, 2021. 2, 3
Lewis. Skinning: Real-time shape deformation. In ACM [26] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su,
SIGGRAPH 2014 Courses, 2014. 4 Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neu-
[9] Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima ral radiance field without posed camera. In ICCV, 2021. 2
Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating [27] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
neural radiance fields. In ICCV, 2021. 2 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
[10] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Representing scenes as neural radiance fields for view syn-
Jitendra Malik. Learning category-specific mesh reconstruc- thesis. In ECCV, 2020. 2, 3, 5, 6
tion from image collections. In ECCV, 2018. 2 [28] Natalia Neverova, David Novotny, Vasil Khalidov, Marc
[11] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Szafraniec, Patrick Labatut, and Andrea Vedaldi. Continu-
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. ous surface embeddings. In NeurIPS, 2020. 3, 5, 6, 10
End-to-end learning of geometry and context for deep stereo [29] Natalia Neverova, Artsiom Sanakoyeu, Patrick Labatut,
regression. In ICCV, 2017. 5 David Novotny, and Andrea Vedaldi. Discovering rela-
[12] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir- tionships between object categories via universal canonical
shick. Pointrend: Image segmentation as rendering. In maps. In CVPR, 2021. 5, 9, 10
CVPR, 2020. 6 [30] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya
[13] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Harada. Neural articulated radiance field. In ICCV, 2021.
Black. Vibe: Video inference for human body pose and 2
shape estimation. In CVPR, June 2020. 2 [31] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
[14] Filippos Kokkinos and Iasonas Kokkinos. To the point: Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
Correspondence-driven monocular 3d category reconstruc- Martin-Brualla. Nerfies: Deformable neural radiance fields.
tion. In NeurIPS, 2021. 2 In ICCV, 2021. 2, 3, 6, 7, 8
[15] Chen Kong and Simon Lucey. Deep non-rigid structure from [32] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T.
motion. In ICCV, 2019. 2 Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-
[16] Nilesh Kulkarni, Abhinav Gupta, David F Fouhey, and Shub- Brualla, and Steven M. Seitz. Hypernerf: A higher-
ham Tulsiani. Articulation-aware canonical surface map- dimensional representation for topologically varying neural
ping. In CVPR, pages 452–461, 2020. 2, 3, 5 radiance fields. arXiv, 2021. 2
[17] Suryansh Kumar. Non-rigid structure from motion: Prior- [33] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,
free factorization method revisited. In WACV, 2020. 2 Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and
[18] Xueting Li, Sifei Liu, Shalini De Mello, Kihwan Kim, Xi- Michael J. Black. Expressive body capture: 3d hands, face,
aolong Wang, Ming-Hsuan Yang, and Jan Kautz. Online and body from a single image. In CVPR, 2019. 2
14
[34] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan [51] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and
Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- Victor Adrian Prisacariu. Nerf–: Neural radiance fields
matable neural radiance fields for modeling dynamic human without known camera parameters. In arXiv preprint
bodies. In ICCV, 2021. 2 arXiv:2102.07064, 2021. 2
[35] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, [52] Shangzhe Wu, Tomas Jakab, Christian Rupprecht, and An-
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: drea Vedaldi. Dove: Learning deformable 3d objects by
Implicit neural representations with structured latent codes watching videos. arXiv preprint arXiv:2107.10844, 2021.
for novel view synthesis of dynamic humans. In CVPR, 2
2021. 2 [53] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular
[36] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and total capture: Posing face, body, and hands in the wild. In
Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields CVPR, 2019. 2
for Dynamic Scenes. In CVPR, 2020. 2, 8 [54] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. Denserac: Joint
[37] Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J. 3d pose and shape estimation by dense render-and-compare.
Black. SCANimate: Weakly supervised learning of skinned In ICCV, 2019. 6
clothed avatar networks. In CVPR, 2021. 4 [55] Gengshan Yang and Deva Ramanan. Volumetric correspon-
[38] Peter Sand and Seth Teller. Particle video: Long-range mo- dence networks for optical flow. In NeurIPS, 2019. 6, 7
tion estimation using point trajectories. In IJCV, 2008. 2 [56] Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic,
[39] Vikramjit Sidhu, Edgar Tretschk, Vladislav Golyanik, An- Forrester Cole, Huiwen Chang, Deva Ramanan, William T
tonio Agudo, and Christian Theobalt. Neural dense non- Freeman, and Ce Liu. LASR: Learning articulated shape
rigid structure from motion with latent space constraints. In reconstruction from a monocular video. In CVPR, 2021. 2,
ECCV, 2020. 2 3, 4
[40] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: [57] Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic,
Forcing a network to be meticulous for weakly-supervised Forrester Cole, Ce Liu, and Deva Ramanan. Viser: Video-
object and action localization. In ICCV, 2017. 10 specific surface embeddings for articulated 3d shape recon-
[41] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo struction. In NeurIPS, 2021. 2, 5, 6, 7
tourism: exploring photo collections in 3d. In SIGGRAPH.
[58] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
2006. 1
ume rendering of neural implicit surfaces. arXiv preprint
[42] Noah Snavely, Steven M Seitz, and Richard Szeliski. Mod-
arXiv:2106.12052, 2021. 3, 6
eling the world from internet photo collections. IJCV, 2008.
[59] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
1
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
[43] Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge
ral surface reconstruction by disentangling geometry and ap-
Rhodin. A-nerf: Articulated neural radiance fields for learn-
pearance. In NeurIPS, 2020. 5, 7
ing human shape, appearance, and pose. In NeurIPS, 2021.
2 [60] Yufei Ye, Shubham Tulsiani, and Abhinav Gupta. Shelf-
supervised mesh prediction in the wild. In CVPR, 2021. 2
[44] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew Davi-
son. iMAP: Implicit mapping and positioning in real-time. [61] Jason Y. Zhang, Gengshan Yang, Shubham Tulsiani, and
In ICCV, 2021. 6 Deva Ramanan. NeRS: Neural reflectance surfaces for
[45] Narayanan Sundaram, Thomas Brox, and Kurt Keutzer. sparse-view 3d reconstruction in the wild. In NeurIPS, 2021.
Dense point trajectories by gpu-accelerated large displace- 6
ment optical flow. In ECCV, 2010. 2 [62] Silvia Zuffi, Angjoo Kanazawa, Tanya Berger-Wolf, and
[46] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Michael Black. Three-d safari: Learning to estimate zebra
Zollhöfer, Christoph Lassner, and Christian Theobalt. Non- pose, shape, and texture from images “in the wild”. In ICCV,
rigid neural radiance fields: Reconstruction and novel view 2019. 2
synthesis of a dynamic scene from monocular video. In [63] Silvia Zuffi, Angjoo Kanazawa, and Michael J. Black. Lions
ICCV, 2021. 2 and tigers and bears: Capturing non-rigid, 3D, articulated
[47] Shubham Tulsiani, Nilesh Kulkarni, and Abhinav Gupta. Im- shape from images. In CVPR, 2018. 2
plicit mesh reconstruction from unannotated image collec- [64] Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and
tions. In arXiv, 2020. 2 Michael J. Black. 3D menagerie: Modeling the 3D shape
[48] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan and pose of animals. In CVPR, 2017. 2
Popović. Articulated mesh animation from multi-view sil-
houettes. TOG, 2008. 7
[49] Chaoyang Wang, Ben Eckart, Simon Lucey, and Orazio
Gallo. Neural trajectory fields for dynamic novel view syn-
thesis. arXiv preprint arXiv:2105.05994, 2021. 2
[50] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
NeurIPS, 2021. 3
15

Banmo: Building Animatable 3D Neural Models From Many Casual Videos

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Banmo: Building Animatable 3D Neural Models From Many Casual Videos

Uploaded by

Copyright:

Available Formats

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Color: Skinning weights Pose 1

Abstract can be self-supervised with cycle consistency. On real and

Forward Warp. (Sec. 3.2)

An image at time t Canonical Space

Xt = W t,→ (X∗ ) = Gt Jt,→ X∗ ,

described above, including MLPs, {MLPc , MLPSDF ,

Image Space at t1 Image Space at t2

reconstruction losses feature registration losses

Figure 3. Canonical Embeddings. We jointly optimize an im-

pixel feature embedding ψ tI (xt ) ∈ R16 . The pixel embed- X 2

where V∗ are sampled points in a canonical 3D grid, X 2

Active sampling over (x, y, t). Inspired by iMAP [44],

Table 1. Quantitative results on AMA and Animated Objects.

ment, and only highlight crucial aspects of BANMo below.

Reference Without root pose initialization (Sec. 3.4)

Figure 7. Diagnostics of registration (Sec. 3.3). Without canon-

B.1. Root Pose Initialization

Figure 11. Inference pipeline of PoseNet. To initialize the op-

Name Value Description

B.3. Optimization details

You might also like

Banmo: Building Animatable 3D Neural Models From Many Casual Videos

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Banmo: Building Animatable 3D Neural Models From Many Casual Videos

Uploaded by

Copyright:

Available Formats

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Color: Skinning weights Pose 1

Abstract can be self-supervised with cycle consistency. On real and

Forward Warp. (Sec. 3.2)

An image at time t Canonical Space

Xt = W t,→ (X∗ ) = Gt Jt,→ X∗ ,

described above, including MLPs, {MLPc , MLPSDF ,

Image Space at t1 Image Space at t2

reconstruction losses feature registration losses

Figure 3. Canonical Embeddings. We jointly optimize an im-

pixel feature embedding ψ tI (xt ) ∈ R16 . The pixel embed- X 2

where V∗ are sampled points in a canonical 3D grid, X   2

Active sampling over (x, y, t). Inspired by iMAP [44],

Table 1. Quantitative results on AMA and Animated Objects.

ment, and only highlight crucial aspects of BANMo below.

Reference Without root pose initialization (Sec. 3.4)

Figure 7. Diagnostics of registration (Sec. 3.3). Without canon-

B.1. Root Pose Initialization

Figure 11. Inference pipeline of PoseNet. To initialize the op-

Name Value Description

B.3. Optimization details

You might also like

where V∗ are sampled points in a canonical 3D grid, X 2