You are on page 1of 7

Anything-3D: Towards Single-view Anything Reconstruction in the Wild

Qiuhong Shen* Xingyi Yang* Xinchao Wang


National University of Singapore
{qiuhong.shen,xyang}@u.nus.edu, xinchao@nus.edu.sg
arXiv:2304.10261v1 [cs.CV] 19 Apr 2023

a corgi is running with its tongue out

a silver goddess with wings is over a car

a brown bull is standing in a field with a blue rope

Figure 1. The Anything-3D framework proficiently recovers the 3D geometry and texture of any object from a single-view image captured
in uncontrolled environments. Despite significant variations in camera perspective and object properties, our approach consistently delivers
reliable recovery results.

Abstract to-image diffusion model to lift object into a neural ra-


diance field. Demonstrating its ability to produce accu-
3D reconstruction from a single-RGB image in un- rate and detailed 3D reconstructions for a wide array of
constrained real-world scenarios presents numerous chal- objects, Anything-3D† shows promise in addressing the
lenges due to the inherent diversity and complexity of limitations of existing methodologies. Through compre-
objects and environments. In this paper, we introduce hensive experiments and evaluations on various datasets,
Anything-3D, a methodical framework that ingeniously we showcase the merits of our approach, underscoring
combines a series of visual-language models and the its potential to contribute meaningfully to the field of 3D
Segment-Anything object segmentation model to elevate ob- reconstruction. Demos and code will be available at
jects to 3D, yielding a reliable and versatile system for https://github.com/Anything-of-anything/Anything-3D.
single-view conditioned 3D reconstruction task. Our ap-
proach employs a BLIP model to generate textural descrip-
tions, utilizes the Segment-Anything model for the effec- 1. Introduction
tive extraction of objects of interest, and leverages a text-
The quest to reconstruct 3D objects from 2D images is
* Equal Contribution central to the evolution of computer vision and has far-
† Work in progress reaching consequences for robotics, autonomous driving,

1
augmented and virtual reality, as well as 3D printing. In ject reconstruction from single images.
spite of considerable progress in recent years, the task of
single-image object reconstruction in unstructured settings 2. Related Works
continues to be an intellectually engaging and daunting
problem. Single view 3D reconstruction. A 3D object has details
of geometry with texture and can be projected to 2D im-
The task at hand entails generating a 3D representation
age at any views. Dense view images with camera poses or
of one or more objects from a single 2D image, poten-
depth information are always required to rebuild 3D objects.
tially encompassing point clouds, meshes, or volumetric
Though the general 3D reconstruction task has made signif-
representations. However, the problem is fundamentally ill-
icant strides in past decades, exploration of reconstructing
posed. Due to the intrinsic ambiguity that arises from the
3D objects from single views is still limited. While single-
2D projection, it is impossible to unambiguously determine
view reconstruction has been attempted for specific object
an object’s 3D structure.
categories, such as faces, human bodies, and vehicles, these
Reconstructing objects in the wild is primarily compli- approaches rely on strong category-wise priors like meshes
cated by the vast diversity in shapes, sizes, textures, and ap- and CAD models. The reconstruction of arbitrary 3D ob-
pearances. Furthermore, objects in real-world images are jects from single-view images remains largely unexplored
frequently subject to partial occlusion, hindering the ac- and requires expert knowledge, often taking several hours.
curate reconstruction of obscured parts. Variables such as Reconstructing an arbitrary 3D object from a single-
lighting and shadows can drastically affect object appear- view image is inherently an ill-posed optimization problem,
ance, while differences in angles and distances can lead to making it extremely challenging and largely unexplored.
significant alterations in 2D projections. However, recent work has begun incorporating general vi-
In this paper, we present Anything-3D, a pioneering sual knowledge into 3D reconstruction models, enabling the
and systematic framework that fuses visual-language mod- generation of 3D models from single-view images. Neu-
els with object segmentation to effortlessly transform ob- raLift360 [15] employ pre-trained depth-estimator and 2D
jects from 2D to 3D, culminating in a powerful and adapt- diffusion priors to recover coarse geometry and textures of
able system for single-view reconstruction tasks. Our 3D objects from single image. 3DFuse [11] initialize ge-
framework overcomes these challenges by incorporating ometry with estimated point cloud [6], then fine-grained
the Segment-Anything model (SAM) alongside a series of geometry and textures are learned from single image with
visual-language foundation models to retrieve the 3D tex- a 2D diffusion prior with score distillation [13]. Another
ture and geometry of a given image. Initially, a Boot- work MCC [14] learns from object-centric videos to gen-
strapping Language-Image Pre-training (BLIP) estimates erate 3D objects from single RGB-D image. These works
the textual description of the image. Next, SAM identifies show promising results in this direction but still rely heavily
the object region of interest. The segmented object and de- on human-specified text or captured depth information.
scription are subsequently employed to execute the 3D re- Our work aims to reconstruct arbitrary objects of in-
construction task. Our 3D reconstruction model leverages terest from single-perspective, in-the-wild images, which
a pre-trained 2D text-to-image diffusion model to carry out are ubiquitous and easily accessible in the real world. By
image-to-3D synthesis. In particular, we utilize a text-to- tackling this challenging problem, we hope to unlock new
image diffusion model that processes a 2D image and a tex- potential for 3D object reconstruction from limited data
tual description of an object, employing score distillation to sources.
train a neural radiance field specific to the image. Through Text-to-3D Generative Models. An alternative approach
rigorous experiments on diverse datasets, we showcase the to generating 3D objects involves using text prompts as
effectiveness and adaptability of our approach, outstripping input. Groundbreaking works like DreamFusion [7] pio-
existing methods in terms of accuracy, robustness, and gen- neered text-to-3D generative models using score distillation
eralization capability. Moreover, we provide a thorough on pre-trained 2D generative models. This inspired a series
analysis of the challenges inherent in 3D object reconstruc- of subsequent works, such as SJC [13], Magic3D [5], and
tion in the wild and discuss how our framework addresses Fantasia3D [1], which focused on generating voxel-based
these obstacles. radiance fields or high-resolution 3D objects from text using
By harmoniously melding the zero-shot vision and lin- various optimization techniques and neural rendering meth-
guistic comprehension abilities of the foundation models, ods. Point-E [6] represents another approach, generating
our framework excels at reconstructing an extensive range point clouds from either images or text.
of objects from real-world images, producing precise and In contrast to these text-based methods, our work targets
intricate 3D representations applicable to a myriad of use generating 3D objects from single-view images. We strive
cases. Ultimately, our proposed framework, Anything-3D, to ensure the generated 3D objects are as consistent as pos-
represents a groundbreaking stride in the domain of 3D ob- sible with the given single-view images, pushing the bound-

2
aries of 3D vision and machine learning research. Our ap- to make assumptions and trade-offs to arrive at a plau-
proach demonstrates the potential for reconstructing com- sible 3D structure.
plex, real-world objects from limited information, inspiring
further exploration and innovation in the field.
3.3. What Do We Mean by “Anything”?
3. Preliminary The term ”anything” emphasizes the exceptional and
3.1. Problem definition ambitious nature of our approach. By using this term, we
highlight the capability of our framework to generate 3D
Our problem refers to the task of automatically generat- models from a single input image, irrespective of the object
ing a 3D model of an object in a real-world image, using category or scene complexity. This aspect differentiates our
only a single 2D image as input, without any prior informa- approach from previous works that may be confined to cer-
tion about the object category. tain object classes or restricted environments.
Let I ∈ RH×W ×3 be a real-world image represented No prior work, to the best of our knowledge, has suc-
as a matrix of RGB values, where H and W represent the cessfully tackled the task of Single-view Anything Recon-
height and width of the image, respectively. The goal is to struction in the Wild, leaving the feasibility of a solution
generate a 3D model of the object in the image, represented uncertain. In this paper, we boldly address this challenge by
by a NeRF function C(x; θ), where x ∈ R5 represents a devising a unified framework that combines a series of foun-
point and its view direction in 3D space, and θ represents dational models, each boasting unparalleled generalization
the parameters of the neural network that predicts the RGB abilities. Our framework is designed to enable robust and
value c and volume density σ of the sampled location. accurate 3D object reconstruction from a single image, irre-
3.2. Challenges spective of object category or environmental complexities.
Our contribution specifically leverages state-of-the-art
The problem of single view, 3D reconstruction in the visual-language models to extract semantic information
wild for arbitrary class image presents a thrilling but from the input image. This extracted information is then uti-
formidable challenge, with numerous complexities and con- lized to inform a high-precision object segmentation algo-
straints to overcome. rithm. Through the careful fusion of these components, our
1. Arbitrary Class. Previous models for reconstructing Single-view Anything Reconstruction in the Wild frame-
3D images succeed in specific object categories but work generates detailed 3D object reconstructions from
often fail when applied to new or unseen categories a single input image, effectively overcoming challenges
without a parametric form, highlighting the need for posed by complex and unstructured environments.
new methods to extract meaningful features and pat-
terns from images without prior assumptions. 4. Methodology
2. In the Wild. Real-world images present various chal- In this section, we delve into the details of the method-
lenges, such as occlusions, lighting variations, and ology used in our proposed framework, Anything-3D. Our
complex object shapes, which can affect the accuracy framework seamlessly integrates visual-language models
of 3D reconstruction. Overcoming these difficulties re- and object segmentation techniques to address the challeng-
quires networks to account for these variations while ing problem of single-image object reconstruction in un-
inferring the 3D structure of the scene. structured settings. By leveraging the unique strengths of
each component, we have developed a powerful approach
3. No Supervision. The absence of a large-scale dataset that can effectively handle the vast variation of images en-
with paired single view images and 3D ground truth countered in the wild, across arbitrary categories. Our over-
limits the training and evaluation of proposed models, all pipeline is shown in Figure 2.
hindering their ability to generalize to a wide range of
object categories and poses. The lack of supervision 4.1. Identify Object with the Segment-Anything
also makes it difficult to validate the accuracy of the
reconstructed 3D models. The first step in our framework is to employ the
Segment-Anything Model (SAM) [3] to identify and extract
4. Single-View. Inferring the correct 3D model from a the object regions of interest in the input image. Specifi-
single 2D image is inherently ill-posed, as there are in- cally, SAM takes as input the image I and a prompt p spec-
finitely many possible 3D models that could explain ifying a point or bounding box annotation, and produces
the same image. This leads to ambiguity in the prob- accurate segmentation masks M that mask out the back-
lem and makes it challenging to achieve high accu- ground, later an affine transformation is applied to generate
racy in reconstructed 3D models, requiring networks the local image patch I 0 with the bounding box:

3
SAM

“a corgi is running
BLIP
with its tongue out”

Stable
Score Distillation Diffusion

Geometry
constraint
Camera pose 𝜋
Point-E

Neural Radiance Field Coarse point cloud

Figure 2. Anything-3D combines visual-language models and object segmentation for efficient single-view 3D reconstruction. The frame-
work employs BLIP for textual description generation and SAM for object segmentation, followed by 3D reconstruction using a pre-trained
2D text-to-image diffusion model. This model processes the 2D image and textual description, utilizing score distillation to train a neural
radiance field specific to the image for image-to-3D synthesis.

M = SAM(I, p); I 0 = Affine(M I; b) (1) e∗ = arg min ||φ (It , e) − ||22 (3)
e
where denotes the element-wise multiplication using Here, φ represents the pre-trained diffusion model [9],
image masking. SAM is a state-of-the-art object segmenta- and It denotes the noisy image at step t:
tion model that can handle a wide variety of object classes,
addressing the challenge of object diversity in the wild. √ √
It = ᾱt I 0 + 1 − ᾱt , where  ∼ N (0, I) (4)
4.2. Image-to-Text Semantic Conversion

The resulting textual embedding e serves as invalu-
Following the region of interest extraction, we em- able auxiliary information for subsequent 3D reconstruc-
ploy a Bootstrapping Language-Image Pre-training (BLIP) tion stages, harnessing the text-image consistency inherent
model [4] to acquire a comprehensive textual description in diffusion models to facilitate a more accurate and com-
of the input image. The BLIP model, pre-trained on vast prehensive understanding of the object.
quantities of image-text data pairs, develops a profound un-
derstanding of object properties, enabling it to provide a re- 4.3. 3D Reconstruction though Diffusion Prior
liable description of the target object. Utilizing an image
With the segmented object and its textual embedding as
patch I 0 , the BLIP model generates a textual description T :
input, we proceed to the 3D reconstruction task through
a coarse-to-fine strategy, as described in the 3DFuse [11].
T = BLIP(I 0 ) (2)
Our 3D reconstruction model leverages an image-to-point
Although the extracted T serves as an initial approxi- cloud and a depth-aware pretrained 2D text-to-image diffu-
mation of the image semantics, it may not encompass the sion model, to perform image-to-3D synthesis.
precise object information. Consequently, we employ T as Coarse Geometric Recovery. Upon receiving the seg-
the starting point and apply textual inversion [2] to estimate mented image I 0 , we exploit the Point-E model [6] to reveal
the exact soft prompt embedding. Specifically, the initial a preliminary estimation of the object’s complex structure.
prompt embedding e0 = EMBEDDING(T ) is optimized by By employing these 3D priors, we meticulously generate
matching the score function predicted by the text-to-image a sparse point cloud, which is subsequently projected to a
diffusion model: sparse depth map P according to the camera pose π:

4
of 3D objects efficiently. BLIP-2 [4] is employed to gen-
P = P(Point-E(I 0 ), π) (5) erate the text prompt taking the masked out image patch as
input. And a pre-trained Point-E [6] is adopted to initialize a
In this case, P(·) represents a depth-projection function.
coarse 3D model for generating sparse depth map. We then
This process yields a coarse and sparse depth map, which
perform evaluations on selected images from the SA-1B [3]
serves as a geometric constraint, providing crucial guidance
dataset, which comprises 11 million images collected from
for object recovery and laying the foundation for subsequent
uncontrolled environments.
refinement.
Results. The segmentation results and generated 3D ob-
Refined 3D Reconstruction. Armed with the coarse geom-
jects are presented in Fig. 1, Fig. 3 and Fig. 4. Specifically,
etry constraint P , the image I 0 , and the text embedding e∗ ,
we demonstrate the robustness of our framework by recon-
we employ an off-the-self text-to-image diffusion models
structing objects under various challenging conditions, such
φ , namely Stable-Diffusion [10], as a prior to synthesize
as occlusion, varying lighting, and different viewpoints.
images from alternative camera views, thereby facilitating
Our framework demonstrates robustness by reconstructing
the estimation of 3D textures from a single image.
objects under various challenging conditions, such as oc-
In particular, we employ the sparse depth encoder
clusion, fluctuating lighting, and diverse viewpoints. For
Eρ (·) [11], which ingests the sparse depth map P to con-
instance, we successfully reconstruct irregularly-shaped ob-
trol the pose of the generated images. The output features
jects like crane and cannon. Our model also exhibits com-
Eρ (P ) are then integrated into the intermediate features of
petence in handling small objects within cluttered environ-
diffusion U-net architecture with ControlNet [16]. Conse-
ments, such as the rubber duck and piggy bank.
quently, the depth map operates as a guiding signal, ensur-
These results suggest that our method surpasses exist-
ing that the synthesized images adhere to a consistent geo-
ing approaches in terms of accuracy and generalizability,
metric perspective.
highlighting the effectiveness of our framework in recon-
Afterward, we apply score distillation [7,13] to optimize
structing 3D objects from single-view images under diverse
NeRF. In this technique, the parameters of NeRF are up-
conditions.
dated to follow the score function predicted by the diffusion
model. To be more specific, we denote θ as the parameters
of NeRF, and Rθ (π) as the rendering function that takes in 6. Conclusion
a camera pose π and produces a rendered image. We use the In this paper, we introduce a novel task aimed at con-
diffusion model to infer the 2D score of the rendered image, structing arbitrary 3D objects from single-view, in-the-wild
which is then used to optimize the θ. By disregarding neg- images. We commence by delineating the primary chal-
ligible terms in the gradient, we obtain the score distillation lenges that accompany this task. To surmount these chal-
gradient concerning the NeRF parameters: lenges, we devise a versatile framework named Anything-
3D. Integrating the zero-shot ability of series of vision and
∇Θ LSDS (θ, x = Rθ (π)) = multimodal foundation models, our framework interactively
h  ∂x i (6) identifies regions of interest within single-view images and
Et,ρ w̃(t) φ (xt , Eρ (P ), e∗ ) −  represents 2D objects with optimized text embeddings. Ul-
∂θ timately, we efficiently generate high-quality 3D objects
where φ (xt , Eρ (P ), e∗ ) represents the diffusion model, using a 3D-aware score distillation model. In summary,
which accepts the coarse geometry P , prompt embedding Anything-3D showcases the potential for reconstructing in-
e∗ and noisy rendered image xt as inputs and estimates the the-wild 3D objects from single-view images. We hope our
score. xt is computed similar to Eq. 4. For an in-depth work serves as a solid step towards establishing an agile
derivation, please refer to DreamFussion [7] and SJC [13]. paradigm for 3D content creation.
In conclusion, the optimized NeRF model facilitates the
production of a superior 3D reconstruction of the object, 7. Limitations and Future Work
derived from a single view image.
This preliminary technical report on Single-view Any-
thing Reconstruction in the Wild presents the initial findings
5. Experiment
and insights of our approach. The reconstruction quality
As preliminary study, we exam the feasibility of our so- is not yet perfect, and we are continuously working to en-
lution through a series of qualitative visualizations. hance it. We have not provided quantitative evaluations on
Implementation Details. In practice, we employ a pre- 3D datasets, such as novel view synthesis and reconstruc-
trained depth encoder on the Co3D [8] dataset within the tion error, but we plan to incorporate these assessments in
3Dfuse [11] framework. We choose a voxel-based Neural future iterations of our work.
Radiance Field [12] to optimize the geometry and texture Furthermore, we aim to expand this framework to ac-

5
a white porsche convertible car with black detailing

an orange crane is attached to tracer and digger

a yellow rubber duck with green hat holding a box of present

Figure 3. Generated 3D objects of a convertible car, crane and rubble duck, visualized from five viewpoints.

an old and faded cannon with two wheels

an yellow pick bank with ‘URLOUB’ written on it

a red stool with its seat up

Figure 4. Generated 3D objects of a cannon, pig bank and stool, visualized from five viewpoints.

commodate more practical setups, including sparse view


object recovery. By addressing these limitations and broad-
ening the scope of our approach, we aspire to refine and
augment the performance of our Anything-3D framework.

6
References on Computer Vision and Pattern Recognition, pages 5459–
5469, 2022. 5
[1] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. [13] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
Fantasia3d: Disentangling geometry and appearance for and Greg Shakhnarovich. Score jacobian chaining: Lift-
high-quality text-to-3d content creation. arXiv preprint ing pretrained 2d diffusion models for 3d generation. arXiv
arXiv:2303.13873, 2023. 2 preprint arXiv:2212.00774, 2022. 2, 5
[2] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- [14] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Feichtenhofer, and Georgia Gkioxari. Multiview com-
Or. An image is worth one word: Personalizing text-to- pressive coding for 3d reconstruction. arXiv preprint
image generation using textual inversion. arXiv preprint arXiv:2301.08247, 2023. 2
arXiv:2208.01618, 2022. 4
[15] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang,
[3] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, and Zhangyang Wang. Neurallift-360: Lifting an in-the-
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- wild 2d photo to a 3d object with 360 views. arXiv preprint
head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and arXiv:2211.16431, 2022. 2
Ross Girshick. Segment anything. arXiv:2304.02643, 2023. [16] Lvmin Zhang and Maneesh Agrawala. Adding conditional
3, 5 control to text-to-image diffusion models, 2023. 5
[4] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
Blip: Bootstrapping language-image pre-training for uni-
fied vision-language understanding and generation. In In-
ternational Conference on Machine Learning, pages 12888–
12900. PMLR, 2022. 4, 5
[5] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fi-
dler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-
resolution text-to-3d content creation. arXiv preprint
arXiv:2211.10440, 2022. 2
[6] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela
Mishkin, and Mark Chen. Point-e: A system for generat-
ing 3d point clouds from complex prompts. arXiv preprint
arXiv:2212.08751, 2022. 2, 4, 5
[7] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
preprint arXiv:2209.14988, 2022. 2, 5
[8] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
Luca Sbordone, Patrick Labatut, and David Novotny. Com-
mon objects in 3d: Large-scale learning and evaluation of
real-life 3d category reconstruction. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 10901–10911, 2021. 5
[9] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10684–10695, 2022. 4
[10] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10684–10695, 2022. 5
[11] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon
Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee,
and Seungryong Kim. Let 2d diffusion model know 3d-
consistency for robust text-to-3d generation. arXiv preprint
arXiv:2303.07937, 2023. 2, 4, 5
[12] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
grid optimization: Super-fast convergence for radiance fields
reconstruction. In Proceedings of the IEEE/CVF Conference

You might also like