You are on page 1of 20

Generative Novel View Synthesis with 3D-Aware Diffusion Models

Eric R. Chan * †1,2 , Koki Nagano* 2 , Matthew A. Chan* 2 , Alexander W. Bergman* 1 , Jeong Joon Park* 1 ,
Axel Levy1 , Miika Aittala2 , Shalini De Mello2 , Tero Karras2 , and Gordon Wetzstein1
1 2
Stanford University NVIDIA
arXiv:2304.02602v1 [cs.CV] 5 Apr 2023

Abstract
We present a diffusion-based model for 3D-aware gen-
erative novel view synthesis from as few as a single input
image. Our model samples from the distribution of possible
renderings consistent with the input and, even in the presence
of ambiguity, is capable of rendering diverse and plausible
novel views. To achieve this, our method makes use of existing
2D diffusion backbones but, crucially, incorporates geom-
etry priors in the form of a 3D feature volume. This latent
feature field captures the distribution over possible scene rep-
resentations and improves our method’s ability to generate
view-consistent novel renderings. In addition to generating
novel views, our method has the ability to autoregressively
synthesize 3D-consistent sequences. We demonstrate state-of-
the-art results on synthetic renderings and room-scale scenes; Figure 1. Our 3D-aware diffusion model synthesizes realistic novel
we also show compelling results for challenging, real-world views from as little as a single input image. These results are
generated with the ShapeNet [10], Matterport3D [9], and Common
objects.
Objects in 3D [50] datasets.

1. Introduction
In this work, we challenge ourselves to addresses multiple
open problems in novel view synthesis (NVS): to design
an NVS framework that (1) operates from as little as a
single image and is capable of (2) generating long-range of
sequences far from the input views as well as (3) handling
both individual objects and complex scenes (see Fig. 1).
While existing few-shot NVS approaches, trained on a
category of objects with a regression objective, can generate
geometrically consistent renderings, i.e., sequences whose Figure 2. While regression-based models are capable of effective
frames share a coherent scene structure, they are ineffective view synthesis near input views (top row), they blur across ambiguity
in handling extrapolation and unbounded scenes (see Fig. 2). when extrapolating. Generative approaches can continue to sample
Dealing with long-range extrapolation (2) requires using a plausible renderings far from input views (second row, third column).
generative prior to deal with the innate ambiguity that comes
with completing portions of the scenes that were unobserved the boundaries of NVS towards a solution that can operate
in the input. In this work, we propose a diffusion-based in a wide range of challenging real-world data.
few-shot NVS framework that can generate plausible and Previous approaches to few-shot novel view synthesis
competitively geometrically consistent renderings, pushing can broadly be grouped into two categories. Geometry-prior-
* Equal contribution. based methods [53, 52, 41, 37, 42, 3, 89] have drawn from
† Work was done during an internship at NVIDIA. work on scene representations and neural rendering [78].

1
While they achieve impressive results on interpolating near them on 3D neural features extracted from input
input views, most methods are trained purely with regression image(s).
objectives and struggle in dealing with ambiguity or longer-
range extrapolations. When challenged with the task of • We demonstrate that our 3D feature-conditioned
novel view synthesis from sparse inputs, they can only tackle diffusion model can generate realistic novel views given
mildly ambiguous cases, i.e., cases where the conditional as little as a single input image on a wide variety of
distribution of novel renderings is well approximated by the datasets, including object level, room level, and complex
mean estimator of this distribution — obtained by minimizing real-world.
a pixel-wise L1 or L2 loss [90, 70, 89]. However, in highly • We show that with our proposed method and sampling
ambiguous cases, for example when parts of the scene are strategy, our method can generate long trajectories of
occluded in all the given views, the conditional distribution realistic, multi-view consistent novel views without
of novel renderings becomes multi-modal and the mean suffering from the blurring of regression models or the
estimator produces blurry novel views (see Fig. 2). Because drift of pure generative models.
of these limitations, regression-based approaches are limited
to short-range view interpolation of object-centric scenes and We will make the code and pre-trained models available.
struggle in long range extrapolation of unconstrained scenes.
In contrast, generative approaches rely on generative pri- 2. Related work
ors and solve the novel view synthesis problem by generating
random plausible samples from this conditional distribution. Focusing on novel view synthesis (NVS) from as little as
Existing generative models for view synthesis [56, 84, 51, 38] a single image, our work touches on several areas at the inter-
autoregressively extrapolate one or a few input images with section of 3D reconstruction, NVS, and generative models.
few or no geometry priors. For this reason, most of these Geometry-based novel view synthesis. A large body of
methods struggle with generating geometrically consistent se- prior works for NVS recovers the 3D structure of a scene
quences — renderings are only approximately consistent be- by estimating the input images’ camera parameters [72, 63]
tween frames and lack a coherent rigid scene structure. In this and running multi-view stereo (MVS) [1, 20]. The recovered
work, we present an NVS method that bridges the gap between explicit geometry proxies enable NVS but fail to synthesize
geometry-based and generative view synthesis approaches photorealistic and complete novel views especially for
for both geometrically consistent and generative rendering. occluded regions. Some recent methods [52, 53] combine 3D
Our method leverages recent developments in diffusion geometry from an MVS pipeline with deep learning–based
models. Specifically, conditional diffusion models [59, 57, NVS, but the overall quality may suffer if the MVS pipeline
49, 55, 58] can be directly applied to the task of NVS. Con- fails. Other explicit geometric representations, such as depth
ditioned on input images, these models can sample from the maps [19, 80], multi-plane images [18, 94], or voxels [69, 39]
conditional distribution of output renderings. As a generative are also used by many recent NVS approaches, as surveyed
model, they naturally handle ambiguity and lend themselves by Tewari et al. [78].
to continued autoregressive extrapolation of plausible outputs. Regression-based novel view synthesis. Many deep
However, as we show in Sec. 4 (Tab. 1), an image diffusion learning–based approaches to NVS are supervised to predict
framework alone struggles to synthesize 3D-consistent views. training views with regression. These works often employ
Geometry priors remain valuable for ensuring view 3D representations for scenes and differentiable neural
consistency when operating on complex scenes, and rendering [70, 41]. While many methods are optimized on
pixel-aligned features [60, 89, 81] have been shown to be a per-scene basis with dense input views [41], few-shot NVS
successful for conditioning scene representations on images. approaches are designed to generalize across a class of 3D
We incorporate these ideas into the architecture of our scenes, which enable them to make predictions from one
diffusion-based NVS model with the inclusion of a latent or a few input images at inference. Among few-shot NVS
3D feature field and neural feature rendering [44]. Unlike methods, some rely on test-time optimization [70, 29] or meta
previous view synthesis works that include neural fields, how- learning [67, 77], while others lift input observations via en-
ever, our latent feature field captures a distribution of scene coders [80, 45, 94, 89, 79, 11, 81] and predict novel views in a
representations rather than the representation of a specific feed-forward fashion. A recent trend has some NVS methods
scene. A rendering from this latent field is distilled into the forgoing geometry priors for light fields [68] or transform-
rendering of a particular scene realization through diffusion ers [61, 34], but these geometry-free methods are otherwise
sampling at inference. This novel formulation is able to both trained similarly to other regression-based NVS algorithms.
handle ambiguity resulting from long-range extrapolation
and generate geometrically consistent sequences. Generative models for novel view synthesis. A separate
In summary, contributions of our work include: line of work studies methods for long-range view extrapo-
lation. Because venturing far beyond the observed views
• We present a novel view synthesis method that extends requires generating parts of the scene, these methods are
2D diffusion models to be 3D-aware by conditioning typically grounded in generative models. A common thread

2
amongst these methods is that they often contain only weak
geometry priors, e.g., sparse feature point clouds [84, 54, 33],
or lack geometry priors altogether [56, 51]. As image-
translation-based generative models, they are capable of
conditioning on their own previous generations to autore-
gressively synthesize long camera trajectories, sometimes
infinitely [38, 36]. Because the focus is on extrapolating
at large scales, these methods ordinarily achieve only
approximate view consistency at longer ranges.
3D GANs. 3D GANs [43, 66, 7, 6, 22, 46, 87, 93, 71,
88, 92, 13, 85, 4] combine an adversarial [21] training
strategy with implicit neural scene representations to learn Figure 3. Illustration of our framework D. The pipeline receives
generative models for 3D objects. While typically tasked as input one or more input views x and the camera parameters
with unconditional synthesis of 3D objects, a trained 3D associated with input and target views. We extract features from
GAN contains a strong prior for 3D shapes and can be each input view x using T and unproject them into a feature volume
inverted for NVS of detailed scenes [7, 6]. 3D GANs have W . These volumes are aggregated using a mean-pooling operation,
been extensively developed to achieve compositionality [44], decoded by a small MLP f , and a feature image F is created by
projecting into the target view xtarget using volume rendering. The
higher rendering resolution [6, 22, 71], video generation [2],
U-Net denoiser U then takes in the resulting feature image F as
and scalability to larger scenes [14]. GANs, however, are well as a noisy image of the target view y and noise level σ, and
notoriously difficult to train, and their 3D inversions from an produces a denoised image of the target view xtarget .
input image are often brittle without additional 3D priors [86]
or an accurate camera input [32]. Moreover, most 3D GANs
assume canonical camera poses and limit their optimal In novel view synthesis, we are given a set of input images
operating ranges to single objects. xinputs and camera parameters Pinputs with associated pose
and intrinsics and are tasked with making a prediction for a
2D diffusion models. 2D diffusion models [25, 73, 75, 30]
query view given a set of query camera parameters.
have transformed image synthesis. Favorable properties such
Our goal is to sample novel views from the corresponding
as mode coverage and a stable training objective have enabled
conditional distribution:
them to outperform [15] previous generative models [21] on
unconditional generation. Diffusion models have also been p(xtarget |xinputs ,Pinputs ,Ptarget ). (1)
shown to be excellent at modeling conditional distributions of
images, where the conditioning information may be a class la-
bel [76, 15], text [49, 55, 58] or another image [26, 59, 57, 8]. 3.1. 3D-aware diffusion model architecture
Recent 3D diffusion works. Recently, DreamFusion [48] Diffusion models rely on a denoiser trained to predict
and 3DiM [83] apply 2D image diffusion models to build Ep(x|y) [x] given y, a noisy version of x with noise
3D generative models. DreamFusion performs text-guided standard deviation σ. An image is generated by drawing
2
3D generation by optimizing a NeRF from scratch. 3DiM y0 ∼ N (0,σmax I) and iteratively denoising it according to
performs novel view synthesis conditioned on input images a sequence of noise levels σ0 = σmax > ... > σN = 0.
and poses (similar to [51]) and does not employ any explicit In our work, we directly repurpose 2D diffusion models to
geometry priors; it aggregates multiple observations at model the distribution in Eq. 1. The intuition is that generative
inference using a unique stochastic conditioning scheme. By novel view synthesis is identical to any other conditional
contrast, the geometry priors present in our approach enable image generation task — all we need to do is condition
3D consistency with a much lighter-weight model (90M a 2D image diffusion model on the input image and the
for ours vs 471M or 1.3B for 3DiM [83]), and because our relative camera pose. However, while there are many ways of
model naturally handles multiple input views, we have the applying this conditioning, some may be more effective than
flexibility to choose efficient sampling schemes at inference. others (see Tab. 1 and ablation studies of different options in
While code for 3DiM is unavailable, we compare to a similar Sec. 4.4). By incorporating geometry priors in the form of a
geometry-free variant in Sec. 4 (Tab. 1) and to stochastic view 3D feature field and neural rendering, we give our architecture
conditioning in the supplement. a strong inductive bias towards geometrical consistency.
Fig. 3 summarizes the design of our conditional-desnoiser-
3. Method based pipeline D that takes as inputs a noisy target view y,
conditioning information (xinputs ,Pinputs ,Ptarget ) and a noise
Here we describe the architecture of our NVS model for level σ. Our strategy builds upon pixel-aligned implicit func-
both single and multiple-view conditioning, and we explain tions [60, 89] and neural rendering. Following Fig. 3, given
our training and inference methods. a single input image x taken from an input view camera P,

3
we use an image-to-image translation network T to predict 3.3. Training
a feature image with c×d channels and reshape it into a fea-
At each iteration during training, we sample a batch of
ture volume W that spans the source camera frustum. d then
target images, input images, and their associated camera
corresponds to the depth dimension of the volume and c to
poses, where the targets and inputs are constrained to be from
the number of channels in each cell of the volume (typically,
the same scene. Our model is trained end-to-end from scratch
c = 16 and d = 64). Given a query camera Ptarget , we cast rays
to minimize the following objective
in 3D space. Continuing on Fig. 3, for any point r along a ray,
we sample the volume W with trilinear interpolation and de- L := E(xtarget ,xinputs ,Ptarget ,Pinputs )∼pdata Eε∼N (0,σ2 I) (6)
code the obtained feature w = W (r) with a small multi-layer  target inputs inputs target target 2

perceptron (MLP) f to obtain a density τ and a feature vector c kD(x +ε ;x ,P ,P ,σ)−x k2 ,

(τ,c) = f (w). (2) where σ is sampled during training according to the strategy
proposed by EDM [30]. The number of conditioning views
By projecting this feature field into the target view using vol- for a query is drawn uniformly from {1,2,3} at every iteration.
ume rendering [40, 41], we obtain a feature image F in Fig. 3: During training, we apply non-leaking augmentation [30] to
U and augment input images with small amounts of random
F (x,P,Ptarget ) = RENDER(f ◦T (x),P,Ptarget ). (3) noise. Please see the supplement for hyperparameters and
additional training details.
In practice, we employ the image segmentation archi-
tecture DeepLabV3+ [12, 28] for T , and implement f as a 3.4. Generating novel views at inference
two-layer ReLU MLP with 64 channels. We perform volume
rendering over features in the same way as NeRF [41]. We Sampling a novel view with our method is identical to sam-
use input/output image resolution 1282 in all experiments. pling an image with a conditional diffusion model. The spe-
The feature image F is concatenated to the noisy cific update rule for the denoised image is determined by the
image y and passed as input to a denoiser network U to choice of sampler. In our experiments, we use a deterministic
produce the final target view xtarget (see Fig. 3). We use 2nd order sampling strategy [30], with 25 or fewer denoising
DDPM++ [76, 30] for U , where steps. Other sampling strategies [76, 74] can be dropped in
if other properties (e.g., stochastic sampling) are desired.
D(y ;xinputs ,Pinputs ,Ptarget ,σ) = U (y,F ;σ) (4) In order to improve efficiency at inference, we decouple
Γ and U . Rather than running both Γ and U at every step
Fig. 3 and Eq. 4 summarize the design of D. The total during sampling, we first render the feature image F as
number of trainable parameters in D is 90M. a preprocessing step and reuse it for each iteration of the
3.2. Incorporating multiple views sampling loop – while U must run every step during inference,
Γ is run only once.
The previous section describes our approach to condition- Alternative “one-step” inference. An alternative variant
ing on a single input view. However, additional information of our model to generating an image with iterative denoising
in the form of multiple input views reduces uncertainty and is to produce the image with a single step of denoising. Intu-
enables our model to sample renderings from a narrower itively, the one-step prediction of a model trained with Eq. 6
distribution. When multiple conditioning views are available, should behave identically to the prediction of a model trained
we process each input image independently into a separate to minimize pixel-wise MSE. Thus, this alternative inference
feature volume. mode is representative of regression-based methods. A model
Eq. 2 can be generalized to n conditioning views by trained as described is capable of both generative sampling
averaging the features wj = Wj (r) obtained for each input and deterministic one-step inference—no architecture or
image xj , as in [89]: training modifications are required.
 
n 3.5. Autoregressive generation
1 X
(τ,c) = f  wj . (5)
n j=1 In order to generate consistent sequences, we take
an autoregressive approach to synthesizing sequential
To leverage this strategy during inference, we train our frames. Instead of independently generating each frame
model by conditioning with multiple (variable) input images. conditioned only on the input images, which would lead to
Conditioning using multiple input images helps to ensure large deviations between frames, we generate each frame
smooth, loop-consistent video synthesis. While conditioning conditioned on the inputs as well as a subset of previously
on only the previous frame is sufficient for view consistency generated frames. While there are many possible ways of
in a small view change, it does not guarantee loop closure. selecting conditioning views, a reasonable setting that we use
In practice, we find that conditioning on a subset of previous in our experiments is to condition on the input image(s), the
views helps to enforce correct loop closure while maintaining most recently generated image, and five additional images
reasonable view to view consistency. drawn at random from the set of previously generated frames.

4
FID↓ LPIPS↓ DISTS↓ PSNR↑ SSIM ↑
PixelNeRF [89] 65.83 0.146 0.203 23.2 0.90
ViewFormer [34] 20.82 0.146 0.161 19.0 0.83
EG3D-PTI [6] 27.23 0.150 0.310 19.0 0.85
3DiM (autoregressive) [83]† 8.99 21.01 0.57
Explicit 8.09 0.129 0.158 19.1 0.86
Geom.-Free 16.68 0.342 0.329 13.1 0.74

Ours
One-Step 42.07 0.150 0.178 23.2 0.91
Full (autoregressive) 11.08 0.120 0.146 20.6 0.89
Full 6.47 0.104 0.145 20.7 0.89

Table 1. Quantitative comparison of single-view novel view


synthesis on ShapeNet cars [10, 70]. † As reported by [83].
Figure 4. Qualitative comparison on ShapeNet [10] with one input
view. Unlike regression-based approaches, our method produces
sharp realizations. With one-step inference, our approach behaves
like a mean estimator of the novel view, similarly to PixelNeRF.

We found this default conditioning setting to be a good


starting point that balances short range, frame-to-frame
consistency, long-range consistency across the scene,
and compute cost, but other variants may be preferred to Figure 5. Generating new views from more (bottom) or less
emphasize specific qualities. (top) ambiguous conditioning information. PixelNeRF [89] is
While one might expect errors and artifacts to accumulate constrained to output deterministic novel views and renders an
throughout long autoregressive sequences, in practice we find average of all plausible renderings that are consistent with the
that our model effectively suppresses such errors, making input view. In comparison, our method samples the conditional
it suitable for extended sequence generation. Please see the distribution, leading to sharp but different realizations. In the last
column, we show the per-pixel standard deviation of the novel
supplement for alternative autoregressive schemes.
view and show that unseen areas are more ambiguous, i.e., vary
more from one sample to the other. Pixel-wise standard deviation is
4. Experiments computed over 50 samples. Dark pixels indicate higher ambiguity.

We evaluate the performance of our generative NVS


method on ShapeNet [10] “cars” and Matterport3D [9], two Metrics. We evaluate the task of novel view synthesis
starkly different datasets. ShapeNet is representative of along three axes: ability to (1) recreate the image quality
synthetic, object-centric datasets that have long been domi- and diversity of the ground truth dataset, (2) generate novel
nated by regression-based approaches to NVS (e.g., [89, 68]). views consistent with the ground truth, and (3) generate
Meanwhile, long-range NVS on Matterport3D is prototypical sequences that are geometrically consistent. For (1), we
of unbounded scene exploration, where generative models use distribution-comparison metrics, FID [24] and KID [5],
with weak geometry priors [84, 56, 51] have seen more which are commonly used to evaluate generative models
success. Finally, we stress-test our method on the challenging for image synthesis. For (2), we use perceptual metrics
Common Objects in 3D (CO3D) [50], an unconstrained LPIPS [91] and DISTS [17], which measure structural and
real-world dataset — to our knowledge, our work is the first texture similarity between the synthesized novel view and
to attempt single-shot NVS on this dataset while including ground-truth novel view. For completeness, we include
its complex backgrounds. Our method improves upon the PSNR and SSIM, although the drawbacks of these metrics
state-of-the-art for all tasks. For additional results, please are well-studied: these raw pixel metrics have been shown
refer to the videos contained in the supplement. to be poor evaluators of generative models as they favor
conservative, blurry estimates that lack detail [59, 57].
Baselines and implementation details. For ShapeNet For (3), we provide COLMAP [64, 65] reconstructions of
and CO3D, we compare our method to PixelNeRF [89], a generated video sequences, a standard evaluation for 3D
state-of-the-art NeRF-based method for NVS, and View- consistency in 3D GANs [66, 7, 6]. Dense, well-defined
Former [34], a transformer-based, geometry-free approach point clouds are indicative of geometrically consistent frames.
to NVS. For ShapeNet, we additionally provide a comparison We calculate Chamfer distances between reconstructions
with EG3D-PTI [6], which is based on a state-of-the-art 3D of the ground-truth images and reconstructions of generated
GAN for object-scale scenes, and a numerical comparison sequences to quantitatively evaluate geometrical consistency.
with 3DiM [83], a recent geometry-free diffusion method for
NVS. For Matterport3D, we compare our method against the 4.1. ShapeNet
state-of-the-art on this dataset: Look Outside The Room [51], We standardize our training and evaluation on the single-
a transformer-based, geometry-free NVS method designed class, single-view NVS benchmark described in [89, 70, 34].
for room-scale scenes, and to additional SOTA methods, The ShapeNet training set contains 2,458 cars, each with 50
including SynSin [84] and GeoGPT [56] in Tab. 2.

5
Figure 6. COLMAP reconstructions from video sequences
produced by our method are dense, well-defined, and highly similar
to reconstructions of the ground-truth images, demonstrating a high Figure 7. Qualitative comparison on Matterport3D [9] for NVS.
degree of geometric consistency, as measured by Chamfer distance. Given a single input image (1st col.), we autoregressively run
The three rows show results on ShapeNet, Matterport3D, and CO3D, our method and LOTR [51] for 10 frames to synthesize novel
respectively. view images (2nd and 3rd columns). Ground truth images for the
corresponding query camera poses are shown in the fourth column.
Best viewed zoomed-in.
renderings randomly distributed on the surface of a sphere.
For evaluation, we use the provided test set with 704 cars,
it achieves competetive scores with a lighter weight model
each with 250 rendered images and poses on an Archimedean
(90M vs 471M params) and fewer diffusion steps (25 vs 512).
spiral. All evaluations are conducted with a single input
In Fig. 5, we demonstrate that for a given observation, our
image. For our model, we evaluate both independently
model is capable of producing multiple plausible realizations.
generated frames and frames generated with autoregressive
When conditioning information is reliable, such as when the
conditioning. In addition to our model and the baselines, we
query view is close to the input view, ambiguity is low and
provide additional comparisons to several ablative variants of
samples are drawn from a narrow conditional distribution.
our approach, which are discussed in more detail in Sec. 4.4.
For more ambiguous inputs, such as when the model is
Fig. 4 provides a qualitative comparison against baselines
tasked with recreating regions that were occluded in the input
for single-view novel view synthesis on ShapeNet. In contrast
image, our model produces plausible realizations with more
to PixelNeRF, which predicts a blurry mean of the conditional
variation. In contrast, regression-based methods such as Pix-
distribution, our method (Ours Full) generates sharp realiza-
elNeRF deterministically predict the mean of the conditional
tions. While ViewFormer also produces sharp images due to
distribution and are therefore unable to create high quality
training with a perceptual loss, its renderings fail to transfer
realizations when the target view is far from conditioning
some small details, such as headlight shape, from the input.
information and the conditional distribution is large.
In Tab. 1, we report the quality of novel renderings
Fig. 6 shows that our method can also achieve high
produced by our method and baselines, as measured by
geometrical consistency when combined with autoregressive
FID [24], LPIPS [91], DISTS [16], PSNR, and SSIM [82]. As
generation as validated by dense point cloud reconstruction
a generative model, our method creates sharp, diverse outputs,
and the Chamfer distance to the ground truth.
which closely match the image distribution; it thus scores
more favorably in FID than regression baselines [89, 34], 4.2. Matterport3D
which tend to produce less finely detailed renderings. Our
method outperforms baselines in LPIPS and DISTS, which Beyond ShapeNet, we seek to show the effectiveness of
indicates that our method produces novel views that achieve our method on the Matterport3D (MP3D) dataset that features
greater structural and textural similarity to the ground truth building-scale, real-world scans. We use the provided code of
novel views. We would not expect a generative model to [51] to sample trajectories of embodied agents and generate
outperform a regression model in PSNR and SSIM, and 6,000 videos for training and 200 videos for testing, using the
indeed, renderings from PixelNeRF achieve higher scores provided 61/18 training and test splits. We train our model by
in these pixel-wise metrics than realizations from our model. sampling random pairs of input and target images from the
However, we note that the one-step denoised prediction of our same video sequence, where 50% of input views are drawn
model (described in Sec. 3.4) is able to match PixelNeRF’s from within ten frames of the target view and the rest are sam-
state-of-the-art PSNR and SSIM. While our method with pled randomly from the video sequence. The rest of the train-
autoregressive conditioning does not surpass 3DiM [83], ing procedure is equivalent to the one we use with ShapeNet.

6
KID↓ LPIPS↓ DISTS↓ PSNR↑ SSIM↑ less meaningful. We compare against Look Outside the
LOTR [51] (10 f.) 0.050 0.33 0.27 16.57 0.49 Room (LOTR) [51], the current state-of-the-art (SOTA) for
Ours (10 f.) 0.002 0.14 0.14 20.80 0.71
single-view NVS on Matterport3D that outperforms prior
SynSin-6X∗ [84] 0.072 0.48 0.34 14.89 0.41
GeoGPT∗ [56] 0.039 0.33 0.27 16.47 0.49
NVS works (i.e., [84, 54, 56, 35]). We additionally compare
LOTR [51] 0.027 0.25 0.22 18.00 0.55 against SynSin [84] and GeoGPT [56], using the 5-frame
Ours 0.002 0.09 0.11 22.79 0.79 renderings provided by the authors of LOTR. Note that, since
the trajectories of embodied agents are randomly sampled,
Table 2. Quantitative comparison of single-view novel view the trajectories used for these two baselines are different
synthesis on Matterport3D [9]. Here, we use KID since it provides from those used for our method and LOTR. This comparison
an unbiased estimate when the number of images is small. “10 f.”
measures performance on 200 random trajectories, which is
indicates novel view synthesis for 10 frames from the input image
(used 5 frames for the bottom rows). * For SynSin and GeoGPT, we
statistically meaningful and the results align with the trends
obtained the rendered images from the authors of LOTR. reported in LOTR. For all baselines, we downsample the
outputs to our output resolution, i.e., 1282 , and compute the
aforementioned metrics against the ground truth images. To
measure the realism of the outputs, we choose KID [5], as
it is known to be less biased than FID when the number of
test images is small (we use 2000 images).
The results, summarized in Tab. 2, show that our approach
generates novel view predictions that outperform baselines
in terms of quality and consistency with the input view. Fig. 7
supports the trends observed in the metrics—our NVS is
Figure 8. Regression-based models, such as the one-step variant noticeably more accurate and realistic than the current SOTA.
of our approach, struggle to model ambiguity and therefore fail to In Fig. 9, we compare against LOTR on a cyclic trajectory.
create plausible renderings far from the input. Generative sampling Our method produces better loop closure, indicating higher
enables plausible synthesis in ambiguity. When combined with geometric consistency and showing the effectiveness of
autoregressive generation, we are able to explore areas that were incorporating 3D priors. Fig. 6 additionally validates the
completely occluded in the input.
consistency of our results with superior reconstructed point
clouds and Chamfer distances.

4.3. Common Objects in 3D (CO3D)


We challenge our method with real-world scenes from the
Common Objects in 3D (CO3D) [50] dataset with complete
backgrounds. To our knowledge, no prior method has at-
tempted single-shot NVS on CO3D without object masks. We
train our method on the hydrant category of the CO3D dataset,
which contains 726 RGB videos of real-world fire hydrants.
Most videos contain a walkaround trajectory looking in at
the hydrant spanning between 60 and 360 degrees, and most
videos consist of about 200 frames. We use a 95:5 train/test
split to train our model. CO3D is a highly unconstrained
and extraordinarily difficult benchmark: scene scale, camera
Figure 9. Loop closure test on Matterport3D [9]. We run intrinsics, complex backgrounds, and lighting conditions are
our method and LOTR [51] on a small cyclic rotation angle highly variable between (and sometimes within) scenes.
trajectory (0◦ →15◦ →45◦ →15◦ ). Without 3D representations, Fig. 10 compares predictions from our method against
transformer-based methods, such as LOTR, rely on interpreting raw baselines on CO3D. Our method produces plausible and
camera parameters, resulting in weak spatial awareness. Our 3D sharp foregrounds and backgrounds that do not deteriorate in
feature representation more effectively aggregates past observations quality with increasing distance from the source pose. While
and provides better loop closure. Best viewed zoomed-in. we include a qualitative comparison against ViewFormer
for reference, we exclude it from numerical comparisons
For evaluation, we randomly select an input frame in the because of its reliance on object masks. Fig. 6 demonstrates
test video set (one input frame for each test video), and run ten the degree of geometric consistency that is attainable by
steps of autoregressive synthesis, following the test camera our approach. Tab. 3 additionally provides a quantitative
trajectory; we calculate metrics using all ten synthesized comparison against PixelNeRF. On complex scenes rife with
frames. Beyond 10 frames, input and the target frusta rarely ambiguity, the generative nature of our approach enables
overlap, making comparisons against ground truth frames synthesis of plausible realizations.

7
Figure 10. While PixelNeRF produces severe artifacts when the
rendering view is far away from the input and ViewFormer requires Figure 11. Without autoregressive conditioning (top), our method
masks for training on this dataset, our method generates compelling generates plausible, albeit geometrically incoherent, novel views
sequences from single-views on challenging, real-world objects of conditioned on the input image. With autoregressive conditioning
the CO3D dataset [50]. (bottom), our method generates plausible sequences that achieve
greater geometric consistency between frames.
KID↓ LPIPS↓ DISTS↓ PSNR↑ SSIM↑
PixelNeRF [89] 0.210 0.705 0.487 16.26 0.271
One-Step 0.106 0.641 0.492 16.78 0.331 input view, it samples from a wide conditional distribution
Ours

Full 0.012 0.369 0.446 15.48 0.266 and accordingly, subsequent frames exhibit significant
variance. Autoregressive conditioning effectively conditions
Table 3. Quantitative comparison of single-view novel view the network not only on the source image, but also on
synthesis on CO3D [50]. previously generated frames that closely overlap with the
current view, helping narrow this conditional distribution.
4.4. Ablation Studies
Choice of intermediate representations. Tab. 1 (bottom) Additional studies. Additional ablations, including
compares several choices of intermediate representations experiments that evaluate out-of-distribution extrapolation,
within our method. While we have described a specific classifier-free guidance, effect of number of input views,
approach to the task of generative novel view synthesis using stochastic conditioning, and effect of distance to input views,
diffusion, there is ample freedom to choose how D interprets can be found in the supplement.
information from input views. In fact, the simplest approach
forgoes any geometry priors and instead directly conditions 5. Discussion
the model on an input view by concatenation. In our exper-
iments, this geometry-free approach struggled compared to Conclusion. We proposed a generative novel view syn-
variants that incorporated geometry priors. However, greater thesis approach from a single image using geometry-based
model capacity and effective use of cross-attention [83] may priors and diffusion models. Our hybrid method combines
be key to making this approach work. We additionally com- the benefit of explicit 3D representations with the generative
pare against an “Explicit” intermediate representation similar power of diffusion models for generating realistic and
to our described approach but without the MLP decoder; 3D-aware novel views, demonstrating the state-of-the-art
while slightly faster, this representation generally produced performance in both object-scale and room-scale scenes. We
worse results. We compare to the one-step inference mode also demonstrate the compelling results on a challenging
of our method on ShapeNet in Fig. 4 and Tab. 1, on MP3D real-world dataset of CO3D with background — a challenge
in Fig. 8, and on CO3D in Tab. 3. Like regression-based never attempted. While our results are not perfect, we believe
methods, it obtains excellent PSNR and SSIM scores but we presented a significant step towards a practical NVS
lacks the ability to generate plausible results far from the solution that can operate on a wide range of real-world data.
input. On Matterport3D, Fig. 8 illustrates the motivation
of using a generative prior for long-range synthesis. While
the quality of regression-based predictions rapidly degrades Limitations and future work. While our method effec-
with increasing ambiguity, a generative model can create tively combines explicit geometry priors with 2D diffusion
a plausible rendering even in regions with little or no models, the output resolution is currently limited to 1282 and
conditioning information, such as behind an occlusion. the diffusion-based sampling is not fast enough for interactive
visualization. Since our model can leverage existing 2D
diffusion architectures for U , it can directly benefit from
Effect of autoregressive generation. Although autore- future advances in the underlying 2D diffusion models. While
gressive conditioning slightly trades off image quality our method achieves reasonable geometrical consistency, it
(Tab. 1), Fig. 11 demonstrates the necessity of autoregressive can still exhibit minor inconsistencies and drift in challenging
conditioning for generating geometrically consistent real-world datasets, which should be addressed by future
multi-view images. Without autoregressive conditioning, work. While our method can operate for novel view synthesis
independently sampled frames are each plausible, but lack from a single view during inference, training the method
coherence—when conditioning information is ambiguous, requires multi-view supervision with accurate camera poses.
e.g., when the model is predicting novel views far from the In this work, we implemented our method using a 3D

8
feature volume representation. Possible future work includes Conference on Computer Vision and Pattern Recognition
investigating other types of intermediate 3D representations. (CVPR), pages 5799–5809, 2021. 3, 5
[8] Trevor Chan, Nada Kamona, Brian-Tinh Vu, Felix Wehrli,
and Chamith Rajapakse. Deep learning super-resolution
Ethical considerations. Diffusion models could be of mr images of the distal tibia improves image quality and
extended to generate DeepFakes. These pose a societal threat, assessment of bone microstructure. 3
and we do not condone using our work to generate fake [9] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej
images or videos with the intent of spreading misinformation. Halber, Matthias Niessner, Manolis Savva, Shuran Song,
Andy Zeng, and Yinda Zhang. Matterport3D: Learning
Acknowledgements from RGB-D data in indoor environments. arXiv preprint
We thank David Luebke, Samuli Laine, Tsung-Yi Lin, arXiv:1709.06158, 2017. 1, 5, 6, 7, 19
and Jaakko Lehtinen for feedback on drafts and early [10] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
discussions. We thank Jonáš Kulhánek and Xuanchi Ren Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet:
for thoughtful communications and for providing results and
An information-rich 3D model repository. arXiv preprint
data for comparisons. We thank Trevor Chan for help with arXiv:1512.03012, 2015. 1, 5, 16, 17, 19
figures. Koki Nagano and Eric Chan were partially supported
[11] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang,
by DARPA’s Semantic Forensics (SemaFor) contract Fanbo Xiang, Jingyi Yu, and Hao Su. MVSNeRF: Fast
(HR0011-20-3-0005). JJ park was supported by ARL grant generalizable radiance field reconstruction from multi-view
W911NF-21-2-0104. This project was in part supported stereo. In IEEE International Conference on Computer Vision
by Samsung, the Stanford Institute for Human-Centered (ICCV), 2021. 2
AI (HAI), and a PECASE from the ARO. The views and [12] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
conclusions contained in this document are those of the Schroff, and Hartwig Adam. Encoder-decoder with atrous
authors and should not be interpreted as representing the separable convolution for semantic image segmentation. In
official policies, either expressed or implied, of the U.S. Proceedings of the European conference on computer vision
Government. Distribution Statement “A” (Approved for (ECCV), pages 801–818, 2018. 4, 17
Public Release, Distribution Unlimited). [13] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong.
GRAM: Generative radiance manifolds for 3D-aware image
generation. In IEEE Conference on Computer Vision and
References Pattern Recognition (CVPR), 2022. 3
[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian [14] Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava,
Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Graham W Taylor, and Joshua M Susskind. Unconstrained
Building Rome in a day. Communications of the ACM, scene generation with locally conditioned radiance fields. In
54(10):105–112, 2011. 2 IEEE International Conference on Computer Vision (ICCV),
[2] Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, pages 14304–14313, 2021. 3
Hao Tang, Gordon Wetzstein, Leonidas Guibas, Luc Van Gool, [15] Prafulla Dhariwal and Alexander Nichol. Diffusion models
and Radu Timofte. 3D-aware video generation. arXiv preprint beat GANs on image synthesis. Advances in Neural
arXiv:2206.14797, 2022. 3 Information Processing Systems, 34:8780–8794, 2021. 3, 20
[3] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter [16] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli.
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Image quality assessment: Unifying structure and texture
Mip-NeRF: A multiscale representation for anti-aliasing similarity. IEEE transactions on pattern analysis and machine
neural radiance fields. In IEEE International Conference on intelligence, 2020. 6
Computer Vision (ICCV), pages 5855–5864, 2021. 1 [17] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli.
[4] Alexander W. Bergman, Petr Kellnhofer, Wang Yifan, Eric R. Image quality assessment: Unifying structure and texture
Chan, David B. Lindell, and Gordon Wetzstein. Generative similarity. IEEE Transactions on Pattern Analysis and
neural articulated radiance fields. In Advances in Neural Machine Intelligence, 2022. 5
Information Processing Systems (NeurIPS), 2022. 3 [18] John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall,
[5] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Graham Fyffe, Ryan Overbeck, Noah Snavely, and Richard
Arthur Gretton. Demystifying MMD GANs. In International Tucker. DeepView: View synthesis with learned gradient
Conference on Learning Representations (ICLR), 2018. 5, 7 descent. In IEEE Conference on Computer Vision and Pattern
[6] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Recognition (CVPR), 2019. 2
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J [19] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely.
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient Deep stereo: Learning to predict new views from the world’s
geometry-aware 3D generative adversarial networks. In IEEE imagery. In IEEE Conference on Computer Vision and Pattern
Conference on Computer Vision and Pattern Recognition Recognition (CVPR), 2016. 2
(CVPR), pages 16123–16133, 2022. 3, 5 [20] Michael Goesele, Noah Snavely, Brian Curless, Hugues
[7] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Hoppe, and Steven M Seitz. Multi-view stereo for community
and Gordon Wetzstein. pi-GAN: Periodic implicit generative photo collections. In IEEE International Conference on
adversarial networks for 3D-aware image synthesis. In IEEE Computer Vision (ICCV), pages 1–8. IEEE, 2007. 2

9
[21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing generation of natural scenes from single images. In European
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Conference on Computer Vision (ECCV), pages 515–534.
and Yoshua Bengio. Generative adversarial networks. Springer, 2022. 3
Communications of the ACM, 63(11):139–144, 2020. 3, 20 [37] David B Lindell, Dave Van Veen, Jeong Joon Park, and Gordon
[22] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Wetzstein. BACON: Band-limited coordinate networks for
StyleNeRF: A style-based 3D-aware generator for high- multiscale scene representation. In IEEE Conference on
resolution image synthesis. arXiv preprint arXiv:2110.08985, Computer Vision and Pattern Recognition (CVPR), pages
2021. 3 16252–16262, 2022. 1
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [38] Andrew Liu, Richard Tucker, Varun Jampani, Ameesh
Deep residual learning for image recognition. In Proceedings Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite
of the IEEE conference on computer vision and pattern nature: Perpetual view generation of natural scenes from a
recognition, pages 770–778, 2016. 17 single image. In IEEE International Conference on Computer
[24] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Vision (ICCV), pages 14458–14467, 2021. 2, 3
Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. [39] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
GANs trained by a two time-scale update rule converge to Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural
a Nash equilibrium. In Advances in Neural Information volumes: Learning dynamic renderable volumes from images.
Processing Systems (NeurIPS), 2017. 5, 6 ACM Trans. Graph., 38(4):65:1–65:14, July 2019. 2
[25] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- [40] Nelson Max. Optical models for direct volume rendering.
sion probabilistic models. Advances in Neural Information IEEE Transactions on Visualization and Computer Graphics,
Processing Systems, 33:6840–6851, 2020. 3 1(2):99–108, 1995. 4
[26] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, [41] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
models for high fidelity image generation. J. Mach. Learn. Representing scenes as neural radiance fields for view
Res., 23:47–1, 2022. 3 synthesis. Communications of the ACM, 65(1):99–106, 2021.
[27] Jonathan Ho and Tim Salimans. Classifier-free diffusion 1, 2, 4, 17
guidance. arXiv preprint arXiv:2207.12598, 2022. 14 [42] Thomas Müller, Alex Evans, Christoph Schied, and Alexander
[28] Pavel Iakubovskii. Segmentation models pytorch. Keller. Instant neural graphics primitives with a multiresolu-
https://github.com/qubvel/segmentation_ tion hash encoding. arXiv preprint arXiv:2201.05989, 2022. 1
models.pytorch, 2019. 4, 17 [43] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
[29] Wonbong Jang and Lourdes Agapito. CodeNeRF: Disen- Richardt, and Yong-Liang Yang. Hologan: Unsupervised
tangled neural radiance fields for object categories. In IEEE learning of 3d representations from natural images. In
International Conference on Computer Vision (ICCV), pages Proceedings of the IEEE/CVF International Conference on
12949–12958, 2021. 2 Computer Vision, pages 7588–7597, 2019. 3
[30] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. [44] Michael Niemeyer and Andreas Geiger. GIRAFFE: Repre-
Elucidating the design space of diffusion-based generative senting scenes as compositional generative neural feature
models. In Advances in Neural Information Processing fields. In IEEE Conference on Computer Vision and Pattern
Systems (NeurIPS), 2022. 3, 4, 17, 18 Recognition (CVPR), pages 11453–11464, 2021. 2, 3, 17
[31] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, [45] Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3D
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free Ken Burns effect from a single image. ACM Transactions on
generative adversarial networks. In Proc. NeurIPS, 2021. 18 Graphics (ToG), 38(6):1–15, 2019. 2
[32] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo, [46] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman,
and Seungryong Kim. 3d gan inversion with pose optimization. Jeong Joon Park, and Ira Kemelmacher-Shlizerman.
IEEE Winter Conference on Applications of Computer Vision StyleSDF: High-resolution 3D-consistent image and geometry
(WACV), 2023. 3 generation. In IEEE Conference on Computer Vision and
[33] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Pattern Recognition (CVPR), pages 13503–13513, 2022. 3
and Peter Anderson. Pathdreamer: A world model for indoor [47] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased
navigation. In IEEE International Conference on Computer resizing and surprising subtleties in gan evaluation. In CVPR,
Vision (ICCV), pages 14738–14748, 2021. 3 2022. 18
[34] Jonáš Kulhánek, Erik Derner, Torsten Sattler, and Robert [48] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall.
Babuška. ViewFormer: NeRF-free neural rendering from DreamFusion: Text-to-3D using 2D diffusion. arXiv preprint
few images using transformers. In European Conference on arXiv:2209.14988, 2022. 3
Computer Vision (ECCV), 2022. 2, 5, 6, 17, 18 [49] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[35] Zihang Lai, Sifei Liu, Alexei A Efros, and Xiaolong Wang. and Mark Chen. Hierarchical text-conditional image gener-
Video autoencoder: self-supervised disentanglement of static ation with CLIP latents. arXiv preprint arXiv:2204.06125,
3D structure and motion. In IEEE Conference on Computer 2022. 2, 3
Vision and Pattern Recognition (CVPR), pages 9730–9740, [50] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
2021. 7 Luca Sbordone, Patrick Labatut, and David Novotny. Com-
[36] Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo mon objects in 3D: Large-scale learning and evaluation of
Kanazawa. InfiniteNature-Zero: Learning perpetual view real-life 3D category reconstruction. In IEEE International

10
Conference on Computer Vision (ICCV), pages 10901–10911, [64] Johannes Lutz Schönberger and Jan-Michael Frahm.
2021. 1, 5, 7, 8, 16, 19 Structure-from-motion revisited. In Conference on Computer
[51] Xuanchi Ren and Xiaolong Wang. Look outside the room: Syn- Vision and Pattern Recognition (CVPR), 2016. 5
thesizing a consistent long-term 3D scene video from a single [65] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
image. In IEEE Conference on Computer Vision and Pattern and Jan-Michael Frahm. Pixelwise view selection for
Recognition (CVPR), pages 3563–3573, 2022. 2, 3, 5, 6, 7, 19 unstructured multi-view stereo. In European Conference on
[52] Gernot Riegler and Vladlen Koltun. Free view synthesis. In Computer Vision (ECCV), 2016. 5
European Conference on Computer Vision (ECCV), 2020. 1, 2 [66] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
[53] Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Geiger. GRAF: Generative radiance fields for 3D-aware
IEEE Conference on Computer Vision and Pattern Recognition image synthesis. Advances in Neural Information Processing
(CVPR), 2021. 1, 2 Systems, 33:20154–20166, 2020. 3, 5
[54] Chris Rockwell, David F Fouhey, and Justin Johnson. [67] Vincent Sitzmann, Eric Chan, Richard Tucker, Noah Snavely,
PixelSynth: Generating a 3D-consistent experience from a and Gordon Wetzstein. MetaSDF: Meta-learning signed dis-
single image. In IEEE International Conference on Computer tance functions. Advances in Neural Information Processing
Vision (ICCV), pages 14104–14113, 2021. 3, 7, 19 Systems, 33:10136–10147, 2020. 2
[55] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [68] Vincent Sitzmann, Semon Rezchikov, William T. Freeman,
Patrick Esser, and Björn Ommer. High-resolution image Joshua B. Tenenbaum, and Fredo Durand. Light field
synthesis with latent diffusion models. In IEEE Conference networks: Neural scene representations with single-evaluation
on Computer Vision and Pattern Recognition (CVPR), pages rendering. In Advances in Neural Information Processing
10684–10695, 2022. 2, 3 Systems (NeurIPS), 2021. 2, 5, 20
[56] R. Rombach, P. Esser, and B. Ommer. Geometry-free view syn- [69] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner,
thesis: Transformers and no 3D priors. In IEEE International Gordon Wetzstein, and Michael Zollhöfer. DeepVoxels: Learn-
Conference on Computer Vision (ICCV), 2021. 2, 3, 5, 7, 19 ing persistent 3D feature embeddings. In IEEE Conference
[57] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, on Computer Vision and Pattern Recognition (CVPR), 2019. 2
Jonathan Ho, Tim Salimans, David Fleet, and Mohammad [70] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein.
Norouzi. Palette: Image-to-image diffusion models. In ACM Scene representation networks: Continuous 3D-structure-
Transactions on Graphics (SIGGRAPH), pages 1–10, 2022. aware neural scene representations. Advances in Neural
2, 3, 5 Information Processing Systems, 32, 2019. 2, 5, 18
[58] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay [71] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, and Peter
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Wonka. EpiGRAF: Rethinking training of 3D GANs. Ad-
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, vances in Neural Information Processing Systems (NeurIPS),
et al. Photorealistic text-to-image diffusion models with deep 2022. 3
language understanding. arXiv preprint arXiv:2205.11487,
[72] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo
2022. 2, 3
tourism: exploring photo collections in 3D. In ACM Trans-
[59] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal-
actions on Graphics (SIGGRAPH), pages 835–846. 2006. 2
imans, David J Fleet, and Mohammad Norouzi. Image
super-resolution via iterative refinement. IEEE Transactions [73] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
on Pattern Analysis and Machine Intelligence, 2022. 2, 3, 5 and Surya Ganguli. Deep unsupervised learning using nonequi-
librium thermodynamics. In International Conference on
[60] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo
Machine Learning (ICML), pages 2256–2265. PMLR, 2015. 3
Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-
aligned implicit function for high-resolution clothed human [74] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
digitization. In IEEE International Conference on Computer diffusion implicit models. arXiv preprint arXiv:2010.02502,
Vision (ICCV), pages 2304–2314, 2019. 2, 3 2020. 4
[61] Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs [75] Yang Song and Stefano Ermon. Generative modeling by
Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario estimating gradients of the data distribution. Advances in
Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene Neural Information Processing Systems, 32, 2019. 3
representation transformer: Geometry-free novel view [76] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma,
synthesis through set-latent scene representations. In IEEE Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based
Conference on Computer Vision and Pattern Recognition generative modeling through stochastic differential equations.
(CVPR), 2022. 2, 20 arXiv preprint arXiv:2011.13456, 2020. 3, 4, 17
[62] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, [77] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi
Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Schmidt, Pratul P Srinivasan, Jonathan T Barron, and Ren Ng.
Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A Learned initializations for optimizing coordinate-based neural
platform for embodied ai research. In Proceedings of the representations. In IEEE Conference on Computer Vision and
IEEE/CVF International Conference on Computer Vision, Pattern Recognition (CVPR), pages 2846–2855, 2021. 2
pages 9339–9347, 2019. 19 [78] Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan,
[63] Johannes L Schonberger and Jan-Michael Frahm. Structure- Edgar Tretschk, W Yifan, Christoph Lassner, Vincent
from-motion revisited. In IEEE Conference on Computer Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, et al.
Vision and Pattern Recognition (CVPR), pages 4104–4113, Advances in neural rendering. In Computer Graphics Forum,
2016. 2 volume 41, pages 703–735. Wiley Online Library, 2022. 1, 2

11
[79] Alex Trevithick and Bo Yang. GRF: Learning a general radi- [94] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
ance field for 3D scene representation and rendering. In IEEE and Noah Snavely. Stereo magnification: Learning view
International Conference on Computer Vision (ICCV), 2021. 2 synthesis using multiplane images. In ACM Transactions on
[80] Richard Tucker and Noah Snavely. Single-view view synthesis Graphics (SIGGRAPH), 2018. 2, 19
with multiplane images. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2020. 2
[81] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul
Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo
Martin-Brualla, Noah Snavely, and Thomas Funkhouser.
IBRNet: Learning multi-view image-based rendering. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2021. 2
[82] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P
Simoncelli. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image processing,
13(4):600–612, 2004. 6
[83] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi.
Novel view synthesis with diffusion models. arXiv preprint
arXiv:2210.04628, 2022. 3, 5, 6, 8, 15, 17
[84] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin
Johnson. SynSin: End-to-end view synthesis from a single
image. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 7467–7477, 2020. 2, 3, 5, 7, 19
[85] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong.
GRAM-HD: 3D-consistent image generation at high
resolution with generative radiance manifolds, 2022. 3
[86] Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and
Qifeng Chen. High-fidelity 3d gan inversion by pseudo-multi-
view optimization. arXiv preprint arXiv:2211.15662, 2022. 3
[87] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, and Bolei
Zhou. 3D-aware image synthesis via learning structural and
textural representations. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 18430–18439,
2022. 3
[88] Yang Xue, Yuheng Li, Krishna Kumar Singh, and Yong Jae
Lee. GIRAFFE HD: A high-resolution 3D-aware generative
model. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2022. 3
[89] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
pixelNeRF: Neural radiance fields from one or few images. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 4578–4587, 2021. 1, 2, 3, 4, 5, 6, 8, 17, 18, 19
[90] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In European Conference on Computer
Vision (ECCV), pages 649–666. Springer, 2016. 2
[91] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep fea-
tures as a perceptual metric. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018. 5, 6, 13
[92] Xuanmeng Zhang, Zhedong Zheng, Daiheng Gao, Bang
Zhang, Pan Pan, and Yi Yang. Multi-view consistent gener-
ative adversarial networks for 3D-aware image synthesis. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2022. 3
[93] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. CIPS-3D:
A 3D-aware generator of GANs based on conditionally-
independent pixel synthesis. arXiv preprint arXiv:2110.09788,
2021. 3

12
Supplementary Material
In this supplement, we first provide additional experiments
(Sec. A). We follow with details of our implementation
(Sec. B), including further descriptions of the model
architecture and training process, as well as hyperparameters.
We then discuss experimental details (Sec. C). Lastly, we con-
sider artifacts and limitations (Sec. D) that may be targets for
future work. We encourage readers to view the accompanying
supplemental videos, which contain additional visual results.

A. Additional experiments & ablations

Figure 12. New views generated from out-of-distribution poses.


Extreme zooms and large translations may lead to unrealistic views.

A.1. Extrapolation to unseen camera poses.


In the ShapeNet dataset, cameras are located on a sphere,
point towards the centers of the objects and have the same
“up” direction during training. We investigate the results Figure 13. Our synthesized novel views sorted by the percentile
of our method when querying out-of-distribution poses at of the LPIPS [91] score, with results that scored best according to
test time in Fig. 12. From a fixed pose, we generate a zoom, LPIPS at the top.
a one-dimensional translation of the camera, and a camera
roll. Although novel views deteriorate with large deviations
from the training pose distribution, the 3D prior present in
our method can reasonably tolerate small extrapolations.

A.2. Percentile results based on LPIPS


Fig. 13 shows our synthesized results on ShapeNet ordered
by the percentile of the LPIPS [91] score, with examples that
scored best according to the metric at the top and examples
that scored worst at the bottom. We compute predictions for
the same input and output views across the entire test set. To
reduce the effects of randomness, we evaluate 9 realizations
for each input, and use only the median image/score when
ordering our results. Our method produces consistently sharp
outputs (even at the 10th percentile) and maintains overall
textures and shapes from the input image.
Figure 14. Effect of varying the number of input views. Increasing
A.3. Handling multiple input images the number of input views reduces uncertainty, decreasing the
pixel-wise standard deviation in novel renderings. Dark pixels in the
Fig. 14 shows our generated novel views when more than
third row represent higher standard deviation and indicate greater
one image is given as the input conditioning information. variation in the realizations.
When only 1 view is given from the back side of the car,

13
Figure 15. Average pixel variance of generated views vs. the
distance between the query camera and the input camera. Input
views close to the camera are valuable—the model can directly
observe many of the details it must transfer to the output rendering.
Input views distant to the camera are more ambiguous—the model
is tasked with generating large parts of the rendering from scratch.
As the conditioning information gets increasingly ambiguous, novel Figure 16. Independent (single-frame input) NVS with various
views get increasingly diverse. Pixel variance is calculated across classifier-free guidance (CFG) strengths. For each level of CFG, we
50 renderings per pose. Red bars indicate the empirical standard show three realizations. With guidance = 0, we sample a “diverse”
deviation of the moving average. set of novel views, each plausible, but with variations (e.g. doors).
Higher guidance strength reduces diversity but improves sample
quality. Excessively high guidance begins to introduce saturation
the model has the freedom to choose multiple plausible and visual artifacts. Negative guidance upweights the unconditional
completions for the unseen front side of the car, leading to contribution; with guidance = −1, generation is unconditional.
a high standard deviation (high uncertainty). Adding 2 or
3 views reduces uncertainty (low standard deviation), and
the model generates a novel view that is compatible with
multiple input views.
A.4. Effect of distance to input view
As Fig. 15 demonstrates, nearby views provide more
valuable information than distant views, thus reducing vari-
ance in the output rendering. Consequently, by conditioning
autoregressively on nearby views, we narrow the conditional
distribution of possible outputs, improving geometric
consistency compared to non-autoregressive conditioning.
Figure 17. Autoregressive sequence generation with varying
A.5. Classifier-free guidance
CFG strength. With low guidance, we can generate extended
Recently, [27] suggested classifier-free diffusion guidance autoregressive sequences with little deterioration over time. Higher
technique to effectively trade off diversity and sample guidance tends to carry over errors from previous frames, which
quality. At training, we implement classifier-free guidance gradually degrades the quality of subsequent generations.
by dropping out the feature image with 10% probability; in
its place, we replace this conditioning image with a sample
of Gaussian noise. At inference, we can linearly interpolate less diverse, each one is of high fidelity. Excessively high
between unconditional and unconditional predictions of the guidance strength begins to introduce artifacts and color
denoised image in order to boost or decrease the effect of the saturation. Negative guidance upweights unconditional
conditioning information. prediction; guidance = −1 produces unconditional samples
Fig. ?? shows the effect of classifier-free guidance [27] without influence from the input image. In general, when
(CFG) when making predictions in isolation. In general, making independent novel view predictions, we find
positive classifier-free guidance increases the effect of the moderate levels of CFG to be beneficial. However, as
conditioning information and improves sample quality. described in Sec. A.6, CFG has an adverse effect on the
With guidance = 0, our model produces greater variation quality of autoregressively generated sequences. As a default,
of generated views (note the different realizations of the we refrain from using CFG in our experiments.
passenger-side door). However, we would consider some of A.6. Extended autoregressive generation
these realizations to be unlikely given the input. Increased
CFG strength narrows the distribution of possible outputs, Fig. 17 shows autoregressively generated sequences made
and while we would consider such a set of realizations to be with varying levels of classifier-free guidance. When making

14
long autoregressive sequences, the ability to suppress errors cent rendering, and several previous renderings, to be a good
and return to the image manifold is an important attribute. compromise between long-term and short-term consistency.
Unchecked, gradual accumulation of errors could lead to pro- The number of previously generated images we condition
gressive deterioration in image quality. Intuitively, uncondi- upon affects the behavior. Because we equally weight the
tional samples do not suffer from error buildup, since uncondi- contribution of all images we condition upon, increasing the
tional (CFG = −1) samples make use of no information from number of previous renderings (which are sampled uniformly
previous frames. On the other end of the spectrum, highly from the generated sequence) reduces the relative contribu-
conditioned (CFG >> 0) samples should be more likely tion of the most recent rendering. Increasing the size of this
to suffer from error accumulation because they emphasize “buffer” of previously-generated conditioning images thus
information from previous frames. A happy medium between improves long-term consistency at the cost of short-term con-
these two extremes allows the model to use information from sistency; reducing the size of the buffer has the opposite effect.
previous frames while preventing undesired error accumula- One way to suppress flickering is to generate frames in
tion. Empirically, we find that while small positive guidance two passes, where in the second pass, we condition on the
can reduce frame-to-frame flicker, it enhances the model’s nearby frames from the first pass in a sliding window fashion.
tendency to carry over visual errors from previous frames. Empirically, conditioning on only the nearest 4 frames during
We observe saturation buildup and artifact accumulation to the second pass results in videos with reduced flicker, at the
be significant roadblocks to using CFG when synthesizing expensive of higher inference computation. However, unless
long video sequences. For these reasons, we default to using otherwise noted, we render all videos shown with our baseline
CFG = 0, which we found to enable autoregressive generation autoregressive strategy, i.e. without these alternative methods.
of long sequences without significant error accumulation. A
solution that enables higher CFG weights for autoregressive A.8. Stochastic Conditioning
generation may make a valuable contribution in the future. To demonstrate the effectiveness of our autoregressive
A.7. Alternative autoregressive conditioning synthesis method, which aggregates conditioning feature
schemes volumes from autoregressively selected generated images,
we compare to an adaptation of the stochastic conditioning
Baseline strategy When generating a sequence autoregres- method proposed in 3DiM [83]. We adapt the stochastic con-
sively, there are many possible strategies, each with a set ditioning method to our architecture by replacing the feature
of tradeoffs. To produce the visual results presented in our volume aggregation from autoregressively selected generated
work, we used the following baseline strategy, with minor images with a single feature volume generated from an image
variations for different datasets. As described in the main randomly sampled from all previously generated images. As
paper, our baseline strategy is to condition our model on the done in 3DiM, the number of diffusion denoising steps is
input image(s), the most recently generated rendering, and increased significantly and the randomly sampled image is
five previously generated images, selected at random. varied at each individual step of denoising. Each generated
For Matterport3D, when generating long sequences, we final image is then added to the set of all previous images and
select the five previously generated frames from a set of can be used as conditioning in subsequent view generations.
only the 20 most recently generated frames; we additionally This alternative form of conditioning is also able to provide
condition on every 15th previously generated frame. the model with information from many generated views, but
For CO3D, we use the two-pass conditioning method they are processed independently with each step of denoising,
discussed below to improve temporal consistency. rather than together after a feature volume aggregation.
In Fig. 20, we show 3D reconstruction results from
Alternative strategies and tradeoffs As described in the sequences of images generated by our autoregressive
main paper, our baseline autoregressive strategy can induce synthesis method and with our adaptation of stochastic
noticeable flickering. One way to reduce flickering is to condi- conditioning [83]. Here, we find that our autoregressive
tion on only the previous frame. Doing so almost completely synthesis method performs slightly better than stochastic
eliminates frame-to-frame flicker. However, this strategy sac- conditioning in terms of 3D consistency of generated
rifices long-term consistency and does little to prevent drift; frames as seen by the COLMAP 3D reconstruction and
new renderings might not be consistent with frames rendered corresponding Chamfer distance. Additionally, we are able
at the start of the sequence. By contrast, to promote long- to generate novel views significantly faster – in practice,
term consistency, one could avoid conditioning on previously- stochastic conditioning requires 256 denoising steps to
generated frames at all and instead condition on only the input generate each novel view while our method only requires 25,
image(s). Because drift is the result of error accumulation leading to a 10x improvement in speed.
from conditioning on previous generations, this strategy elim-
A.9. Additional Common Objects in 3D results
inates potential for drift. However, it suffers from short-term
inconsistency (i.e. frame-to-frame flicker). We found our We provide additional results for single-view novel view
baseline strategy, which conditions on the inputs, the most re- synthesis (NVS) with real-world objects for CO3D Hydrants

15
Figure 18. Qualitative comparison for single-view novel view synthesis on CO3D [50] Hydrants.

Figure 19. Additional qualitative comparisons against baselines on ShapeNet [10].

16
feature volume encoder expects as input a 3 × 128 × 128
image; it produces a (16 × 64) × 128 × 128 feature image,
which we reshape into a 16×64×128×128 volume.

Multiview aggregation. We aggregate information from


Figure 20. Our default autoregressive conditioning strategy, which multiple input views by predicting a feature volume Wi for
aggregates information from multiple views within a feature each input image independently, projecting the query point
volume, typically performs at least on par with stochastic view into each feature volume, sampling a separate feature vector
conditioning [83] in geometric consistency, but requires many fewer from each feature volume, and mean-pooling across the sam-
steps of diffusion to remain effective. Here, we compare COLMAP
pled feature vectors to produce a single aggregated feature.
reconstructions of a sequenced produced by feature aggregation,
using 25 steps of denoising, against a sequence produced by We experimented with two alternative aggregation strategies:
stochastic conditioning, using 256 steps of denoising. 1. max-pooling, and 2. weighted average pooling, where the
feature volumes have an additional channel that is interpreted
as a weight by a softmax function. We found these alternative
in Fig. 18. We compare against ViewFormer [34], which aggregation strategies to perform similarly to mean-pooling.
has demonstrated success in few-shot NVS on CO3D, and
PixelNeRF [89]. However, we note that ViewFormer is MLP, f . We use a two-layer ReLU MLP to aggregate
not a 1:1 comparison for two reasons: 1. ViewFormer features drawn from multiple input images. Our MLP has
operates with object masks, whereas our method operates an input dimension of 16, two hidden layers of dimension
with backgrounds. 2. ViewFormer train/test splits did not 64, and an output dimension of 17, which is interpreted as a 1-
align with other methods. For this figure, and for comparison channel density τ and a 16-channel feature c. We additionally
videos, we selected objects that were contained in our test skip the MLP’s input feature to the output feature.
split but were part of ViewFormer’s train split. Despite these
disadvantages, our method demonstrates a compelling ability
to plausibly complete complex scenes. Rendering. We render feature images from the model
using neural volume rendering [41] of features [44], from
A.10. Additional ShapeNet results the neural field parameterized by the set of feature volumes
W and the MLP f . For computational efficiency, we render
Fig. 19 provides additional visual comparisons on the
at half spatial resolution, i.e. 64 × 64 and use bilinear
ShapeNet [10] dataset against baselines. In general, our
upsampling to produce a 128 × 128 feature image. We use
method renders images with sharper details and higher
64 depth samples by default, scattered along each ray with
perceived quality than PixelNeRF, while better transferring
stratified sampling. We do not use importance sampling.
details from the input image than ViewFormer and EG3D.
In this figure, renderings from our method are selected from
autoregressively-generated sequences. UNet, U . The design of U is based on DDPM++ [76], us-
ing the implementation and preconditioning scheme of [30].
B. Implementation details U accepts as input 19 total channels (a noisy RGB image,
plus a 16-channels feature rendering) of spatial dimension
We implemented our 3D-aware diffusion models using 1282 . It produces a 3-channel 1282 denoised rendering. For
the official source code of EDM [30], which is available experiments shown in the manuscript, our models contain five
at https://github.com/NVlabs/edm. Most of downsampling blocks with channel multipliers of [128, 128,
our training setup and hyperparameters follow [30]; the 256, 256, 256]. As in [76], we utilize a residual skip connec-
exceptions are detailed here. tion from the input to U to each block in the encoder of U .

Feature volume encoder, T . Our encoder backbone is Training. We use a batch size of 96 for all training runs,
based on DeepLabV3+ [12]. We use a Pytorch reimple- split across 8 A100 GPUs, with a learning rate of 2 × 10−5 .
mentation [28] available at https://github.com/ During training, we sample the noise level σ according to the
qubvel/segmentation_models.pytorch, and method proposed by [30] by drawing σ from the following
ResNet34 [23] as the encoder backbone. We found unmod- distribution:
ified DeepLabV3+, to struggle because the output branch
contains several unlearned, bilinear upsampling layers; this 2
log(σ) ∼ N (Pmean ,Pstd ). (7)
resolution bottleneck makes it difficult to effectively recon-
struct fine details from the input. We replace these unlearned We use Pmean = −1.0, Pstd = 1.4. During training, we
upsampling layers with learnable convolutional layers and randomly drop out the conditioning information with a
skip connections from previous layers. We disable batchnorm probability 0.1 to enable classifier-free guidance. In place
and dropout throughout the feature volume encoder. The of the rendered feature image, we insert random noise.

17
Our dataset is composed of posed multi-view images, C. Experiment details
where for each training image, we are given the 4×4 camera
pose matrix, the camera field of view, and a near/far plane. C.1. Evaluation details
For all experiments, we specify a global near/far value FID Calculation. We compute FID by sampling 30,000
for each dataset, where the values are chosen such that a images randomly from both the ground truth testing dataset
camera frustum with the chosen near/far planes adequately and corresponding generated frames. We use an inception
covers the visible portion of the scene. For ShapeNet, network provided in the StyleGAN3 [31] repository for
near/far = (0.8, 1.8); for MP3D, near/far = (0., 12.5); for computing image features.
CO3D, near/far = (0.5,40). We found our method to be fairly
robust to the chosen values of near/far planes.
KID Calculation. We compute KID by sam-
For ShapeNet, we train until the model has processed
pling all images from both the ground truth testing
140M images, which takes approximately 9 days on eight
dataset and corresponding generated frames. We
A100 GPUs. For MP3D, we train for 110M images, which
use the implementation of clean-fid [47], available at
takes approximately 7 days on eight A100 GPUs. For CO3D,
https://github.com/GaParmar/clean-fid.
we train for 170M images, which takes approximately eleven
days on eight A100 GPUs.
COLMAP Reconstructions. We compute COLMAP
reconstructions using frames from rendered video sequences.
Augmentation. During training, we introduce two forms We provide the ground-truth camera pose trajectory as
of augmentation. First, with probability 0.5, we add Gaussian input for all reconstructions. For ShapeNet evaluations,
white noise to the input images. For input images in the we additionally compute masks by thresholding images
range [−1,1], we sample the standard deviation of the added to remove white pixels. We leave all settings at their
noise uniformly from [0,0.5]. Second, we apply non-leaking recommended default.
augmentation [30] to U . With probability 0.1, we apply
random flips, random integer translations (up to 16 pixels),
Chamfer Distance Calculation. For all datasets, we
and random 90º rotations, where the transformations are
compute the bi-directional Chamfer distance between the
applied to the input noisy image, the input feature image, and
reconstructed point cloud from synthesized images to the
the target denoised image. We condition U with a vector that
reconstructed point cloud from ground truth images. Addi-
informs it of the currently applied augmentations; we zero
tionally, for CO3D, we translate and scale the reconstructed
this vector at inference.
point clouds to lie within the unit cube.

Inference. We use the deterministic second order sampler C.2. Baselines


proposed in [30] at inference. As a default, we use N = 25 PixelNeRF [89]. We compare to PixelNeRF for the
timesteps, with a noise schedule governed by σmax = 80, ShapeNet and CO3D single-image novel view synthesis
σmin = 0.002, and ρ = 7, where ρ is a constant that controls benchmark. For ShapeNet, we use the official implementation
the spacing of noise noise levels. The noise level at a timestep and pre-trained weights for single-category (car), single-
i is given in Eq. 8: image, ShapeNet novel view synthesis evaluation provided
at: https://github.com/sxyu/pixel-nerf. We
  follow the protocol described in the original PixelNeRF paper
1 i  1 1
and SRNs [70] for data pre-processing. We use the provided
σi<N = σmax ρ + σmin −σmax
ρ ρ . (8)
N −1 dataset and splits in the PixelNeRF repository for training
and testing of both our method and PixelNeRF (this dataset
Rendering an image from scratch with 25 denoising steps is slightly different from that used in the SRNs paper due to a
takes approximately 1.8 seconds per image at inference on bug; see PixelNeRF supplementary information). We follow
an RTX 3090 GPU. the same protocol for evaluation as we do for our method
and SRNs: view 64 is used as input, and the remaining 249
“Production” settings for CO3D. For rendering videos of views are synthesized conditioned on this. For CO3D, we
CO3D, we use more computationally expensive “production” train PixelNeRF from scratch using our train/test splits and
hyperparameters to obtain better image quality. Seeking using the recommended hyperparameters.
better image quality and detail, we use 256 denoising steps
instead of the default 25 denoising steps. Seeking better ViewFormer [34]. We compare to ViewFormer on the
temporal consistency, we increase the number of samples per ShapeNet single-image novel view synthesis benchmark
ray cast through the latent feature field, from 64 to 128; we and qualitatively on single-image novel view synthesis for
also use the two-pass form of autoregressive conditioning CO3D. We received the data and results for single-image
described in Sec. A.7. novel view synthesis for the entire ShapeNet testing set

18
from the authors. We compute metrics using their provided our novel-view renderings are significantly more desirable.
ground truth data and synthesized results. The training
and testing splits are the same as those used in our method C.3. Dataset details
and in PixelNeRF. They use the previously introduced ShapeNet [10]. We extensively evaluate our method on
protocol for single-image novel view synthesis evaluation: the ShapeNet dataset. The full ShapeNet dataset contains
view 64 is used as input, and the remaining 249 views are different object categories, each with a synthetically
synthesized conditioned on this. For CO3D, we instead generated posed images in pre-defined training, validation,
condition on the first frame from each shown sequence, and and testing sets. In our work, we specifically evaluate with
generate a video based on this conditioning information. the “cars” category, and focus on single-image novel view
We use provided source code from the official repository at: synthesis. We use the version of the dataset provided in
https://github.com/jkulhanek/viewformer. PixelNeRF [89] for consistency in training and evaluation,
We do not generate quantitative metrics, as ViewFormer oper- keeping all frames in the dataset at 1282 resolution and doing
ates on masked and center-cropped images. Additionally, the no additional pre-processing. As described in the main paper,
images, which we use for comparison are in the training set the training set contains 2,458 cars, each with 50 renderings
for ViewFormer, while for our method they are in the test set. randomly distributed on the surface of a sphere. The test
split contains 704 cars, each with 250 rendered images and
Look Outside the Room [51]. We compare against poses on an Archimedean spiral. During the training of our
Look-outside-the-room (LOTR), the current state-of-the- method, we use the defined training split, randomly sampling
art method on novel view synthesis on Matterport3D between one and three input frames with the objective of
(MP3D)[9] and RealEstate10K [94] datasets. For LOTR, synthesizing a randomly selected target frame for a specific
we obtained the pretrained weights for the MP3D dataset object instance. In evaluation, we use the defined testing split,
from their official codebase https://github.com/ use image number 64 as input, and synthesize the other 249
xrenaa/Look-Outside-Room. We match LOTR’s ground truth images. We note that since these images are
data preparation methodology, including identical train/test synthetically generated at only 1282 , they lack backgrounds
splits, and we use LOTR’s implementation for generating and fine detail. However, the accuracy of poses in the
multi-view images from MP3D RGB-D scans. For testing constrained environment and consistent evaluation method
their method, we prepare a common set of 200 input images between baselines allows for easily providing quantitative
from the test split with the trajectories and ground truth images benchmarks for single-image novel view synthesis.
for the next 10 frames for each input. Then, we run the LOTR
method on the given input using the code from their Github Matterport3D [9]. We showcase our algorithm on a highly
repository, using 3 overlapping frame windows, as stated in complex, large-scale indoor dataset, Matterport3D (MP3D).
their paper. We run LOTR on the next 10 frames, given the MP3D contains RGB-D scans of real-world building interiors.
input frame, and measure the metrics against the ground truth. Scenes are calibrated to metric scale, and thus there is no
scale ambiguity. We preprocess MP3D scans into a dataset
Additional Baselines for MP3D To further evaluate our of posed multi-view images following the procedure detailed
method’s effectiveness on the novel-view synthesis task on in LOTR [51] and SynSin [84]. Specifically, we generate the
MP3D scenes, we compare against additional baselines of image sequences by simulating a navigation agent in the room
GeoGPT [56] and SynSin [84]. Note that these two baselines, scans, using the popular Habitat [62] API. We randomly select
along with another recent work of PixelSynth [54], have been the start and end position within the MP3D scenes and simu-
already shown to underperform against LOTR [51]. Since late the navigation towards the goal via Habitat. The agent is
GeoGPT does not provide pre-trained models or rendered only allowed to take limited actions, including going forward
images for MP3D, we inquired the authors of LOTR for the and rotating 15 degrees. During training, we randomly
images they used for the benchmarks. The acquired NVS im- sample a target frame and then select 1 to 3 random source
ages of GeoGPT and SynSin are rendered by the exact same frames in the neighborhood of 20 frames for conditioning.
protocol as our experiments, except that they proceeded five
frames from the initial input images for 200 sequences (thus Common Objects in 3D [50]. We validate our method on
we have 1,000 images in total). We note that the trajectories a real-world dataset: Common Objects in 3D (CO3D). The
used for these acquired images are different from the trajec- CO3D dataset consists of several categories. We train on
tories we used for our experiments because the trajectories CO3D Hydrants, which contains 726 scenes. The average
are generated randomly via the Habitat embodied agent sim- scene consists of around 200 frames of RGB video, object
ulation [62]. However, at 1000 trajectory samples, we believe masks, poses, and semi-sparse depth. We note that the CO3D
our comparisons are statistically significant. The final num- dataset is quite unconstrained: even across scenes within
bers we computed show similar trends to those reported in the a category, aspect ratio, resolution, FOV, camera trajectory,
LOTR paper, further confirming the validity of the compar- object scale, and global orientation all vary. Additionally,
isons. Both qualitatively and quantitatively, we observe that we note that the dataset is noisy, with several examples of

19
miscategorized objects and numerous extremely short or D.2. Limitations
low-quality videos. Such noise adds to the challenge of
We believe our method to be a valuable step towards in-the-
single-image NVS.
wild single-view novel view synthesis but we acknowledge
In preparing data, we first center-crop to the largest pos- several limitations. While we demonstrate our method to be
sible square, then resize to 1282 using Lanczos resampling. competitively geometrically consistent, it is not inherently
We adjust the camera intrinsics to reflect this change. We also 3D or temporally consistent. Noticeable flicker and other
seek to normalize the canonical scale of scenes across the artifacts are sometimes visible in rendered sequences.
dataset. To do so, we examine the provided depths within While our model generally produces plausible renderings,
each scene, and consider the depth values that fall within it may not always perfectly transfer details from the input.
the object segmentation mask. For each image, we calculate On ShapeNet, this sometimes manifests as an inability
the median value of the masked depth. Taking the mean to replicate the angle of car tires or the style of windows
of these median values across the scene gives us a rough across the line of symmetry; on more complex datasets, the
approximation of the distance between the camera and object. model sometimes struggles to transfer fine details. We use
We adjust the scale of the scene so that this camera-object a relatively lightweight, ResNet-backed Deeplab feature
distance is identical across every scene in the dataset. encoder. A more powerful encoder, potentially one that
To help resolve scale, which is highly variable across the makes use of attention to improve long-range information
dataset, and to provide information parity with PixelNeRF, flow, may resolve these issues.
which has access to a global reference frame, we provide our
feature encoder, T , with the location of the global origin. In
addition to each input RGB image, we concatenate a channel
that contains a depth rendering of the three coordinate planes,
as rendered from the input camera. We modify T to accept
the four-channel input. We find this input augmentation to
improve our model’s ability to localize objects.

D. Discussion
D.1. Alternative approaches
GAN-based generative novel view synthesis. We have
presented a diffusion-based generative model for novel view
synthesis, but in principle, it is possible to construct a similar
framework around other types of generative models. Gen-
erative Adversarial Networks [21] (GANs), are a natural fit,
and adversarial training could drop in to replace our diffusion
objective with minor changes. While recent work [15] has
demonstrated that diffusion models often outperform GANs
in mode coverage and image quality, GANs have a major
advantage in speed. Future work that aims for real-time
synthesis may prefer a GAN-based 3D-aware NVS approach.

Transformer-based, geometry-free multi-view aggre-


gation strategies. A promising alternative to explicit
geometry priors, such as the type we have presented in this
work, is to instead make use of powerful attention mecha-
nisms for effectively combining multiple observations. Scene
Representation Transformers [61] utilize a transformer-based
approach to merge information from multiple views, which
is effective for NVS on both simple and complex scenes. We
explored an SRT-based variant of Γ, which would forego
explicit geometry priors for a transformer and light field [68]
based conditioning scheme. However, we had difficulty
achieving sufficient convergence and in justifying the
additional compute cost. Nevertheless, related approaches
could be a promising area for future study.

20

You might also like