You are on page 1of 11

L OOSE C ONTROL: Lifting ControlNet for Generalized Depth Conditioning

Shariq Farooq Bhat Niloy J. Mitra Peter Wonka


KAUST University College London, Adobe Research KAUST
arXiv:2312.03079v1 [cs.CV] 5 Dec 2023

Figure 1. Our framework LooseControl enables multiple ways to control the generative image modeling process. Left: (C1) Scene
boundary control lets the user specify the boundary of the scene. We show the control inputs on top, the ControlNet results in the middle,
and our results at the bottom. Middle: (C2) 3D box control can additionally specify object locations with the help of approximate 3D
bounding boxes. We again show the control inputs on top, the ControlNet results in the middle, and our results on the bottom. Right-Top:
(E1) 3D box editing. Note how the overall style of the scene is preserved across edits. Right-Bottom: (E2) Attribute editing: an example
of changing a couch to another one and changing the overall furniture density in a room.

Abstract ing boxes while freezing the style of the image. This yields
minimal changes apart from changes induced by the edited
boxes. (E2) Attribute editing proposes possible editing di-
We present L OOSE C ONTROL to allow generalized depth rections to change one particular aspect of the scene, such
conditioning for diffusion-based image generation. Con- as the overall object density or a particular object. Ex-
trolNet, the SOTA for depth-conditioned image generation, tensive tests and comparisons with baselines demonstrate
produces remarkable results but relies on having access the generality of our method. We believe that L OOSE C ON -
to detailed depth maps for guidance. Creating such exact TROL can become an important design tool for easily cre-
depth maps, in many scenarios, is challenging. This paper ating complex environments and be extended to other forms
introduces a generalized version of depth conditioning that of guidance channels. Code and more information is avail-
enables many new content-creation workflows. Specifically, able at https://shariqfarooq123.github.io/
we allow (C1) scene boundary control for loosely specify- loose-control/.1
ing scenes with only boundary conditions, and (C2) 3D box
control for specifying layout locations of the target objects
rather than the exact shape and appearance of the objects. 1. Introduction
Using L OOSE C ONTROL, along with text guidance, users
can create complex environments (e.g., rooms, street views, Diffusion-based generative models now produce images
etc.) by specifying only scene boundaries and locations of with a remarkable degree of photorealism. ControlNet [51],
primary objects. Further, we provide two editing mecha- trained on top of StableDiffusion [37], is the most power-
nisms to refine the results: (E1) 3D box editing enables ful way to control such a generation process. Specifically,
the user to refine images by changing, adding, or remov- 1 The project was supported in part by “NTGC-AI” funding at KAUST.

1
ControlNet allows guidance in the form of one or more of network to be fine-tuned in a few steps with only a small
depth, edges, normal, or semantic channels. This ability amount of training data, preventing the network from for-
to accurately control the image generation process enables getting the original generation weights. A second impor-
various creative applications without requiring different ar- tant component is automatically synthesizing the necessary
chitectures for different applications. However, providing training data without requiring manual annotations. Finally,
perfect guidance for ControlNet can itself be challenging. the edits are realized by manipulating the “keys” and “val-
For example, imagine a scenario where a user wants to ues” in attention layers, and a singular value analysis of the
create living room images using depth guidance. She is ControlNet Jacobian.
now expected to supply a depth map containing informa- We evaluate a range of use cases. Since we enable a
tion about walls, room furniture, all the way to even smaller strictly more general depth control, there are no direct com-
objects and decorations. This is non-trivial. Producing re- petitors. Hence, we compare strict and weaker guidance
alistic depth maps, especially for cluttered scenes, is proba- by varying the control weighting in ControlNet, as well as
bly as challenging as solving her original task. On the one retraining ControlNet on synthetic data. We evaluate our
hand, providing rough guidance, e.g., providing depth infor- setup on a variety of different scenes and also perform a
mation for walls, floor, and ceiling, results in unsatisfactory user study that reveals over 95% preference for our method
images. While one may expect a furnished room, the room compared to baselines. In summary, we are the first to al-
will be empty. Also providing approximate depth using tar- low image generation from loose depth specifications and
get bounding boxes for furniture does not yield the desired provide multiple types of edits to semantically explore vari-
result, because only boxy objects are generated. Figure 1 ations in generations.
shows some examples using ControlNet [51].
We present L OOSE C ONTROL that allows controlling im- 2. Related Work
age generation using generalized guidance. In this work, Diffusion models have rapidly evolved as a leading gener-
we focus on guidance through depth maps. We consider ative approach, demonstrating remarkable success both in
two types of specifications. (C1) Scene boundary control 2D image generation [6, 14, 18, 32, 33, 35, 37, 39] as well
where a layout can be given by its boundaries, e.g., walls as 3D shape generation [11, 12, 15, 20, 21, 40, 50, 52].
and floors, but the final scene can be filled by an arbitrary While there are many similarities and synergies between 2D
number of objects that are not part of the depth conditioning and 3D diffusion, in this paper, we only focus on the prob-
as long as these additional objects are closer to the camera lem of adding conditional control to text-to-image diffusion
than the scene boundary. (C2) 3D box control where, in models. In this context, many existing methods proposed
addition to layout boundaries, users can provide finer-scale solutions on direct guidance through well-known condi-
guidance in the form of approximate bounding boxes for tion inputs like inpainting masks [45, 48], sketches [44],
target objects. In the final generation, however, there can be scene graphs [49], color palletes [19, 30, 43], 2D bounding
additional secondary objects that do not need to strictly ad- boxes [27], segmentation maps [13, 37, 46], composition
here to the given boxes. We demonstrate that many layout- of multiple text descriptions [28], depth maps [30, 51] or
guided image generation scenarios (e.g., rooms, streets, un- by fine-tuning these models on a few subject-specific im-
derwater) can be addressed in this framework. Figure 1 ages [16, 29, 38, 42]. We focus on introducing new types
(left, middle) shows some of our generations. of control. ControlNet [51] stands out as a key method in
We also provide two interactive editing modes to refine spatial control, supporting a diverse array of conditions like
the results. (E1) 3D box editing enables the user to change, edge maps, depth maps, segmentation masks, normal maps,
add, and remove boxes while freezing the style of the re- and OpenPose [8, 9, 41, 47] under a single framework. It
sulting image. The goal is to obtain minimal changes apart is notably based on the widely-used open-source model,
from the changes induced by the edited boxes. We describe Stable Diffusion [37], which contributes to its popularity.
how a notion of style can be formalized and preserved in However, creating spatial control conditions manually by a
our diffusion setup. (E2) 3D attribute editing enables users user for ControlNet can be challenging, often resulting in
to change one particular attribute of the result, such as the indirect control where conditions are derived from another
type of one particular piece of furniture. Since diffusion image source, which can stifle the creative generation pro-
processes return a distribution of results, we perform local cess. Additionally, this method can be restrictive in terms
analysis around any generation to reveal dominant modes of of the diversity of generations per condition, posing limita-
variation. Figure 1 (right) shows a few editing examples. tions in various scenarios. Our work builds on top of Con-
The technical realization of these four components is trolNet and introduces novel conditioning mechanisms that
built on ControlNet with a frozen StableDiffusion back- are not only easy to construct manually by a user but also
bone. We first propose a Low Rank (LoRA) based net- offer enhanced control and flexibility. In particular, we con-
work adaptation. The LoRA-based architecture allows the tribute novel forms of loose control described in the next

2
section. 4. Realizing L OOSE C ONTROL
3. Problem Setup - L OOSE C ONTROL To realize ordinary depth control of text-to-image diffusion
models, one needs access to triplets of the form (T, Dc , I)
We formally introduce the core problem statement proposed for training, where T is the text description of the image I
in this work - loose depth control. Ordinary depth control, and Dc represents the depth condition. Generally, one has
as implemented by the original ControlNet can be formally access to the pairs (T, I) and the ordinary depth condition
described as follows: Given an input condition depth map Dc is obtained by applying an off-the-shelf depth estima-
Dc , and access to an off-the-shelf monocular depth estima- tor fD on the given image (i.e., Dc = fD (I)) to obtain
tor fD , generate an image Igen such that the estimated depth the triplets (T, Dc = fD (I), I). However, in our case, the
fD (Igen ) respects the input depth condition i.e.: depth condition Dc must take a more generalized form. For
Generate Igen such that fD (Igen ) = Dc . (1) our goal, the form of Dc is constrained by three main re-
quirements: (i) Compatibility with ControlNet, where the
Thus, by design, the conditioning imposed by Control- depth condition should resemble a conventional depth map.
Net is strict: as per training, the equality in Eq. (1) must ex- (ii) Efficient to extract from a given image without manual
actly hold. Our goal is to extend this notion of control to a annotation. (iii) Easy to construct manually by a user. We
more flexible generalized form. We therefore define gener- describe below how to obtain the appropriate form of Dc
alized control via the following setup: Given an input con- such that L OOSE C ONTROL is realized.
dition depth map Dc , and access to an off-the-shelf monoc-
ular depth estimator fD , and an arbitrary Boolean condition 4.1. How to represent scenes as boundaries?
function ϕ(·, ·): We begin by outlining the estimation of the depth condition
Generate Igen such that ϕ(fD (Igen ), Dc ) is TRUE. (2) Dc from a given image for implementing scene boundary
control as described in Eq. (3). In this context, we seek a
It is easy to observe that Eq. (1) is a special case of depth condition that acts as an upper depth limit for indi-
Eq. (2). In this work, we propose to consider two other vidual pixels. We propose to extract the scene boundary
cases: scene boundary control and 3D box control, as de- surface for upper-bound training in the following manner.
scribed next. Specifically, we define the boundary as a set of planar sur-
(C1) Scene boundary control. In this case, we impose faces encompassing the scene that accurately delineates the
the condition such that the input depth condition Dc only scene’s spatial extent. For a given bounded 3D scene and
specifies the tight upper bound of depth at each pixel: a specific camera viewpoint, the boundary depth is defined
ϕ : fD (Igen ) ≤ Dc . (3) as the z-depth of these boundary surfaces. In practice, this
means that all pixels in the image inherit the depth value
(C2) 3D box control. In this case, we let the condition associated with the boundary surface, even if the boundary
Dc control only the approximate position, orientation, and surfaces are occluded by objects within the scene’s interior.
size of objects by specifying their approximate 3D bound- To provide a concrete example, consider an indoor scene
ing boxes. This leads to finer control than (C1) yet still (refer to Fig. 3). The boundary typically encompasses the
avoids specifying the exact shape and appearance of the ob- walls, ceiling, and floor planes. Consequently, a bound-
jects. Essentially, we design a condition function ϕ that en- ary depth map for an indoor scene exclusively reflects the
i
sures that the objects Ogen generated in an image conform depths of walls, ceiling, and floor, irrespective of the pres-
to their respective specified 3D bounding boxes Bi : ence of room furniture.
i
ϕ : Bi ∼ 3DBox(Ogen ) ∀i, (4) A naive approach to extracting boundary depth involves
leveraging annotated room layout datasets. However, such
i
where 3DBox(Ogen ) represents the oriented 3D bounding datasets often contain ambiguous annotations in multi-room
box of the i-th object segment in the generated image. This images, and the room-level annotations directly violate our
means that the position, size, and orientation of each object “upper” bound condition (see Supplementary Materials).
will be approximately within the bounds set by its corre- Additionally, this approach necessitates manual annotation
sponding 3D bounding box. Although our 3D box control for boundary extraction and confines the scope to room-
training is strict, we show that boxes can be treated as only centric scenarios. Our objective is to devise a strategy that
approximate, which proves to be highly beneficial in prac- is applicable across diverse scene types and readily applica-
tice. ble to extract boundary depth from any given image. To this
Both Eq. (3) and Eq. (4) specify a form of depth condi- end, we propose the following algorithm:
tion that is strictly more general than the ordinary control We use a multi-step approach to efficiently extract
such as realized by ControlNet. We term such conditioning boundary depth from a given image. Refer to Fig. 2 for the
as L OOSE C ONTROL. outline of our approach. We begin by estimating the depth

3
Backprojection
Orthographic
Projection
Approx Poly

Extrusion

Depth Render

Figure 2. Pipeline for extracting boundary depth from an image. From left to right: Input image and its estimated depth map, back-
projected 3D mesh, orthographic projection of the mesh on a horizontal plane, polygon approximation of the 2D boundary, extrusion of
poly sides, resulting boundary depth map. For ease of visualization, the ceiling is not shown.

map of the given image using an off-the-shelf monocular which, currently, presents challenges. Existing monocular
depth estimator [7]. We then back-project the image into a 3D object detection pipelines tend to be either sparse, tar-
3D triangular mesh within the world space. Our goal is to geting specific object categories, or domain-specific, focus-
extract the planar surfaces that encompass this mesh. For ing on scenarios like road scenes, limiting their versatility.
efficiency, we only make use of vertical planes during train- To address these limitations, we introduce a custom monoc-
ing. This reduces the 3D boundary extraction problem to ular pipeline that approximates the 3D bounding boxes of
2D. We project the 3D mesh of the scene onto a horizontal objects and is simultaneously dense (capable of recovering
plane using orthographic projection. This projection facili- any object type) and generalizable (no domain preference).
tates the precise delineation of the 2D boundary that encap- For a given image, we first obtain its depth map and
sulates the scene. We then approximate the 2D boundary of segmentation map using an off-the-shelf monocular depth
projection with a polygon and extrude the sides into planes estimator ZoeDepth [7] and SAM [24], respectively. The
matching the scene height. For an indoor scene, for exam- depth map provides 3D spatial information, while the seg-
ple, these ‘poly’-planes represent a good approximation of mentation map delineates object boundaries, which we use
wall planes. These poly planes and optionally the ground together to approximate 3D bounding boxes of objects. For
and ceiling plane (if available) together form the boundary each segment, we perform back-projection, transforming
surface. We then render the depth of this boundary surface the image segment using its depth map into a point cloud
from the original camera view to get the boundary depth in 3D world space. We then estimate the corresponding
map Dc , serving as a proxy for the scene boundary condi- oriented 3D bounding box with minimal volume for each
tion. back-projected segment point cloud and represent it as a
cuboidal mesh. Finally, to obtain Dc , we render this scene
of boxes using the original camera parameters used in back-
projection and extract the depth map. This process is sum-
Scene Boundary Control 3D Box Control

marized in Fig. 5.

Scene Boundary Control 3D Box Control 4.3. How to train for L OOSE C ONTROL?
Figure 3. Illustration of proxy depth for scene boundary control We build our framework on StableDiffusion [37] and Con-
training (left) and 3D box control training (right). trolNet [51]. In both scene boundary control and 3D box
control, the constructed depth condition Dc serves as a
4.2. How to represent scenes as 3D boxes? proxy for specifying a more generalized control - scene
boundary and 3D bounding box, respectively - while retain-
We now describe the estimation of Dc from a given image
to realize 3D box control (Eq. (4)). For a given image, our
goal is to construct Dc such that it contains the informa-
Depth Estimator ControlNet Stable Diffusion
tion about the 3D bounding boxes of the objects present in
the scene. At the same time, we need to ensure compatibil- photo of a helicopter over a pine forest in winter

/
ity with depth-conditioned frameworks, such as ControlNet, Proxy Estimator LoRA
Stable Diffusion
where the depth condition should resemble a conventional ControlNet

depth map. To achieve this, our idea is to obtain 3D bound-


ing boxes of the objects in the image and render the depth Figure 4. Training pipelines for ControlNet (Top) and L OOSE C-
of the boxes. The resulting depth map can serve as a proxy ONTROL (Bottom). Proxy Estimator represents our proxy depth
for specifying 3D bounding boxes (See Fig. 3). However, extraction algorithms. During inference, we enable an option for a
this task necessitates a robust monocular 3D object detector, user to design the condition depth manually via a UI.

4
Backprojection

Depth Render

3D Lifting

Compute BBox

Figure 5. Pipeline for extracting proxy depth for 3D box control from an image. Left to right: Input image, its estimated depth and
segmentation maps, back-projected 3D mesh, 3D lifting of segmentation, 3D bounding boxes for segments, resulting proxy depth map.

ing compatibility with ordinary depth-conditioned frame- W ′ x = W x + γBAx. We observe that controlling γ can
works such as ControlNet. This ensures a smooth transi- lead to higher quality results, especially for γ > 1. We use
tion and facilitates the easy adoption of these approaches. γ = 1.2 as the default.
Thanks to this backward compatibility, we can efficiently
fine-tune a pre-trained depth ControlNet to achieve gen- 4.4. 3D Box Editing
eralized depth control. We prepare the triplets (T, Dc , I) Both the scene boundary condition and the box condition
from the given (T, I) pairs where Dc is constructed using are easy to manipulate by a user. This opens up possibili-
the algorithms presented in previous sections. We explore ties where a user can interactively manipulate the scene and
two primary options for fine-tuning: (i) Naive fine-tuning: objects, for example, by changing, adding, or removing the
where we finetune the entire ControlNet; and (ii) LoRA boxes. However, it is desirable to ‘lock in’ and maintain the
fine-tuning: where we employ LoRA-based fine-tuning of style and composition of the scene while manipulating the
ControlNet (see Fig. 4). Specifically, for every attention condition. A naive way to maintain the overall style would
block in ControlNet, we learn a low-rank update: to be fix the seed but we observe that changing the condi-
tion even when fixing the seed can lead to diverging gen-
W ′ x = W x + BAx (5) erations. To this end, we propose Style Preserving Edits,
inspired by video diffusion models [22], which allows users
where W ∈ RM ×N is the original frozen projection, B ∈ to maintain the desired style of the scene while conduct-
RM ×r and A ∈ Rr×N are trainable low rank matrices of ing sequential edits. This feature is realized by replacing
rank r ≪ min(M, N ). It is important to note that in both the “keys” and “values” for the given image in the atten-
cases Stable Diffusion U-net remains frozen during fine- tion layers of the Stable Diffusion U-net with those from
tuning. For both control types (C1 and C2), we use the the source image. However, in our case, sharing “keys” and
NYU-Depth-v2 [31] dataset for fine-tuning and obtain tex- “values” through all layers of the U-net leads to undesir-
tual image descriptions using BLIPv2 [26] captioning. able results, producing images identical to source images.
Inference time adjustments. We observe that naive We find that sharing “keys” and “values” from only the last
fine-tuning results in saturated generations (Fig. 6). Inter- two decoder layers ensures the preservation of the desired
estingly, the color artifacts are eliminated if the residuals identity or style while adhering to the new target condition.
injected into Stable Diffusion U-net from ControlNet are
reweighted at inference time. We observe that residual to 4.5. Attribute Editing
the bottleneck block is responsible for most of the condi-
L OOSE C ONTROL’s generalized nature compared to ordi-
tion adherence and other higher resolution skip residuals
nary depth control allows for a wider range of possible
largely add only local texture information. On the other
scenes under a given condition. To explore this expansive
hand, LoRA fine-tuning proves to be robust without any
space, we conduct various forms of latent exploration.
color issues. However, we introduce a controllable scale
ControlNet produces two types of residuals for a given
factor γ at inference time such that the update is given as:
condition: residuals added to the SD U-net bottleneck and
residuals added to decoder skip connections. Our focus cen-
ters on the bottleneck feature space (H-space), known for
its semantic behavior [17, 25] and from our experiments, its
pivotal role in condition adherence.
In our exploration, we examine how the latent input,
x, affects ControlNet’s output for a given fixed condition.
We extract the most influential directions in x-space, which
Figure 6. Given scene boundary (left), comparing naive fine- cause substantial changes in ∆h ∈ H, and find their coun-
tuning, adjusted skip weights, and our LoRA-based training, resp. terparts in H-space, using Singular Value Decomposition

5
input guidance
(C1) Scene Boundary Control

ControlNet
LooseControl (ours)
a clothing store seating area of charming Irish a heavily equipped kitchen a living room with sofa, inside of an old cathedral crowd celebrating Guy Fawkes’s busy street in Toyko at night
pub with dark wood paneling with appliances house plants and a table with pew Night on Trafalgar Square with a lot of pedestrians

input guidance
(C2) 3D Box Control

ControlNet
LooseControl (ours)
swivel office sofa chair a white English arm roll sofa a living room four office chairs a helicopter flying over a bedroom with a canopy bed a set of conical icebergs
with a round backrest a pine forest in winter with puffy bedding & nightstands in the ocean

Figure 7. Qualitative results of (C1) Scene boundary control (rows 1-3) and (C2) 3D Box control (rows 4-6). Top: Input depth condition
Middle: ControlNet generations. Bottom: Ours. Our generations are more realistic and adhere to the prompt better than ControlNet.

(SVD) of the ControlNet Jacobian, ∂∆h/∂x. To keep the adhere to the given condition.
SVD computation feasible, we only extract the first N di-
rections, labeled {ei }, which guide our modification of the 5. Applications and Results
bottleneck residual via: ∆h′ = ∆h + βei , where the scalar
β controls the magnitude of the edit. We developed an interface to enable users to conveniently
Per-condition latent exploration enhances user control design a scene boundary and 3D boxes, and generate im-
by revealing various semantic edit directions, for example, ages interactively (see supplementary material). Here, we
sizes, types, and even the number of objects in the scene. present qualitative and quantitative results.
This gives rise to a continuous control space, allowing for
fine-grained adjustments that were challenging with con- (C1) Scene Boundary Control. Scene boundary con-
ventional conditioning methods. This continuous control trol enables diverse applications across various domains.
space operates orthogonal to the input depth condition, en- We present some examples and comparisons between
abling the maintenance of specified conditions—whether L OOSE C ONTROL and ordinary control in Fig. 7. One par-
scene boundary control or 3D box control—while simul- ticularly noteworthy application of scene boundary control
taneously exploring a wide range of generated images that is in indoor scene generation. Users can provide somewhat

6
Figure 8. Qualitative results of (E1) 3D Box Editing, and (E2) Attribute Editing. (E1) 3D Box editing enables us to move and re-orient
objects in the scene, change their size, split them, and more. (E2) Attribute Editing lets us change properties like the density of furniture
(e.g., increase or decrease density) and the material of objects (e.g., fabric to leather or vica versa).

7
abstract specifications for room layouts, generally confined a box into two, resulting in semantic edits on objects.
to walls and floor, and still generate images of fully fur- (E2) Attribute Editing. Extracting the edit directions for a
nished rooms. This is different from ordinary depth condi- fixed scene boundary leads to continuous control directions
tioning, which demands precise depth information. for attributes such as the density of furniture and furniture
Another scenario involves (partially) bounded outdoor type. For a given 3D box control condition (see Fig. 8), the
scenes where the user can provide the scene boundary edit directions represent attributes like the shape, thickness,
depth, for example, in terms of locations of buildings (See and material properties such as “leather”. We observe that
Fig. 1). Unlike the ordinary depth control implemented edit directions are smooth and continuous for the most part
by the original ControlNet, which leads to empty scenes, and varying the magnitude β directly reflects that magni-
our L OOSE C ONTROL model excels at generating realis- tude of change in the attribute.
tic scenes, inclusive of objects like cars, traffic lights, and
pedestrians, all while adhering to the user-provided scene
boundary condition. This feat underscores the reliabil-
ity and generalization capability of L OOSE C ONTROL via User Study. We conducted a user study to quantify the
proxy fine-tuning. improvement in Scene Boundary Control 3D Box Control

In Fig. 9, we show the effect of varying the condition- control and quality
ing scale, λ, in the context of scene boundary control. We of generations.
observe that ControlNet still produces ‘empty’ rooms with 41 users were
lesser prompt adherence for most of the conditioning scales. presented with ours
ControlNet
the prompt and
condition and
(C2) 3D Box Control. We present generations and com- asked to pick the image they preferred between ours and
parisons with ControlNet for 3D box control for a variety of ControlNet. A majority of users, more than 95%, ranked
scenes in Fig. 7. ControlNet with ordinary control produces our result better. See inset for results.
‘boxy’ generations which are not only unrealistic but also
result in lesser prompt adherence. On the other hand, our
method is able to produce more realistic images and shows 6. Conclusion
better adherence to the given prompt demonstrating the util-
ity of such control. We have presented L OOSE C ONTROL to support general-
ized depth control for image generation. We introduced two
types of control: (C1) Scene boundary control and (C2) 3D
(E1) 3D Box Editing. We show qualitative results for 3D Box control to generate initial images. To refine initial re-
Box Editing in Fig. 8. 3D box Editing allows users to ma- sults, we proposed (E1) 3D box editing and (E2) Attribute
nipulate objects, for example, by changing their position in editing. Our framework provides new modes of creative
3D space, their orientation, and size. Users can perform generation and editing allowing users to more effectively
3D-aware edits, such as relocating furniture within a room explore the design space with depth guidance. A user study
or converting a small sofa chair to a big sofa. The Style revealed over 95% preference for L OOSE C ONTROL com-
Preserving Edit mechanism facilitates sequential editing of pared to previous work.
object poses and locations, preserving the scene’s essence
Limitations. Although our method works well for primary
while introducing creative modifications. Since the 3D box
objects, control over secondary objects is harder to achieve
control implicitly specifies the rough shape and size of the
– we attribute this to the training data where secondary ob-
objects, we can edit the boxes themselves such as splitting
jects are less common. We expect a scene-based generator
may lead to improved results. Also, similar to the original
ControlNet

condition ControlNet we find that providing too many constraints as


input reduces the diversity of the results.
Future work. We would like to use L OOSE C ONTROL
LooseControl (ours)

in conjunction with masks and also with MultiControlNet


through the generalized specification of other maps. For
example, in the context of sketch-based generation, specify-
ing only a selection of (dominant) edges is easier for users
than drawing all the Canny edges. Further, we would like to
Figure 9. Effect of conditioning control scale λ. Top: ControlNet. explore temporally consistent generation to produce output
Bottom: Ours. The results were produced for the prompt “A cozy videos under smoothly changing guidance as in an interpo-
coastal bedroom with a sea view”. lated 6DOF object specification.

8
References the semantic latent space of diffusion models. arXiv preprint
arXiv:2303.11073, 2023. 5
[1] https://huggingface.co/. 1
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
[2] https://www.babylonjs.com/. 1
fusion probabilistic models. Advances in neural information
[3] https : / / huggingface . co / docs / diffusers /
processing systems, 33:6840–6851, 2020. 2
index. 1
[19] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao,
[4] https://streamlit.io/. 1
and Jingren Zhou. Composer: Creative and controllable im-
[5] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Ab-
age synthesis with composable conditions. arXiv preprint
dulrahman Alfozan, and James Zou. Gradio: Hassle-free
arXiv:2302.09778, 2023. 2
sharing and testing of ml models in the wild. arXiv preprint
arXiv:1906.02569, 2019. 1 [20] Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural
[6] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, wavelet-domain diffusion for 3d shape generation. In SIG-
Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Ait- GRAPH Asia 2022 Conference Papers, New York, NY, USA,
tala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Kar- 2022. Association for Computing Machinery. 2
ras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion [21] Animesh Karnewar, Niloy J Mitra, Andrea Vedaldi, and
models with ensemble of expert denoisers. arXiv preprint David Novotny. Holofusion: Towards photo-realistic 3d gen-
arXiv:2211.01324, 2022. 2 erative modeling. ICCV, 2023. 2
[7] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter [22] Levon Khachatryan, Andranik Movsisyan, Vahram Tade-
Wonka, and Matthias Müller. Zoedepth: Zero-shot trans- vosyan, Roberto Henschel, Zhangyang Wang, Shant
fer by combining relative and metric depth. arXiv preprint Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-
arXiv:2302.12288, 2023. 4, 1 image diffusion models are zero-shot video generators. arXiv
[8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. preprint arXiv:2303.13439, 2023. 5
Realtime multi-person 2d pose estimation using part affinity [23] Diederik P Kingma and Jimmy Ba. Adam: A method for
fields. In CVPR, 2017. 2 stochastic optimization. arXiv preprint arXiv:1412.6980,
[9] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. 2014. 1
Sheikh. Openpose: Realtime multi-person 2d pose estima- [24] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
tion using part affinity fields. IEEE Transactions on Pattern Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
Analysis and Machine Intelligence, 2019. 2 head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
[10] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej thing. arXiv preprint arXiv:2304.02643, 2023. 4, 1
Halber, Matthias Niessner, Manolis Savva, Shuran Song, [25] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion
Andy Zeng, and Yinda Zhang. Matterport3d: Learning models already have a semantic latent space. arXiv preprint
from rgb-d data in indoor environments. arXiv preprint arXiv:2210.10960, 2022. 5
arXiv:1709.06158, 2017. 1
[26] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
[11] Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen
Blip-2: Bootstrapping language-image pre-training with
Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf:
frozen image encoders and large language models. arXiv
A unified approach to 3d generation and reconstruction. In
preprint arXiv:2301.12597, 2023. 5
ICCV, 2023. 2
[27] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
[12] Gene Chou, Yuval Bahat, and Felix Heide. Diffusion-SDF:
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
Conditional generative modeling of signed distance func-
Gligen: Open-set grounded text-to-image generation. In Pro-
tions, 2023. 2
ceedings of the IEEE/CVF Conference on Computer Vision
[13] Guillaume Couairon, Jakob Verbeek, Holger Schwenk,
and Pattern Recognition, pages 22511–22521, 2023. 2
and Matthieu Cord. Diffedit: Diffusion-based seman-
tic image editing with mask guidance. arXiv preprint [28] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and
arXiv:2210.11427, 2022. 2 Joshua B Tenenbaum. Compositional visual generation with
[14] Prafulla Dhariwal and Alexander Nichol. Diffusion models composable diffusion models. In European Conference on
beat gans on image synthesis. Advances in neural informa- Computer Vision, pages 423–439. Springer, 2022. 2
tion processing systems, 34:8780–8794, 2021. 2 [29] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
[15] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, Daniel Cohen-Or. Null-text inversion for editing real im-
and Angela Dai. Hyperdiffusion: Generating implicit ages using guided diffusion models. In Proceedings of
neural fields with weight-space diffusion. arXiv preprint the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:2303.17015, 2023. 2 Recognition, pages 6038–6047, 2023. 2
[16] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- [30] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning
Or. An image is worth one word: Personalizing text-to- adapters to dig out more controllable ability for text-to-image
image generation using textual inversion. arXiv preprint diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2
arXiv:2208.01618, 2022. 2 [31] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob
[17] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, Fergus. Indoor segmentation and support inference from
and Tomer Michaeli. Discovering interpretable directions in rgbd images. In ECCV, 2012. 5

9
[32] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [45] Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah
Mark Chen. Glide: Towards photorealistic image generation Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor
and editing with text-guided diffusion models. arXiv preprint and editbench: Advancing and evaluating text-guided im-
arXiv:2112.10741, 2021. 2 age inpainting. In Proceedings of the IEEE/CVF Conference
[33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved on Computer Vision and Pattern Recognition, pages 18359–
denoising diffusion probabilistic models. In International 18369, 2023. 2
Conference on Machine Learning, pages 8162–8171. PMLR, [46] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
2021. 2 Chen, Qifeng Chen, and Fang Wen. Pretraining is all
[34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, you need for image-to-image translation. arXiv preprint
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming arXiv:2205.12952, 2022. 2
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- [47] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
perative style, high-performance deep learning library. Ad- Sheikh. Convolutional pose machines. In CVPR, 2016. 2
vances in neural information processing systems, 32, 2019. [48] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun
1 Zhang. Smartbrush: Text and shape guided object inpainting
[35] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, with diffusion model. In Proceedings of the IEEE/CVF Con-
and Mark Chen. Hierarchical text-conditional image gener- ference on Computer Vision and Pattern Recognition, pages
ation with clip latents. arXiv preprint arXiv:2204.06125, 1 22428–22437, 2023. 2
(2):3, 2022. 2 [49] Ling Yang, Zhilin Huang, Yang Song, Shenda Hong, Guohao
[36] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay- Li, Wentao Zhang, Bin Cui, Bernard Ghanem, and Ming-
lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Hsuan Yang. Diffusion-based scene graph to image gener-
Gkioxari. Accelerating 3d deep learning with pytorch3d. ation with masked contrastive pre-training. arXiv preprint
arXiv preprint arXiv:2007.08501, 2020. 1 arXiv:2211.11138, 2022. 2
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [50] Biao Zhang, Jiapeng Tang, Matthias Nießner, and Peter
Patrick Esser, and Björn Ommer. High-resolution image Wonka. 3dshape2vecset: A 3d shape representation for
synthesis with latent diffusion models. In Proceedings of neural fields and generative diffusion models. ACM Trans.
the IEEE/CVF conference on computer vision and pattern Graph., 42(4), 2023. 2
recognition, pages 10684–10695, 2022. 1, 2, 4 [51] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
[38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, conditional control to text-to-image diffusion models. In
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine Proceedings of the IEEE/CVF International Conference on
tuning text-to-image diffusion models for subject-driven Computer Vision, pages 3836–3847, 2023. 1, 2, 4
generation. In Proceedings of the IEEE/CVF Conference
[52] Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong,
on Computer Vision and Pattern Recognition, pages 22500–
Yang Liu, and Heung-Yeung Shum. Locally attentional SDF
22510, 2023. 2
diffusion for controllable 3d shape generation. ACM Trans.
[39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Graph., 42(4):91:1–91:13, 2023. 2
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
[53] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d:
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
A modern library for 3d data processing. arXiv preprint
et al. Photorealistic text-to-image diffusion models with deep
arXiv:1801.09847, 2018. 1
language understanding. Advances in Neural Information
Processing Systems, 35:36479–36494, 2022. 2
[40] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner,
Jiajun Wu, and Gordon Wetzstein. 3d neural field generation
using triplane diffusion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 20875–20886, 2023. 2
[41] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser
Sheikh. Hand keypoint detection in single images using mul-
tiview bootstrapping. In CVPR, 2017. 2
[42] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon.
Key-locked rank one editing for text-to-image personaliza-
tion. ACM SIGGRAPH 2023 Conference Proceedings, 2023.
2
[43] Vaibhav Vavilala and David Forsyth. Applying a color
palette with local control using diffusion models, 2023. 2
[44] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
Sketch-guided text-to-image diffusion models. In ACM SIG-
GRAPH 2023 Conference Proceedings, pages 1–11, 2023.
2

10
L OOSE C ONTROL: Lifting ControlNet for Generalized Depth Conditioning
Supplementary Material

Figure 10. Upper-bound violations in Matterport3D [10]. Top: RGB panorama images. Bottom: Depth rendered using the layout labels
from the dataset.

7. Implementation details 9. User study additional details


We created an anonymous user study as a web form using
For all our experiments, we use Stable Diffusion v1.5 [37]
streamlit [4]. Users were presented with ‘Two-alternative
and the corresponding ControlNet [51] checkpoint
forced choice’ (2-AFC) with two options as images gener-
“lllyasviel/control v11f1p sd15 depth” hosted on Hug-
ated by our baseline (ControlNet) and our result and asked
ginFace [1]. We use the PyTorch [34] framework and
to respond with their preference for the given text prompt
the diffusers [3] library as the framework for diffusion
and condition image. The options were anonymized and did
models. We use ZoeDepth [7] and SAM [24] for extracting
not indicate the name of the method. Each user was asked
depth maps and segmentation maps respectively. For
to respond to 10 randomized questions in total (5 for Scene
SAM, we use min_mask_area=1e4. We make use of
Boundary Control, and 5 for 3D box control). The order
Pytorch3D [36] and Open3D [53] in our 3D framework
of options was also randomized. As mentioned in the main
and PyTorch3D for rendering the depth maps for obtaining
paper, over 95% of responses were in favor of our method.
the proxy depths for fine-tuning. For backprojection,
we randomly choose an FOV in the range of 43 and 57
degrees. For both Scene Boundary Control and 3D Box
Control, we use LoRA rank r = 8 and fine-tune only
LoRA layers for 200 steps with a learning rate of 0.0001,
and batch size of 12 with Adam [23] optimizer. We
use controlnet_conditioning_scale=1.0 and
LoRA scale factor γ = 1.2 as default. We built our 3D
editor user interface using Gradio [5] and BabylonJS [2].

8. Issues with Room Layout Datasets

As discussed in the main paper, a possible alternative for


extracting the scene boundary for the preparation of the
dataset for Scene Boundary Control could be room layout
datasets. However, we note that these datasets often con-
tain ambiguous layout labels that directly violate our tight
upper-bound condition required to implement scene bound-
ary control. We provide some examples of these violations
for the popular MatterPort3D dataset as the representative
in Fig. 10

You might also like