使用扩散模型进行像素级引导的细粒度图像编辑

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models
Naoki Matsunaga* Masato Ishii Akio Hayakawa Kenji Suzuki Takuya Narihira
Sony Group Corporation
Tokyo, Japan
arXiv:2212.02024v2 [cs.CV] 8 Dec 2022
Original + Change Hairstyle + Move eyes + Close Mouth
Original +Move Eyes Edited Images Original +Shorten Arm Edited Images
Ours
EditGAN
(b) Comparison with GAN-based Method
Figure 1. Our approach with pixel-wise segmentation guidance enables fine-grained real image editing. An existing GAN-based method,
called EditGAN, fails to reconstruct detailed features, which are less observed in the training data (e.g., a pacifier in the left case). In
contrast, our diffusion-based method enables pixel-wise editing while preserving detailed features.
Abstract method can create reasonable contents in the editing region

while preserving the contents outside this region. The ex-
Generative models, particularly GANs, have been uti- perimental results validate the advantages of the proposed
lized for image editing. Although GAN-based methods per- method both quantitatively and qualitatively.
form well on generating reasonable contents aligned with
the user’s intentions, they struggle to strictly preserve the
contents outside the editing region. To address this issue, 1. Introduction
we use diffusion models instead of GANs and propose a
Recently, generative models have been widely utilized in
novel image-editing method, based on pixel-wise guidance.
image-editing methods [15, 54] thanks to their potential to
Specifically, we first train pixel-classifiers with few anno-
generate high-quality visual contents. Several techniques,
tated data and then estimate the semantic segmentation map
such as neural photo editing filters, have been already ap-
of a target image. Users then manipulate the map to instruct
plied to real-world applications by digital artists [7] and
how the image is to be edited. The diffusion model gener-
have contributed to expanding their creativity. To more ef-
ates an edited image via guidance by pixel-wise classifiers,
fectively reflect user’s intention to the details in edited im-
such that the resultant image aligns with the manipulated
ages, we focus on fine-grained image editing. In this paper,
map. As the guidance is conducted pixel-wise, the proposed
we refer fine-grained image editing as an editing process
* Corresponding author Naoki.Matsunaga@sony.com with pixel-level control that does not generates a new con-
1
tent but rather makes small and localized changes to the ap- 2. Related Works
pearance of existing contents in the image, such as manipu-
late facial parts and animal’s pose shown in Fig.1 The fine- 2.1. Diffusion models
grained image editing particularly requires two properties: Denoising Diffusion Probabilistic Model (DDPM) [16,
(1) visual contents in an edited region should be reasonably 20, 33, 46] is a generative model that aims to generate
and locally created according to the user’s intention, and (2) data by reversing a diffusion process. The forward dif-
the contents outside the edited region should be precisely fusion process is defined as a Markov chain process with
preserved. Jointly satisfying these two properties leads to time, where a small amount of Gaussian noise is added to
higher controllability and better user-experience in the edit- samples in T steps, resulting in xT following an isotropic
ing process. Gaussian distribution. A single step in this process can
In the literature, GANs (Generative Adversarial Net- be represented
√ by the following equation: q(xt |xt−1 ) :=
works) [18] have been intensely adopted for image edit- N ( 1 − βt xt−1 , βt I), where βt is the hyperparameter that
ing [54]. These methods generally perform well in a global defines the process. If βt is sufficiently small, the step of the
manner [9, 10, 17, 27, 36, 37, 43–45, 52, 53, 55] by manipu- reverse diffusion process q(xt−1 |xt ) can also be parameter-
lating latent features corresponding to a target image. For ized by a Gaussian transition as follows: pθ (xt−1 |xt ) :=
fine-grained editing, those methods have been extended to N (µθ (xt , t), Σθ (xt , t)), the mean µθ and covariance Σθ is
precisely reflect fine-grained instruction by users, such as estimated by xt which is an input of the noise estimator
manipulated semantic segmentation map [29, 30, 35, 58], θ (xt , t), where θ is parameterized by the U-net encoder
sketches [12], into edited images. Although these methods and the decoder. The model is trained to obtain µ via noise
can provide high-quality visual contents in the edited area, estimation for xt , while Σ is assumed to be fixed in DDPM.
it is essentially difficult to strictly preserve other contents Consequently, we can represent a single step in the reverse
in the target image because it is difficult to find the latent process as follows:
features corresponding to the edited image while satisfying !
such constraint. As shown at the bottom of Fig.1, GAN- 1 βt
xt−1 = √ xt − √ θ (xt , t) + σt , (1)
based methods may cause unintentional change in the vi- 1 − βt 1 − ᾱt
sual contents, which are less observed in the training data, Qt
outside the edited region (e.g, a pacifier in the left case). where αt := 1 − βt , ᾱt := s=1 αs , ∼ N (0, I). Given
In this work, we adopt diffusion models instead of GANs an initial xT , we generate a sample by repeating the above
to address the above-mentioned issue and propose a novel step from t = T to t = 0.
fine-grained image editing method based on pixel-wise Data generation through DDPM is stochastic, as shown
guidance. For the guidance, we first train pixel-classifiers in Eq. 1. Therefore, given a fixed xT , we can obtain various
that are to estimate semantics of a target pixel. The clas- x0 through this process. Conversely, [47] extends DDPM to
sifiers are then used to guide the generation process by the obtain a deterministic generation process by utilizing a non-
diffusion model so that the generated image aligns with an Markovian forward process. A single step of the reverse
edited semantic segmentation map. Since this guidance is process in DDIM is described as follows:
conducted in a pixel-wise manner, our method can precisely √ p
xt−1 = ᾱt−1 fθ (xt , t) + 1 − ᾱt−1 θ (xt , t), (2)
preserve contents outside an edited area while creating rea-
sonable contents in the edited area. Experimental results √
xt − 1 − ᾱt θ (xt , t)
validate the advantage of our method compared with GAN- fθ (xt , t) := √ . (3)
ᾱt
based methods both quantitatively and qualitatively.
In this setting, given target data x0 , we can obtain the la-
In summary, our contributions are three-fold:
tent variable xT to generate x0 through DDIM by repeating
the following computation:
• We propose a novel image editing method based on
√ p
pixel-wise guidance using diffusion models. Our xt+1 = ᾱt+1 fθ (xt , t) + 1 − ᾱt+1 θ (xt , t). (4)
method is specifically designed to create reasonable
contents in the editing region while preserving the con- Using Eq. 4 and Eq. 2, DDIM achieve accurate reconstruc-
tents outside this region. tion (inversion).
• We empirically demonstrate that our method outper-
forms GAN-based methods both quantitatively and Classifier Guidance Classifier guidance [16] is a tech-
qualitatively. nique that enables pre-trained unconditional diffusion mod-
• Further empirical analysis reveal that our method is ca- els to generate images corresponding to a specified class
pable of generating semantically interpolated images label. It requires a pre-trained classifier to specify the de-
between the original image and its edited one. sired class and uses it to guide the generation process of
2
the diffusion models. Specifically, classifier pφ (y|xt , t) is latent variables that can reconstruct the images via a gen-
first trained with noisy images xt that corresponds to each erator. Although, there are many inversion techniques [1–4,
time step t in the diffusion process. Then, in the genera- 40, 50, 58] for searching such latent variables, these GAN-
tion process, the mean of the distribution of xt is shifted at based methods including EditGAN suffer from incomplete
each time step by adding the gradients of the classifier with reconstruction of features that are scarce in training data.
respect to the data, as follows: That could be a major drawback limiting their real-world
editing applications.
pθ (xt−1 |xt ) = N (µθ (xt , t) + sΣ∇xt log pφ (y|xt , t), Σ),
(5)
Diffusion-based methods Diffusion models [16, 20, 33,
where s is the guidance scale parameter. When we set a 46] and score-based generative models [48, 49] are suit-
large value to s, the generated images are highly aligned to able for real image editing because it has outperformed im-
the desired class, but they are likely to lose their diversity. age synthesis quality [16] against existing generative mod-
els [11, 26, 32, 39, 51] and can be accurately reconstructed,
Semantic Segmentation with Diffusion Models The as described in the Sec.2.1. Some works have been con-
proposed method builds on DatasetDDPM [8], which is a ducted in a global manner [24, 38], which can be applied
label-efficient semantic segmentation model that use pixel- to style and attribute transformations, but is unsuitable for
level representations of the noise estimator θ . In [8], the localized, fine-grained editing.
model trained on synthetic images is called DatasetDDPM; For fine-grained editing, some works can be divided into
therefore, in this paper, the model trained on real images two categories. One line of research is training the condi-
was named DatasetDDPM-R. DatasetDDPM-R estimates a tional diffusion models [34, 41]. We can edit images locally
segmentation label on each pixel via a three-layer multi- using an inpainting technique [41] and text-prompts [34] by
layer perceptron (MLP), which is referred to as a pixel- training diffusion models using such conditions as inputs.
classifier, using pixel-level representations, the UNet’s in- However, these methods incur high training costs because
termediate activations are upsampled and concatenated, as they are required to train the diffusion models.
inputs. Because pixel-wise features can be used as one piece On the other hand, unconditional model based meth-
of data, it can build a model with few annotated data. Fur- ods [5, 6, 13, 31], it can use the pre-trained diffusion mod-
thermore, it also outperformed DatasetGAN [57] because els, achieve editing with low training cost. ILVR [13] is an
real images can be used for training. image translation method in which the low-frequency com-
ponent is used for conditioning the sampling process, and
2.2. Image Editing Methods it achieves scribble-based editing. In SDEdit [31], user-
GAN-based methods Generative Adversarial Net- provided strokes on images are first noised in the latent
work(GAN) [18]-based editing methods are divided into spaces by a stochastic SDE process. The edited images are
various categories. There are those, for example, that find then generated by denoising via the reverse SDE process.
a meaningful and disentangled direction in latent spaces These methods achieve user-provided localized editing, but
in an unsupervised manner [9, 10, 37, 43, 44, 52, 55] and they cannnot easily edit fine-grained features. In particu-
those that use text [17, 27, 36, 53]. These methods are lar, for SDEdit, there is no guarantee that the edited con-
useful for concept-level editing, such as style transfer and tent will be preserved through the generation process be-
attribute transformation, but are unsuitable for localized cause no conditions related to the edit are given in a simple
and fine-grained editing. stochastic reverse SDE process. Blended-diffusion [5, 6] is
To achieve fine-grained editing, some approaches use a technique for localized editing that corresponds to user-
segmentation maps and sketches for user-provided in- provided text prompts. But it is unsuitable for fine-grained
puts [12, 29, 35, 58]. These methods achieve precise edit- editing, such as ”Growing hair a little longer”, via the text
ing of details corresponding to the manipulation of user- descriptions. Therefore, we propose the localized and fine-
provided inputs, but their disadvantage is that a large grained image editing method in label-efficient manner.
amount of annotated data is required for conditional train-
ing. EditGAN [30] solve this problem by building on 3. Fine-grained Edit with Pixel-wise Guidance
DatasetGAN [57], which consists of a simple three-layer
3.1. Overview
MLP classifiers with the middle features of StyleGAN [22,
23] as inputs. This classifier predicts the segmentation label The overall flow of the proposed method for image edit-
on each pixel, and as a result, EditGAN can achieve high- ing is shown in Fig. 2. First, we train pixel classifiers [8]
precision editing in a label-efficient manner. gmulti and gt that are used for estimation of a segmenta-
However, when editing real images, the GAN-based tion map for editing and pixel-wise guidance, respectively.
methods listed above first need to search for appropriate For map manipulation, we generate the original segmenta-
3
Training Phase Inference Phase
(a) Training Pixel Classifiers (b) Estimating Map for Real Images
Pixel Classifiers
𝑔𝑡 / 𝑔𝑚𝑢𝑙𝑡𝑖
𝒙 𝒚
Add Noise
𝑔𝑚𝑢𝑙𝑡𝑖
(d) Pixel-wise gradient-based guidance
(c) Map manipulation and mask preparation
𝛁𝑳𝒔𝒆𝒈
𝒙𝒆𝒅𝒊𝒕𝒆𝒅
Change
𝒚 hairstyle 𝒚𝒆𝒅𝒊𝒕𝒆𝒅 Binary ROI Mask 𝒙 𝒙𝒕 𝟎
m
𝑔𝑡
Figure 2. Overall workflow of the proposed method. (a) For preparation, we train pixel classifiers of DatasetDDPM-R [8], which is an MLP-
based segmentation models and save parameters. (b) After training, we estimate the segmentation map using intermediate representations
across multiple steps (50,150,250 steps concatenate). (c) We then manipulate the segmentation map into desired one. (d) In the iterative
reverse denoising process, we guidance sampling process with the gradient of pixel classifiers.
tion map y of a target real image x by applying the pixel t={50, 150, 250} and concatenated, following the original
classifier gmulti to x and then manipulate it to the desired paper [8].
map y edited . Furthermore, we guide the reverse diffusion
process with the gradient of the pixel classifiers gt with re- Map manipulation and mask preparation After esti-
spect to the data. Finally, we obtain edited images xedited . mating the original segmentation map y using gmulti , the
Detailed descriptions are given in the following sections. user can manipulate map y and then obtain the target map
3.2. Editing Framework y edited as described in Fig. 2 (c). Simultaneously, we set the
binary ROI mask m by picking up the all edit-related pix-
Training Pixel Classifiers We use the pixel classifier of els from the original map y and manipulated map y edited as
DatasetDDPM-R for gradient-based guidance. This is be- follows [30].
cause we can use it to (a) build segmentation models in (
edited
a label-efficient manner and, (b) reduce the computational 1 if yij ∈ Qedit or yij ∈ Qedit ,
mij = (7)
cost because noise and label estimation can be computed 0 otherwise.
simultaneously on the same forward path; furthermore, we
can (c) apply it to any pre-trained non-conditonal diffusion m ∈ RH×W is a binary mask, defined by all pixels p whose
model. segmentation labels yp are included in the edit-related class
When training a pixel-classifier gt , we use image-label list Qedit , which is to be preliminarily specified manually.
pairs, which require human annotation. We use noise sam- To seamlessly blend the edited region with the outside of
ples xt , obtained directly from data x0 , as training data for the region, this m is dilated for three pixels as a buffer.
each time step t.
√ √ Pixel-wise gradient-based guidance To generate images
xt = ᾱt x0 + 1 − ᾱt , ∼ N (0, I). (6)
corresponding to the edited map y edited , we guide the re-
Here, the label annotated on x0 is associated with xt and verse diffusion process using this map. First, we obtain the
used for training as image-label pairs. Then, we save the starting noise sample xt0 that was determined by x0 using
parameters of gt . As we mentioned on Sec.2.1, gt estimates Eq.4. Then, we guide the sampling process at each time
the segmentation label on each pixel and the intermediate step t for t = t0 , ...1, and an overview of the pixel-wise
activations are extracted from the U-Net decoder. guidance at each time step is shown in Fig. 3.
For estimating original map y as described in Fig. 2 (b), Along with noise estimation, we extracted the intermedi-
we train another pixel-classifier gmulti with the representa- ate activations z ∈ RH×W ×d , which are appropriately up-
tions across multiple time steps, which are extracted from sampled and concatenated. Following [8], we set d=2816,
4
Diffusion-based Segmentation Model
ROI Mask 𝒎 𝒙𝒃𝒈 ~𝑵 ഥ 𝒕 𝒙𝟎 , 𝟏 − 𝜶

𝜶 ഥ𝒕 𝐈
Pixel Classifier
⨀ 𝑳𝒔𝒆𝒈 𝒙𝟎
𝑔𝑡
𝒙𝒇𝒈 ~𝑵(𝝁 + 𝒔 𝚺 𝛁𝒙𝒕 𝑳𝒔𝒆𝒈 , 𝚺)
Decoder
Encoder
𝒙𝒕
U-net
U-net (𝝁, 𝚺)
𝒙𝒕−𝟏 = 𝒙𝒇𝒈 ⨀𝒎 + 𝒙𝒃𝒈 ⨀(𝟏 − 𝒎)
Figure 3. Overview of pixel-wise segmentation guidance at a time-step t. The input of the segmentation model is the masked region’s (m)
pixel features. We obtain mean and variance on normal denoising process and gradient of segmentation loss on the segmentation model’s
output. Then we can obtain guided foreground images that are in the ROI by pixel-wise guidance.
which is dimension of intermediate representations at a sin- Algorithm 1 Pixel-wise gradient-based guidance

gle time step, in this paper. For this guidance, we use Input: original image x, target map y edited , ROI mask m,
only pixels inside the binary ROI mask m as input to the guidance scale s, start denoising step t0 , pixel classifiers
pixel classifier gt and then obtain the gradient of the cross- gt
P Lce which was averaged by the number of pix-
entropy loss Output:P edited image xedited
els N = ij I(mij = 1) as follows: N ← ij I(mij = 1)
H W
x0 ← x
1 XX
for t = 0 to t0 − 1 do
Lseg (z, y edited ) = edited
mij Lce gt (zij ), yij . √ √
N i=1 j=1 xt+1 ← ᾱt+1 fθ (xt , t) + 1 − ᾱt+1 θ (xt , t)
(8) end for
After obtaining the gradient, we guide the sampling pro- for t = t0 to 1 do
cess according to Eq.5 as follows. µ, Σ ← µθ (xt , t), Σθ (xt , t)
PH PW
Lseg ← N1 i=1 j=1 mij Lce gt (zij ), yij edited
xfg ∼ N (µ + sΣ∇xt Lseg , Σ), (9)
xt−1,fg ∼ N (µ√+ sΣ∇xt Lseg , Σ)
where µ = µθ (xt , t), Σ = Σθ (xt , t). This foreground term xt−1,bg ∼ N ( ᾱt x0 , (1 − ᾱt )I)
is related to the edit region. Following [6], we obtain back- xt−1 ← xt−1,fg m + xt−1,bg (1 − m)
ground term by simply adding noise to the original image end for
x0 to preserve the outside of the edited region and combine return xedited
these two terms to construct xt−1 as follows:
√ Table 1. Number of annotated images and classes for each dataset
xt−1,bg ∼ N ( ᾱt x0 , (1 − ᾱt )I), (10) used in the training of the pixel-classifier and manipulation.
xt−1 = xt−1,fg m + xt−1,bg (1 − m), (11)
Dataset Classes Number
where denotes the element-wise multiplication. This FFHQ-256 [22] 34 20
guidance procedure is summarized in Algorithm. 1. LSUN-Horse [56] 19 30
LSUN-Cat [56] 15 30
4. Experiments CelebA-HQ [21] 34 -
4.1. Experimental Setting

Datasets We compared our methods with EditGAN [30] models and pixel-classifiers trained on FFHQ-256, because
and SDEdit [31]. In all editing experiments, we used it is composed of more classes than the default CelebA-HQ
2562 images. We used FFHQ-256 [22], LSUN-cat [56] number of classes of 19, and it provides higher editability.
and LSUN-horse [56] for the EditGAN experiments, and These numbers are listed in Table 1. We used the annotated
CelebA-HQ [21] for the SDEdit experiments. For train- dataset as published in the DatasetDDPM official repository
ing of pixel-classifiers, we used the same number of anno- *
tated images and classes as DatasetDDPM [8], but in the
CelebA-HQ experiments, we used the backbone diffusion * https://github.com/yandex-research/ddpm-segmentation
5
+ Short Hair +Up Tail +Open Mouth +Close Mouth +Remove Person + Enlarge Ear
Original
Original
Original
Guide Input Edited Image Guide Input Edited Image
Ours
Ours
EditGAN
EditGAN
Figure 4. Qualitative comparions with EditGAN. These results were obtained by editing the segmentation map to follow the written
text(e.g. + Open Mouth). The left two samples show the edited map, which is included to clarify the specified manipulation. These results
confirm that our method achieves the desired editing performance while preserving the details of the original images, whereas EditGAN
fails to reconstruct some features, which are less observed in training data, such as facial painting, a person, and hands in original images.
Implementation The proposed method uses ADM [16] In all manipulations, we used the Paint-App to create guide
for backbone diffusion models and we used pre-trained inputs. This could be any other application such as Photo-
ADM model provided in [8]. To train the pixel classifiers, shop. The other implementation details are described in the
we trained only a single pixel classifier gt at each time step appendix.
t as a guidance model, whereas the original DatasetDDPM
trained multiple ensemble models. In this work, we used Evaluation Metrics Our goal is to achieve fine-grained
the intermediate representations extracted from the decoder editing while preserving contents in regions outside of the
block B = 5,6,7,8,12 at each time step; therefore, the total edited region. Therefore, we evaluated the reconstruction
dimensions of the pixel-wise features were 2816. For guid- performance, the accuracy of manipulation, and the quality
ance, we set hyper parameters {t0 , s} = {500,100} for the of the edited whole images for quantitative evaluation. We
manipulation of small parts and {750,40} for large parts of used MAE and PSNR to evaluate the reconstruction per-
FFHQ-256, CelebA-HQ and LSUN-Cat. The size of these formance. These metrics are calculated for pixels outside
parts was divided by a threshold of 5000 in terms of the the editing region, defined by the binary ROI mask m in
number of pixels in the binary ROI mask m. On LSUN- Sec.3.2. For manipulation performance, we evaluated the
Horse, we set {t0 , s}={800, 25}. Additionally, in all exper- prediction accuracy against the target map yedited by pre-
iments, we set the batch-size to four. With these settings, the dicting the map corresponding to the edited images xedited
overall edit is finished in 2∼6 minutes on NVIDIA A100- with DatasetGAN and DatasetDDPM-R. Subsequently, In-
SXM4-40GB. Sensitivity analysis on these parameters are ception Score (IS) [42] and Fréchet Inception Distance
discussed in Sec.4.4 (FID) [19] were used to ensure semantic consistency be-
In the EditGAN experiments, we used pre-trained Style- tween the reconstructed and manipulated regions. For the
GAN2 as the backbone model and trained the image en- FID, we measured the distance between two distributions
coder to initialize the latent codes corresponding to real of the original images before editing and edited images.
images, as in their paper [30]. For editing, we used
optimization-based editing in [30] because this approach, 4.2. Evaluation and Comparison
while increasing the editing time, offers the most accurate Quantitative Comparison The results of the quantitative
editing performance among the proposed approaches in Ed- comparison with the EditGAN are presented in Table 2. The
itGAN. Then, we performed 100 steps of optimization using evaluation metrics are described in the Sec.4.1. In this ex-
Adam [25] to optimize the latent codes. periment, we randomly selected 50 images on FFHQ-256
In the SDEdit experiments, we used VP-SDE [31], pre- and then predicted the original segmentation map with seg-
trained on the CelebA-HQ dataset, for localized image ma- mentation models (i.e. DatasetGAN and DatasetDDPM-R),
nipulation on CelebA-HQ. For all manipulations, we used and applied one of the following manipulations: {”close
stroke-based editing with t0 = 0.5, N = 500 and K = 1. mouth,” ”open mouth,” ”move eye,” ”close eye,” ”edit eye-
6
Guide Input Edited Image Guide Input Edited Image Guide Input Edited Image
+Close Eyes +Move Eyes
Ours
Ours
Ours
+Close Mouth
Original
Original
Original
SDEdit
SDEdit
SDEdit
Figure 5. Qualitative comparison of our proposed method and SDEdit. We adopted stroke-based editing in SDEdit and experimented using
VP-SDE (t0 =0.5, N =500). The guide input, which was edited with strokes using image editing software (e.g. Paint app, Photoshop), were
used for editing. As a result, we confirmed that SDEdit failed to edit details (e.g. move eyes) that our porposed method edited.
Table 2. Quantitative evaluation with EditGAN on FFHQ-256. We Qualitative Comparison First, we show the a compari-
selected 50 samples randomly and manipulated the corresponding son with EditGAN [30] in Fig.1 and Fig.4. In these figures,
map. The results revealed that the proposed method outperforms we show the results obtained by the operation of the seg-
EditGAN on all evaluation metrics.
mentation map as described in the text (e.g., Open Mouth).
From these figures, we confirmed that EditGAN failed to
m=0 whole image m=1
Method reconstruct some features, which are less observed in train-
MAE ↓ PSNR ↑ FID ↓ IS ↑ Accuracy ↑ ing data, such as the pacifier, face paintings, and the hands
around the face, because of an inaccurate inversion per-
EditGAN 0.07485 67.24 79.67 3.765 0.8138
formance. This could be a significant challenge for ap-
Ours 0.01380 81.53 15.07 4.416 0.8290 plications that edit photos captured in the real-world. In
addition, for some operations, such as editing a pony tail
and short hair in Fig.4, EditGAN fails to achieve the de-
Input Representative interpolated samples Input
sired editing on a default set of parameters. In contrast, our
Closing Eyes
method achieved the desired fine-grained editing naturally

while preserving other features that were not reconstructed
in EditGAN.
We then show a comparison with SDEdit [31] in Fig.5.
Moving Eyes
For SDEdit, we use stroke-based edited images (Guide In-

put) as input to the SDE. The manipulations are shown in
Fig.5 and our aim is to compare the fine-grained editabil-
Elongate Bang
ity. The results reveal that SDEdit fails to edit such precise
features of images, which can be achieved with our method.
4.3. Further Applications

Figure 6. Results of interpolation between edited and original im-
The proposed methods can be utilized for some appli-
ages. The results demonstrated the capability of meaningful inter-
polation on latent space of diffusion models.
cations such as interpolation and inpainting. For interpola-
tion, we use real images and their manipulated results (or
both manipulated results) as the two inputs. We then inter-
polate the latent variables xt0 of these inputs, which are ob-
brow,” ”change hairstyle”} on the segmentation map. As tained using Eq.4, to 50 samples, and reconstruct their im-
for the proposed method, we can obtain multiple outcomes ages with Eq.2. In Fig. 6, we show the representative sam-
owing to the probabilistic process of diffusion models while ples and interpolate variables at time step t0 = 500. These
EditGAN can produce only one result with the same param- results demonstrate that we can obtain meaningful interpo-
eters. Therefore, we selected a sample that is qualitatively lated samples on the latent space of the diffusion model if
superior within the batch size for evaluation. these changes are minute. However, if there is a large differ-
As in Table 2, the results revealed that, for all of ence between inputs, as shown in the examples of hairstyle
these metrics, the proposed method outperformed Edi- change, the interpolated samples become slightly blurred.
GAN. These results demonstrate that the proposed method This is because there is no guarantee that the manipulated
achieves more fine-grained editing while preserving the out- samples will follow a linear relationship with the original
side of the edited region than EditGAN. images. However, these editing vectors might be useful for
7
(a) Sensitivity Analysis on (𝑡0 , 𝑠) (b) Samples with Small ROI (pixel number ≤5000) (c) Samples with Large ROI (pixel number >5000)
(ⅰ) Small Part’s Manipulation
Original (𝑡0 : 500, 𝑠:100) (500, 200)
+ Close Eyes
Accuracy
Accuracy
(ⅱ) Large Part’s Manipulation
Original (500, 200) (750, 40)
+ Elongate Bang
Signal-to-noise ratio (SNR) Signal-to-noise ratio (SNR)
Figure 7. Analysis of the effect of the hyper-parameters, start denoising step t0 and guidance scale s, on the prediction accuracy. This
figure shows the relationship between signal-to-noise ratio (SNR) and accuracy on 50 experimental samples which were divided into two
groups according to the size of editing region. This accuracy is averaged on each grouped samples and calculated against the estimated
map on each step. When the SNR magnitude is between 10−2 and 100 , the perceptually recognizable content are generated [14].
controlling the intensity of editing, as in EditGAN. For the Next, we analyze the effect of guidance scale s. The re-
inpainting, one of the results is presented in Fig.4 (”remove sults of manipulating a small region in Fig.7 (a) also reveal
person”). It requires a large start step of t=800, but we can that a large scale s generates images that are too aligned
confirm that the region with the person has been replaced to a specified class and fails to generate realistic outcomes.
naturally with the background. Therefore, we considered that s=100 was suitable on when
we set t0 =500. We, therefore, set (t0 ,s) = (500,100) for
4.4. Detailed Analysis the manipulation of small regions and (750,40) for the ma-
Sensitivity Analysis In this section, we analyze the sen- nipulation of large regions on the face and cat datasets.
sitivity of hyper-parameters, the start denoising step t0 and However, for LSUN horse manipulation, where an object
the guidance scale s, on the editing performance. We is placed in the background and an object is generated on
demonstrate the relationship between signal-to-noise ratio the background, we set a large value of t0 =800 and s was
(SNR) at each time step and prediction accuracy within set to 25 because a large value of s is likely to shift the mean
the binary ROI mask m in Fig.7. We experimented at of xt outside the learned distribution on large t0 .
t0 ={300,500,750} and s={40,100,200}, respectively. The
larger s than s=40 on t0 =750 has made artifacts because Limitations There are few cases that a large denoising
the large value of s could guide the samples to the outside step is necessary even if the manipulation is small when
the learned distribution, so it is not shown in the figure. we manipulate the eyebrow or mouth with orthodontics. In
In this experiment, we divided 50 experimental samples, such cases, the user is required to manually adjust the step
which were same samples in Sec.4.2, into two groups: those after observing the outcomes. Additionally, as the proposed
with more than 5000 pixels within the binary ROI mask m edit is based on class labels, the manipulation is restricted
(Fig. 7) and those with less than 5000 (Fig. 7). The accu- within the learned class label, and there are cases in which
racy between the target map yedited and the output of gt at original content color has been changed through editing .
each step, and averaged over the samples in the group. (e.g., eye color, hair color).
First we shall discuss about t0 . From Fig.7 (b) and (c),
we confirmed that the manipulation of large regions require 5. Conclusion
a large t0 , while that of a small region can be achieved with
a small t0 . This is because a large content of images are In this paper, we proposed a novel fine-grained image
created in the early step of the reverse diffusion process [8, editing method. We achieved the precise editing by pixel-
14, 28], which is visualized on a green background [14]. wise guidance with pixel classifiers. Since the guidance is
A small content was created after that. In Fig.7 (a), the conducted in a pixel-wise manner, the proposed method can
results of the manipulation of a large region demonstrated jointly achieve both creating reasonable contents in the edit-
that no matter how large the guidance scale s is (such as ing region as well as preserving other contents in its outside.
s=200), it is difficult to change global features with a small The experimental results demonstrate the advantage of our
step t0 =500, as this results in an unnatural edit. model qualitatively and quantitatively.
8
References [15] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu,
and Mubarak Shah. Diffusion models in vision: A survey.
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- ArXiv, abs/2209.04747, 2022. 1
age2stylegan: How to embed images into the stylegan latent
[16] Prafulla Dhariwal and Alex Nichol. Diffusion models beat
space? 2019 IEEE/CVF International Conference on Com-
gans on image synthesis. ArXiv, abs/2105.05233, 2021. 2, 3,
puter Vision (ICCV), pages 4431–4440, 2019. 3
6
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
[17] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and
age2stylegan++: How to edit the embedded images? 2020
Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adap-
IEEE/CVF Conference on Computer Vision and Pattern
tation of image generators. ArXiv, abs/2108.00946, 2021. 2,
Recognition (CVPR), pages 8293–8302, 2020. 3
3
[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle:
[18] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
A residual-based stylegan encoder via iterative refinement.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
2021 IEEE/CVF International Conference on Computer Vi-
Yoshua Bengio. Generative adversarial networks. In Pro-
sion (ICCV), pages 6691–6700, 2021. 3
ceedings of the International Conference on Neural Infor-
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and
mation Processing Systems, pages 2672–2680, 2014. 2, 3
Amit H. Bermano. Hyperstyle: Stylegan inversion with hy-
[19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
pernetworks for real image editing. 2022 IEEE/CVF Confer-
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
ence on Computer Vision and Pattern Recognition (CVPR),
two time-scale update rule converge to a local nash equilib-
pages 18490–18500, 2022. 3
rium. In NIPS, 2017. 6
[5] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
latent diffusion. ArXiv, abs/2206.02779, 2022. 3 [20] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion
probabilistic models. ArXiv, abs/2006.11239, 2020. 2, 3
[6] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
diffusion for text-driven editing of natural images. In Pro- [21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
ceedings of the IEEE/CVF Conference on Computer Vision Progressive growing of gans for improved quality, stability,
and Pattern Recognition, pages 18208–18218, 2022. 3, 5 and variation. ArXiv, abs/1710.10196, 2018. 5
[7] Jason Bailey. The tools of generative art, from flash to neural [22] Tero Karras, Samuli Laine, and Timo Aila. A style-based
networks. Art in America, 8, 2020. 1 generator architecture for generative adversarial networks.
[8] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, 2019 IEEE/CVF Conference on Computer Vision and Pat-
Valentin Khrulkov, and Artem Babenko. Label-efficient tern Recognition (CVPR), pages 4396–4405, 2019. 3, 5
semantic segmentation with diffusion models. ArXiv, [23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
abs/2112.03126, 2022. 3, 4, 5, 6, 8, 11 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[9] David Bau, Hendrik Strobelt, William S. Peebles, Jonas the image quality of stylegan. 2020 IEEE/CVF Conference
Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Se- on Computer Vision and Pattern Recognition (CVPR), pages
mantic photo manipulation with a generative image prior. 8107–8116, 2020. 3
ACM Transactions on Graphics (TOG), 38:1 – 11, 2019. 2, [24] Gwanghyun Kim, Taesung Kwon, and Jong-Chul Ye. Diffu-
3 sionclip: Text-guided diffusion models for robust image ma-
[10] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, nipulation. 2022 IEEE/CVF Conference on Computer Vision
Joshua B. Tenenbaum, William T. Freeman, and Antonio and Pattern Recognition (CVPR), pages 2416–2425, 2022. 3
Torralba. Gan dissection: Visualizing and understanding [25] Diederik P. Kingma and Jimmy Ba. Adam: A method for
generative adversarial networks. ArXiv, abs/1811.10597, stochastic optimization. CoRR, abs/1412.6980, 2015. 6, 11
2019. 2, 3 [26] Diederik P Kingma and Max Welling. Auto-encoding varia-
[11] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3
scale gan training for high fidelity natural image synthesis. [27] Gihyun Kwon and Jong-Chul Ye. Clipstyler: Image style
arXiv preprint arXiv:1809.11096, 2018. 3 transfer with a single text condition. 2022 IEEE/CVF
[12] Shu-Yu Chen, Wanchao Su, Lin Gao, Shi hong Xia, and Conference on Computer Vision and Pattern Recognition
Hongbo Fu. Deepfacedrawing: deep generation of face im- (CVPR), pages 18041–18050, 2022. 2, 3
ages from sketches. ACM Trans. Graph., 39:72, 2020. 2, [28] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffu-
3 sion models already have a semantic latent space. ArXiv,
[13] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune abs/2210.10960, 2022. 8
Gwon, and Sungroh Yoon. Ilvr: Conditioning method for [29] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo.
denoising diffusion probabilistic models. 2021 IEEE/CVF Maskgan: Towards diverse and interactive facial image ma-
International Conference on Computer Vision (ICCV), pages nipulation. 2020 IEEE/CVF Conference on Computer Vision
14347–14356, 2021. 3 and Pattern Recognition (CVPR), pages 5548–5557, 2020.
[14] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon 2, 3
Kim, Hyunwoo J. Kim, and Sung-Hoon Yoon. Perception [30] Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim,
prioritized training of diffusion models. 2022 IEEE/CVF Antonio Torralba, and Sanja Fidler. Editgan: High-precision
Conference on Computer Vision and Pattern Recognition semantic image editing. In NeurIPS, 2021. 2, 3, 4, 5, 6, 7,
(CVPR), pages 11462–11471, 2022. 8 11, 12
9
[31] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- [45] Yujun Shen and Bolei Zhou. Closed-form factorization of la-
yan Zhu, and Stefano Ermon. Sdedit: Image synthesis tent semantics in gans. 2021 IEEE/CVF Conference on Com-
and editing with stochastic differential equations. ArXiv, puter Vision and Pattern Recognition (CVPR), pages 1532–
abs/2108.01073, 2021. 3, 5, 6, 7, 12, 13 1540, 2021. 2
[32] Jacob Menick and Nal Kalchbrenner. Generating high fi- [46] Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Ma-
delity images with subscale pixel networks and multidimen- heswaranathan, and Surya Ganguli. Deep unsupervised
sional upscaling. arXiv preprint arXiv:1812.01608, 2018. 3 learning using nonequilibrium thermodynamics. ArXiv,
[33] Alex Nichol and Prafulla Dhariwal. Improved denoising dif- abs/1503.03585, 2015. 2, 3
fusion probabilistic models. ArXiv, abs/2102.09672, 2021. [47] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
2, 3 ing diffusion implicit models. ArXiv, abs/2010.02502, 2021.
2
[34] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and [48] Yang Song and Stefano Ermon. Generative modeling
Mark Chen. Glide: Towards photorealistic image genera- by estimating gradients of the data distribution. ArXiv,
tion and editing with text-guided diffusion models. In ICML, abs/1907.05600, 2019. 3
2022. 3 [49] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P.
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
[35] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Score-based generative modeling through stochastic differ-
Zhu. Semantic image synthesis with spatially-adaptive nor-
ential equations. ArXiv, abs/2011.13456, 2021. 3
malization. 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 2332–2341, 2019. 2, [50] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
3 Daniel Cohen-Or. Designing an encoder for stylegan image
manipulation. ACM Transactions on Graphics (TOG), 40:1
[36] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
– 14, 2021. 3
and Dani Lischinski. Styleclip: Text-driven manipulation of
[51] Aäron Van Den Oord, Nal Kalchbrenner, and Koray
stylegan imagery. 2021 IEEE/CVF International Conference
Kavukcuoglu. Pixel recurrent neural networks. In Interna-
on Computer Vision (ICCV), pages 2065–2074, 2021. 2, 3
tional conference on machine learning, pages 1747–1756.
[37] Antoine Plumerault, Hervé Le Borgne, and Céline Hude- PMLR, 2016. 3
lot. Controlling generative models with continuous factors [52] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace
of variations. ArXiv, abs/2001.10238, 2020. 2, 3 analysis: Disentangled controls for stylegan image genera-
[38] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizad- tion. 2021 IEEE/CVF Conference on Computer Vision and
wongsa, and Supasorn Suwajanakorn. Diffusion autoen- Pattern Recognition (CVPR), pages 12858–12867, 2021. 2,
coders: Toward a meaningful and decodable representation. 3
2022 IEEE/CVF Conference on Computer Vision and Pat- [53] Weihao Xia, Yujiu Yang, Jing Xue, and Baoyuan Wu. Tedi-
tern Recognition (CVPR), pages 10609–10619, 2022. 3 gan: Text-guided diverse face image generation and manipu-
[39] Danilo Rezende and Shakir Mohamed. Variational inference lation. 2021 IEEE/CVF Conference on Computer Vision and
with normalizing flows. In International conference on ma- Pattern Recognition (CVPR), pages 2256–2265, 2021. 2, 3
chine learning, pages 1530–1538. PMLR, 2015. 3 [54] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei
[40] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Zhou, and Ming-Hsuan Yang. Gan inversion: A survey.
Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding ArXiv, abs/2101.05278, 2022. 1, 2
in style: a stylegan encoder for image-to-image translation. [55] Yanbo Xu, Yueqin Yin, Liming Jiang, Qianyi Wu, Chengyao
2021 IEEE/CVF Conference on Computer Vision and Pat- Zheng, Chen Change Loy, Bo Dai, and Wayne Wu. Transed-
tern Recognition (CVPR), pages 2287–2296, 2021. 3 itor: Transformer-based dual-space gan for highly control-
[41] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. lable facial editing. 2022 IEEE/CVF Conference on Com-
Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mo- puter Vision and Pattern Recognition (CVPR), pages 7673–
hammad Norouzi. Palette: Image-to-image diffusion mod- 7682, 2022. 2, 3
els. ACM SIGGRAPH 2022 Conference Proceedings, 2022. [56] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
3 iong Xiao. Lsun: Construction of a large-scale image
[42] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki dataset using deep learning with humans in the loop. ArXiv,
Cheung, Alec Radford, and Xi Chen. Improved techniques abs/1506.03365, 2015. 5
for training gans. ArXiv, abs/1606.03498, 2016. 6 [57] Yuxuan Zhang, Huan Ling, Jun Gao, K. Yin, Jean-Francois
[43] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In- Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fi-
terpreting the latent space of gans for semantic face editing. dler. Datasetgan: Efficient labeled data factory with minimal
2020 IEEE/CVF Conference on Computer Vision and Pat- human effort. 2021 IEEE/CVF Conference on Computer Vi-
tern Recognition (CVPR), pages 9240–9249, 2020. 2, 3 sion and Pattern Recognition (CVPR), pages 10140–10150,
2021. 3, 11
[44] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou.
[58] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou.
Interfacegan: Interpreting the disentangled face representa-
In-domain gan inversion for real image editing. ArXiv,
tion learned by gans. IEEE Transactions on Pattern Analysis
abs/2004.00049, 2020. 2, 3
and Machine Intelligence, 44:2004–2018, 2022. 2, 3
10
+Bigger Eyes +Remove Person +Remove Mane +Close Mouth +Elongate Bangs +Move Eyes
Original
Ours
EditGAN
Figure 8. Additional qualitative comparison with EditGAN.
Supplementary Material LSUN-Cat but also to LSUN-Horse. In FFHQ, we trained

without warm up. For inference, we refine the latent code,
A. Implementation Details which is initialized by this encoder, via 500-steps optimiza-
tion as described in [30].
A.1. Pixel Classifiers
As for pixel classifiers of DatasetGAN, we used the
We respectively trained a single neural network for gt same annotated data for training. As in their setup [30],
and gmulti , whereas the original DatasetDDPM [8] used an we trained ten independent segmentation models, each of
ensemble of ten individual networks. We used the same which is a three-layer MLP classifier [57], using the Adam
training conditions and model architecture as in their pa- optimizer with learning rate of 0.001. Other training im-
per [8]. We trained them for 4 epochs, using the Adam [25] plementation details were same as the original configura-
optimizer with learning rate of 0.001. The batch size was tions [30].
set to 64. These settings are used for all datasets.
B. Additional Results
A.2. EditGAN
EditGAN [30] consists of three components, which are B.1. Reconstruction Performance with EditGAN
a generator of StyleGAN2, an image encoder for inversion, Figure. 9 shows example images on LSUN-Cat and
and DatasetGAN [57] for estimation of segmentation maps. FFHQ-256 that StyleGAN2 reconstructs based on the ini-
As for the generator, we used the pretrained model, tialized latent codes by image encoder in EditGAN. Fig-
which is publicly available at the official repository † . Only ure.9 (a) and (b) show representative success and failure
for FFHQ-256, we used the pretrained model available at cases, respectively. Although the reconstruction is accurate
another repository, ‡ because this is not provided at the of- in the case of relatively simple scenes, it tends to fail, when
ficial one. the image contains objects that are rarely seen in the train-
Regarding the image encoder, we used the same model ing dataset (e.g., face painting and microphone).
architecture as in the literature [30]. We trained the encoder
using the Adam optimizer with learning rate of 3 × 10−5 . B.2. On strategies for selecting edited images
In their implementations, they train only on generated sam-
In this work, we assume that the users selected the qual-
ples from GAN for first 20,000 iterations as warm up and
itatively best images among the several edited images pro-
then train jointly real and generated images on LSUN-Cat
vided by our method. Consequently, the performance of
dataset. We applied this warm up training not only to
our method may depend on this selection strategy, because
† https://github.com/NVlabs/stylegan2 our method provides some variety in the edited images as
‡ https://github.com/rosinality/stylegan2-pytorch shown in Fig. 11. We investigated the performance of our
11
Original
Recon
(a) Success Example (b) Failure Example
Figure 9. Results of the reconstruction performance in EditGAN. The latent codes corresponding to these original images are obtained by
optimization after initialized by trained image encoder [30]. These results demonstrate that the contents, which is less observed in the
training data (e.g., ”microrophone”, ”another cat”), were inaccurately reconstracted by EditGAN.
Table 3. We present an analysis of performance changes for the selection strategy of edited images. We investigated several selection
strategies in the proposed method. (a) selecting a qualitatively superior sample, (b) selecting a quantitatively superior sample, and (c)
random selection. In (c), we also report the standard deviation of each performance metric. We confirmed that the proposed method is not
sensitive to the selection strategies and outperformed EditGAN in all metrics.
m=0 whole image m=1

Method
MAE ↓ PSNR ↑ FID ↓ IS ↑ Accuracy ↑
EditGAN 0.07485 67.24 79.67 3.765 0.8138

(a) Ours (qualitative choice) 0.01380 81.53 15.07 4.416 0.8290
(b) Ours (quantitative choice) 0.01425 81.44 15.81 4.377 0.8302
(c) Ours (random choice) 0.01393 ± 0.001742 81.59 ± 0.02958 14.62 ± 1.005 4.459 ± 0.03649 0.8195 ± 0.001742
method with several strategies for the selection: (a) select- appropriate t0 varies from image to image. Also, even if the
ing a qualitatively superior sample as in Sec.4.2, (b) select- t0 is set optimally, the proposed method can achieve more
ing a quantitatively superior sample in terms of the error of natural editing.
the segmentation map estimated from the edited images, (c)
random selection. The results are shown in Table 3. In (c)
B.4. Failure Cases
we also report the standard deviation of each performance
metric. We confirmed that the performance of the proposed There were few failure cases, in which we cannot obtain
method is not sensitive to the choice of the strategy. natural results by the proposed method with the default set-
ting of t0 and s. We show such examples in Fig. 11. In
B.3. Further Comparison with SDEdit
this figure, we demonstrate the all edited images within the
In SDEdit, there is a realism-faithfulness trade-off when mini-batch, which we set to 4. The original image, the orig-
we vary the value of t0 [31]. Therefore, we also investigated inal segmentation map, and the edited map are shown in the
how the edited images of our method change depending on first row, and the second and third row show the edited im-
the value of t0 . In this experiment of SDEdit, we varied the ages with the default setting and tuned setting, respectively.
value of t0 from 0.1 to 0.5 and corresponding total denois- From the samples shown on the left, we confirmed that im-
ing steps N using VP-SDE (K=1). Figure.10 shows several ages with features less observed in training data, such as
examples of the edited images. These results were obtained teeth with orthodontics, require larger t0 , such as t0 =750,
by selecting a qualitatively superior sample within a mini- to produce naturally edited images. From the right samples,
batch for each image in both our method and SDEdit. These we found that setting a large value to s occasionally changed
results support that there is a strict trade-off between faith- hair color in some cases. In these cases, the user is required
fulness and realism in SDEdit as reported in the paper [31]. to manually tune the value of t0 and s after observing the
It is difficult for the users to tune t0 manually because the outcomes.
12
SDEdit Ours
Original Guide Input (𝑡0 : 0.1 , 𝑁: 100) (0.2, 200) (0.3, 300) (0.4, 400) (0.5, 500) Guide Input Edited Image
+Close Eyes
+Close Mouth
+Move Eyes
Figure 10. Comparison of stroke-based image editing performance with t0 in SDEdit. We show the edited images with different values
of t0 in SDEdit. For SDEdit, we use VP-SDE [31] and set K = 1. These results reveal that SDEdit has a strict trade-off between the
faithfulness and realism. It is difficult for the users to adjust the appropriate t0 manually.
Original Image Original Map Edited Map Original Image Original Map Edited Map
+Elongate Bangs
+Close Mouth
Edited Images (Batch size:4) Edited Images (Batch size:4)

[𝑡0 : 500, 𝑠:100]
[𝑡0 : 750, 𝑠: 25]

(Default)
[𝑡0 : 750, 𝑠: 40]

[𝑡0 : 750, 𝑠: 40]
（Default）
Figure 11. We show some failure cases. There are few exceptions which cannot be edited naturally on default set of t0 and s. We show
such edited examples obtained stochastically within batch size, which we set to 4. Both examples can produce a natural image with some
of the default settings. But, in the sample shown on the left, more steps are more stable, while in the sample shown on the right, smaller
scales are more stable. In these cases, the user is required to manually tune the value of t0 and s after observing the outcomes.
B.5. Running Time this thresholding. In the analysis in Sec.4.4, we set 5000
pixels as the threshold, because there was a big gap in ths
We measured the averaged processing time of the pro-
size of the editing region between large and small part’s ma-
posed method over 50 samples, which are used for quanti-
nipulation. In Fig.12, we show the statistics of the number
tative comparison in Sec.4.1. The averaged editing time was
of pixels in the ROI over all datasets used in the experiments
147 seconds, when we set the batch size to 1. It increased
of Sec. 4.2 and Sec. B.7, which are FFHQ-34, Celeb-A, and
to 341 seconds with larger batch size of 4.
LSUN-Cat. It demonstrates that most of the samples have a
B.6. On the threshold to the size of the editing region fairly small number of pixels within m = 1, while the others
have exceptionally large. To adopt the specific configura-
As described in Sec.4.4, we set the threshold to the size tion to such samples with large ROI, we set the threshold to
of the editing region and changed the value of t0 based on
13
Small Part’s
Manipulation
Number of Samples
Large Part’s Manipulation
Number of pixels within 𝑚 = 1
Figure 12. Statistics of the number of pixels in the ROI over all
datasets used in Sec.4.2 and Sec. B.7, which are FFHQ-34, Celeb-
A, and LSUN-Cat. We confirmed that this threshold is not a sen-
sitive parameter.
5000. This value is not much sensitive to the performance

of the proposed method as shown in Fig. 12.
B.7. Additional Results of Our Method
We present additional examples of the edited images pro-
duced by our method on FFHQ-256, LSUN-Cat and LSUN-
Horse (Figs.13-19). In all examples, the original real im-
ages and the corresponding maps are shown on the left, and
the edited maps and the resultant images are shown on the
right. The experimental settings are the same with those in
Sec.4.2 and Sec.4.3
14
(a) Original (b) Edited
Figure 13. Hairstyle Editing.
15
Figure 14. Mouth Editing. In these examples, we show the manipulation of ”+Open Mouth” in the top two rows and examples of ”+Open
Mouth” in the bottom two rows.
16
Figure 15. Closing Eyes. We show some examples of the manipulation of ”+Close Eyes”. We confirmed that eyes in glasses, eyes with
shadows, etc., could also be generated naturally.
17
Figure 16. Moving Eyes. We demonstrate some examples corresponding to ”+Moving Eyes”.
18
Figure 17. Eyebrow Editing. We show examples of editing that change the shape or position of eyebrows. We confirmed that the proposed
method allows editing only one of the semantic-related parts (such as lift right eyebrow while keeping the left eyebrow unchanged), which
EditGAN is not good at.
19
+ Close Eyes
+ Shrink Eyes
+ Open Mouth
+ Reshape Right Ear
+ Shrink Ear
Figure 18. Editing cat images. We show some edited examples following by the written text.
20
+ Remove Mane
+ Close Right Horse
Eyes
+ Shrink Ear
+ Remove Person
+ Raise Tail
Figure 19. Editing horse images. We show some edited examples following by the written text. In the above three samples, we set {t0 ,s}
= {500,100} according to the threshold setting. In the two samples below, we set {t0 , s}={800, 25}, because this manipulation, which
forces to replace some objects to background or generate some objects on background, requires a large steps as discussed in Sec.4.4. In the
bottom example, the semantic editing was done, but the color was changed as described in Sec.11
21

使用扩散模型进行像素级引导的细粒度图像编辑

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

使用扩散模型进行像素级引导的细粒度图像编辑

Uploaded by

Copyright:

Available Formats

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Original + Change Hairstyle + Move eyes + Close Mouth

(b) Comparison with GAN-based Method

Abstract method can create reasonable contents in the editing region

ROI Mask 𝒎 𝒙𝒃𝒈 ~𝑵 ഥ 𝒕 𝒙𝟎 , 𝟏 − 𝜶

which is dimension of intermediate representations at a sin- Algorithm 1 Pixel-wise gradient-based guidance

4.1. Experimental Setting

+Close Eyes +Move Eyes

method achieved the desired fine-grained editing naturally

For SDEdit, we use stroke-based edited images (Guide In-

4.3. Further Applications

Signal-to-noise ratio (SNR) Signal-to-noise ratio (SNR)

Figure 8. Additional qualitative comparison with EditGAN.

Supplementary Material LSUN-Cat but also to LSUN-Horse. In FFHQ, we trained

(a) Success Example (b) Failure Example

m=0 whole image m=1

EditGAN 0.07485 67.24 79.67 3.765 0.8138

Edited Images (Batch size:4) Edited Images (Batch size:4)

[𝑡0 : 750, 𝑠: 25]

[𝑡0 : 750, 𝑠: 40]

Large Part’s Manipulation

Number of pixels within 𝑚 = 1

5000. This value is not much sensitive to the performance

Figure 13. Hairstyle Editing.

(a) Original (b) Edited

(a) Original (b) Edited

You might also like