Professional Documents
Culture Documents
2020), we observe trends that suggest predictable improve- conditioned on the input x0 . With
noised latents directly Q
t
ments in performance as we increase training compute. αt := 1 − βt and ᾱt := s=0 αs , we can write the marginal
√
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I) (8)
2. Denoising Diffusion Probabilistic Models √ √
xt = ᾱt x0 + 1 − ᾱt (9)
We briefly review the formulation of DDPMs from Ho et al.
(2020). This formulation makes various simplifying assump- where ∼ N (0, I). Here, 1 − ᾱt tells us the variance of the
tions, such as a fixed noising process q which adds diagonal noise for an arbitrary timestep, and we could equivalently
Gaussian noise at each timestep. For a more general deriva- use this to define the noise schedule instead of βt .
tion, see Sohl-Dickstein et al. (2015).
Using Bayes theorem, one can calculate the posterior
q(xt−1 |xt , x0 ) in terms of β̃t and µ̃t (xt , x0 ) which are de-
2.1. Definitions fined as follows:
Given a data distribution x0 ∼ q(x0 ), we define a forward 1 − ᾱt−1
noising process q which produces latents x1 through xT by β̃t := βt (10)
1 − ᾱt
adding Gaussian noise at time t with variance βt ∈ (0, 1) as √ √
ᾱt−1 βt αt (1 − ᾱt−1 )
follows: µ̃t (xt , x0 ) := x0 + xt
1 − ᾱt 1 − ᾱt
T
Y (11)
q(x1 , ..., xT |x0 ) := q(xt |xt−1 ) (1) q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t I) (12)
t=1
p
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) (2)
2.2. Training in Practice
Given sufficiently large T and a well behaved schedule of The objective in Equation 4 is a sum of independent terms
βt , the latent xT is nearly an isotropic Gaussian distribution. Lt−1 , and Equation 9 provides an efficient way to sample
Thus, if we know the exact reverse distribution q(xt−1 |xt ), from an arbitrary step of the forward noising process and
we can sample xT ∼ N (0, I) and run the process in reverse estimate Lt−1 using the posterior (Equation 12) and prior
to get a sample from q(x0 ). However, since q(xt−1 |xt ) (Equation 3). We can thus randomly sample t and use the
depends on the entire data distribution, we approximate it expectation Et,x0 , [Lt−1 ] to estimate Lvlb . Ho et al. (2020)
using a neural network as follows: uniformly sample t for each image in each mini-batch.
There are many different ways to parameterize µθ (xt , t) in
pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (3) the prior. The most obvious option is to predict µθ (xt , t)
directly with a neural network. Alternatively, the network
The combination of q and p is a variational auto-encoder could predict x0 , and this output could be used in Equation
(Kingma & Welling, 2013), and we can write the variational 11 to produce µθ (xt , t). The network could also predict the
lower bound (VLB) as follows: noise and use Equations 9 and 11 to derive
Lvlb := L0 + L1 + ... + LT −1 + LT (4) 1 βt
µθ (xt , t) = √ xt − √ θ (xt , t) (13)
L0 := − log pθ (x0 |x1 ) (5) αt 1 − ᾱt
Lt−1 := DKL (q(xt−1 |xt , x0 ) || pθ (xt−1 |xt )) (6) Ho et al. (2020) found that predicting worked best, es-
LT := DKL (q(xT |x0 ) || p(xT )) (7) pecially when combined with a reweighted loss function:
sample quality using either σt2 = βt or σt2 = β̃t , which are 100
βt̃ /βt
4 × 10−1
3. Improving the Log-likelihood
T ̃ 100 steps
3 × 10−1
T ̃ 1000 steps
While Ho et al. (2020) found that DDPMs can generate high- T ̃ 10000 steps
fidelity samples according to FID (Heusel et al., 2017) and 0.0 0.2 0.4 0.6 0.8 1.0
diffusion step (t/T)
Inception Score (Salimans et al., 2016), they were unable to
achieve competitive log-likelihoods with these models. Log-
likelihood is a widely used metric in generative modeling, Figure 1. The ratio β̃t /βt for every diffusion step for diffusion
and it is generally believed that optimizing log-likelihood processes of different lengths.
forces generative models to capture all of the modes of
the data distribution (Razavi et al., 2019). Additionally, 100
recent work (Henighan et al., 2020) has shown that small 10−1
improvements in log-likelihood can have a dramatic impact 10−2
on sample quality and learnt feature representations. Thus, it
1.0 linēr
cosine
0.8
0.6
ᾱ t
Figure 3. Latent samples from linear (top) and cosine (bottom) 0.4
schedules respectively at linearly spaced values of t from 0 to T .
The latents in the last quarter of the linear schedule are almost 0.2
purely noise, whereas the cosine schedule adds noise more slowly
0.0
0.0 0.2 0.4 0.6 0.8 1.0
cosine schedule diffusion step (t/T)
60 linear schedule
40
4.50
Iters T Schedule Objective NLL FID
4.25
itive with the best convolutional models, but is worse than fully
40.0
transformer-based architectures.
35.0
FID
Glow (Kingma & Dhariwal, 2018) 3.81 3.35
Flow++ (Ho et al., 2019) 3.69 3.08 25.0
PixelCNN (van den Oord et al., 2016c) 3.57 3.14
SPN (Menick & Kalchbrenner, 2018) 3.52 - 20.0
NVAE (Vahdat & Kautz, 2020) - 2.91
Very Deep VAE (Child, 2020) 3.52 2.87 15.0
102 103
PixelSNAIL (Chen et al., 2018) 3.52 2.85 sampling steps
Image Transformer (Parmar et al., 2018) 3.48 2.90
20.0
Sparse Transformer (Child et al., 2019) 3.44 2.80
Routing Transformer (Roy et al., 2020) 3.43 - 17.5
FID
7.5
30.0
(2020b) propose fast sampling algorithms for models trained
with the DDPM objective by leveraging different sampling
20.0 processes. Song et al. (2020a) does this by deriving an im-
plicit generative model that has the same marginal noise
1017 1018 1019 1020
distributions as DDPMs while deterministically mapping
compute (FLOPs) noise to images. Song et al. (2020b) model the diffusion
64 ch (30M params)
process as the discretization of a continuous SDE, and ob-
3.80
96 ch (68M params) serve that there exists an ODE that corresponds to sampling
128 ch (120M params)
3.75 from the reverse SDE. By varying the numerical precision
192 ch (270M params)
3.40 + (3.000e-15*C)^-0.17 of an ODE solver, they can sample with fewer function
3.70
NLL (bits/dim)
Figure 10. FID and validation NLL throughout training on Im- Also in parallel, Gao et al. (2020) develops a diffusion model
ageNet 64 × 64 for different model sizes. The constant for the with reverse diffusion steps modeled by an energy-based
FID trend line was approximated using the FID of in-distribution model. A potential implication of this approach is that fewer
data. For the NLL trend line, the constant was approximated by diffusion steps should be needed to achieve good samples.
rounding down the current state-of-the-art NLL (Roy et al., 2020)
on this dataset.
8. Conclusion
than FID. This could be caused by a variety of factors, such We have shown that, with a few modifications, DDPMs can
as 1) an unexpectedly high irreducible loss (Henighan et al., sample much faster and achieve better log-likelihoods with
2020) for this type of diffusion model, or 2) the model over- little impact on sample quality. The likelihood is improved
fitting to the training distribution. We also note that these by learning Σθ using our parameterization and Lhybrid ob-
models do not achieve optimal log-likelihoods in general jective. This brings the likelihood of these models much
because they were trained with our Lhybrid objective and not closer to other likelihood-based models. We surprisingly
directly with Lvlb to keep both good log-likelihoods and discover that this change also allows sampling from these
sample quality. models with many fewer steps.
We have also found that DDPMs can match the sample qual-
7. Related Work ity of GANs while achieving much better mode coverage
Chen et al. (2020b) and Kong et al. (2020) are two recent as measured by recall. Furthermore, we have investigated
works that use DDPMs to produce high fidelity audio condi- how DDPMs scale with the amount of available training
tioned on mel-spectrograms. Concurrent to our work, Chen compute, and found that more training compute trivially
et al. (2020b) use a combination of improved schedule and leads to better sample quality and log-likelihood.
L1 loss to allow sampling with fewer steps with very lit- The combination of these results makes DDPMs an attrac-
tle reduction in sample quality. However, compared to our tive choice for generative modeling, since they combine
unconditional image generation task, their generative task good log-likelihoods, high-quality samples, and reasonably
has a strong input conditioning signal provided by the mel- fast sampling with a well-grounded, stationary training ob-
spectrograms, and we hypothesize that this makes it easier jective that scales easily with training compute. These re-
to sample with fewer diffusion steps. sults indicate that DDPMs are a promising direction for
Jolicoeur-Martineau et al. (2020) explored score matching future research.
in the image domain, and constructed an adversarial training
objective to produce better x0 predictions. However, they
found that choosing a better network architecture removed
Improved Denoising Diffusion Probabilistic Models 9
References Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P.
Flow++: Improving flow-based generative models with
Brock, A., Donahue, J., and Simonyan, K. Large scale gan
variational dequantization and architecture design. arXiv
training for high fidelity natural image synthesis. arXiv
preprint arXiv:1902.00275, 2019.
preprint arXiv:1809.11096, 2018.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, bilistic models, 2020.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Hyvärinen, A. Estimation of non-normalized statistical
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, models by score matching. Journal of Machine Learning
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Research, 6(Apr):695–709, 2005.
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
S., Radford, A., Sutskever, I., and Amodei, D. Language Jolicoeur-Martineau, A., Piché-Taillefer, R., des Combes,
models are few-shot learners, 2020. R. T., and Mitliagkas, I. Adversarial score matching and
improved sampling for image generation, 2020.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H.,
Dhariwal, P., Luan, D., and Sutskever, I. Genera- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
tive pretraining from pixels, 2020a. URL https: Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
//cdn.openai.com/papers/Generative_ Amodei, D. Scaling laws for neural language models,
Pretraining_from_Pixels_V2.pdf. 2020.
Child, R. Very deep vaes generalize autoregressive models Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B.
and can outperform them on images. arXiv preprint Diffwave: A versatile diffusion model for audio synthesis,
arXiv:2011.10650, 2020. 2020.
Child, R., Gray, S., Radford, A., and Sutskever, I. Generat- Krizhevsky, A. Learning multiple layers of
ing long sequences with sparse transformers, 2019. features from tiny images, 2009. URL
http://www.cs.toronto.edu/˜kriz/
Gao, R., Song, Y., Poole, B., Wu, Y. N., and Kingma, D. P. learning-features-2009-TR.pdf.
Learning energy-based models by diffusion recovery like-
lihood, 2020. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and
Aila, T. Improved precision and recall metric for assess-
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual ing generative models, 2019.
learning for image recognition, 2015.
McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D.
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., An empirical model of large-batch training, 2018.
Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Menick, J. and Kalchbrenner, N. Generating high fidelity im-
Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, ages with subscale pixel networks and multidimensional
N., Ziegler, D. M., Schulman, J., Amodei, D., and Mc- upscaling, 2018.
Candlish, S. Scaling laws for autoregressive generative
modeling, 2020. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer,
N., Ku, A., and Tran, D. Image transformer. arXiv
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and preprint arXiv:1802.05751, 2018.
Hochreiter, S. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. Advances in Ravuri, S. and Vinyals, O. Classification accuracy score
Neural Information Processing Systems 30 (NIPS 2017), for conditional generative models. arXiv preprint
2017. arXiv:1905.10887, 2019.
Improved Denoising Diffusion Probabilistic Models 10
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Xiao, J. Lsun: Construction of a large-scale image dataset
using deep learning with humans in the loop, 2015.
Improved Denoising Diffusion Probabilistic Models 11
A. Hyperparameters the FID to be higher, but requires much less compute for
sampling and helps do large ablations. Since we mainly use
For all of our experiments, we use a UNet model architec- FID for relative comparisons on unconditional ImageNet
ture4 similar to that used by Ho et al. (2020). We changed 64 × 64, this bias is acceptable. For computing the reference
the attention layers to use multi-head attention (Vaswani distribution statistics we follow prior work (Ho et al., 2020;
et al., 2017), and opted to use four attention heads rather Brock et al., 2018) and use the full training set for CIFAR-10
than one (while keeping the same total number of channels). and ImageNet, and 50K training samples for LSUN. Note
We employ attention not only at the 16x16 resolution, but that unconditional ImageNet 64 × 64 models are trained and
also at the 8x8 resolution. Additionally, we changed the evaluated using the official ImageNet-64 dataset (van den
way the model conditions on t. In particular, instead of com- Oord et al., 2016a), whereas for class conditional ImageNet
puting a conditioning vector v and injecting it into hidden 64 × 64 and 256 × 256 we center crop and area downsample
state h as GroupNorm(h + v), we compute conditioning images (Brock et al., 2018).
vectors w and b and inject them into the hidden state as
GroupNorm(h)(w + 1) + b. We found in preliminary ex-
periments on ImageNet 64 × 64 that these modifications B. Fast Sampling on LSUN 256 × 256
slightly improved FID.
For ImageNet 64 × 64 the architecture we use is described
Lsimple (σt2 = βt, batch=64, lr=2e-5) Lsimple (̃̃IM, batch=64, lr=2e-5)
as follows. The downsampling stack performs four steps of
Lsimple (σ 2 = βt, batch=128, lr=1e-4)
t
Lsimple (̃̃IM, batch=128, lr=1e-4)
downsampling, each with three residual blocks (He et al., Lsimple (σt2 = βt̃ , batch=64, lr=2e-5) Lhybrid (batch=64, lr=2e-5)
2015). The upsampling stack is setup as a mirror image of Lhybrid (batch=128, lr=1e-4)
Lsimple (σ 2 = βt̃ , batch=128, lr=1e-4)
t
the downsampling stack. From highest to lowest resolution,
the UNet stages use [C, 2C, 3C, 4C] channels, respectively. 20.0
In our ImageNet 64 × 64 ablations, we set C = 128, but 17.5
we experiment with scaling C in a later section. We esti-
15.0
mate that, with C = 128, our model is comprised of 120M
parameters and requires roughly 39 billion FLOPs in the 12.5
forward pass. 10.0
FID
MODEL FID
VQ-VAE-2 ((Razavi et al., 2019), two-stage) 38.1
Improved Diffusion (ours, single-stage) 31.5
Improved Diffusion (ours, two-stage) 12.3
BigGAN (Brock et al., 2018) 7.7
BigGAN-deep (Brock et al., 2018) 7.0
D. Combining Lhybrid and Lvlb Models E. Log-likelihood with Fewer Diffusion Steps
1.025
4.2
1.000
0.975 4.1
0.950 4.0
NLL (bits/dim)
0.925 3.9
0.900 3.8
0 500 1000 1500 2000 2500 3000 3500 4000
diffusion step (t)
3.7
Figure 13. The ratio between VLB terms for each diffusion step of 3.6
θhybrid and θvlb . Values less than 1.0 indicate that θhybrid is ”better”
than θvlb for that timestep of the diffusion process. 103
evaluation steps
4.0
3.8
NLL (bits/dim)
3.6
3.4
3.2
Figure 14. Samples from θvlb and θhybrid , as well as an ensemble 3.0
produced by using θvlb for the first and last 100 diffusion steps. For 103
evaluation steps
these samples, the seed was fixed, allowing a direct comparison
between models.
To understand the trade-off between Lhybrid and Lvlb , we Figure 15. NLL versus number of evaluation steps, for models
show in Figure 13 that the model resulting from Lvlb (re- trained on ImageNet 64 × 64 (top) and CIFAR-10 (bottom). All
ferred to as θvlb ) is better at the start and end of the diffusion models were trained with 4000 diffusion steps.
process, while the model resulting from Lhybrid (referred
to as θhybrid ) is better throughout the middle of the diffu- Figures 15 plots negative log-likelihood as a function of
sion process. This suggests that θvlb is focusing more on number of sampling steps for both ImageNet 64 × 64 and
imperceptible details, hence the lower sample quality. CIFAR-10. In initial experiments, we found that although
constant striding did not significantly affect FID, it dras-
Given the above observation, we performed an experiment
tically reduced log-likelihood. To address this, we use a
on ImageNet 64 × 64 to combine the two models by con-
strided subset of timesteps as for FID, but we also include
structing an ensemble that uses θhybrid for t ∈ [100, T −100)
every t from 1 to T /K. This requires T /K extra evaluation
and θvlb elsewhere. We found that this model achieved an
steps, but greatly improves log-likelihood compared to the
FID of 19.9 and an NLL of 3.52 bits/dim. This is only
uniformly strided schedule. We did not attempt to calculate
slightly worse than θhybrid in terms of FID, while being bet-
NLL using DDIM, since Song et al. (2020a) does not present
ter than both models in terms of NLL.
NLL results or a simple way of estimating likelihood under
DDIM.
Improved Denoising Diffusion Probabilistic Models 14
FID
0.1, 0.9999
7 0.1, 0.99995
FID
6 0.1, 0.99999
0.3, 0.99
6 0.3, 0.999
5 0.3, 0.9999
0.3, 0.99995
5 0.3, 0.99999
4 200 400 600 800 1000 1200 1400 original best
training iters (thousands)
3
100 200 300 400 500
training iters (thousands) Figure 17. A sweep of dropout and EMA hyperparameters on
class conditional ImageNet-64.
3.40 linear (test)
linear (train)
3.35 cosine (test) Like on CIFAR-10, we surprisingly observed overfitting
cosine (train) on class-conditional ImageNet 64 × 64, despite it being a
3.30
much larger and more diverse dataset. The main observable
3.25 result of this overfitting was that FID started becoming
NLL
Figure 18. 50 sampling steps on unconditional ImageNet 64 × 64 Figure 21. 400 sampling steps on unconditional ImageNet 64 × 64
Figure 19. 100 sampling steps on unconditional ImageNet 64 × 64 Figure 22. 1000 sampling steps on unconditional ImageNet 64×64
Figure 20. 200 sampling steps on unconditional ImageNet 64 × 64 Figure 23. 4K sampling steps on unconditional ImageNet 64 × 64.
Improved Denoising Diffusion Probabilistic Models 16
Figure 24. 50 sampling steps on unconditional CIFAR-10 Figure 27. 400 sampling steps on unconditional CIFAR-10
Figure 25. 100 sampling steps on unconditional CIFAR-10 Figure 28. 1000 sampling steps on unconditional CIFAR-10
Figure 26. 200 sampling steps on unconditional CIFAR-10 Figure 29. 4000 sampling steps on unconditional CIFAR-10
Improved Denoising Diffusion Probabilistic Models 17
Figure 30. Unconditional ImageNet 64 × 64 samples generated Figure 31. Unconditional CIFAR-10 samples generated from
from Lhybrid (top) and Lvlb (bottom) models using the exact same Lhybrid (top) and Lvlb (bottom) models using the exact same random
random noise. Both models were trained for 1.5M iterations. noise. Both models were trained for 500K iterations.