Professional Documents
Culture Documents
Abstract
Diffusion models have made significant break-
throughs in image, audio, and video generation,
arXiv:2303.01469v1 [cs.LG] 2 Mar 2023
For example, consistency models define a one-to-one map- pling x from the dataset, followed by sampling xtn`1 from
ping from a Gaussian noise vector to a data sample. Similar the transition density of the SDE N px, t2n`1 Iq, and then
to latent variable models like GANs, VAEs, and normalizing computing x̂φ tn using one discretization step of the numeri-
flows, consistency models can easily interpolate between cal ODE solver according to Eq. (6). Afterwards, we train
samples by traversing the latent space (Fig. 11). As consis- the consistency model by minimizing its output differences
tency models are trained to recover x from any noisy input on the pair px̂φ tn , xtn`1 q. This motivates our following con-
xt where t P r, T s, they can perform denoising for various sistency distillation loss for training consistency models.
noise levels (Fig. 12). Moreover, the multistep generation Definition 1. The consistency distillation loss is defined as
procedure in Algorithm 1 is useful for solving certain inverse
problems in zero shot by using an iterative replacement pro- LN ´
CD pθ, θ ; φq :“
cedure similar to that of diffusion models (Song & Ermon,
Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂φ
tn , tn qqs, (7)
2019; Song et al., 2021; Ho et al., 2022b). This enables
many applications in the context of image editing, including where the expectation is taken with respect to x „ pdata , n „
inpainting (Fig. 10), colorization (Fig. 8), super-resolution UJ1, N ´1K, and xtn`1 „ N px; t2n`1 Iq. Here UJ1, N ´1K
(Fig. 6b) and stroke-guided image editing (Fig. 13) as in denotes the uniform distribution over t1, 2, ¨ ¨ ¨ , N ´ 1u,
SDEdit (Meng et al., 2021). In Section 6.3, we empiri- λp¨q P R` is a positive weighting function, x̂φ tn is given by
cally demonstrate the power of consistency models on many Eq. (6), θ ´ denotes a running average of the past values of
zero-shot image editing tasks. θ during the course of optimization, and dp¨, ¨q is a metric
function that satisfies @x, y : dpx, yq ě 0 and dpx, yq “ 0
4. Training Consistency Models via if and only if x “ y.
Distillation
Unless otherwise stated, we adopt the notations in Defi-
We present our first method for training consistency mod- nition 1 throughout this paper, and use Er¨s to denote the
els based on distilling a pre-trained score model sφ px, tq. expectation over all relevant random variables. In our ex-
Our discussion revolves around the empirical PF ODE in periments, we consider the squared `2 distance dpx, yq “
Eq. (3), obtained by plugging the score model sφ px, tq }x ´ y}22 , `1 distance dpx, yq “ }x ´ y}1 , and the Learned
into the PF ODE. Consider discretizing the time horizon Perceptual Image Patch Similarity (LPIPS, Zhang et al.
r, T s into N ´ 1 sub-intervals, with boundaries t1 “ ă (2018)). We find λptn q ” 1 performs well across all
Consistency Models
tasks and datasets. In practice, we minimize the objec- That is, the target and online consistency models will eventu-
tive by stochastic gradient descent on the model parameters ally match each other. If the consistency model additionally
θ, while updating θ ´ with exponential moving average achieves zero consistency distillation loss, then Theorem 1
(EMA). That is, given a decay rate 0 ď µ ă 1, we perform implies that, under some regularity conditions, the estimated
the following update after each optimization step: consistency model can become arbitrarily accurate, as long
as the step size of the ODE solver is sufficiently small.
θ ´ Ð stopgradpµθ ´ ` p1 ´ µqθq. (8)
The consistency distillation loss LN ´
CD pθ, θ ; φq can be ex-
The overall training procedure is summarized in Algo- tended to hold for infinitely many time steps (N Ñ 8) if
rithm 2. In alignment with the convention in deep reinforce- θ ´ “ θ or θ ´ “ stopgradpθq. The resulting continuous-
ment learning (Mnih et al., 2013; 2015; Lillicrap et al., 2015) time loss functions do not require specifying N nor the time
and momentum based contrastive learning (Grill et al., 2020; steps tt1 , t2 , ¨ ¨ ¨ , tN u. Nonetheless, they involve Jacobian-
He et al., 2020), we refer to fθ´ as the “target network”, vector products and require forward-mode automatic dif-
and fθ as the “online network”. We find that compared to ferentiation for efficient implementation, which may not
simply setting θ ´ “ θ, the EMA update and “stopgrad” be well-supported in some deep learning frameworks. We
operator in Eq. (8) can greatly stabilize the training process provide these continuous-time distillation loss functions in
and improve the final performance of the consistency model. Theorems 3 to 5, and relegate details to Appendix B.1.
Below we provide a theoretical justification for consistency
distillation based on asymptotic analysis. 5. Training Consistency Models in Isolation
Theorem 1. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u, and Consistency models can be trained without relying on any
f p¨, ¨; φq be the consistency function of the empirical PF pre-trained diffusion models. This differs from diffusion
ODE in Eq. (3). Assume fθ satisfies the Lipschitz condition: distillation techniques, making consistency models a new
there exists L ą 0 such that for all t P r, T s, x, and y, independent family of generative models.
we have kfθ px, tq ´ fθ py, tqk2 ď L kx ´ yk2 . Assume
further that for all n P J1, N ´ 1K, the ODE solver called In consistency distillation, we use a pre-trained score model
at tn`1 has local error uniformly bounded by Opptn`1 ´ sφ px, tq to approximate the ground truth score function
tn qp`1 q with p ě 1. Then, if LN ∇ log pt pxq. To get rid of this dependency, we need to
CD pθ, θ; φq “ 0, we have
seek other ways to estimate the score function. In fact,
sup }fθ px, tn q ´ f px, tn ; φq}2 “ Opp∆tqp q. there exists an unbiased estimator of ∇ log pt pxt q due to the
n,x following identity (Lemma 1 in Appendix A):
„ ˇ
Proof. The proof is based on induction and parallels the xt ´ x ˇˇ
classic proof of global error bounds for numerical ODE ∇ log pt pxt q “ ´E xt ,
t2 ˇ
solvers (Süli & Mayers, 2003). We provide the full proof in
Appendix A.2. where x „ pdata and xt „ N px; t2 Iq. That is, given x and
xt , we can form a Monte Carlo estimate of ∇ log pt pxt q
Since θ ´ is a running average of the history of θ, we have with ´pxt ´ xq{t2 . We now show that this estimate actu-
θ ´ “ θ when the optimization of Algorithm 2 converges. ally suffices to replace the pre-trained diffusion model in
Consistency Models
consistency distillation, when using the Euler method (or automatic differentiation for efficient implementation. Un-
any higher order method) as the ODE solver in the limit of like the discrete-time CT loss, there is no undesirable “bias”
N Ñ 8. associated with the continuous-time objective, as we effec-
tively take ∆ Ñ 0 in Theorem 2. We relegate more details
More precisely, we have the following theorem.
to Appendix B.2.
Theorem 2. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u. As-
sume d and fθ´ are both twice continuously differentiable
with bounded second derivatives, the weighting function
6. Experiments
2
λp¨q is bounded, and Erk∇ log ptn pxtn qk2 s ă 8. As- We employ consistency distillation and consistency train-
sume further that we use the Euler ODE solver, and the ing to learn consistency models on real image datasets,
pre-trained score model matches the ground truth, i.e., including CIFAR-10 (Krizhevsky et al., 2009), ImageNet
@t P r, T s : sφ px, tq ” ∇ log pt pxq. Then, 64 ˆ 64 (Deng et al., 2009), LSUN Bedroom 256 ˆ 256,
and LSUN Cat 256 ˆ 256 (Yu et al., 2015). Results are
LN ´ N ´
CD pθ, θ ; φq “ LCT pθ, θ q ` op∆tq, (9) compared according to Fréchet Inception Distance (FID,
Heusel et al. (2017), lower is better), Inception Score (IS,
where the expectation is taken with respect to x „ pdata , n „
Salimans et al. (2016), higher is better), Precision (Prec.,
UJ1, N ´ 1K, and xtn`1 „ N px; t2n`1 Iq. The consistency
Kynkäänniemi et al. (2019), higher is better), and Recall
training objective, denoted by LN ´
CT pθ, θ q, is defined as
(Rec., Kynkäänniemi et al. (2019), higher is better). Addi-
Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs, (10) tional experimental details are provided in Appendix C.
(a) Metric functions in CD. (b) Solvers and N in CD. (c) N with Heun solver in CD. (d) Adaptive N and µ in CT.
Figure 3: Various factors that affect consistency distillation (CD) and consistency training (CT) on CIFAR-10. The best
configuration for CD is LPIPS, Heun ODE solver, and N “ 18. Our adaptive schedule functions for N and µ make CT
converge significantly faster than fixing them to be constants during the course of optimization.
(a) CIFAR-10 (b) ImageNet 64 ˆ 64 (c) Bedroom 256 ˆ 256 (d) Cat 256 ˆ 256
Figure 4: Multistep image generation with consistency distillation (CD). CD outperforms progressive distillation (PD)
across all datasets and sampling steps. The only exception is single-step generation on Bedroom 256 ˆ 256.
corroborates with Theorem 1, which states that the optimal 6.2. Few-Step Image Generation
consistency models trained by higher order ODE solvers
Distillation In current literature, the most directly compara-
have smaller estimation errors with the same N . The results
ble approach to our consistency distillation (CD) is progres-
of Fig. 3c also indicate that once N is sufficiently large, the
sive distillation (PD, Salimans & Ho (2022)); both are thus
performance of CD becomes insensitive to N . Given these
far the only distillation approaches that do not construct
insights, we hereafter use LPIPS and Heun ODE solver for
synthetic data before distillation. In stark contrast, other dis-
CD unless otherwise stated. For N in CD, we follow the
tillation techniques, such as knowledge distillation (Luhman
suggestions in Karras et al. (2022) on CIFAR-10 and Im-
& Luhman, 2021) and DFNO (Zheng et al., 2022), have to
ageNet 64 ˆ 64. We tune N separately on other datasets
prepare a large synthetic dataset by generating numerous
(details in Appendix C).
samples from the diffusion model with expensive numeri-
Due to the strong connection between CD and CT, we adopt cal ODE solvers. We perform comprehensive comparison
LPIPS for our CT experiments throughout this paper. Unlike between PD and CD on CIFAR-10, ImageNet 64 ˆ 64, and
CD, there is no need for using Heun’s second order solver LSUN 256 ˆ 256, with all results reported in Fig. 4. All
in CT as the loss function does not rely on any particular methods distill from an EDM (Karras et al., 2022) model
numerical ODE solver. As demonstrated in Fig. 3d, the con- that we pre-trained in-house. We note that across all sam-
vergence of CT is highly sensitive to N —smaller N leads pling iterations, using the LPIPS metric uniformly improves
to faster convergence but worse samples, whereas larger PD compared to the squared `2 distance in the original
N leads to slower convergence but better samples upon paper of Salimans & Ho (2022). Both PD and CD improve
convergence. This matches our analysis in Section 5, and as we take more sampling steps. We find that CD uniformly
motivates our practical choice of progressively growing N outperforms PD across all datasets, sampling steps, and
and µ for CT to balance the trade-off between convergence metric functions considered, except for single-step gener-
speed and sample quality. As shown in Fig. 3d, adaptive ation on Bedroom 256 ˆ 256, where CD with `2 slightly
schedules of N and µ significantly improve the convergence underperforms PD with `2 . As shown in Table 1, CD even
speed and sample quality of CT. In our experiments, we outperforms distillation approaches that require synthetic
tune the schedules N p¨q and µp¨q separately for images of dataset construction, such as Knowledge Distillation (Luh-
different resolutions, with more details in Appendix C. man & Luhman, 2021) and DFNO (Zheng et al., 2022).
Consistency Models
Table 1: Sample quality on CIFAR-10. ˚ Methods that require Table 2: Sample quality on ImageNet 64 ˆ 64, and LSUN
synthetic data construction for distillation. Bedroom & Cat 256 ˆ 256. : Distillation techniques.
METHOD NFE (Ó) FID (Ó) IS (Ò) METHOD NFE (Ó) FID (Ó) Prec. (Ò) Rec. (Ò)
Diffusion + Samplers ImageNet 64 ˆ 64
DDIM (Song et al., 2020) 50 4.67 PD: (Salimans & Ho, 2022) 1 15.39 0.59 0.62
DDIM (Song et al., 2020) 20 6.84 DFNO:˚ (Zheng et al., 2022) 1 8.35
DDIM (Song et al., 2020) 10 8.23 CD: 1 6.20 0.68 0.63
DPM-solver-2 (Lu et al., 2022) 12 5.28 PD: (Salimans & Ho, 2022) 2 8.95 0.63 0.65
DPM-solver-3 (Lu et al., 2022) 12 6.03 CD: 2 4.70 0.69 0.64
3-DEIS (Zhang & Chen, 2022) 10 4.17
ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63
Diffusion + Distillation EDM (Karras et al., 2022) 79 2.44 0.71 0.67
Knowledge Distillation˚ (Luhman & Luhman, 2021) 1 9.36 BigGAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48
DFNO˚ (Zheng et al., 2022) 1 4.12 CT 1 13.0 0.71 0.47
1-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 6.18 9.08 CT 2 11.1 0.69 0.56
2-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 4.85 9.01
3-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 5.21 8.79
LSUN Bedroom 256 ˆ 256
PD (Salimans & Ho, 2022) 1 8.34 8.69 PD: (Salimans & Ho, 2022) 1 16.92 0.47 0.27
CD 1 3.55 9.48 PD: (Salimans & Ho, 2022) 2 8.47 0.56 0.39
PD (Salimans & Ho, 2022) 2 5.58 9.05 CD: 1 7.80 0.66 0.34
CD 2 2.93 9.75 CD: 2 5.22 0.68 0.39
Direct Generation DDPM (Ho et al., 2020) 1000 4.89 0.60 0.45
ADM (Dhariwal & Nichol, 2021) 1000 1.90 0.66 0.51
BigGAN (Brock et al., 2019) 1 14.7 9.22
CR-GAN (Zhang et al., 2019) 1 14.6 8.40 EDM (Karras et al., 2022) 79 3.57 0.66 0.45
AutoGAN (Gong et al., 2019) 1 12.4 8.55 SS-GAN (Chen et al., 2019b) 1 13.3
E2GAN (Tian et al., 2020) 1 11.3 8.51 PGGAN (Karras et al., 2018) 1 8.34
ViTGAN (Lee et al., 2021) 1 6.66 9.30 PG-SWGAN (Wu et al., 2019) 1 8.0
TransGAN (Jiang et al., 2021) 1 9.26 9.05 StyleGAN2 (Karras et al., 2020) 1 2.35 0.59 0.48
StyleGAN2-ADA (Karras et al., 2020) 1 2.92 9.83 CT 1 16.0 0.60 0.17
StyleGAN-XL (Sauer et al., 2022) 1 1.85 CT 2 7.85 0.68 0.33
Score SDE (Song et al., 2021) 2000 2.20 9.89 LSUN Cat 256 ˆ 256
DDPM (Ho et al., 2020) 1000 3.17 9.46
PD: (Salimans & Ho, 2022) 1 29.6 0.51 0.25
LSGM (Vahdat et al., 2021) 147 2.10
PD: (Salimans & Ho, 2022) 2 15.5 0.59 0.36
PFGM (Xu et al., 2022) 110 2.35 9.68
CD: 1 11.0 0.65 0.36
EDM (Karras et al., 2022) 36 2.04 9.84
CD: 2 8.84 0.66 0.40
1-Rectified Flow (Liu et al., 2022) 1 378 1.13
Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92 DDPM (Ho et al., 2020) 1000 17.1 0.53 0.48
Residual Flow (Chen et al., 2019a) 1 46.4 ADM (Dhariwal & Nichol, 2021) 1000 5.57 0.63 0.52
GLFlow (Xiao et al., 2019) 1 44.6 EDM (Karras et al., 2022) 79 6.69 0.70 0.43
DenseFlow (Grcić et al., 2021) 1 34.9 PGGAN (Karras et al., 2018) 1 37.5
DC-VAE (Parmar et al., 2021) 1 17.9 8.20 StyleGAN2 (Karras et al., 2020) 1 7.25 0.58 0.43
CT 1 8.70 8.49 CT 1 20.7 0.56 0.23
CT 2 5.83 8.85 CT 2 11.7 0.63 0.36
Figure 5: Samples generated by EDM (top), CT + single-step generation (middle), and CT + 2-step generation (Bottom). All
corresponding images are generated from the same initial noise.
Consistency Models
(a) Left: The gray-scale image. Middle: Colorized images. Right: The ground-truth image.
(b) Left: The downsampled image (32 ˆ 32). Middle: Full resolution images (256 ˆ 256). Right: The ground-truth image (256 ˆ 256).
(c) Left: A stroke input provided by users. Right: Stroke-guided image generation.
Figure 6: Zero-shot image editing with a consistency model trained by consistency distillation on LSUN Bedroom 256ˆ256.
Direct Generation In Tables 1 and 2, we compare the low-resolution inputs. In Fig. 6c, we additionally demon-
sample quality of consistency training (CT) with other gen- strate that it can generate images based on stroke inputs cre-
erative models using one-step and two-step generation. We ated by humans, as in SDEdit for diffusion models (Meng
also include PD and CD results for reference. Both tables re- et al., 2021). Again, this editing capability is zero-shot,
port PD results obtained from the `2 metric function, as this as the model has not been trained on stroke inputs. In
is the default setting used in the original paper of Salimans Appendix D, we additionally demonstrate the zero-shot
& Ho (2022). For fair comparison, we train both PD and capability of consistency models on inpainting (Fig. 10),
CD to distill the same EDM models. In Tables 1 and 2, we interpolation (Fig. 11) and denoising (Fig. 12), with more
observe that CT outperforms all single-step, non-adversarial examples on colorization (Fig. 8), super-resolution (Fig. 9)
generative models, i.e., VAEs and normalizing flows, by and stroke-guided image generation (Fig. 13).
a significant margin on CIFAR-10. Moreover, CT obtains
comparable quality to PD for single-step generation without 7. Conclusion
relying on distillation. In Fig. 5, we provide EDM samples
(top), single-step CT samples (middle), and two-step CT We have introduced consistency models, a type of generative
samples (bottom). In Appendix E, we show additional sam- models that are specifically designed to support one-step
ples for both CD and CT in Figs. 14 to 21. Importantly, all and few-step generation. We have empirically demonstrated
samples obtained from the same initial noise vector share that our consistency distillation method outshines the exist-
significant structural similarity, even though CT and EDM ing distillation techniques for diffusion models on multiple
models are trained independently from one another. This image benchmarks and various sampling iterations. Further-
indicates that CT is unlikely to suffer from mode collapse, more, as a standalone generative model, consistency models
as EDMs do not. outdo other available models that permit single-step genera-
tion, barring GANs. Similar to diffusion models, they also
6.3. Zero-Shot Image Editing allow zero-shot image editing applications such as inpaint-
ing, colorization, super-resolution, denoising, interpolation,
Similar to diffusion models, consistency models allow zero- and stroke-guided image generation.
shot image editing by modifying the multistep sampling
process in Algorithm 1. We demonstrate this capability In addition, consistency models share striking similarities
with a consistency model trained on the LSUN bedroom with techniques employed in other fields, including deep
dataset using consistency distillation. In Fig. 6a, we show Q-learning (Mnih et al., 2015) and momentum-based con-
such a consistency model can colorize gray-scale bedroom trastive learning (Grill et al., 2020; He et al., 2020). This
images at test time, even though it has never been trained offers exciting prospects for cross-pollination of ideas and
on colorization tasks. In Fig. 6b, we show the same con- methods among these diverse fields.
sistency model can generate high-resolution images from
Consistency Models
Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
J. C. Diffusion posterior sampling for general noisy in- Hochreiter, S. GANs trained by a two time-scale update
verse problems. In International Conference on Learning rule converge to a local Nash equilibrium. In Advances in
Representations, 2023. URL https://openreview. Neural Information Processing Systems, pp. 6626–6637,
net/forum?id=OnD9zGAGT0k. 2017.
Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Proba-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, bilistic Models. Advances in Neural Information Process-
L. Imagenet: A large-scale hierarchical image database. ing Systems, 33, 2020.
In 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255. Ieee, 2009. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko,
A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J.,
Dhariwal, P. and Nichol, A. Diffusion models beat gans et al. Imagen video: High definition video generation
on image synthesis. Advances in Neural Information with diffusion models. arXiv preprint arXiv:2210.02303,
Processing Systems (NeurIPS), 2021. 2022a.
Consistency Models
Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu,
M., and Fleet, D. J. Video diffusion models. In ICLR C. Vitgan: Training gans with vision transformers. arXiv
Workshop on Deep Generative Models for Highly Struc- preprint arXiv:2107.04589, 2021.
tured Data, 2022b. URL https://openreview.
net/forum?id=BBelR2NdDZ5. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
Hyvärinen, A. and Dayan, P. Estimation of non-normalized control with deep reinforcement learning. arXiv preprint
statistical models by score matching. Journal of Machine arXiv:1509.02971, 2015.
Learning Research (JMLR), 6(4), 2005.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure Han, J. On the variance of the adaptive learning rate and
transformers can make one strong gan, and that can scale beyond. arXiv preprint arXiv:1908.03265, 2019.
up. Advances in Neural Information Processing Systems,
34:14745–14758, 2021. Liu, X., Gong, C., and Liu, Q. Flow straight and fast:
Learning to generate and transfer data with rectified flow.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres-
arXiv preprint arXiv:2209.03003, 2022.
sive growing of GANs for improved quality, stability,
and variation. In International Conference on Learning Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J.
Representations, 2018. URL https://openreview. Dpm-solver: A fast ode solver for diffusion probabilis-
net/forum?id=Hk99zCeAb. tic model sampling in around 10 steps. arXiv preprint
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., arXiv:2206.00927, 2022.
and Aila, T. Analyzing and improving the image quality
Luhman, E. and Luhman, T. Knowledge distillation in
of stylegan. 2020.
iterative generative models for improved sampling speed.
Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating arXiv preprint arXiv:2101.02388, 2021.
the design space of diffusion-based generative models. In
Proc. NeurIPS, 2022. Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon,
S. Sdedit: Image synthesis and editing with stochastic
Kawar, B., Vaksman, G., and Elad, M. Snips: Solving differential equations. arXiv preprint arXiv:2108.01073,
noisy inverse problems stochastically. arXiv preprint 2021.
arXiv:2105.14951, 2021.
Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and
Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising Salimans, T. On distillation of guided diffusion models.
diffusion restoration models. In Advances in Neural In- arXiv preprint arXiv:2210.03142, 2022.
formation Processing Systems, 2022.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Kingma, D. P. and Dhariwal, P. Glow: Generative flow Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
with invertible 1x1 convolutions. In Bengio, S., Wal- atari with deep reinforcement learning. arXiv preprint
lach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., arXiv:1312.5602, 2013.
and Garnett, R. (eds.), Advances in Neural Information
Processing Systems 31, pp. 10215–10224. 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
Kingma, D. P. and Welling, M. Auto-encoding variational land, A. K., Ostrovski, G., et al. Human-level control
bayes. In International Conference on Learning Repre- through deep reinforcement learning. nature, 518(7540):
sentations, 2014. 529–533, 2015.
Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro,
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
B. DiffWave: A Versatile Diffusion Model for Audio
P., McGrew, B., Sutskever, I., and Chen, M. Glide:
Synthesis. arXiv preprint arXiv:2009.09761, 2020.
Towards photorealistic image generation and editing
Krizhevsky, A., Hinton, G., et al. Learning multiple layers with text-guided diffusion models. arXiv preprint
of features from tiny images. 2009. arXiv:2112.10741, 2021.
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and Parmar, G., Li, D., Lee, K., and Tu, Z. Dual contradistinctive
Aila, T. Improved precision and recall metric for assess- generative autoencoder. In Proceedings of the IEEE/CVF
ing generative models. Advances in Neural Information Conference on Computer Vision and Pattern Recognition,
Processing Systems, 32, 2019. pp. 823–832, 2021.
Consistency Models
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudi- Song, Y. and Ermon, S. Improved Techniques for Training
nov, M. Grad-TTS: A diffusion probabilistic model for Score-Based Generative Models. Advances in Neural
text-to-speech. arXiv preprint arXiv:2105.06337, 2021. Information Processing Systems, 33, 2020.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score
M. Hierarchical text-conditional image generation with matching: A scalable approach to density and score esti-
clip latents. arXiv preprint arXiv:2204.06125, 2022. mation. In Proceedings of the Thirty-Fifth Conference on
Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv,
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Israel, July 22-25, 2019, pp. 204, 2019.
backpropagation and approximate inference in deep gen-
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,
erative models. In Proceedings of the 31st International
Ermon, S., and Poole, B. Score-based generative mod-
Conference on Machine Learning, pp. 1278–1286, 2014.
eling through stochastic differential equations. In In-
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and ternational Conference on Learning Representations,
Ommer, B. High-resolution image synthesis with latent 2021. URL https://openreview.net/forum?
diffusion models. In Proceedings of the IEEE/CVF Con- id=PxTIG12RRHS.
ference on Computer Vision and Pattern Recognition, pp. Song, Y., Shen, L., Xing, L., and Ermon, S. Solving inverse
10684–10695, 2022. problems in medical imaging with score-based genera-
tive models. In International Conference on Learning
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton,
Representations, 2022. URL https://openreview.
E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S.,
net/forum?id=vaRCHVj0uGI.
Lopes, R. G., et al. Photorealistic text-to-image diffusion
models with deep language understanding. arXiv preprint Süli, E. and Mayers, D. F. An introduction to numerical
arXiv:2205.11487, 2022. analysis. Cambridge university press, 2003.
Salimans, T. and Ho, J. Progressive distillation for fast Tian, Y., Wang, Q., Huang, Z., Li, W., Dai, D., Yang, M.,
sampling of diffusion models. In International Confer- Wang, J., and Fink, O. Off-policy reinforcement learn-
ence on Learning Representations, 2022. URL https: ing for efficient and effective gan architecture search. In
//openreview.net/forum?id=TIdIXIpzhoI. Computer Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings,
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Part VII 16, pp. 175–192. Springer, 2020.
Radford, A., and Chen, X. Improved techniques for train-
ing gans. In Advances in neural information processing Uria, B., Murray, I., and Larochelle, H. Rnade: the real-
systems, pp. 2234–2242, 2016. valued neural autoregressive density-estimator. In Pro-
ceedings of the 26th International Conference on Neu-
Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling ral Information Processing Systems-Volume 2, pp. 2175–
stylegan to large diverse datasets. In ACM SIGGRAPH 2183, 2013.
2022 conference proceedings, pp. 1–10, 2022.
Uria, B., Côté, M.-A., Gregor, K., Murray, I., and
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Larochelle, H. Neural autoregressive distribution esti-
Ganguli, S. Deep Unsupervised Learning Using Nonequi- mation. The Journal of Machine Learning Research, 17
librium Thermodynamics. In International Conference (1):7184–7220, 2016.
on Machine Learning, pp. 2256–2265, 2015. Vahdat, A., Kreis, K., and Kautz, J. Score-based generative
modeling in latent space. Advances in Neural Information
Song, J., Meng, C., and Ermon, S. Denoising diffusion
Processing Systems, 34:11287–11302, 2021.
implicit models. arXiv preprint arXiv:2010.02502, 2020.
Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.
Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pixel recurrent neural networks. In Proceedings of the
Pseudoinverse-guided diffusion models for inverse prob- 33rd International Conference on International Confer-
lems. In International Conference on Learning Represen- ence on Machine Learning - Volume 48, ICML’16, pp.
tations, 2023. URL https://openreview.net/ 1747–1756. JMLR.org, 2016. URL http://dl.acm.
forum?id=9_gsMA8MRKQ. org/citation.cfm?id=3045390.3045575.
Song, Y. and Ermon, S. Generative Modeling by Estimating Vincent, P. A Connection Between Score Matching and
Gradients of the Data Distribution. In Advances in Neural Denoising Autoencoders. Neural Computation, 23(7):
Information Processing Systems, pp. 11918–11930, 2019. 1661–1674, 2011.
Consistency Models
Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel,
D. P., and Gool, L. V. Sliced wasserstein generative
models. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 3713–
3722, 2019.
Xiao, Z., Yan, Q., and Amit, Y. Generative latent flow. arXiv
preprint arXiv:1905.10485, 2019.
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Xiao, J. Lsun: Construction of a large-scale image dataset
using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015.
Zhang, H., Zhang, Z., Odena, A., and Lee, H. Consistency
regularization for generative adversarial networks. arXiv
preprint arXiv:1910.12027, 2019.
Zhang, Q. and Chen, Y. Fast sampling of diffusion
models with exponential integrator. arXiv preprint
arXiv:2204.13902, 2022.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,
O. The unreasonable effectiveness of deep features as a
perceptual metric. In CVPR, 2018.
Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and
Anandkumar, A. Fast sampling of diffusion models
via operator learning. arXiv preprint arXiv:2211.13449,
2022.
Consistency Models
Contents
1 Introduction 1
2 Diffusion Models 2
3 Consistency Models 3
6 Experiments 6
6.1 Training Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Few-Step Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.3 Zero-Shot Image Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Conclusion 9
Appendices 15
Appendix A Proofs 15
A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.2 Consistency Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.3 Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Appendices
A. Proofs
A.1. Notations
We use fθ px, tq to denote a consistency model parameterized by θ, and f px, t; φq the consistency function of the empirical
PF ODE in Eq. (3). Here φ symbolizes its dependency on the pre-trained score model sφ px, tq. For the consistency function
of the PF ODE in Eq. (2), we denote it as f px, tq. Given a multi-variate function hpx, yq, we let B1 hpx, yq denote the
Jacobian of h over x, and analogously B2 hpx, yq denote the Jacobian of h over y. Unless otherwise stated, x is supposed to
be a random variable sampled from the data distribution pdata pxq, n is sampled uniformly at random from J1, N ´ 1K, and
xtn is sampled from N px; t2n Iq. Here J1, N ´ 1K represents the set of integers t1, 2, ¨ ¨ ¨ , N ´ 1u. Furthermore, recall that
we define
x̂φ
tn :“ xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; φq,
where Φp¨ ¨ ¨ ; φq denotes the update function of a one-step ODE solver for the empirical PF ODE defined by the score
model sφ px, tq. By default, Er¨s denotes the expectation over all relevant random variables in the expression.
Proof. From LN
CD pθ, θ; φq “ 0, we have
φ
LN
CD pθ, θ; φq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ px̂tn , tn qqs “ 0. (11)
According to the definition, we have ptn pxtn q “ pdata pxq b N p0, t2n Iq where tn ě ą 0. It follows that ptn pxtn q ą 0 for
every xtn and 1 ď n ď N . Therefore, Eq. (11) entails
piq
“ fθ px̂φ
tn , tn q ´ f pxtn , tn ; φq
“ fθ px̂φ
tn , tn q ´ fθ pxtn , tn q ` fθ pxtn , tn q ´ f pxtn , tn ; φq
“ fθ px̂φ
tn , tn q ´ fθ pxtn , tn q ` en , (14)
where (i) is due to Eq. (13) and f pxtn`1 , tn`1 ; φq “ f pxtn , tn ; φq. Because fθ p¨, tn q has Lipschitz constant L, we have
ken`1 k2 ď ken k2 ` L
x̂φ tn ´ xtn
2
piq
“ ken k2 ` L ¨ Opptn`1 ´ tn qp`1 q
“ ken k2 ` Opptn`1 ´ tn qp`1 q,
where (i) holds because the ODE solver has local error bounded by Opptn`1 ´ tn qp`1 q. In addition, we observe that e1 “ 0,
because
e1 “ fθ pxt1 , t1 q ´ f pxt1 , t1 ; φq
piq
“ xt1 ´ f pxt1 , t1 ; φq
piiq
“ xt1 ´ xt1
“ 0.
Here (i) is true because the consistency model is parameterized such that f pxt1 , t1 ; φq “ xt1 and (ii) is entailed by the
definition of f p¨, ¨; φq. This allows us to perform induction on the recursion formula Eq. (14) to obtain
n´1
ÿ
ken k2 ď ke1 k2 ` Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ ptk`1 ´ tk qOpptk`1 ´ tk qp q
k“1
n´1
ÿ
ď ptk`1 ´ tk qOpp∆tqp q
k“1
n´1
ÿ
“ Opp∆tqp q ptk`1 ´ tk q
k“1
“ Opp∆tqp qptn ´ t1 q
ď Opp∆tqp qpT ´ q
“ Opp∆tqp q,
LN ´ N ´
CD pθ, θ ; φq “ LCT pθ, θ q ` op∆tq,
where the expectation is taken with respect to x „ pdata , n „ UJ1, N ´ 1K, and xtn`1 „ N px; t2n`1 Iq. The consistency
training objective, denoted by LN ´
CT pθ, θ q, is defined as
Then, we apply Lemma 1 to Eq. (15) and use Taylor expansion in the reverse direction to obtain
LN ´
CD pθ, θ ; φq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ „ *
xtn`1 ´ x ˇˇ
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1 E tn`1
t2
ˇx
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs
piq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ ˆ ˙*
xtn`1 ´ x
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2
Consistency Models
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs
„
“E λptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq
„ ˆ ˙
xtn`1 ´ x
` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2
` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qs ` op|tn`1 ´ tn |q
` Erop|tn`1 ´ tn |qs
„ ˆ ˆ ˙˙
xt ´x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 qtn`1 n`1 , tn ` Erop|tn`1 ´ tn |qs
t2n`1
„ ˆ ˆ ˙˙
xt ´x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 q n`1 , tn ` Erop|tn`1 ´ tn |qs
tn`1
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn`1 z ` ptn ´ tn`1 qz, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop∆tqs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` op∆tq
“LN ´
CT pθ, θ q ` op∆tq, (16)
xt ´x
where (i) is due to the law of total expectation, and z :“ n`1 tn`1 is distributed according to the standard Gaussian
distribution. This implies LN CD pθ, θ ´
; φq “ LN
CT pθ, θ ´
q ` op∆tq and thus completes the proof for Eq. (9). More-
over, we have LNCT pθ, θ ´
q ě Op∆tq whenever inf LN
N CD pθ, θ ´
; φq ą 0. Otherwise, LN ´
CT pθ, θ q ă Op∆tq and thus
N ´ N ´
lim∆tÑ0 LCD pθ, θ ; φq “ 0, a clear contradiction to inf N LCD pθ, θ ; φq ą 0.
B. Continuous-Time Extensions
The consistency distillation and consistency training objectives can be generalized to hold for infinite time steps (N Ñ 8)
under suitable conditions.
The matrices G and H play a crucial role in forming continuous-time objectives for consistency distillation. Additionally,
we denote the Jacobian of fθ px, tq with respect to x as BfθBx
px,tq
.
When θ ´ “ θ (with no stopgrad operator), we have the following theoretical result.
Theorem 3. Let tn “ pn´1q T ´
N ´1 pT ´ q ` and ∆t “ N ´1 , where n P J1, N K. Assume d is three times continuously
differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and second
derivatives. Assume further that the weighting function λp¨q is bounded, and supx,tPr,T s ksφ px, tqk2 ă 8. Suppose we use
the Euler ODE solver. Then,
LN
CD pθ, θ; φq
lim “ L8
CD pθ, θ; φq, (17)
∆tÑ0 p∆tq2
Consistency Models
where
« ˆ ˙T ˆ ˙ff
1 Bfθ px t , tq Bf θ px t , tq Bfθ px t , tq Bfθ px t , tq
L8
CD pθ, θ; φq :“ E λptq ´t sφ pxt , tq Gpfθ pxt , tqq ´t sφ pxt , tq .
2 Bt Bxt Bt Bxt
(18)
Here the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.
Proof. First, we can derive the following equation with Taylor expansion:
fθ px̂φ
tn , tn q ´ fθ pxtn`1 , tn`1 q “ fθ pxtn`1 ` tn`1 sφ pxtn`1 , tn`1 q∆t, tn q ´ fθ pxtn`1 , tn`1 q
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
“tn`1 sφ pxtn`1 , tn`1 q∆t ´ ∆t ` Opp∆tq2 q. (19)
Bxtn`1 Btn`1
where we obtain (i) by first expanding dpfθ pxtn`1 , tn`1 q, ¨q to second order, and then noting that dpx, xq ” 0 and
∇y dpx, yq|y“x ” 0. We obtain (ii) using Eq. (19). By taking the limit for both sides of Eq. (20) as ∆t Ñ 0, we arrive at
Eq. (17), which completes the proof.
Remark 1. Although Theorem 3 assumes uniform timesteps and the Euler ODE solver for technical simplicity, we believe
an analogous result can be derived without those assumptions, as all time boundaries and ODE solvers should perform
similarly as N Ñ 8. We leave a more general version of Theorem 3 as future work.
Remark 2. Theorem 3 implies that consistency models can be trained by minimizing L8 CD pθ, θ; φq. In particular, when
2
dpx, yq “ kx ´ yk2 , we have
«
2 ff
Bfθ px t , tq Bf θ px t , tq
L8CD pθ, θ; φq “ E λptq
´t sφ pxt , tq
. (21)
Bt Bxt 2
However, this continuous-time objective requires computing Jacobian-vector products as a subroutine to evaluate the loss
function, which can be slow and laborious to implement in deep learning frameworks that do not support forward-mode
automatic differentiation.
Remark 3. If fθ px, tq matches the ground truth consistency function for the empirical PF ODE of sφ px, tq, then
fθ pxt , tq ” x
Bfθ pxt , tq dxt Bfθ pxt , tq
ðñ ` ”0
Bxt dt Bt
Consistency Models
For some metric functions, such as the `1 norm, the Hessian Gpxq is zero so Theorem 3 is vacuous. Below we show that a
non-vacuous statement holds for the `1 norm with just a small modification of the proof for Theorem 3.
Theorem 4. Let tn “ pn´1q T ´
N ´1 pT ´ q ` and ∆t “ N ´1 , where n P J1, N K. Assume fθ is twice continuously differ-
entiable with bounded first and second derivatives. Assume further that the weighting function λp¨q is bounded, and
supx,tPr,T s ksφ px, tqk2 ă 8. Suppose we use the Euler ODE solver, and set dpx, yq “ kx ´ yk1 . Then,
LN
CD pθ, θ; φq
lim “ L8
CD, `1 pθ, θ; φq, (22)
∆tÑ0 ∆t
where
„
Bfθ pxt , tq Bfθ pxt , tq
L8
CD, `1 pθ, θ; φq :“ E λptq
t Bxt sφ pxt , tq ´
Bt
1
where the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.
Proof. We have
1 N 1
LCD pθ, θ; φq “ Erλptn q}fθ pxtn`1 , tn`1 q ´ fθ px̂φtn , tn q}1 s
∆t „
∆t
piq 1
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q 2
“ E λptn q
t n`1 s px ,
φ tn`1 n`1 t q∆t ´ ∆t ` Opp∆tq q
∆t
Bxtn`1 Btn`1
1
„
Bf θ px tn`1 , tn`1 q Bfθ px t n`1 , tn`1 q
“E λptn q
tn`1 sφ pxtn`1 , tn`1 q ´ ` Op∆tq
(23)
Bxtn`1
Btn`1
1
where (i) is obtained by plugging Eq. (19) into the previous equation. Taking the limit for both sides of Eq. (23) as ∆t Ñ 0
leads to Eq. (22), which completes the proof.
Remark 4. According to Theorem 4, consistency models can be trained by minimizing L8 CD, `1 pθ, θ; φq. Moreover, the same
d
reasoning in Remark 3 can be applied to show that L8
CD, `1 pθ, θ; φq “ 0 if and only if f θ pxt , tq “ x for all xt P R and
t P r, T s.
In the second case where θ ´ “ stopgradpθq, we can derive a pseudo-objective whose gradient matches the gradient of
LN ´
CD pθ, θ ; φq in the limit of N Ñ 8. Minimizing this pseudo-objective with gradient descent gives another way to train
consistency models via distillation. This pseudo-objective is provided by the theorem below.
Theorem 5. Let tn “ pn´1q T ´
N ´1 pT ´ q ` and ∆t “ N ´1 , where n P J1, N K. Assume d is three times continu-
ously differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and
second derivatives. Assume further that the weighting function λp¨q is bounded, supx,tPr,T s ksφ px, tqk2 ă 8, and
supx,tPr,T s k∇θ fθ px, tqk2 ă 8. Suppose we use the Euler ODE solver, and θ ´ “ stopgradpθq. Then,
∇θ LN ´
CD pθ, θ ; φq
lim “ ∇θ L 8 ´
CD pθ, θ ; φq, (24)
∆tÑ0 ∆t
where
„ ˆ ˙
T Bfθ´ pxt , tq Bfθ´ pxt , tq
L8
CD pθ, θ ´
; φq :“ E λptqfθ px t , tq Hpf θ ´ pxt , tqq ´ t s px
φ t , tq . (25)
Bt Bxt
Here the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.
Consistency Models
Proof. The proof mostly follows that of Theorem 5. First, we leverage Taylor series expansion to obtain
1 N 1
LCT pθ, θ ´ q “ Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs
∆t ˆ ∆t
piq 1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆t
˙
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|3 qs
1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq (31)
2∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
where (i) is derived by first expanding dp¨, fθ´ px ` tn z, tn qq to second order, and then noting that dpx, xq ” 0 and
∇y dpy, xq|y“x ” 0. Next, we compute the gradient of Eq. (31) with respect to θ and simplify the result to obtain
1
∇θ L N ´
CT pθ, θ q
∆t
1
“ ∇θ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
piq 1
“ Etλptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq
∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
" „
piiq 1 T
“ E λptn qr∇θ fθ px ` tn`1 z, tn`1 qs Hpfθ´ px ` tn z, tn qq ∆tB1 fθ´ px ` tn z, tn qz
∆t
*
` B2 fθ´ px ` tn z, tn q∆t ` ErOp|∆t|qs
" „
“E λptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
*
` B2 fθ px ` tn z, tn q ` ErOp|∆t|qs
´
" „
“∇θ E λptn qrfθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
*
` B2 fθ´ px ` tn z, tn q ` ErOp|∆t|qs
" „ *
xt ´ x
“∇θ E λptn qrfθ pxtn`1 , tn`1 qsT Hpfθ´ pxtn , tn qq B1 fθ´ pxtn , tn q n ` B2 fθ´ pxtn , tn q ` ErOp|∆t|qs (32)
tn
Here (i) results from the chain rule, and (ii) follows from Taylor expansion. Taking the limit for both sides of Eq. (32) as
∆t Ñ 0 yields Eq. (29), which completes the proof.
Consistency Models
Remark 10. Under suitable regularity conditions, we can combine the insights of Theorems 2 and 6 to derive
∇θ LN ´
CD pθ, θ ; φq
lim “ ∇θ L 8 ´
CT pθ, θ q.
∆tÑ0 ∆t
As in Theorems 2 and 6, here φ represents diffusion model parameters that satisfy sφ px, tq ” ∇ log pt pxq, and θ ´ “
stopgradpθq.
Remark 11. Similar to L8 ´ 8 ´
CD pθ, θ ; φq in Theorem 5, LCT pθ, θ q is a pseudo-objective; one cannot track training by
8 ´
monitoring the value of LCT pθ, θ q, but can still apply gradient descent on this loss function to train a consistency
model fθ px, tq directly from data. Moreover, the same observation in Remark 7 holds true: L8 ´
CT pθ, θ q “ 0 and
8 ´
∇θ LCT pθ, θ q “ 0 if fθ px, tq matches the ground truth consistency function for the PF ODE.
We did not investigate using the LPIPS metric in Theorem 3 because minimizing the resulting objective would require
back-propagating through second order derivatives of the VGG network used in LPIPS, which is computationally expensive
and prone to numerical instability. As revealed by Fig. 7a, the stopgrad version of continuous-time distillation (Theorem 5)
works better than the non-stopgrad version (Theorem 3) for both the LPIPS and `2 metrics, and the LPIPS metric works
the best for all distillation approaches. Additionally, discrete-time consistency distillation outperforms continuous-time
consistency distillation, possibly due to the larger variance in continuous-time objectives, and the fact that one can use
effective higher-order ODE solvers in discrete-time objectives.
For consistency training (CT), we find it important to initialize consistency models from a pre-trained EDM model in order
to stabilize training when using continuous-time objectives. We hypothesize that this is caused by the large variance in our
continuous-time loss functions. For fair comparison, we thus initialize all consistency models from the same pre-trained
EDM model on CIFAR-10 for both discrete-time and continuous-time CT, even though the former works well with random
initialization. We leave variance reduction techniques for continuous-time CT to future research.
We empirically compare the following objectives:
• CT (LPIPS): Consistency training LNCT with N “ 120 and the LPIPS metric. We set the learning rate to 4e-4, and the
EMA decay rate for the target network to 0.99. We do not use the schedule functions for N and µ here because they
cause slower learning when the consistency model is initialized from a pre-trained EDM model.
• CT8 p`2 q: Consistency training L8
CT with the `2 metric. We set the learning rate to 5e-6.
As shown in Fig. 7b, the LPIPS metric leads to improved performance for continuous-time CT. We also find that continuous-
time CT outperforms discrete-time CT with the same LPIPS metric. This is likely due to the bias in discrete-time CT, as
∆t ą 0 in Theorem 2 for discrete-time objectives, whereas continuous-time CT has no bias since it implicitly drives ∆t to 0.
Parameterization for Consistency Models We use the same architectures for consistency models as those used for
EDMs. The only difference is we slightly modify the skip connections in EDM to ensure the boundary condition holds for
Consistency Models
consistency models. Recall that in Section 3 we propose to parameterize a consistency model in the following form:
Schedule Functions for Consistency Training As discussed in Section 5, consistency generation requires specifying
schedule functions N p¨q and µp¨q for best performance. Throughout our experiments, we use schedule functions that take
the form below:
Sc W
k
N pkq “ pps1 ` 1q2 ´ s20 q ` s20 ´ 1 ` 1
K
ˆ ˙
s0 log µ0
µpkq “ exp ,
N pkq
where K denotes the total number of training iterations, s0 denotes the initial discretization steps, s1 ą s0 denotes the target
discretization steps at the end of training, and µ0 ą 0 denotes the EMA decay rate at the beginning of model training.
Training Details In both consistency distillation and progressive distillation, we distill EDMs (Karras et al., 2022). We
trained these EDMs ourselves according to the specifications given in Karras et al. (2022). The original EDM paper did
not provide hyperparameters for the LSUN Bedroom 256 ˆ 256 and Cat 256 ˆ 256 datasets, so we mostly used the same
hyperparameters as those for the ImageNet 64 ˆ 64 dataset. The difference is that we trained for 600k and 300k iterations
for the LSUN Bedroom and Cat datasets respectively, and reduced the batch size from 4096 to 2048.
We used the same EMA decay rate for LSUN 256 ˆ 256 datasets as for the ImageNet 64 ˆ 64 dataset. For progressive
distillation, we used the same training settings as those described in Salimans & Ho (2022) for CIFAR-10 and ImageNet
64 ˆ 64. Although the original paper did not test on LSUN 256 ˆ 256 datasets, we used the same settings for ImageNet
64 ˆ 64 and found them to work well.
In all distillation experiments, we initialized the consistency model with pre-trained EDM weights. For consistency training,
we initialized the model randomly, just as we did for training the EDMs. We trained all consistency models with the
Rectified Adam optimizer (Liu et al., 2019), with no learning rate decay or warm-up, and no weight decay. We also applied
EMA to the weights of the online consistency models in both consistency distillation and consistency training, as well as
to the weights of the training online consistency models according to Karras et al. (2022). For LSUN 256 ˆ 256 datasets,
we chose the EMA decay rate to be the same as that for ImageNet 64 ˆ 64, except for consistency distillation on LSUN
Bedroom 256 ˆ 256, where we found that using zero EMA worked better.
When using the LPIPS metric on CIFAR-10 and ImageNet 64 ˆ 64, we rescale images to resolution 224 ˆ 224 with bilinear
upsampling before feeding them to the LPIPS network. For LSUN 256 ˆ 256, we evaluated LPIPS without rescaling inputs.
In addition, we performed horizontal flips for data augmentation for all models and on all datasets. We trained all models on
a cluster of Nvidia A100 GPUs. Additional hyperparameters for consistency training and distillation are listed in Table 3.
and stroke-guided image generation (SDEdit, Meng et al. (2021), Fig. 13). The consistency model used here is trained via
consistency distillation on the LSUN Bedroom 256 ˆ 256.
All these image editing tasks, except for image interpolation and denoising, can be performed via a small modification to the
multistep sampling algorithm in Algorithm 1. The resulting pseudocode is provided in Algorithm 4. Here y is a reference
image that guides sample generation, Ω is a binary mask, d computes element-wise products, and A is an invertible linear
transformation that maps images into a latent space where the conditional information in y is infused into the iterative
generation procedure by masking with Ω. Unless otherwise stated, we choose
ˆ ˙ρ
i ´ 1 1{ρ
ti “ T 1{ρ ` p ´ T 1{ρ q
N ´1
in our experiments, where N “ 40 for LSUN Bedroom 256 ˆ 256.
Below we describe how to perform each task using Algorithm 4.
Inpainting When using Algorithm 4 for inpainting, we let y be an image where missing pixels are masked out, Ω be a
binary mask where 1 indicates the missing pixels, and A be the identity transformation.
Colorization The algorithm for image colorization is similar, as colorization becomes a special case of inpainting once we
transform data into a decoupled space. Specifically, let y P Rhˆwˆ3 be a gray-scale image that we aim to colorize, where
all channels of y are assumed to be the same, i.e., yr:, :, 0s “ yr:, :, 1s “ yr:, :, 2s in NumPy notation. In our experiments,
each channel of this gray scale image is obtained from a colorful image by averaging the RGB channels with
Let Q P R3ˆ3 be an orthogonal matrix whose first column is proportional to the vector p0.2989, 0.5870, 0.1140q. This
orthogonal matrix can be obtained easily via QR decomposition, and we use the following in our experiments
¨ ˛
0.4471 ´0.8204 0.3563
Q “ ˝0.8780 0.4785 0 ‚.
0.1705 ´0.3129 ´0.9343
Because Q is orthogonal, the inversion A´1 : y P Rhˆw ÞÑ x P Rhˆwˆ3 is easy to compute, where
2
ÿ
xri, j, ks “ yri, j, lsQrk, ls.
l“0
With A and Ω defined as above, we can now use Algorithm 4 for image colorization.
Super-resolution With a similar strategy, we employ Algorithm 4 for image super-resolution. For simplicity, we assume
that the down-sampled image is obtained by averaging non-overlapping patches of size p ˆ p. Suppose the shape of full
resolution images is h ˆ w ˆ 3. Let y P Rhˆwˆ3 denote a low-resolution image naively up-sampled to full resolution,
2
where pixels in each non-overlapping patch share the same value. Additionally, let Ω P t0, 1uh{pˆw{pˆp ˆ3 be a binary
mask such that
#
1, kě1
Ωri, j, k, ls “ .
0, k“0
2 2
Similar to image colorization, super-resolution requires an orthogonal matrix Q P Rp ˆp whose first column is
p1{p, 1{p, ¨ ¨ ¨ , 1{pq. This orthogonal matrix can be obtained with QR decomposition. To perform super-resolution, we
2
define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rh{pˆw{pˆp ˆ3 , where
2
pÿ ´1
yri, j, k, ls “ xri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrm, ks.
m“0
2
The inverse transformation A´1 : y P Rh{pˆw{pˆp ˆ3
ÞÑ x P Rhˆwˆ3 is easy to derive, with
2
pÿ ´1
xri, j, k, ls “ yri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrk, ms.
m“0
Stroke-guided image generation We can also use Algorithm 4 for stroke-guided image generation as introduced in
SDEdit (Meng et al., 2021). Specifically, we let y P Rhˆwˆ3 be a stroke painting. We set A “ I, and define Ω P Rhˆwˆ3
as a matrix of ones. In our experiments, we set t1 “ 5.38 and t2 “ 2.24, with N “ 2.
Denoising It is possible to denoise images perturbed with various scales of Gaussian noise using a single consistency
model. Suppose the input image x is perturbed with N p0; σ 2 Iq. As long as σ P r, T s, we can evaluate fθ px, σq to produce
the denoised image.
Interpolation We can interpolate between two images generated by consistency models. Suppose the first sample x1 is
produced by noise vector z1 , and the second sample x2 is produced by noise vector z2 . In other words, x1 “ fθ pz1 , T q and
x2 “ fθ pz2 , T q. To interpolate between x1 and x2 , we first use spherical linear interpolation to get
Figure 8: Gray-scale images (left), colorized images by a consistency model (middle), and ground truth (right).
Consistency Models
Figure 9: Downsampled images of resolution 32 ˆ 32 (left), full resolution (256 ˆ 256) images generated by a consistency
model (middle), and ground truth images of resolution 256 ˆ 256 (right).
Consistency Models
Figure 10: Masked images (left), imputed images by a consistency model (middle), and ground truth (right).
Consistency Models
Figure 11: Interpolating between leftmost and rightmost images with spherical linear interpolation. All samples are generated
by a consistency model trained on LSUN Bedroom 256 ˆ 256.
Consistency Models
Figure 12: Single-step denoising with a consistency model. The leftmost images are ground truth. For every two rows, the
top row shows noisy images with different noise levels, while the bottom row gives denoised images.
Consistency Models
Figure 13: SDEdit with a consistency model. The leftmost images are stroke painting inputs. Images on the right side are
the results of stroke-guided image generation (SDEdit).
Consistency Models
Figure 14: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
Consistency Models
Figure 15: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
Consistency Models
Figure 16: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models
Figure 17: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models
Figure 18: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
Consistency Models
Figure 19: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
Consistency Models
Figure 20: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models
Figure 21: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.