Professional Documents
Culture Documents
Abstract
Diffusion models have significantly advanced the
arXiv:2303.01469v2 [cs.LG] 31 May 2023
1
Consistency Models
2
Consistency Models
3
Consistency Models
Algorithm 1 Multistep Consistency Sampling N is sufficiently large, we can obtain an accurate estimate
Input: Consistency model fθ p¨, ¨q, sequence of time of xtn from xtn`1 by running one discretization step of a
points τ1 ą τ2 ą ¨ ¨ ¨ ą τN ´1 , initial noise x̂T numerical ODE solver. This estimate, which we denote as
x Ð fθ px̂T , T q x̂ϕ
tn , is defined by
for n “ 1 to N ´ 1 do
Sample z „ a N p0, Iq x̂ϕ
tn :“ xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; ϕq, (6)
x̂τn Ð x ` τn2 ´ ϵ2 z where Φp¨ ¨ ¨ ; ϕq represents the update function of a one-
x Ð fθ px̂τn , τn q step ODE solver applied to the empirical PF ODE. For
end for example, when using the Euler solver, we have Φpx, t; ϕq “
Output: x ´tsϕ px, tq which corresponds to the following update rule
x̂ϕ
tn “ xtn`1 ´ ptn ´ tn`1 qtn`1 sϕ pxtn`1 , tn`1 q.
tτ1 , τ2 , ¨ ¨ ¨ , τN ´1 u in Algorithm 1 with a greedy algorithm,
For simplicity, we only consider one-step ODE solvers in
where the time points are pinpointed one at a time using
this work. It is straightforward to generalize our framework
ternary search to optimize the FID of samples obtained from
to multistep ODE solvers and we leave it as future work.
Algorithm 1. This assumes that given prior time points, the
FID is a unimodal function of the next time point. We find Due to the connection between the PF ODE in Eq. (2) and
this assumption to hold empirically in our experiments, and the SDE in Eq. (1) (see Section 2), one can sample along the
leave the exploration of better strategies as future work. distribution of ODE trajectories by first sampling x „ pdata ,
then adding Gaussian noise to x. Specifically, given a data
Zero-Shot Data Editing Similar to diffusion models, con-
point x, we can generate a pair of adjacent data points
sistency models enable various data editing and manipu-
px̂ϕ
tn , xtn`1 q on the PF ODE trajectory efficiently by sam-
lation applications in zero shot; they do not require ex-
pling x from the dataset, followed by sampling xtn`1 from
plicit training to perform these tasks. For example, consis-
the transition density of the SDE N px, t2n`1 Iq, and then
tency models define a one-to-one mapping from a Gaussian
noise vector to a data sample. Similar to latent variable computing x̂ϕ tn using one discretization step of the numeri-
models like GANs, VAEs, and normalizing flows, consis- cal ODE solver according to Eq. (6). Afterwards, we train
tency models can easily interpolate between samples by the consistency model by minimizing its output differences
traversing the latent space (Fig. 11). As consistency models on the pair px̂ϕ tn , xtn`1 q. This motivates our following con-
are trained to recover xϵ from any noisy input xt where sistency distillation loss for training consistency models.
t P rϵ, T s, they can perform denoising for various noise Definition 1. The consistency distillation loss is defined as
levels (Fig. 12). Moreover, the multistep generation pro-
cedure in Algorithm 1 is useful for solving certain inverse LN ´
CD pθ, θ ; ϕq :“
problems in zero shot by using an iterative replacement pro- Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂ϕ
tn , tn qqs, (7)
cedure similar to that of diffusion models (Song & Ermon,
2019; Song et al., 2021; Ho et al., 2022b). This enables where the expectation is taken with respect to x „ pdata , n „
many applications in the context of image editing, including UJ1, N ´1K, and xtn`1 „ N px; t2n`1 Iq. Here UJ1, N ´1K
inpainting (Fig. 10), colorization (Fig. 8), super-resolution denotes the uniform distribution over t1, 2, ¨ ¨ ¨ , N ´ 1u,
(Fig. 6b) and stroke-guided image editing (Fig. 13) as in λp¨q P R` is a positive weighting function, x̂ϕ tn is given by
SDEdit (Meng et al., 2021). In Section 6.3, we empiri- Eq. (6), θ ´ denotes a running average of the past values of
cally demonstrate the power of consistency models on many θ during the course of optimization, and dp¨, ¨q is a metric
zero-shot image editing tasks. function that satisfies @x, y : dpx, yq ě 0 and dpx, yq “ 0
if and only if x “ y.
4. Training Consistency Models via Distillation
Unless otherwise stated, we adopt the notations in Defi-
We present our first method for training consistency mod- nition 1 throughout this paper, and use Er¨s to denote the
els based on distilling a pre-trained score model sϕ px, tq. expectation over all random variables. In our experiments,
Our discussion revolves around the empirical PF ODE in we consider the squared ℓ2 distance dpx, yq “ }x ´ y}22 , ℓ1
Eq. (3), obtained by plugging the score model sϕ px, tq distance dpx, yq “ }x ´ y}1 , and the Learned Perceptual
into the PF ODE. Consider discretizing the time horizon Image Patch Similarity (LPIPS, Zhang et al. (2018)). We
rϵ, T s into N ´ 1 sub-intervals, with boundaries t1 “ ϵ ă find λptn q ” 1 performs well across all tasks and datasets.
t2 ă ¨ ¨ ¨ ă tN “ T . In practice, we follow Karras In practice, we minimize the objective by stochastic gradient
et al. (2022) to determine the boundaries with the formula descent on the model parameters θ, while updating θ ´ with
ti “ pϵ1{ρ ` i´1{N ´1pT 1{ρ ´ ϵ1{ρ qqρ , where ρ “ 7. When exponential moving average (EMA). That is, given a decay
4
Consistency Models
rate 0 ď µ ă 1, we perform the following update after each implies that, under some regularity conditions, the estimated
optimization step: consistency model can become arbitrarily accurate, as long
as the step size of the ODE solver is sufficiently small. Im-
θ ´ Ð stopgradpµθ ´ ` p1 ´ µqθq. (8) portantly, our boundary condition fθ px, ϵq ” x precludes
the trivial solution fθ px, tq ” 0 from arising in consistency
The overall training procedure is summarized in Algo- model training.
rithm 2. In alignment with the convention in deep reinforce-
ment learning (Mnih et al., 2013; 2015; Lillicrap et al., 2015) The consistency distillation loss LN ´
CD pθ, θ ; ϕq can be ex-
and momentum based contrastive learning (Grill et al., 2020; tended to hold for infinitely many time steps (N Ñ 8) if
He et al., 2020), we refer to fθ´ as the “target network”, θ ´ “ θ or θ ´ “ stopgradpθq. The resulting continuous-
and fθ as the “online network”. We find that compared to time loss functions do not require specifying N nor the time
simply setting θ ´ “ θ, the EMA update and “stopgrad” steps tt1 , t2 , ¨ ¨ ¨ , tN u. Nonetheless, they involve Jacobian-
operator in Eq. (8) can greatly stabilize the training process vector products and require forward-mode automatic dif-
and improve the final performance of the consistency model. ferentiation for efficient implementation, which may not
be well-supported in some deep learning frameworks. We
Below we provide a theoretical justification for consistency provide these continuous-time distillation loss functions in
distillation based on asymptotic analysis. Theorems 3 to 5, and relegate details to Appendix B.1.
Theorem 1. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u, and
f p¨, ¨; ϕq be the consistency function of the empirical PF 5. Training Consistency Models in Isolation
ODE in Eq. (3). Assume fθ satisfies the Lipschitz condition:
there exists L ą 0 such that for all t P rϵ, T s, x, and y, Consistency models can be trained without relying on any
we have ∥fθ px, tq ´ fθ py, tq∥2 ď L ∥x ´ y∥2 . Assume pre-trained diffusion models. This differs from existing
further that for all n P J1, N ´ 1K, the ODE solver called diffusion distillation techniques, making consistency models
at tn`1 has local error uniformly bounded by Opptn`1 ´ a new independent family of generative models.
tn qp`1 q with p ě 1. Then, if LN
CD pθ, θ; ϕq “ 0, we have
Recall that in consistency distillation, we rely on a pre-
sup }fθ px, tn q ´ f px, tn ; ϕq}2 “ Opp∆tqp q. trained score model sϕ px, tq to approximate the ground
n,x truth score function ∇ log pt pxq. It turns out that we can
avoid this pre-trained score model altogether by leveraging
Proof. The proof is based on induction and parallels the the following unbiased estimator (Lemma 1 in Appendix A):
classic proof of global error bounds for numerical ODE „ ˇ ȷ
solvers (Süli & Mayers, 2003). We provide the full proof in xt ´ x ˇˇ
∇ log pt pxt q “ ´E xt ,
Appendix A.2. t2 ˇ
5
Consistency Models
Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs, (10) 6.1. Training Consistency Models
where z „ N p0, Iq. Moreover, LN ´
CT pθ, θ q ě Op∆tq if We perform a series of experiments on CIFAR-10 to under-
inf N LN ´
CD pθ, θ ; ϕq ą 0. stand the effect of various hyperparameters on the perfor-
mance of consistency models trained by consistency distil-
Proof. The proof is based on Taylor series expansion and lation (CD) and consistency training (CT). We first focus on
properties of score functions (Lemma 1). A complete proof the effect of the metric function dp¨, ¨q, the ODE solver, and
is provided in Appendix A.3. the number of discretization steps N in CD, then investigate
the effect of the schedule functions N p¨q and µp¨q in CT.
We refer to Eq. (10) as the consistency training (CT) loss. To set up our experiments for CD, we consider the squared
Crucially, Lpθ, θ ´ q only depends on the online network ℓ2 distance dpx, yq “ }x ´ y}22 , ℓ1 distance dpx, yq “
fθ , and the target network fθ´ , while being completely }x ´ y}1 , and the Learned Perceptual Image Patch Simi-
agnostic to diffusion model parameters ϕ. The loss function larity (LPIPS, Zhang et al. (2018)) as the metric function.
Lpθ, θ ´ q ě Op∆tq decreases at a slower rate than the For the ODE solver, we compare Euler’s forward method
remainder op∆tq and thus will dominate the loss in Eq. (9) and Heun’s second order method as detailed in Karras et al.
as N Ñ 8 and ∆t Ñ 0. (2022). For the number of discretization steps N , we com-
For improved practical performance, we propose to progres- pare N P t9, 12, 18, 36, 50, 60, 80, 120u. All consistency
sively increase N during training according to a schedule models trained by CD in our experiments are initialized with
function N p¨q. The intuition (cf ., Fig. 3d) is that the consis- the corresponding pre-trained diffusion models, whereas
tency training loss has less “variance” but more “bias” with models trained by CT are randomly initialized.
respect to the underlying consistency distillation loss (i.e., As visualized in Fig. 3a, the optimal metric for CD is LPIPS,
the left-hand side of Eq. (9)) when N is small (i.e., ∆t is which outperforms both ℓ1 and ℓ2 by a large margin over
large), which facilitates faster convergence at the beginning all training iterations. This is expected as the outputs of
of training. On the contrary, it has more “variance” but less consistency models are images on CIFAR-10, and LPIPS is
“bias” when N is large (i.e., ∆t is small), which is desirable specifically designed for measuring the similarity between
when closer to the end of training. For best performance, natural images. Next, we investigate which ODE solver and
we also find that µ should change along with N , according which discretization step N work the best for CD. As shown
to a schedule function µp¨q. The full algorithm of consis- in Figs. 3b and 3c, Heun ODE solver and N “ 18 are the
tency training is provided in Algorithm 3, and the schedule best choices. Both are in line with the recommendation
functions used in our experiments are given in Appendix C. of Karras et al. (2022) despite the fact that we are train-
Similar to consistency distillation, the consistency training ing consistency models, not diffusion models. Moreover,
loss LN ´
CT pθ, θ q can be extended to hold in continuous time
Fig. 3b shows that with the same N , Heun’s second order
(i.e., N Ñ 8) if θ ´ “ stopgradpθq, as shown in Theo- solver uniformly outperforms Euler’s first order solver. This
rem 6. This continuous-time loss function does not require corroborates with Theorem 1, which states that the optimal
schedule functions for N or µ, but requires forward-mode consistency models trained by higher order ODE solvers
automatic differentiation for efficient implementation. Un- have smaller estimation errors with the same N . The results
like the discrete-time CT loss, there is no undesirable “bias” of Fig. 3c also indicate that once N is sufficiently large, the
associated with the continuous-time objective, as we effec- performance of CD becomes insensitive to N . Given these
tively take ∆t Ñ 0 in Theorem 2. We relegate more details insights, we hereafter use LPIPS and Heun ODE solver for
to Appendix B.2. CD unless otherwise stated. For N in CD, we follow the
6
Consistency Models
(a) Metric functions in CD. (b) Solvers and N in CD. (c) N with Heun solver in CD. (d) Adaptive N and µ in CT.
Figure 3: Various factors that affect consistency distillation (CD) and consistency training (CT) on CIFAR-10. The best
configuration for CD is LPIPS, Heun ODE solver, and N “ 18. Our adaptive schedule functions for N and µ make CT
converge significantly faster than fixing them to be constants during the course of optimization.
(a) CIFAR-10 (b) ImageNet 64 ˆ 64 (c) Bedroom 256 ˆ 256 (d) Cat 256 ˆ 256
Figure 4: Multistep image generation with consistency distillation (CD). CD outperforms progressive distillation (PD)
across all datasets and sampling steps. The only exception is single-step generation on Bedroom 256 ˆ 256.
suggestions in Karras et al. (2022) on CIFAR-10 and Im- tillation techniques, such as knowledge distillation (Luhman
ageNet 64 ˆ 64. We tune N separately on other datasets & Luhman, 2021) and DFNO (Zheng et al., 2022), have to
(details in Appendix C). prepare a large synthetic dataset by generating numerous
samples from the diffusion model with expensive numerical
Due to the strong connection between CD and CT, we adopt
ODE/SDE solvers. We perform comprehensive comparison
LPIPS for our CT experiments throughout this paper. Unlike
for PD and CD on CIFAR-10, ImageNet 64ˆ64, and LSUN
CD, there is no need for using Heun’s second order solver
256 ˆ 256, with all results reported in Fig. 4. All methods
in CT as the loss function does not rely on any particular
distill from an EDM (Karras et al., 2022) model that we pre-
numerical ODE solver. As demonstrated in Fig. 3d, the con-
trained in-house. We note that across all sampling iterations,
vergence of CT is highly sensitive to N —smaller N leads
using the LPIPS metric uniformly improves PD compared
to faster convergence but worse samples, whereas larger
to the squared ℓ2 distance in the original paper of Salimans
N leads to slower convergence but better samples upon
& Ho (2022). Both PD and CD improve as we take more
convergence. This matches our analysis in Section 5, and
sampling steps. We find that CD uniformly outperforms
motivates our practical choice of progressively growing N
PD across all datasets, sampling steps, and metric functions
and µ for CT to balance the trade-off between convergence
considered, except for single-step generation on Bedroom
speed and sample quality. As shown in Fig. 3d, adaptive
256 ˆ 256, where CD with ℓ2 slightly underperforms PD
schedules of N and µ significantly improve the convergence
with ℓ2 . As shown in Table 1, CD even outperforms distilla-
speed and sample quality of CT. In our experiments, we
tion approaches that require synthetic dataset construction,
tune the schedules N p¨q and µp¨q separately for images of
such as Knowledge Distillation (Luhman & Luhman, 2021)
different resolutions, with more details in Appendix C.
and DFNO (Zheng et al., 2022).
6.2. Few-Step Image Generation Direct Generation In Tables 1 and 2, we compare the
sample quality of consistency training (CT) with other gen-
Distillation In current literature, the most directly compara- erative models using one-step and two-step generation. We
ble approach to our consistency distillation (CD) is progres- also include PD and CD results for reference. Both tables re-
sive distillation (PD, Salimans & Ho (2022)); both are thus port PD results obtained from the ℓ2 metric function, as this
far the only distillation approaches that do not construct is the default setting used in the original paper of Salimans
synthetic data before distillation. In stark contrast, other dis-
7
Consistency Models
Table 1: Sample quality on CIFAR-10. ˚ Methods that require Table 2: Sample quality on ImageNet 64 ˆ 64, and LSUN
synthetic data construction for distillation. Bedroom & Cat 256 ˆ 256. : Distillation techniques.
METHOD NFE (Ó) FID (Ó) IS (Ò) METHOD NFE (Ó) FID (Ó) Prec. (Ò) Rec. (Ò)
Diffusion + Samplers ImageNet 64 ˆ 64
DDIM (Song et al., 2020) 50 4.67 PD: (Salimans & Ho, 2022) 1 15.39 0.59 0.62
DDIM (Song et al., 2020) 20 6.84 DFNO: (Zheng et al., 2022) 1 8.35
DDIM (Song et al., 2020) 10 8.23 CD: 1 6.20 0.68 0.63
DPM-solver-2 (Lu et al., 2022) 10 5.94 PD: (Salimans & Ho, 2022) 2 8.95 0.63 0.65
DPM-solver-fast (Lu et al., 2022) 10 4.70 CD: 2 4.70 0.69 0.64
3-DEIS (Zhang & Chen, 2022) 10 4.17
ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63
Diffusion + Distillation EDM (Karras et al., 2022) 79 2.44 0.71 0.67
Knowledge Distillation˚ (Luhman & Luhman, 2021) 1 9.36 BigGAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48
DFNO˚ (Zheng et al., 2022) 1 4.12 CT 1 13.0 0.71 0.47
1-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 6.18 9.08 CT 2 11.1 0.69 0.56
2-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 4.85 9.01
LSUN Bedroom 256 ˆ 256
3-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 5.21 8.79
PD (Salimans & Ho, 2022) 1 8.34 8.69 PD: (Salimans & Ho, 2022) 1 16.92 0.47 0.27
CD 1 3.55 9.48 PD: (Salimans & Ho, 2022) 2 8.47 0.56 0.39
PD (Salimans & Ho, 2022) 2 5.58 9.05 CD: 1 7.80 0.66 0.34
CD 2 2.93 9.75 CD: 2 5.22 0.68 0.39
Direct Generation DDPM (Ho et al., 2020) 1000 4.89 0.60 0.45
BigGAN (Brock et al., 2019) 1 14.7 9.22 ADM (Dhariwal & Nichol, 2021) 1000 1.90 0.66 0.51
Diffusion GAN (Xiao et al., 2022) 1 14.6 8.93 EDM (Karras et al., 2022) 79 3.57 0.66 0.45
AutoGAN (Gong et al., 2019) 1 12.4 8.55 PGGAN (Karras et al., 2018) 1 8.34
E2GAN (Tian et al., 2020) 1 11.3 8.51 PG-SWGAN (Wu et al., 2019) 1 8.0
ViTGAN (Lee et al., 2021) 1 6.66 9.30 TDPM (GAN) (Zheng et al., 2023) 1 5.24
TransGAN (Jiang et al., 2021) 1 9.26 9.05 StyleGAN2 (Karras et al., 2020) 1 2.35 0.59 0.48
StyleGAN2-ADA (Karras et al., 2020) 1 2.92 9.83 CT 1 16.0 0.60 0.17
StyleGAN-XL (Sauer et al., 2022) 1 1.85 CT 2 7.85 0.68 0.33
Score SDE (Song et al., 2021) 2000 2.20 9.89
LSUN Cat 256 ˆ 256
DDPM (Ho et al., 2020) 1000 3.17 9.46
LSGM (Vahdat et al., 2021) 147 2.10 PD: (Salimans & Ho, 2022) 1 29.6 0.51 0.25
PFGM (Xu et al., 2022) 110 2.35 9.68 PD: (Salimans & Ho, 2022) 2 15.5 0.59 0.36
EDM (Karras et al., 2022) 35 2.04 9.84 CD: 1 11.0 0.65 0.36
1-Rectified Flow (Liu et al., 2022) 1 378 1.13 CD: 2 8.84 0.66 0.40
Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92 DDPM (Ho et al., 2020) 1000 17.1 0.53 0.48
Residual Flow (Chen et al., 2019) 1 46.4 ADM (Dhariwal & Nichol, 2021) 1000 5.57 0.63 0.52
GLFlow (Xiao et al., 2019) 1 44.6 EDM (Karras et al., 2022) 79 6.69 0.70 0.43
DenseFlow (Grcić et al., 2021) 1 34.9 PGGAN (Karras et al., 2018) 1 37.5
DC-VAE (Parmar et al., 2021) 1 17.9 8.20 StyleGAN2 (Karras et al., 2020) 1 7.25 0.58 0.43
CT 1 8.70 8.49 CT 1 20.7 0.56 0.23
CT 2 5.83 8.85 CT 2 11.7 0.63 0.36
Figure 5: Samples generated by EDM (top), CT + single-step generation (middle), and CT + 2-step generation (Bottom). All
corresponding images are generated from the same initial noise.
8
Consistency Models
(a) Left: The gray-scale image. Middle: Colorized images. Right: The ground-truth image.
(b) Left: The downsampled image (32 ˆ 32). Middle: Full resolution images (256 ˆ 256). Right: The ground-truth image (256 ˆ 256).
(c) Left: A stroke input provided by users. Right: Stroke-guided image generation.
Figure 6: Zero-shot image editing with a consistency model trained by consistency distillation on LSUN Bedroom 256ˆ256.
& Ho (2022). For fair comparison, we ensure PD and CD capability of consistency models on inpainting (Fig. 10),
distill the same EDM models. In Tables 1 and 2, we observe interpolation (Fig. 11) and denoising (Fig. 12), with more
that CT outperforms existing single-step, non-adversarial examples on colorization (Fig. 8), super-resolution (Fig. 9)
generative models, i.e., VAEs and normalizing flows, by a and stroke-guided image generation (Fig. 13).
significant margin on CIFAR-10. Moreover, CT achieves
comparable quality to one-step samples from PD without 7. Conclusion
relying on distillation. In Fig. 5, we provide EDM samples
(top), single-step CT samples (middle), and two-step CT We have introduced consistency models, a type of generative
samples (bottom). In Appendix E, we show additional sam- models that are specifically designed to support one-step
ples for both CD and CT in Figs. 14 to 21. Importantly, all and few-step generation. We have empirically demonstrated
samples obtained from the same initial noise vector share that our consistency distillation method outshines the exist-
significant structural similarity, even though CT and EDM ing distillation techniques for diffusion models on multiple
models are trained independently from one another. This image benchmarks and small sampling iterations. Further-
indicates that CT is less likely to suffer from mode collapse, more, as a standalone generative model, consistency models
as EDMs do not. generate better samples than existing single-step genera-
tion models except for GANs. Similar to diffusion models,
6.3. Zero-Shot Image Editing they also allow zero-shot image editing applications such as
inpainting, colorization, super-resolution, denoising, inter-
Similar to diffusion models, consistency models allow zero- polation, and stroke-guided image generation.
shot image editing by modifying the multistep sampling
process in Algorithm 1. We demonstrate this capability In addition, consistency models share striking similarities
with a consistency model trained on the LSUN bedroom with techniques employed in other fields, including deep
dataset using consistency distillation. In Fig. 6a, we show Q-learning (Mnih et al., 2015) and momentum-based con-
such a consistency model can colorize gray-scale bedroom trastive learning (Grill et al., 2020; He et al., 2020). This
images at test time, even though it has never been trained offers exciting prospects for cross-pollination of ideas and
on colorization tasks. In Fig. 6b, we show the same con- methods among these diverse fields.
sistency model can generate high-resolution images from
low-resolution inputs. In Fig. 6c, we additionally demon- Acknowledgements
strate that it can generate images based on stroke inputs cre-
ated by humans, as in SDEdit for diffusion models (Meng We thank Alex Nichol for reviewing the manuscript and
et al., 2021). Again, this editing capability is zero-shot, providing valuable feedback, Chenlin Meng for providing
as the model has not been trained on stroke inputs. In stroke inputs needed in our stroke-guided image generation
Appendix D, we additionally demonstrate the zero-shot experiments, and the OpenAI Algorithms team.
9
Consistency Models
10
Consistency Models
Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
transformers can make one strong gan, and that can scale Han, J. On the variance of the adaptive learning rate and
up. Advances in Neural Information Processing Systems, beyond. arXiv preprint arXiv:1908.03265, 2019.
34:14745–14758, 2021.
Liu, X., Gong, C., and Liu, Q. Flow straight and fast:
Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- Learning to generate and transfer data with rectified flow.
sive growing of GANs for improved quality, stability, arXiv preprint arXiv:2209.03003, 2022.
and variation. In International Conference on Learning Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J.
Representations, 2018. URL https://openreview. Dpm-solver: A fast ode solver for diffusion probabilis-
net/forum?id=Hk99zCeAb. tic model sampling in around 10 steps. arXiv preprint
arXiv:2206.00927, 2022.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,
and Aila, T. Analyzing and improving the image quality Luhman, E. and Luhman, T. Knowledge distillation in
of stylegan. 2020. iterative generative models for improved sampling speed.
arXiv preprint arXiv:2101.02388, 2021.
Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating
the design space of diffusion-based generative models. In Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon,
Proc. NeurIPS, 2022. S. Sdedit: Image synthesis and editing with stochastic
differential equations. arXiv preprint arXiv:2108.01073,
Kawar, B., Vaksman, G., and Elad, M. Snips: Solving 2021.
noisy inverse problems stochastically. arXiv preprint
Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and
arXiv:2105.14951, 2021.
Salimans, T. On distillation of guided diffusion models.
Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising arXiv preprint arXiv:2210.03142, 2022.
diffusion restoration models. In Advances in Neural In- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
formation Processing Systems, 2022. Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
atari with deep reinforcement learning. arXiv preprint
Kingma, D. P. and Dhariwal, P. Glow: Generative flow
arXiv:1312.5602, 2013.
with invertible 1x1 convolutions. In Bengio, S., Wal-
lach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
and Garnett, R. (eds.), Advances in Neural Information J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
Processing Systems 31, pp. 10215–10224. 2018. land, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. nature, 518(7540):
Kingma, D. P. and Welling, M. Auto-encoding variational 529–533, 2015.
bayes. In International Conference on Learning Repre-
sentations, 2014. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
P., McGrew, B., Sutskever, I., and Chen, M. Glide:
Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, Towards photorealistic image generation and editing
B. DiffWave: A Versatile Diffusion Model for Audio with text-guided diffusion models. arXiv preprint
Synthesis. arXiv preprint arXiv:2009.09761, 2020. arXiv:2112.10741, 2021.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers Parmar, G., Li, D., Lee, K., and Tu, Z. Dual contradistinctive
of features from tiny images. 2009. generative autoencoder. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and pp. 823–832, 2021.
Aila, T. Improved precision and recall metric for assess-
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudi-
ing generative models. Advances in Neural Information
nov, M. Grad-TTS: A diffusion probabilistic model for
Processing Systems, 32, 2019.
text-to-speech. arXiv preprint arXiv:2105.06337, 2021.
Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu, Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
C. Vitgan: Training gans with vision transformers. arXiv M. Hierarchical text-conditional image generation with
preprint arXiv:2107.04589, 2021. clip latents. arXiv preprint arXiv:2204.06125, 2022.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous backpropagation and approximate inference in deep gen-
control with deep reinforcement learning. arXiv preprint erative models. In Proceedings of the 31st International
arXiv:1509.02971, 2015. Conference on Machine Learning, pp. 1278–1286, 2014.
11
Consistency Models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and 2021. URL https://openreview.net/forum?
Ommer, B. High-resolution image synthesis with latent id=PxTIG12RRHS.
diffusion models. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pp. Song, Y., Shen, L., Xing, L., and Ermon, S. Solving inverse
10684–10695, 2022. problems in medical imaging with score-based genera-
tive models. In International Conference on Learning
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, Representations, 2022. URL https://openreview.
E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., net/forum?id=vaRCHVj0uGI.
Lopes, R. G., et al. Photorealistic text-to-image diffusion
models with deep language understanding. arXiv preprint Süli, E. and Mayers, D. F. An introduction to numerical
arXiv:2205.11487, 2022. analysis. Cambridge university press, 2003.
Salimans, T. and Ho, J. Progressive distillation for fast Tian, Y., Wang, Q., Huang, Z., Li, W., Dai, D., Yang, M.,
sampling of diffusion models. In International Confer- Wang, J., and Fink, O. Off-policy reinforcement learn-
ence on Learning Representations, 2022. URL https: ing for efficient and effective gan architecture search. In
//openreview.net/forum?id=TIdIXIpzhoI. Computer Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings,
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
Part VII 16, pp. 175–192. Springer, 2020.
Radford, A., and Chen, X. Improved techniques for train-
ing gans. In Advances in neural information processing Vahdat, A., Kreis, K., and Kautz, J. Score-based generative
systems, pp. 2234–2242, 2016. modeling in latent space. Advances in Neural Information
Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling Processing Systems, 34:11287–11302, 2021.
stylegan to large diverse datasets. In ACM SIGGRAPH
Vincent, P. A Connection Between Score Matching and
2022 conference proceedings, pp. 1–10, 2022.
Denoising Autoencoders. Neural Computation, 23(7):
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and 1661–1674, 2011.
Ganguli, S. Deep Unsupervised Learning Using Nonequi-
librium Thermodynamics. In International Conference Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel,
on Machine Learning, pp. 2256–2265, 2015. D. P., and Gool, L. V. Sliced wasserstein generative
models. In Proceedings of the IEEE/CVF Conference
Song, J., Meng, C., and Ermon, S. Denoising diffusion on Computer Vision and Pattern Recognition, pp. 3713–
implicit models. arXiv preprint arXiv:2010.02502, 2020. 3722, 2019.
Song, J., Vahdat, A., Mardani, M., and Kautz, J. Xiao, Z., Yan, Q., and Amit, Y. Generative latent flow. arXiv
Pseudoinverse-guided diffusion models for inverse prob- preprint arXiv:1905.10485, 2019.
lems. In International Conference on Learning Represen-
tations, 2023. URL https://openreview.net/ Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative
forum?id=9_gsMA8MRKQ. learning trilemma with denoising diffusion GANs. In
International Conference on Learning Representations,
Song, Y. and Ermon, S. Generative Modeling by Estimating
2022. URL https://openreview.net/forum?
Gradients of the Data Distribution. In Advances in Neural
id=JprM0p-q0Co.
Information Processing Systems, pp. 11918–11930, 2019.
Song, Y. and Ermon, S. Improved Techniques for Training Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. S. Pois-
Score-Based Generative Models. Advances in Neural son flow generative models. In Oh, A. H., Agarwal, A.,
Information Processing Systems, 33, 2020. Belgrave, D., and Cho, K. (eds.), Advances in Neural
Information Processing Systems, 2022. URL https:
Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score //openreview.net/forum?id=voV_TRqcWh.
matching: A scalable approach to density and score esti-
mation. In Proceedings of the Thirty-Fifth Conference on Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Xiao, J. Lsun: Construction of a large-scale image dataset
Israel, July 22-25, 2019, pp. 204, 2019. using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,
Ermon, S., and Poole, B. Score-based generative mod- Zhang, Q. and Chen, Y. Fast sampling of diffusion
eling through stochastic differential equations. In In- models with exponential integrator. arXiv preprint
ternational Conference on Learning Representations, arXiv:2204.13902, 2022.
12
Consistency Models
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,
O. The unreasonable effectiveness of deep features as a
perceptual metric. In CVPR, 2018.
Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and
Anandkumar, A. Fast sampling of diffusion models
via operator learning. arXiv preprint arXiv:2211.13449,
2022.
Zheng, H., He, P., Chen, W., and Zhou, M. Truncated diffu-
sion probabilistic models and diffusion-based adversarial
auto-encoders. In The Eleventh International Confer-
ence on Learning Representations, 2023. URL https:
//openreview.net/forum?id=HDxgaKk956l.
13
Consistency Models
Contents
1 Introduction 1
2 Diffusion Models 2
3 Consistency Models 3
6 Experiments 6
6.1 Training Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Few-Step Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.3 Zero-Shot Image Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Conclusion 9
Appendices 15
Appendix A Proofs 15
A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.2 Consistency Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.3 Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
14
Consistency Models
Appendices
A. Proofs
A.1. Notations
We use fθ px, tq to denote a consistency model parameterized by θ, and f px, t; ϕq the consistency function of the empirical
PF ODE in Eq. (3). Here ϕ symbolizes its dependency on the pre-trained score model sϕ px, tq. For the consistency function
of the PF ODE in Eq. (2), we denote it as f px, tq. Given a multi-variate function hpx, yq, we let B1 hpx, yq denote the
Jacobian of h over x, and analogously B2 hpx, yq denote the Jacobian of h over y. Unless otherwise stated, x is supposed to
be a random variable sampled from the data distribution pdata pxq, n is sampled uniformly at random from J1, N ´ 1K, and
xtn is sampled from N px; t2n Iq. Here J1, N ´ 1K represents the set of integers t1, 2, ¨ ¨ ¨ , N ´ 1u. Furthermore, recall that
we define
x̂ϕ
tn :“ xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; ϕq,
where Φp¨ ¨ ¨ ; ϕq denotes the update function of a one-step ODE solver for the empirical PF ODE defined by the score
model sϕ px, tq. By default, Er¨s denotes the expectation over all relevant random variables in the expression.
Proof. From LN
CD pθ, θ; ϕq “ 0, we have
ϕ
LN
CD pθ, θ; ϕq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ px̂tn , tn qqs “ 0. (11)
According to the definition, we have ptn pxtn q “ pdata pxq b N p0, t2n Iq where tn ě ϵ ą 0. It follows that ptn pxtn q ą 0 for
every xtn and 1 ď n ď N . Therefore, Eq. (11) entails
15
Consistency Models
piq
“ fθ px̂ϕ
tn , tn q ´ f pxtn , tn ; ϕq
“ fθ px̂ϕ
tn , tn q ´ fθ pxtn , tn q ` fθ pxtn , tn q ´ f pxtn , tn ; ϕq
“ fθ px̂ϕ
tn , tn q ´ fθ pxtn , tn q ` en , (14)
where (i) is due to Eq. (13) and f pxtn`1 , tn`1 ; ϕq “ f pxtn , tn ; ϕq. Because fθ p¨, tn q has Lipschitz constant L, we have
∥en`1 ∥2 ď ∥en ∥2 ` L
x̂ϕ tn ´ xtn
2
piq
“ ∥en ∥2 ` L ¨ Opptn`1 ´ tn qp`1 q
“ ∥en ∥2 ` Opptn`1 ´ tn qp`1 q,
where (i) holds because the ODE solver has local error bounded by Opptn`1 ´ tn qp`1 q. In addition, we observe that e1 “ 0,
because
e1 “ fθ pxt1 , t1 q ´ f pxt1 , t1 ; ϕq
piq
“ xt1 ´ f pxt1 , t1 ; ϕq
piiq
“ xt1 ´ xt1
“ 0.
Here (i) is true because the consistency model is parameterized such that f pxt1 , t1 ; ϕq “ xt1 and (ii) is entailed by the
definition of f p¨, ¨; ϕq. This allows us to perform induction on the recursion formula Eq. (14) to obtain
n´1
ÿ
∥en ∥2 ď ∥e1 ∥2 ` Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ ptk`1 ´ tk qOpptk`1 ´ tk qp q
k“1
n´1
ÿ
ď ptk`1 ´ tk qOpp∆tqp q
k“1
n´1
ÿ
“ Opp∆tqp q ptk`1 ´ tk q
k“1
“ Opp∆tqp qptn ´ t1 q
ď Opp∆tqp qpT ´ ϵq
“ Opp∆tqp q,
16
Consistency Models
ş
pdata pxqppxt | xq∇xt log ppxt | xq dx
“ ş
pdata pxqppxt | xq dx
ş
pdata pxqppxt | xq∇xt log ppxt | xq dx
“
pt pxt q
ż
pdata pxqppxt | xq
“ ∇xt log ppxt | xq dx
pt pxt q
ż
piq
“ ppx | xt q∇xt log ppxt | xq dx
LN ´ N ´
CD pθ, θ ; ϕq “ LCT pθ, θ q ` op∆tq,
where the expectation is taken with respect to x „ pdata , n „ UJ1, N ´ 1K, and xtn`1 „ N px; t2n`1 Iq. The consistency
training objective, denoted by LN ´
CT pθ, θ q, is defined as
Then, we apply Lemma 1 to Eq. (15) and use Taylor expansion in the reverse direction to obtain
LN ´
CD pθ, θ ; ϕq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ „ ȷȷ*
xtn`1 ´ x ˇˇ
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1 E ˇxtn`1
t2n`1
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs
piq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ ˆ ˙ȷ*
xtn`1 ´ x
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2n`1
17
Consistency Models
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs
„
“E λptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq
„ ˆ ˙ȷ
xtn`1 ´ x
` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2n`1
ȷ
` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qs ` op|tn`1 ´ tn |q
` Erop|tn`1 ´ tn |qs
„ ˆ ˆ ˙˙ȷ
xt ´x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 qtn`1 n`1 , tn ` Erop|tn`1 ´ tn |qs
t2n`1
„ ˆ ˆ ˙˙ȷ
xtn`1 ´ x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 q , tn ` Erop|tn`1 ´ tn |qs
tn`1
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn`1 z ` ptn ´ tn`1 qz, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop∆tqs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` op∆tq
“LN ´
CT pθ, θ q ` op∆tq, (16)
xt ´x
where (i) is due to the law of total expectation, and z :“ n`1
tn`1 „ N p0, Iq. This implies LN ´
CD pθ, θ ; ϕq “
N ´ N ´
LCT pθ, θ q ` op∆tq and thus completes the proof for Eq. (9). Moreover, we have LCT pθ, θ q ě Op∆tq whenever
inf N LN ´ N ´ N ´
CD pθ, θ ; ϕq ą 0. Otherwise, LCT pθ, θ q ă Op∆tq and thus lim∆tÑ0 LCD pθ, θ ; ϕq “ 0, which is a clear
N ´
contradiction to inf N LCD pθ, θ ; ϕq ą 0.
Remark 1. When the condition LN ´ ´
CT pθ, θ q ě Op∆tq is not satisfied, such as in the case where θ “ stopgradpθq, the
N ´
validity of LCT pθ, θ q as a training objective for consistency models can still be justified by referencing the result provided
in Theorem 6.
B. Continuous-Time Extensions
The consistency distillation and consistency training objectives can be generalized to hold for infinite time steps (N Ñ 8)
under suitable conditions.
The matrices G and H play a crucial role in forming continuous-time objectives for consistency distillation. Additionally,
we denote the Jacobian of fθ px, tq with respect to x as BfθBx
px,tq
.
When θ ´ “ θ (with no stopgrad operator), we have the following theoretical result.
n´1
Theorem 3. Let tn “ τ p N ´1 q, where n P J1, N K, and τ p¨q is a strictly monotonic function with τ p0q “ ϵ and τ p1q “ T .
Assume τ is continuously differentiable in r0, 1s, d is three times continuously differentiable with bounded third derivatives,
18
Consistency Models
and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting
function λp¨q is bounded, and supx,tPrϵ,T s ∥sϕ px, tq∥2 ă 8. Then with the Euler solver in consistency distillation, we have
lim pN ´ 1q2 LN 8
CD pθ, θ; ϕq “ LCD pθ, θ; ϕq, (17)
N Ñ8
Here the expectation above is taken over x „ pdata , u „ Ur0, 1s, t “ τ puq, and xt „ N px, t2 Iq.
1 n´1
Proof. Let ∆u “ N ´1 and un “ N ´1 . First, we can derive the following equation with Taylor expansion:
fθ px̂ϕ 1
tn , tn q ´ fθ pxtn`1 , tn`1 q “ fθ pxtn`1 ` tn`1 sϕ pxtn`1 , tn`1 qτ pun q∆u, tn q ´ fθ pxtn`1 , tn`1 q
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q 1
“tn`1 sϕ pxtn`1 , tn`1 qτ 1 pun q∆u ´ τ pun q∆u ` Opp∆uq2 q, (19)
Bxtn`1 Btn`1
1
Note that τ 1 pun q “ τ ´1 ptn`1 q . Then, we apply Taylor expansion to the consistency distillation loss, which gives
1 1
pN ´ 1q2 LNCD pθ, θ; ϕq “ 2
LNCD pθ, θ; ϕq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ px̂ϕ
tn , tn qs
p∆uq p∆uq2
ˆ
piq 1
“ Etλptn qτ 1 pun q2 rfθ px̂ϕ T
tn , tn q ´ fθ pxtn`1 , tn`1 qs Gpfθ pxtn`1 , tn`1 qq
2p∆uq2
˙
¨ rfθ px̂ϕ 3
tn , tn q ´ fθ pxtn`1 , tn`1 qsu ` ErOp|∆u| qs
„ ˆ ˙T
piiq 1 1 2 Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
“ E λptn qτ pun q ´ tn`1 sϕ pxtn`1 , tn`1 q Gpfθ pxtn`1 , tn`1 qq
2 Btn`1 Bxtn`1
ˆ ˙ȷ
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
¨ ´ tn`1 sϕ pxtn`1 , tn`1 q ` ErOp|∆u|qs
Btn`1 Bxtn`1
„ ˆ ˙T
1 λptn q Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
“ E ´ tn`1 sϕ pxtn`1 , tn`1 q Gpfθ pxtn`1 , tn`1 qq (20)
2 rpτ ´1 q1 ptn qs2 Btn`1 Bxtn`1
ˆ ˙ȷ
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
¨ ´ tn`1 sϕ pxtn`1 , tn`1 q ` ErOp|∆u|qs
Btn`1 Bxtn`1
where we obtain (i) by expanding dpfθ pxtn`1 , tn`1 q, ¨q to second order and observing dpx, xq ” 0 and ∇y dpx, yq|y“x ” 0.
We obtain (ii) using Eq. (19). By taking the limit for both sides of Eq. (20) as ∆u Ñ 0 or equivalently N Ñ 8, we arrive at
Eq. (17), which completes the proof.
Remark 2. Although Theorem 3 assumes the Euler ODE solver for technical simplicity, we believe an analogous result can
be derived for more general solvers, since all ODE solvers should perform similarly as N Ñ 8. We leave a more general
version of Theorem 3 as future work.
Remark 3. Theorem 3 implies that consistency models can be trained by minimizing L8 CD pθ, θ; ϕq. In particular, when
2
dpx, yq “ ∥x ´ y∥2 , we have
«
2 ff
8 λptq
Bfθ pxt , tq Bfθ pxt , tq
LCD pθ, θ; ϕq “ E ´1 1 2
´t sϕ pxt , tq
. (21)
rpτ q ptqs
Bt Bxt 2
However, this continuous-time objective requires computing Jacobian-vector products as a subroutine to evaluate the loss
function, which can be slow and laborious to implement in deep learning frameworks that do not support forward-mode
automatic differentiation.
19
Consistency Models
Remark 4. If fθ px, tq matches the ground truth consistency function for the empirical PF ODE of sϕ px, tq, then
Bfθ px, tq Bfθ px, tq
´t sϕ px, tq ” 0
Bt Bx
and therefore L8CD pθ, θ; ϕq “ 0. This can be proved by noting that fθ pxt , tq ” xϵ for all t P rϵ, T s, and then taking the
time-derivative of this identity:
fθ pxt , tq ” xϵ
Bfθ pxt , tq dxt Bfθ pxt , tq
ðñ ` ”0
Bxt dt Bt
Bfθ pxt , tq Bfθ pxt , tq
ðñ r´tsϕ pxt , tqs ` ”0
Bxt Bt
Bfθ pxt , tq Bfθ pxt , tq
ðñ ´t sϕ pxt , tq ” 0.
Bt Bxt
The above observation provides another motivation for L8
CD pθ, θ; ϕq, as it is minimized if and only if the consistency model
matches the ground truth consistency function.
For some metric functions, such as the ℓ1 norm, the Hessian Gpxq is zero so Theorem 3 is vacuous. Below we show that a
non-vacuous statement holds for the ℓ1 norm with just a small modification of the proof for Theorem 3.
n´1
Theorem 4. Let tn “ τ p N ´1 q, where n P J1, N K, and τ p¨q is a strictly monotonic function with τ p0q “ ϵ and τ p1q “ T .
Assume τ is continuously differentiable in r0, 1s, and fθ is twice continuously differentiable with bounded first and second
derivatives. Assume further that the weighting function λp¨q is bounded, and supx,tPrϵ,T s ∥sϕ px, tq∥2 ă 8. Suppose we use
the Euler ODE solver, and set dpx, yq “ ∥x ´ y∥1 in consistency distillation. Then we have
lim pN ´ 1qLN 8
CD pθ, θ; ϕq “ LCD, ℓ1 pθ, θ; ϕq, (22)
N Ñ8
where
„
ȷ
λptq
t Bfθ pxt , tq sϕ pxt , tq ´ Bfθ pxt , tq
L8
CD, ℓ1 pθ, θ; ϕq :“ E ´1 1
pτ q ptq
Bxt Bt
1
where the expectation above is taken over x „ pdata , u „ Ur0, 1s, t “ τ puq, and xt „ N px, t2 Iq.
1 n´1
Proof. Let ∆u “ N ´1 and un “ N ´1 . We have
1 N 1
pN ´ 1qLN CD pθ, θ; ϕq “ LCD pθ, θ; ϕq “ Erλptn q}fθ pxtn`1 , tn`1 q ´ fθ px̂ϕtn , tn q}1 s
„
∆u ∆u
ȷ
piq 1
Bfθ pxtn`1 , tn`1 q 1 Bfθ pxtn`1 , tn`1 q 1 2
“ E λptn q
tn`1
sϕ pxtn`1 , tn`1 qτ pun q ´ τ pun q ` Opp∆uq q
∆u Bxtn`1 Btn`1 1
„
ȷ
Bf θ px t , tn`1 q Bfθ px t , tn`1 q
“E λptn qτ 1 pun q
tn`1
n`1
sϕ pxtn`1 , tn`1 q ´ n`1
` Op∆uq
Bxtn`1 Btn`1
„
1ȷ
λptn q
tn`1 Bfθ pxtn`1 , tn`1 q sϕ pxtn`1 , tn`1 q ´ Bfθ pxtn`1 , tn`1 q ` Op∆uq
“E ´1 1
(23)
pτ q ptn q
Bxtn`1 Btn`1
1
where (i) is obtained by plugging Eq. (19) into the previous equation. Taking the limit for both sides of Eq. (23) as ∆u Ñ 0
or equivalently N Ñ 8 leads to Eq. (22), which completes the proof.
Remark 5. According to Theorem 4, consistency models can be trained by minimizing L8 CD, ℓ1 pθ, θ; ϕq. Moreover, the same
d
reasoning in Remark 4 can be applied to show that L8
CD, ℓ1 pθ, θ; ϕq “ 0 if and only if f θ pxt , tq “ xϵ for all xt P R and
t P rϵ, T s.
In the second case where θ ´ “ stopgradpθq, we can derive a so-called “pseudo-objective” whose gradient matches the
gradient of LN ´
CD pθ, θ ; ϕq in the limit of N Ñ 8. Minimizing this pseudo-objective with gradient descent gives another
way to train consistency models via distillation. This pseudo-objective is provided by the theorem below.
20
Consistency Models
n´1
Theorem 5. Let tn “ τ p N ´1 q, where n P J1, N K, and τ p¨q is a strictly monotonic function with τ p0q “ ϵ and τ p1q “ T .
Assume τ is continuously differentiable in r0, 1s, d is three times continuously differentiable with bounded third derivatives,
and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting
function λp¨q is bounded, supx,tPrϵ,T s ∥sϕ px, tq∥2 ă 8, and supx,tPrϵ,T s ∥∇θ fθ px, tq∥2 ă 8. Suppose we use the Euler
ODE solver, and θ ´ “ stopgradpθq in consistency distillation. Then,
lim pN ´ 1q∇θ LN ´ 8 ´
CD pθ, θ ; ϕq “ ∇θ LCD pθ, θ ; ϕq, (24)
N Ñ8
where
„ ˆ ˙ȷ
λptq Bfθ´ pxt , tq Bfθ´ pxt , tq
L8 ´
CD pθ, θ ; ϕq :“ E fθ pxt , tqT Hpfθ´ pxt , tqq ´t sϕ pxt , tq . (25)
pτ q1 ptq
´1 Bt Bxt
Here the expectation above is taken over x „ pdata , u „ Ur0, 1s, t “ τ puq, and xt „ N px, t2 Iq.
1 n´1
Proof. We denote ∆u “ N ´1 and un “ N ´1 . First, we leverage Taylor series expansion to obtain
1 N 1
pN ´ 1qLN ´
CD pθ, θ ; ϕq “ LCD pθ, θ ´ ; ϕq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂ϕ
tn , tn qs
ˆ ∆u ∆u
piq 1
“ Etλptn qrfθ pxtn`1 , tn`1 q ´ fθ´ px̂ϕ T ϕ
tn , tn qs Hpfθ ´ px̂tn , tn qq
2∆u
˙
¨ rfθ pxtn`1 , tn`1 q ´ fθ´ px̂ϕ ,
tn nt qsu ` ErOp|∆u| 3
qs
1
“ Etλptn qrfθ pxtn`1 , tn`1 q ´ fθ´ px̂ϕ T ϕ ϕ 2
tn , tn qs Hpfθ ´ px̂tn , tn qqrfθ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qsu ` ErOp|∆u| qs
2∆u
(26)
21
Consistency Models
Here (i) results from the chain rule, and (ii) follows from Eq. (19) and fθ px, tq ” fθ´ px, tq, since θ ´ “ stopgradpθq.
Taking the limit for both sides of Eq. (28) as ∆u Ñ 0 (or N Ñ 8) yields Eq. (24), which completes the proof.
2
Remark 6. When dpx, yq “ ∥x ´ y∥2 , the pseudo-objective L8 ´
CD pθ, θ ; ϕq can be simplified to
„ ˆ ˙ȷ
λptq T Bfθ´ pxt , tq Bfθ´ pxt , tq
L8CD pθ, θ ´
; ϕq “ 2E fθ px t , tq ´ t sϕ px t , tq . (28)
pτ ´1 q1 ptq Bt Bxt
Remark 7. The objective L8 ´
CD pθ, θ ; ϕq defined in Theorem 5 is only meaningful in terms of its gradient—one cannot
measure the progress of training by tracking the value of L8 ´
CD pθ, θ ; ϕq, but can still apply gradient descent to this objective
to distill consistency models from pre-trained diffusion models. Because this objective is not a typical loss function, we refer
to it as the “pseudo-objective” for consistency distillation.
Remark 8. Following the same reasoning in Remark 4, we can easily derive that L8 ´
CD pθ, θ ; ϕq “ 0 and
8 ´
∇θ LCD pθ, θ ; ϕq “ 0 if fθ px, tq matches the ground truth consistency function for the empirical PF ODE that in-
volves sϕ px, tq. However, the converse does not hold true in general. This distinguishes L8 ´ 8
CD pθ, θ ; ϕq from LCD pθ, θ; ϕq,
the latter of which is a true loss function.
lim pN ´ 1q∇θ LN ´ N ´ 8 ´
CD pθ, θ ; ϕq “ lim pN ´ 1q∇θ LCT pθ, θ q “ ∇θ LCT pθ, θ q, (29)
N Ñ8 N Ñ8
where LN
CD uses the Euler ODE solver, and
„ ˆ ˙ȷ
λptq T Bfθ´ pxt , tq Bfθ´ pxt , tq xt ´ x
L8CT pθ, θ ´
q :“ E f θ px t , tq Hpfθ ´ px t , tqq ` ¨ . (30)
pτ ´1 q1 ptq Bt Bxt t
Here the expectation above is taken over x „ pdata , u „ Ur0, 1s, t “ τ puq, and xt „ N px, t2 Iq.
Proof. The proof mostly follows that of Theorem 5. First, we leverage Taylor series expansion to obtain
1 N 1
pN ´ 1qLN ´
CT pθ, θ q “ LCT pθ, θ ´ q “ Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs
ˆ ∆u ∆u
piq 1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆u
˙
3
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆u| qs
1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq (31)
2∆u
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆u|2 qs
where z „ N p0, Iq, (i) is derived by first expanding dp¨, fθ´ px ` tn z, tn qq to second order, and then noting that dpx, xq ” 0
and ∇y dpy, xq|y“x ” 0. Next, we compute the gradient of Eq. (31) with respect to θ and simplify the result to obtain
1
pN ´ 1q∇θ LN ´
CT pθ, θ q “ ∇θ LN ´
CT pθ, θ q
∆u
1
“ ∇θ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆u
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆u|2 qs
22
Consistency Models
piq 1
“ Etλptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq (32)
∆u
2
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆u| qs
" „
piiq 1 T
“ E λptn qr∇θ fθ px ` tn`1 z, tn`1 qs Hpfθ´ px ` tn z, tn qq τ 1 pun q∆uB1 fθ´ px ` tn z, tn qz
∆u
ȷ*
1
` B2 fθ´ px ` tn z, tn qτ pun q∆u ` ErOp|∆u|qs
" „
1 T
“E λptn qτ pun qr∇θ fθ px ` tn`1 z, tn`1 qs Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
ȷ*
` B2 fθ´ px ` tn z, tn q ` ErOp|∆u|qs
" „
“∇θ E λptn qτ 1 pun qrfθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
ȷ*
` B2 fθ´ px ` tn z, tn q ` ErOp|∆u|qs
" „ ȷ*
1 T xtn ´ x
“∇θ E λptn qτ pun qrfθ pxtn`1 , tn`1 qs Hpfθ´ pxtn , tn qq B1 fθ´ pxtn , tn q ` B2 fθ´ pxtn , tn q ` ErOp|∆u|qs
tn
" „ ȷ*
λptn q T xtn ´ x
“∇θ E rfθ pxtn`1 , tn`1 qs Hpfθ´ pxtn , tn qq B1 fθ´ pxtn , tn q ` B2 fθ´ pxtn , tn q ` ErOp|∆u|qs
pτ ´1 q1 ptn q tn
(33)
Here (i) results from the chain rule, and (ii) follows from Taylor expansion. Taking the limit for both sides of Eq. (33) as
∆u Ñ 0 or N Ñ 8 yields the second equality in Eq. (29).
Now we prove the first equality. Applying Taylor expansion again, we obtain
1 1
pN ´ 1q∇θ LN ´
CD pθ, θ ; ϕq “ ∇θ LN ´
CD pθ, θ ; ϕq “ ∇θ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂ϕ
tn , tn qqs
∆u ∆u
1
“ Erλptn q∇θ dpfθ pxtn`1 , tn`1 q, fθ´ px̂ϕ
tn , tn qqs
∆u
1
“ Erλptn q∇θ fθ pxtn`1 , tn`1 qT B1 dpfθ pxtn`1 , tn`1 q, fθ´ px̂ϕ tn , tn qqs
∆u " „
1
“ E λptn q∇θ fθ pxtn`1 , tn`1 qT B1 dpfθ´ px̂ϕ ϕ
tn , tn q, fθ ´ px̂tn , tn qq
∆u
ȷ*
` Hpfθ´ px̂ϕ
tn , tn qqpfθ pxtn`1 , tn`1 q ´ fθ´ px̂ϕ
tn , tn qq
2
` Op|∆u| q
1
“ Etλptn q∇θ fθ pxtn`1 , tn`1 qT rHpfθ´ px̂ϕ ϕ 2
tn , tn qqpfθ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qqs ` Op|∆u| qu
∆u
1
“ Etλptn q∇θ fθ pxtn`1 , tn`1 qT rHpfθ´ px̂ϕ ϕ 2
tn , tn qqpfθ ´ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qqs ` Op|∆u| qu
∆u
piq 1
“ Etλptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq
∆u
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆u|2 qs
´pxt ´xq
where (i) holds because xtn`1 “ x ` tn`1 z and x̂ϕ
tn “ xtn`1 ´ ptn ´ tn`1 qtn`1
n`1
t2n`1
“ xtn`1 ` ptn ´ tn`1 qz “
x ` tn z. Because (i) matches Eq. (32), we can use the same reasoning procedure from Eq. (32) to Eq. (33) to conclude
limN Ñ8 pN ´ 1q∇θ LN ´ N ´
CD pθ, θ ; ϕq “ limN Ñ8 pN ´ 1q∇θ LCT pθ, θ q, completing the proof.
23
Consistency Models
2
Remark 10. When dpx, yq “ ∥x ´ y∥2 , the continuous-time consistency training objective becomes
„ ˆ ˙ȷ
8 ´ λptq T Bfθ´ pxt , tq Bfθ´ pxt , tq xt ´ x
LCT pθ, θ q “ 2E fθ pxt , tq ` ¨ . (34)
pτ ´1 q1 ptq Bt Bxt t
We did not investigate using the LPIPS metric in Theorem 3 because minimizing the resulting objective would require
back-propagating through second order derivatives of the VGG network used in LPIPS, which is computationally expensive
and prone to numerical instability. As revealed by Fig. 7a, the stopgrad version of continuous-time distillation (Theorem 5)
works better than the non-stopgrad version (Theorem 3) for both the LPIPS and ℓ2 metrics, and the LPIPS metric works
the best for all distillation approaches. Additionally, discrete-time consistency distillation outperforms continuous-time
24
Consistency Models
consistency distillation, possibly due to the larger variance in continuous-time objectives, and the fact that one can use
effective higher-order ODE solvers in discrete-time objectives.
For consistency training (CT), we find it important to initialize consistency models from a pre-trained EDM model in order
to stabilize training when using continuous-time objectives. We hypothesize that this is caused by the large variance in our
continuous-time loss functions. For fair comparison, we thus initialize all consistency models from the same pre-trained
EDM model on CIFAR-10 for both discrete-time and continuous-time CT, even though the former works well with random
initialization. We leave variance reduction techniques for continuous-time CT to future research.
We empirically compare the following objectives:
• CT (LPIPS): Consistency training LNCT with N “ 120 and the LPIPS metric. We set the learning rate to 4e-4, and the
EMA decay rate for the target network to 0.99. We do not use the schedule functions for N and µ here because they
cause slower learning when the consistency model is initialized from a pre-trained EDM model.
• CT8 pℓ2 q: Consistency training L8
CT with the ℓ2 metric. We set the learning rate to 5e-6.
As shown in Fig. 7b, the LPIPS metric leads to improved performance for continuous-time CT. We also find that continuous-
time CT outperforms discrete-time CT with the same LPIPS metric. This is likely due to the bias in discrete-time CT, as
∆t ą 0 in Theorem 2 for discrete-time objectives, whereas continuous-time CT has no bias since it implicitly drives ∆t to 0.
Parameterization for Consistency Models We use the same architectures for consistency models as those used for
EDMs. The only difference is we slightly modify the skip connections in EDM to ensure the boundary condition holds for
consistency models. Recall that in Section 3 we propose to parameterize a consistency model in the following form:
fθ px, tq “ cskip ptqx ` cout ptqFθ px, tq.
In EDM (Karras et al., 2022), authors choose
2
σdata σdata t
cskip ptq “ 2 , cout ptq “ a 2 ,
t2 ` σdata σdata ` t2
25
Consistency Models
where σdata “ 0.5. However, this choice of cskip and cout does not satisfy the boundary condition when the smallest time
instant ϵ ‰ 0. To remedy this issue, we modify them to
2
σdata σdata pt ´ ϵq
cskip ptq “ 2 2 , cout ptq “ a 2 ,
pt ´ ϵq ` σdata σdata ` t2
Schedule Functions for Consistency Training As discussed in Section 5, consistency generation requires specifying
schedule functions N p¨q and µp¨q for best performance. Throughout our experiments, we use schedule functions that take
the form below:
Sc W
k
N pkq “ pps1 ` 1q2 ´ s20 q ` s20 ´ 1 ` 1
K
ˆ ˙
s0 log µ0
µpkq “ exp ,
N pkq
where K denotes the total number of training iterations, s0 denotes the initial discretization steps, s1 ą s0 denotes the target
discretization steps at the end of training, and µ0 ą 0 denotes the EMA decay rate at the beginning of model training.
Training Details In both consistency distillation and progressive distillation, we distill EDMs (Karras et al., 2022). We
trained these EDMs ourselves according to the specifications given in Karras et al. (2022). The original EDM paper did
not provide hyperparameters for the LSUN Bedroom 256 ˆ 256 and Cat 256 ˆ 256 datasets, so we mostly used the same
hyperparameters as those for the ImageNet 64 ˆ 64 dataset. The difference is that we trained for 600k and 300k iterations
for the LSUN Bedroom and Cat datasets respectively, and reduced the batch size from 4096 to 2048.
We used the same EMA decay rate for LSUN 256 ˆ 256 datasets as for the ImageNet 64 ˆ 64 dataset. For progressive
distillation, we used the same training settings as those described in Salimans & Ho (2022) for CIFAR-10 and ImageNet
64 ˆ 64. Although the original paper did not test on LSUN 256 ˆ 256 datasets, we used the same settings for ImageNet
64 ˆ 64 and found them to work well.
In all distillation experiments, we initialized the consistency model with pre-trained EDM weights. For consistency training,
we initialized the model randomly, just as we did for training the EDMs. We trained all consistency models with the
Rectified Adam optimizer (Liu et al., 2019), with no learning rate decay or warm-up, and no weight decay. We also applied
EMA to the weights of the online consistency models in both consistency distillation and consistency training, as well as
to the weights of the training online consistency models according to Karras et al. (2022). For LSUN 256 ˆ 256 datasets,
we chose the EMA decay rate to be the same as that for ImageNet 64 ˆ 64, except for consistency distillation on LSUN
Bedroom 256 ˆ 256, where we found that using zero EMA worked better.
When using the LPIPS metric on CIFAR-10 and ImageNet 64 ˆ 64, we rescale images to resolution 224 ˆ 224 with bilinear
upsampling before feeding them to the LPIPS network. For LSUN 256 ˆ 256, we evaluated LPIPS without rescaling inputs.
In addition, we performed horizontal flips for data augmentation for all models and on all datasets. We trained all models on
a cluster of Nvidia A100 GPUs. Additional hyperparameters for consistency training and distillation are listed in Table 3.
26
Consistency Models
Inpainting When using Algorithm 4 for inpainting, we let y be an image where missing pixels are masked out, Ω be a
binary mask where 1 indicates the missing pixels, and A be the identity transformation.
Colorization The algorithm for image colorization is similar, as colorization becomes a special case of inpainting once we
transform data into a decoupled space. Specifically, let y P Rhˆwˆ3 be a gray-scale image that we aim to colorize, where
all channels of y are assumed to be the same, i.e., yr:, :, 0s “ yr:, :, 1s “ yr:, :, 2s in NumPy notation. In our experiments,
each channel of this gray scale image is obtained from a colorful image by averaging the RGB channels with
0.2989R ` 0.5870G ` 0.1140B.
Let Q P R3ˆ3 be an orthogonal matrix whose first column is proportional to the vector p0.2989, 0.5870, 0.1140q. This
orthogonal matrix can be obtained easily via QR decomposition, and we use the following in our experiments
¨ ˛
0.4471 ´0.8204 0.3563
Q “ ˝0.8780 0.4785 0 ‚.
0.1705 ´0.3129 ´0.9343
Because Q is orthogonal, the inversion A´1 : y P Rhˆw ÞÑ x P Rhˆwˆ3 is easy to compute, where
2
ÿ
xri, j, ks “ yri, j, lsQrk, ls.
l“0
With A and Ω defined as above, we can now use Algorithm 4 for image colorization.
27
Consistency Models
Super-resolution With a similar strategy, we employ Algorithm 4 for image super-resolution. For simplicity, we assume
that the down-sampled image is obtained by averaging non-overlapping patches of size p ˆ p. Suppose the shape of full
resolution images is h ˆ w ˆ 3. Let y P Rhˆwˆ3 denote a low-resolution image naively up-sampled to full resolution,
2
where pixels in each non-overlapping patch share the same value. Additionally, let Ω P t0, 1uh{pˆw{pˆp ˆ3 be a binary
mask such that
#
1, kě1
Ωri, j, k, ls “ .
0, k“0
2 2
Similar to image colorization, super-resolution requires an orthogonal matrix Q P Rp ˆp whose first column is
p1{p, 1{p, ¨ ¨ ¨ , 1{pq. This orthogonal matrix can be obtained with QR decomposition. To perform super-resolution, we
2
define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rh{pˆw{pˆp ˆ3 , where
2
pÿ ´1
yri, j, k, ls “ xri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrm, ks.
m“0
2
The inverse transformation A´1 : y P Rh{pˆw{pˆp ˆ3
ÞÑ x P Rhˆwˆ3 is easy to derive, with
2
pÿ ´1
xri, j, k, ls “ yri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrk, ms.
m“0
Stroke-guided image generation We can also use Algorithm 4 for stroke-guided image generation as introduced in
SDEdit (Meng et al., 2021). Specifically, we let y P Rhˆwˆ3 be a stroke painting. We set A “ I, and define Ω P Rhˆwˆ3
as a matrix of ones. In our experiments, we set t1 “ 5.38 and t2 “ 2.24, with N “ 2.
Denoising It is possible to denoise images perturbed with various scales of Gaussian noise using a single consistency
model. Suppose the input image x is perturbed with N p0; σ 2 Iq. As long as σ P rϵ, T s, we can evaluate fθ px, σq to produce
the denoised image.
Interpolation We can interpolate between two images generated by consistency models. Suppose the first sample x1 is
produced by noise vector z1 , and the second sample x2 is produced by noise vector z2 . In other words, x1 “ fθ pz1 , T q and
x2 “ fθ pz2 , T q. To interpolate between x1 and x2 , we first use spherical linear interpolation to get
28
Consistency Models
Figure 8: Gray-scale images (left), colorized images by a consistency model (middle), and ground truth (right).
29
Consistency Models
Figure 9: Downsampled images of resolution 32 ˆ 32 (left), full resolution (256 ˆ 256) images generated by a consistency
model (middle), and ground truth images of resolution 256 ˆ 256 (right).
30
Consistency Models
Figure 10: Masked images (left), imputed images by a consistency model (middle), and ground truth (right).
31
Consistency Models
Figure 11: Interpolating between leftmost and rightmost images with spherical linear interpolation. All samples are generated
by a consistency model trained on LSUN Bedroom 256 ˆ 256.
32
Consistency Models
Figure 12: Single-step denoising with a consistency model. The leftmost images are ground truth. For every two rows, the
top row shows noisy images with different noise levels, while the bottom row gives denoised images.
33
Consistency Models
Figure 13: SDEdit with a consistency model. The leftmost images are stroke painting inputs. Images on the right side are
the results of stroke-guided image generation (SDEdit).
34
Consistency Models
Figure 14: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
35
Consistency Models
Figure 15: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
36
Consistency Models
Figure 16: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
37
Consistency Models
Figure 17: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.
38
Consistency Models
Figure 18: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
39
Consistency Models
Figure 19: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
40
Consistency Models
Figure 20: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
41
Consistency Models
Figure 21: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.
42