You are on page 1of 41

Consistency Models

Yang Song 1 Prafulla Dhariwal 1 Mark Chen 1 Ilya Sutskever 1

Abstract
Diffusion models have made significant break-
throughs in image, audio, and video generation,
arXiv:2303.01469v1 [cs.LG] 2 Mar 2023

but they depend on an iterative generation process


that causes slow sampling speed and caps their
potential for real-time applications. To overcome
this limitation, we propose consistency models,
a new family of generative models that achieve Figure 1: Given a Probability Flow (PF) ODE that smoothly
high sample quality without adversarial training. converts data to noise, we learn to map any point (e.g., xt ,
They support fast one-step generation by design, xt1 , and xT ) on the ODE trajectory to its origin (e.g., x0 )
while still allowing for few-step sampling to trade for generative modeling. Models of these mappings are
compute for sample quality. They also support called consistency models, as their outputs are trained to be
zero-shot data editing, like image inpainting, col- consistent for points on the same trajectory.
orization, and super-resolution, without requir-
ing explicit training on these tasks. Consistency
models can be trained either as a way to distill Goodfellow et al. (2014)), these models do not rely on ad-
pre-trained diffusion models, or as standalone gen- versarial training and are thus less prone to issues such as
erative models. Through extensive experiments, unstable training and mode collapse. Additionally, diffusion
we demonstrate that they outperform existing dis- models do not impose the same strict constraints on model
tillation techniques for diffusion models in one- architectures as in autoregressive models (Bengio & Bengio,
and few-step generation. For example, we achieve 1999; Uria et al., 2013; 2016; Van Den Oord et al., 2016),
the new state-of-the-art FID of 3.55 on CIFAR- variational autoencoders (VAEs, Kingma & Welling (2014);
10 and 6.20 on ImageNet 64 ˆ 64 for one-step Rezende et al. (2014)), or normalizing flows (Dinh et al.,
generation. When trained as standalone genera- 2015; 2017; Kingma & Dhariwal, 2018).
tive models, consistency models also outperform
single-step, non-adversarial generative models on Key to the success of diffusion models is their iterative sam-
standard benchmarks like CIFAR-10, ImageNet pling process which progressively removes noise from a
64 ˆ 64 and LSUN 256 ˆ 256. random noise vector. This iterative refinement procedure
repetitively evaluates the diffusion model, allowing for the
trade-off of compute for sample quality: by using extra com-
1. Introduction pute for more iterations, a small-sized model can unroll into
a larger computational graph and generate higher quality
Diffusion models (Sohl-Dickstein et al., 2015; Song & Er- samples. Iterative generation is also crucial for the zero-
mon, 2019; 2020; Ho et al., 2020; Song et al., 2021), also shot data editing capabilities of diffusion models, enabling
known as score-based generative models, have achieved them to solve challenging inverse problems ranging from
unprecedented success across multiple fields, including im- image inpainting, colorization, stroke-guided image editing,
age generation (Dhariwal & Nichol, 2021; Nichol et al., to Computed Tomography and Magnetic Resonance Imag-
2021; Ramesh et al., 2022; Saharia et al., 2022; Rombach ing (Song & Ermon, 2019; Song et al., 2021; 2022; 2023;
et al., 2022), audio synthesis (Kong et al., 2020; Chen et al., Kawar et al., 2021; 2022; Chung et al., 2023; Meng et al.,
2021; Popov et al., 2021), and video generation (Ho et al., 2021). However, compared to single-step generative mod-
2022b;a). Unlike Generative Adversarial Networks (GANs, els like GANs, VAEs, and normalizing flows, the iterative
1 generation procedure of diffusion models typically requires
OpenAI, San Francisco, CA 94110, USA. Correspondence to:
Yang Song <songyang@openai.com>. 10–2000 times more compute (Song & Ermon, 2020; Ho
et al., 2020; Song et al., 2021; Zhang & Chen, 2022; Lu
Preprint. Work in progress. et al., 2022), causing slow inference and limiting their real-
Consistency Models

time applications. performance to progressive distillation for single step gen-


eration, despite having no access to pre-trained diffusion
Our objective is to create generative models that facilitate
models. They are able to outperform many GANs, and all
efficient, single-step generation without sacrificing impor-
other non-adversarial, single-step generative models across
tant advantages of iterative refinement. These advantages
multiple datasets. We also show that consistency models
include the ability to trade-off compute for sample quality
can be used to perform zero-shot data editing tasks, includ-
when necessary, as well as the capability to perform zero-
ing image denoising, interpolation, inpainting, colorization,
shot data editing tasks. As illustrated in Fig. 1, we build
super-resolution, and stroke-guided image editing (SDEdit,
on top of the probability flow (PF) ordinary differential
Meng et al. (2021)).
equation (ODE) in continuous-time diffusion models (Song
et al., 2021), whose trajectories smoothly transition the data
distribution into a tractable noise distribution. We propose 2. Diffusion Models
to learn a model that maps any point at any time step to the
Consistency models are heavily inspired by the theory of
trajectory’s starting point. A notable property of our model
(continuous-time) diffusion models (Song et al., 2021). Dif-
is self-consistency: points on the same trajectory map to
fusion models generate data by progressively perturbing
the same initial point. We therefore refer to such models
data to noise via Gaussian perturbations, then creating sam-
as consistency models. Consistency models allow us to
ples from noise via sequential denoising steps. Let pdata pxq
generate data samples (initial points of ODE trajectories,
denote the data distribution. Diffusion models start by dif-
e.g., x0 in Fig. 1) by converting random noise vectors (end-
fusing pdata pxq with a stochastic differential equation (SDE)
points of ODE trajectories, e.g., xT in Fig. 1) with only one
(Song et al., 2021; Karras et al., 2022)
network evaluation. Importantly, by chaining the outputs of
consistency models at multiple time steps, we can improve dxt “ µpxt , tq dt ` σptq dwt , (1)
sample quality and perform zero-shot data editing at the
cost of more compute, similar to what iterative refinement where t P r0, T s, T ą 0 is a fixed constant, µp¨, ¨q and
enables for diffusion models. gp¨q are the drift and diffusion coefficients respectively,
and twt utPr0,T s denotes the standard Brownian motion.
To train a consistency model, we offer two methods based
We denote the distribution of xt as pt pxq and as a result
on enforcing the self-consistency property. The first method
p0 pxq ” pdata pxq. A remarkable property of this SDE is
relies on using numerical ODE solvers and a pre-trained
the existence of an ordinary differential equation (ODE),
diffusion model to generate pairs of adjacent points on a
dubbed the Probability Flow (PF) ODE by Song et al.
PF ODE trajectory. By minimizing the difference between
(2021), whose solution trajectories sampled at t are dis-
model outputs for these pairs, we can effectively distill a
tributed according to pt pxq:
diffusion model into a consistency model, which allows gen-
„ 
erating high-quality samples with one network evaluation. 1
By contrast, our second method eliminates the need for a dxt “ µpxt , tq ´ σptq2 ∇ log pt pxt q dt. (2)
2
pre-trained diffusion model altogether, allowing us to train
a consistency model in isolation. This approach situates Here ∇ log pt pxq is the score function of pt pxq; hence dif-
consistency models as an independent family of generative fusion models are also known as score-based generative
models. Crucially, neither approach requires adversarial models (Song & Ermon, 2019; 2020; Song et al., 2021).
training, and both training methods permit flexible neural Typically, the SDE in Eq. (1) is designed such that pT pxq
network architectures for consistency models. is close to a tractable Gaussian distribution πpxq. We here-
We demonstrate the efficacy of consistency models on sev- after adopt the configurations ? in Karras et al. (2022), who
eral challenging image benchmarks, including CIFAR-10 set µpx, tq “ 0 and σptq “ 2t. In this case, we have
(Krizhevsky et al., 2009), ImageNet 64 ˆ 64 (Deng et al., pt pxq “ pdata pxq b N p0, t2 Iq, where b denotes the convo-
2009), and LSUN 256 ˆ 256 (Yu et al., 2015). Empiri- lution operation, and πpxq “ N p0, T 2 Iq. For sampling, we
cally, we observe that as a distillation approach, consis- first train a score model sφ px, tq « ∇ log pt pxq via score
tency models outperform progressive distillation (Salimans matching (Hyvärinen & Dayan, 2005; Vincent, 2011; Song
& Ho, 2022) across a variety of datasets and number of et al., 2019; Song & Ermon, 2019; Ho et al., 2020), then
sampling steps. On CIFAR-10, consistency models reach plug it into Eq. (2) to obtain an empirical estimate of the PF
new state-of-the-art FIDs of 3.55 and 2.93 for one-step and ODE, which takes the form of
two-step generation. On ImageNet 64 ˆ 64, it achieves dxt
record-breaking FIDs of 6.20 and 4.70 with one and two net- “ ´tsφ pxt , tq. (3)
dt
work evaluations respectively. When trained as standalone
generative models, consistency models achieve comparable We call Eq. (3) the empirical PF ODE. Next, we sample
x̂T „ π “ N p0, T 2 Iq to initialize the empirical PF ODE
Consistency Models

while allowing zero-shot image editing applications. In the


latter case, consistency models are trained in isolation, with
no dependence on pre-trained diffusion models. This makes
them an independent new class of generative models.
Below we introduce the definition, parameterization, and
sampling of consistency models, plus a brief discussion on
their applications to zero-shot data editing.
Definition Given a solution trajectory txt utPr,T s of the
Figure 2: Consistency models are trained to map points on PF ODE in Eq. (2), we define the consistency function as
any trajectory of the PF ODE to the trajectory’s origin. f : pxt , tq ÞÑ x . A consistency function has the property
of self-consistency: its outputs are consistent for arbitrary
pairs of pxt , tq that belong to the same PF ODE trajectory,
i.e., f pxt , tq “ f pxt1 , t1 q for all t, t1 P r, T s. With fixed
and solve it backwards in time with any numerical ODE time argument, f p¨, tq is always an invertible function. As
solver, such as Euler (Song et al., 2020; 2021) and Heun illustrated in Fig. 2, the goal of a consistency model, symbol-
solvers (Karras et al., 2022), to obtain the solution trajectory ized as fθ , is to estimate this consistency function f from
tx̂t utPr0,T s . The resulting x̂0 can then be viewed as an data by learning to enforce the self-consistency property
approximate sample from the data distribution pdata pxq. To (details in Sections 4 and 5).
avoid numerical instability, one typically stops the solver at
t “ , where  is a fixed small positive number, and instead Parameterization For any consistency function f p¨, ¨q, we
accepts x̂ as the approximate sample. Following Karras have f px , q “ x , i.e., f p¨, q is an identity function.
et al. (2022), we rescale pixel values in images to r´1, 1s, We call this constraint the boundary condition. A valid
then set T “ 80, and  “ 0.002. consistency model has to respect this boundary condition.
For consistency models based on deep neural networks, we
Diffusion models are bottlenecked by their slow sampling discuss two ways to implement this boundary condition
speed. Clearly, using ODE solvers for sampling requires almost for free. Suppose we have a free-form deep neural
many evaluations of the score model sφ px, tq, which is network Fθ px, tq whose output has the same dimensionality
computationally costly. Existing methods for fast sampling as x. The first way is to simply parameterize the consistency
include faster numerical ODE solvers (Song et al., 2020; model as
Zhang & Chen, 2022; Lu et al., 2022; Dockhorn et al., 2022), #
and distillation techniques (Luhman & Luhman, 2021; Sali- x t“
fθ px, tq “ . (4)
mans & Ho, 2022; Meng et al., 2022; Zheng et al., 2022). Fθ px, tq t P p, T s
However, ODE solvers still need more than 10 evaluation
steps to generate competitive samples. Most distillation The second method is to parameterize the consistency model
methods like Luhman & Luhman (2021) and Zheng et al. using skip connections, that is,
(2022) rely on collecting a large dataset of samples from
the diffusion model prior to distillation, which itself is com- fθ px, tq “ cskip ptqx ` cout ptqFθ px, tq, (5)
putationally expensive. To our best knowledge, the only
where cskip ptq and cout ptq are differentiable functions such
distillation approach that does not suffer from this drawback
that cskip pq “ 1, and cout pq “ 0. This way, the consis-
is progressive distillation (PD, Salimans & Ho (2022)), with
tency model is differentiable at t “  if Fθ px, tq and scaling
which we compare consistency models extensively in our
coefficients are differentiable, which is critical for train-
experiments.
ing continuous-time consistency models (Appendices B.1
and B.2). The parameterization in Eq. (5) bears strong re-
3. Consistency Models semblance to many successful diffusion models (Karras
et al., 2022; Balaji et al., 2022), making it easier to bor-
We propose consistency models, a new type of generative
row powerful diffusion model architectures for constructing
models that support single-step generation at the core of
consistency models. We therefore follow the second param-
its design, while still allowing iterative generation for zero-
eterization in all experiments.
shot data editing and trade-offs between sample quality
and compute. Consistency models can be trained in either Sampling With a well-trained consistency model fθ p¨, ¨q,
the distillation mode or the isolation mode. In the former we can generate samples by sampling from the initial dis-
case, consistency models distill the knowledge of pre-trained tribution x̂T „ N p0, T 2 Iq and then evaluating the consis-
diffusion models into a single-step sampler, significantly tency model for x̂ “ fθ px̂T , T q. This involves only one
improving other distillation approaches in sample quality, forward pass through the consistency model and therefore
Consistency Models

Algorithm 1 Multistep Consistency Sampling t2 ă ¨ ¨ ¨ ă tN “ T . In practice, we follow Karras


Input: Consistency model fθ p¨, ¨q, sequence of time et al. (2022) to determine the boundaries with the formula
points τ1 ą τ2 ą ¨ ¨ ¨ ą τN ´1 , initial noise x̂T ti “ p1{ρ ` i´1{N ´1pT 1{ρ ´ 1{ρ qqρ , where ρ “ 7. When
x Ð fθ px̂T , T q N is sufficiently large, we can obtain an accurate estimate
for n “ 1 to N ´ 1 do of xtn from xtn`1 by running one discretization step of a
Sample z „ a N p0, Iq numerical ODE solver. This estimate, which we denote as
x̂τn Ð x ` τn2 ´ 2 z x̂φ
tn , is defined by
x Ð fθ px̂τn , τn q
end for x̂φ
tn :“ xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; φq, (6)
Output: x where Φp¨ ¨ ¨ ; φq represents the update function of a one-
step ODE solver applied to the empirical PF ODE. For
example, when using the Euler solver, we have Φpx, t; φq “
generates samples in a single step. Importantly, one can ´tsφ px, tq which corresponds to the following update rule
also evaluate the consistency model multiple times by al-
ternating denoising and noise injection steps for improved x̂φ
tn “ xtn`1 ´ ptn ´ tn`1 qtn`1 sφ pxtn`1 , tn`1 q.
sample quality. Summarized in Algorithm 1, this multistep
sampling procedure provides the flexibility to trade com- For simplicity, we only consider one-step ODE solvers in
pute for sample quality. It also has important applications this work. It is straightforward to generalize our framework
in zero-shot data editing. In practice, we find time points in to multistep ODE solvers and we leave it as future work.
Algorithm 1 with a greedy algorithm, where the time points Due to the connection between the PF ODE in Eq. (2) and
are pinpointed one at a time using ternary search to optimize the SDE in Eq. (1) (see Section 2), one can sample along the
the FID of samples obtained from Algorithm 1. distribution of ODE trajectories by first sampling x „ pdata ,
Zero-Shot Data Editing Consistency models enable vari- then adding Gaussian noise to x. Specifically, given a data
ous data editing and manipulation applications in zero shot; point x, we can generate a pair of adjacent data points
they do not require explicit training to perform these tasks. px̂φ
tn , xtn`1 q on the PF ODE trajectory efficiently by sam-

For example, consistency models define a one-to-one map- pling x from the dataset, followed by sampling xtn`1 from
ping from a Gaussian noise vector to a data sample. Similar the transition density of the SDE N px, t2n`1 Iq, and then
to latent variable models like GANs, VAEs, and normalizing computing x̂φ tn using one discretization step of the numeri-
flows, consistency models can easily interpolate between cal ODE solver according to Eq. (6). Afterwards, we train
samples by traversing the latent space (Fig. 11). As consis- the consistency model by minimizing its output differences
tency models are trained to recover x from any noisy input on the pair px̂φ tn , xtn`1 q. This motivates our following con-
xt where t P r, T s, they can perform denoising for various sistency distillation loss for training consistency models.
noise levels (Fig. 12). Moreover, the multistep generation Definition 1. The consistency distillation loss is defined as
procedure in Algorithm 1 is useful for solving certain inverse
problems in zero shot by using an iterative replacement pro- LN ´
CD pθ, θ ; φq :“
cedure similar to that of diffusion models (Song & Ermon,
Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂φ
tn , tn qqs, (7)
2019; Song et al., 2021; Ho et al., 2022b). This enables
many applications in the context of image editing, including where the expectation is taken with respect to x „ pdata , n „
inpainting (Fig. 10), colorization (Fig. 8), super-resolution UJ1, N ´1K, and xtn`1 „ N px; t2n`1 Iq. Here UJ1, N ´1K
(Fig. 6b) and stroke-guided image editing (Fig. 13) as in denotes the uniform distribution over t1, 2, ¨ ¨ ¨ , N ´ 1u,
SDEdit (Meng et al., 2021). In Section 6.3, we empiri- λp¨q P R` is a positive weighting function, x̂φ tn is given by
cally demonstrate the power of consistency models on many Eq. (6), θ ´ denotes a running average of the past values of
zero-shot image editing tasks. θ during the course of optimization, and dp¨, ¨q is a metric
function that satisfies @x, y : dpx, yq ě 0 and dpx, yq “ 0
4. Training Consistency Models via if and only if x “ y.
Distillation
Unless otherwise stated, we adopt the notations in Defi-
We present our first method for training consistency mod- nition 1 throughout this paper, and use Er¨s to denote the
els based on distilling a pre-trained score model sφ px, tq. expectation over all relevant random variables. In our ex-
Our discussion revolves around the empirical PF ODE in periments, we consider the squared `2 distance dpx, yq “
Eq. (3), obtained by plugging the score model sφ px, tq }x ´ y}22 , `1 distance dpx, yq “ }x ´ y}1 , and the Learned
into the PF ODE. Consider discretizing the time horizon Perceptual Image Patch Similarity (LPIPS, Zhang et al.
r, T s into N ´ 1 sub-intervals, with boundaries t1 “  ă (2018)). We find λptn q ” 1 performs well across all
Consistency Models

Algorithm 2 Consistency Distillation (CD) Algorithm 3 Consistency Training (CT)


Input: dataset D, initial model parameter θ, learning rate Input: dataset D, initial model parameter θ, learning
η, ODE solver Φp¨, ¨; φq, dp¨, ¨q, λp¨q, and µ rate η, step schedule N p¨q, EMA decay rate schedule
θ´ Ð θ µp¨q, dp¨, ¨q, and λp¨q
repeat θ ´ Ð θ and k Ð 0
Sample x „ D and n „ UJ1, N ´ 1K repeat
Sample xtn`1 „ N px; t2n`1 Iq Sample x „ D, and n „ UJ1, N pkq ´ 1K
x̂φ
tn Ð xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; φq
Sample z „ N p0, Iq
Lpθ, θ ´ q Ð
Lpθ, θ ´ ; φq Ð
λptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn q
λptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂φ
tn , tn qq θ Ð θ ´ η∇θ Lpθ, θ ´ q
θ Ð θ ´ η∇θ Lpθ, θ ´ ; φq θ ´ Ð stopgradpµpkqθ ´ ` p1 ´ µpkqqθq
θ ´ Ð stopgradpµθ ´ ` p1 ´ µqθ) k Ðk`1
until convergence until convergence

tasks and datasets. In practice, we minimize the objec- That is, the target and online consistency models will eventu-
tive by stochastic gradient descent on the model parameters ally match each other. If the consistency model additionally
θ, while updating θ ´ with exponential moving average achieves zero consistency distillation loss, then Theorem 1
(EMA). That is, given a decay rate 0 ď µ ă 1, we perform implies that, under some regularity conditions, the estimated
the following update after each optimization step: consistency model can become arbitrarily accurate, as long
as the step size of the ODE solver is sufficiently small.
θ ´ Ð stopgradpµθ ´ ` p1 ´ µqθq. (8)
The consistency distillation loss LN ´
CD pθ, θ ; φq can be ex-
The overall training procedure is summarized in Algo- tended to hold for infinitely many time steps (N Ñ 8) if
rithm 2. In alignment with the convention in deep reinforce- θ ´ “ θ or θ ´ “ stopgradpθq. The resulting continuous-
ment learning (Mnih et al., 2013; 2015; Lillicrap et al., 2015) time loss functions do not require specifying N nor the time
and momentum based contrastive learning (Grill et al., 2020; steps tt1 , t2 , ¨ ¨ ¨ , tN u. Nonetheless, they involve Jacobian-
He et al., 2020), we refer to fθ´ as the “target network”, vector products and require forward-mode automatic dif-
and fθ as the “online network”. We find that compared to ferentiation for efficient implementation, which may not
simply setting θ ´ “ θ, the EMA update and “stopgrad” be well-supported in some deep learning frameworks. We
operator in Eq. (8) can greatly stabilize the training process provide these continuous-time distillation loss functions in
and improve the final performance of the consistency model. Theorems 3 to 5, and relegate details to Appendix B.1.
Below we provide a theoretical justification for consistency
distillation based on asymptotic analysis. 5. Training Consistency Models in Isolation
Theorem 1. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u, and Consistency models can be trained without relying on any
f p¨, ¨; φq be the consistency function of the empirical PF pre-trained diffusion models. This differs from diffusion
ODE in Eq. (3). Assume fθ satisfies the Lipschitz condition: distillation techniques, making consistency models a new
there exists L ą 0 such that for all t P r, T s, x, and y, independent family of generative models.
we have kfθ px, tq ´ fθ py, tqk2 ď L kx ´ yk2 . Assume
further that for all n P J1, N ´ 1K, the ODE solver called In consistency distillation, we use a pre-trained score model
at tn`1 has local error uniformly bounded by Opptn`1 ´ sφ px, tq to approximate the ground truth score function
tn qp`1 q with p ě 1. Then, if LN ∇ log pt pxq. To get rid of this dependency, we need to
CD pθ, θ; φq “ 0, we have
seek other ways to estimate the score function. In fact,
sup }fθ px, tn q ´ f px, tn ; φq}2 “ Opp∆tqp q. there exists an unbiased estimator of ∇ log pt pxt q due to the
n,x following identity (Lemma 1 in Appendix A):
„ ˇ 
Proof. The proof is based on induction and parallels the xt ´ x ˇˇ
classic proof of global error bounds for numerical ODE ∇ log pt pxt q “ ´E xt ,
t2 ˇ
solvers (Süli & Mayers, 2003). We provide the full proof in
Appendix A.2. where x „ pdata and xt „ N px; t2 Iq. That is, given x and
xt , we can form a Monte Carlo estimate of ∇ log pt pxt q
Since θ ´ is a running average of the history of θ, we have with ´pxt ´ xq{t2 . We now show that this estimate actu-
θ ´ “ θ when the optimization of Algorithm 2 converges. ally suffices to replace the pre-trained diffusion model in
Consistency Models

consistency distillation, when using the Euler method (or automatic differentiation for efficient implementation. Un-
any higher order method) as the ODE solver in the limit of like the discrete-time CT loss, there is no undesirable “bias”
N Ñ 8. associated with the continuous-time objective, as we effec-
tively take ∆ Ñ 0 in Theorem 2. We relegate more details
More precisely, we have the following theorem.
to Appendix B.2.
Theorem 2. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u. As-
sume d and fθ´ are both twice continuously differentiable
with bounded second derivatives, the weighting function
6. Experiments
2
λp¨q is bounded, and Erk∇ log ptn pxtn qk2 s ă 8. As- We employ consistency distillation and consistency train-
sume further that we use the Euler ODE solver, and the ing to learn consistency models on real image datasets,
pre-trained score model matches the ground truth, i.e., including CIFAR-10 (Krizhevsky et al., 2009), ImageNet
@t P r, T s : sφ px, tq ” ∇ log pt pxq. Then, 64 ˆ 64 (Deng et al., 2009), LSUN Bedroom 256 ˆ 256,
and LSUN Cat 256 ˆ 256 (Yu et al., 2015). Results are
LN ´ N ´
CD pθ, θ ; φq “ LCT pθ, θ q ` op∆tq, (9) compared according to Fréchet Inception Distance (FID,
Heusel et al. (2017), lower is better), Inception Score (IS,
where the expectation is taken with respect to x „ pdata , n „
Salimans et al. (2016), higher is better), Precision (Prec.,
UJ1, N ´ 1K, and xtn`1 „ N px; t2n`1 Iq. The consistency
Kynkäänniemi et al. (2019), higher is better), and Recall
training objective, denoted by LN ´
CT pθ, θ q, is defined as
(Rec., Kynkäänniemi et al. (2019), higher is better). Addi-
Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs, (10) tional experimental details are provided in Appendix C.

where z „ N p0, Iq. Moreover, LN ´


CT pθ, θ q ě Op∆tq if
6.1. Training Consistency Models
N ´
inf N LCD pθ, θ ; φq ą 0.
We perform a series of experiments on CIFAR-10 to under-
stand the effect of various hyperparameters on the perfor-
Proof. The proof is based on Taylor series expansion and mance of consistency models trained by consistency distil-
properties of score functions (Lemma 1). A complete proof lation (CD) and consistency training (CT). We first focus on
is provided in Appendix A.3. the effect of the metric function dp¨, ¨q, the ODE solver, and
the number of discretization steps N in CD, then investigate
We refer to Eq. (10) as the consistency training (CT) loss. the effect of the schedule functions N p¨q and µp¨q in CT.
Crucially, Lpθ, θ ´ q only depends on the online network
fθ , and the target network fθ´ , while being completely To set up our experiments for CD, we consider the squared
agnostic to diffusion model parameters φ. The loss function `2 distance dpx, yq “ }x ´ y}22 , `1 distance dpx, yq “
Lpθ, θ ´ q ě Op∆tq decreases at a slower rate than the }x ´ y}1 , and the Learned Perceptual Image Patch Simi-
remainder op∆tq and thus will dominate the loss in Eq. (9) larity (LPIPS, Zhang et al. (2018)) as the metric function.
as N Ñ 8 and ∆t Ñ 0. For the ODE solver, we compare Euler’s forward method
and Heun’s second order method as detailed in Karras et al.
For improved practical performance, we propose to progres- (2022). For the number of discretization steps N , we com-
sively increase N during training according to a schedule pare N P t9, 12, 18, 36, 50, 60, 80, 120u. All consistency
function N p¨q. The intuition (cf ., Fig. 3d) is that the consis- models trained by CD in our experiments are initialized with
tency training loss has less “variance” but more “bias” with the corresponding pre-trained diffusion models, whereas
respect to the underlying consistency distillation loss (i.e., models trained by CT are randomly initialized.
the left-hand side of Eq. (9)) when N is small (i.e., ∆t is
large), which facilitates faster convergence at the beginning As visualized in Fig. 3a, the optimal metric for CD is LPIPS,
of training. On the contrary, it has more “variance” but less which outperforms both `1 and `2 by a large margin over
“bias” when N is large (i.e., ∆t is small), which is desirable all training iterations. This is expected as the outputs of
when closer to the end of training. For best performance, consistency models are images on CIFAR-10, and LPIPS is
we also find that µ should change along with N , according specifically designed for measuring the similarity between
to a schedule function µp¨q. The full algorithm of consis- natural images. Next, we investigate which ODE solver and
tency training is provided in Algorithm 3, and the schedule which discretization step N work the best for CD. As shown
functions used in our experiments are given in Appendix C. in Figs. 3b and 3c, Heun ODE solver and N “ 18 are the
best choices. Both are in line with the recommendation
Similar to consistency distillation, the consistency training of Karras et al. (2022) despite the fact that we are train-
loss LN ´
CT pθ, θ q can be extended to hold in continuous time ing consistency models, not diffusion models. Moreover,
(i.e., N Ñ 8) if θ ´ “ stopgradpθq, as shown in Theo- Fig. 3b shows that with the same N , Heun’s second order
rem 6. This continuous-time loss function does not require solver uniformly outperforms Euler’s first order solver. This
schedule functions for N or µ, but requires forward-mode
Consistency Models

(a) Metric functions in CD. (b) Solvers and N in CD. (c) N with Heun solver in CD. (d) Adaptive N and µ in CT.

Figure 3: Various factors that affect consistency distillation (CD) and consistency training (CT) on CIFAR-10. The best
configuration for CD is LPIPS, Heun ODE solver, and N “ 18. Our adaptive schedule functions for N and µ make CT
converge significantly faster than fixing them to be constants during the course of optimization.

(a) CIFAR-10 (b) ImageNet 64 ˆ 64 (c) Bedroom 256 ˆ 256 (d) Cat 256 ˆ 256

Figure 4: Multistep image generation with consistency distillation (CD). CD outperforms progressive distillation (PD)
across all datasets and sampling steps. The only exception is single-step generation on Bedroom 256 ˆ 256.

corroborates with Theorem 1, which states that the optimal 6.2. Few-Step Image Generation
consistency models trained by higher order ODE solvers
Distillation In current literature, the most directly compara-
have smaller estimation errors with the same N . The results
ble approach to our consistency distillation (CD) is progres-
of Fig. 3c also indicate that once N is sufficiently large, the
sive distillation (PD, Salimans & Ho (2022)); both are thus
performance of CD becomes insensitive to N . Given these
far the only distillation approaches that do not construct
insights, we hereafter use LPIPS and Heun ODE solver for
synthetic data before distillation. In stark contrast, other dis-
CD unless otherwise stated. For N in CD, we follow the
tillation techniques, such as knowledge distillation (Luhman
suggestions in Karras et al. (2022) on CIFAR-10 and Im-
& Luhman, 2021) and DFNO (Zheng et al., 2022), have to
ageNet 64 ˆ 64. We tune N separately on other datasets
prepare a large synthetic dataset by generating numerous
(details in Appendix C).
samples from the diffusion model with expensive numeri-
Due to the strong connection between CD and CT, we adopt cal ODE solvers. We perform comprehensive comparison
LPIPS for our CT experiments throughout this paper. Unlike between PD and CD on CIFAR-10, ImageNet 64 ˆ 64, and
CD, there is no need for using Heun’s second order solver LSUN 256 ˆ 256, with all results reported in Fig. 4. All
in CT as the loss function does not rely on any particular methods distill from an EDM (Karras et al., 2022) model
numerical ODE solver. As demonstrated in Fig. 3d, the con- that we pre-trained in-house. We note that across all sam-
vergence of CT is highly sensitive to N —smaller N leads pling iterations, using the LPIPS metric uniformly improves
to faster convergence but worse samples, whereas larger PD compared to the squared `2 distance in the original
N leads to slower convergence but better samples upon paper of Salimans & Ho (2022). Both PD and CD improve
convergence. This matches our analysis in Section 5, and as we take more sampling steps. We find that CD uniformly
motivates our practical choice of progressively growing N outperforms PD across all datasets, sampling steps, and
and µ for CT to balance the trade-off between convergence metric functions considered, except for single-step gener-
speed and sample quality. As shown in Fig. 3d, adaptive ation on Bedroom 256 ˆ 256, where CD with `2 slightly
schedules of N and µ significantly improve the convergence underperforms PD with `2 . As shown in Table 1, CD even
speed and sample quality of CT. In our experiments, we outperforms distillation approaches that require synthetic
tune the schedules N p¨q and µp¨q separately for images of dataset construction, such as Knowledge Distillation (Luh-
different resolutions, with more details in Appendix C. man & Luhman, 2021) and DFNO (Zheng et al., 2022).
Consistency Models

Table 1: Sample quality on CIFAR-10. ˚ Methods that require Table 2: Sample quality on ImageNet 64 ˆ 64, and LSUN
synthetic data construction for distillation. Bedroom & Cat 256 ˆ 256. : Distillation techniques.

METHOD NFE (Ó) FID (Ó) IS (Ò) METHOD NFE (Ó) FID (Ó) Prec. (Ò) Rec. (Ò)
Diffusion + Samplers ImageNet 64 ˆ 64
DDIM (Song et al., 2020) 50 4.67 PD: (Salimans & Ho, 2022) 1 15.39 0.59 0.62
DDIM (Song et al., 2020) 20 6.84 DFNO:˚ (Zheng et al., 2022) 1 8.35
DDIM (Song et al., 2020) 10 8.23 CD: 1 6.20 0.68 0.63
DPM-solver-2 (Lu et al., 2022) 12 5.28 PD: (Salimans & Ho, 2022) 2 8.95 0.63 0.65
DPM-solver-3 (Lu et al., 2022) 12 6.03 CD: 2 4.70 0.69 0.64
3-DEIS (Zhang & Chen, 2022) 10 4.17
ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63
Diffusion + Distillation EDM (Karras et al., 2022) 79 2.44 0.71 0.67
Knowledge Distillation˚ (Luhman & Luhman, 2021) 1 9.36 BigGAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48
DFNO˚ (Zheng et al., 2022) 1 4.12 CT 1 13.0 0.71 0.47
1-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 6.18 9.08 CT 2 11.1 0.69 0.56
2-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 4.85 9.01
3-Rectified Flow (+distill)˚ (Liu et al., 2022) 1 5.21 8.79
LSUN Bedroom 256 ˆ 256
PD (Salimans & Ho, 2022) 1 8.34 8.69 PD: (Salimans & Ho, 2022) 1 16.92 0.47 0.27
CD 1 3.55 9.48 PD: (Salimans & Ho, 2022) 2 8.47 0.56 0.39
PD (Salimans & Ho, 2022) 2 5.58 9.05 CD: 1 7.80 0.66 0.34
CD 2 2.93 9.75 CD: 2 5.22 0.68 0.39
Direct Generation DDPM (Ho et al., 2020) 1000 4.89 0.60 0.45
ADM (Dhariwal & Nichol, 2021) 1000 1.90 0.66 0.51
BigGAN (Brock et al., 2019) 1 14.7 9.22
CR-GAN (Zhang et al., 2019) 1 14.6 8.40 EDM (Karras et al., 2022) 79 3.57 0.66 0.45
AutoGAN (Gong et al., 2019) 1 12.4 8.55 SS-GAN (Chen et al., 2019b) 1 13.3
E2GAN (Tian et al., 2020) 1 11.3 8.51 PGGAN (Karras et al., 2018) 1 8.34
ViTGAN (Lee et al., 2021) 1 6.66 9.30 PG-SWGAN (Wu et al., 2019) 1 8.0
TransGAN (Jiang et al., 2021) 1 9.26 9.05 StyleGAN2 (Karras et al., 2020) 1 2.35 0.59 0.48
StyleGAN2-ADA (Karras et al., 2020) 1 2.92 9.83 CT 1 16.0 0.60 0.17
StyleGAN-XL (Sauer et al., 2022) 1 1.85 CT 2 7.85 0.68 0.33
Score SDE (Song et al., 2021) 2000 2.20 9.89 LSUN Cat 256 ˆ 256
DDPM (Ho et al., 2020) 1000 3.17 9.46
PD: (Salimans & Ho, 2022) 1 29.6 0.51 0.25
LSGM (Vahdat et al., 2021) 147 2.10
PD: (Salimans & Ho, 2022) 2 15.5 0.59 0.36
PFGM (Xu et al., 2022) 110 2.35 9.68
CD: 1 11.0 0.65 0.36
EDM (Karras et al., 2022) 36 2.04 9.84
CD: 2 8.84 0.66 0.40
1-Rectified Flow (Liu et al., 2022) 1 378 1.13
Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92 DDPM (Ho et al., 2020) 1000 17.1 0.53 0.48
Residual Flow (Chen et al., 2019a) 1 46.4 ADM (Dhariwal & Nichol, 2021) 1000 5.57 0.63 0.52
GLFlow (Xiao et al., 2019) 1 44.6 EDM (Karras et al., 2022) 79 6.69 0.70 0.43
DenseFlow (Grcić et al., 2021) 1 34.9 PGGAN (Karras et al., 2018) 1 37.5
DC-VAE (Parmar et al., 2021) 1 17.9 8.20 StyleGAN2 (Karras et al., 2020) 1 7.25 0.58 0.43
CT 1 8.70 8.49 CT 1 20.7 0.56 0.23
CT 2 5.83 8.85 CT 2 11.7 0.63 0.36

Figure 5: Samples generated by EDM (top), CT + single-step generation (middle), and CT + 2-step generation (Bottom). All
corresponding images are generated from the same initial noise.
Consistency Models

(a) Left: The gray-scale image. Middle: Colorized images. Right: The ground-truth image.

(b) Left: The downsampled image (32 ˆ 32). Middle: Full resolution images (256 ˆ 256). Right: The ground-truth image (256 ˆ 256).

(c) Left: A stroke input provided by users. Right: Stroke-guided image generation.

Figure 6: Zero-shot image editing with a consistency model trained by consistency distillation on LSUN Bedroom 256ˆ256.

Direct Generation In Tables 1 and 2, we compare the low-resolution inputs. In Fig. 6c, we additionally demon-
sample quality of consistency training (CT) with other gen- strate that it can generate images based on stroke inputs cre-
erative models using one-step and two-step generation. We ated by humans, as in SDEdit for diffusion models (Meng
also include PD and CD results for reference. Both tables re- et al., 2021). Again, this editing capability is zero-shot,
port PD results obtained from the `2 metric function, as this as the model has not been trained on stroke inputs. In
is the default setting used in the original paper of Salimans Appendix D, we additionally demonstrate the zero-shot
& Ho (2022). For fair comparison, we train both PD and capability of consistency models on inpainting (Fig. 10),
CD to distill the same EDM models. In Tables 1 and 2, we interpolation (Fig. 11) and denoising (Fig. 12), with more
observe that CT outperforms all single-step, non-adversarial examples on colorization (Fig. 8), super-resolution (Fig. 9)
generative models, i.e., VAEs and normalizing flows, by and stroke-guided image generation (Fig. 13).
a significant margin on CIFAR-10. Moreover, CT obtains
comparable quality to PD for single-step generation without 7. Conclusion
relying on distillation. In Fig. 5, we provide EDM samples
(top), single-step CT samples (middle), and two-step CT We have introduced consistency models, a type of generative
samples (bottom). In Appendix E, we show additional sam- models that are specifically designed to support one-step
ples for both CD and CT in Figs. 14 to 21. Importantly, all and few-step generation. We have empirically demonstrated
samples obtained from the same initial noise vector share that our consistency distillation method outshines the exist-
significant structural similarity, even though CT and EDM ing distillation techniques for diffusion models on multiple
models are trained independently from one another. This image benchmarks and various sampling iterations. Further-
indicates that CT is unlikely to suffer from mode collapse, more, as a standalone generative model, consistency models
as EDMs do not. outdo other available models that permit single-step genera-
tion, barring GANs. Similar to diffusion models, they also
6.3. Zero-Shot Image Editing allow zero-shot image editing applications such as inpaint-
ing, colorization, super-resolution, denoising, interpolation,
Similar to diffusion models, consistency models allow zero- and stroke-guided image generation.
shot image editing by modifying the multistep sampling
process in Algorithm 1. We demonstrate this capability In addition, consistency models share striking similarities
with a consistency model trained on the LSUN bedroom with techniques employed in other fields, including deep
dataset using consistency distillation. In Fig. 6a, we show Q-learning (Mnih et al., 2015) and momentum-based con-
such a consistency model can colorize gray-scale bedroom trastive learning (Grill et al., 2020; He et al., 2020). This
images at test time, even though it has never been trained offers exciting prospects for cross-pollination of ideas and
on colorization tasks. In Fig. 6b, we show the same con- methods among these diverse fields.
sistency model can generate high-resolution images from
Consistency Models

Acknowledgements Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear


independent components estimation. International Con-
We thank Alex Nichol for reviewing the manuscript and ference in Learning Representations Workshop Track,
providing valuable feedback, Chenlin Meng for providing 2015.
stroke inputs needed in our stroke-guided image generation
experiments, and all members of the OpenAI Algorithms Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density es-
team for helpful suggestions during the project. timation using real NVP. In 5th International Confer-
ence on Learning Representations, ICLR 2017, Toulon,
References France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017. URL https://openreview.
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, net/forum?id=HkpbnH9lx.
K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., Kar-
ras, T., and Liu, M.-Y. ediff-i: Text-to-image diffusion Dockhorn, T., Vahdat, A., and Kreis, K. Genie: Higher-
models with ensemble of expert denoisers. arXiv preprint order denoising diffusion solvers. arXiv preprint
arXiv:2211.01324, 2022. arXiv:2210.05475, 2022.
Gong, X., Chang, S., Jiang, Y., and Wang, Z. Autogan:
Bengio, Y. and Bengio, S. Modeling high-dimensional
Neural architecture search for generative adversarial net-
discrete data with multi-layer neural networks. Advances
works. In Proceedings of the IEEE/CVF International
in Neural Information Processing Systems, 12, 1999.
Conference on Computer Vision, pp. 3224–3234, 2019.
Brock, A., Donahue, J., and Simonyan, K. Large scale Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
GAN training for high fidelity natural image synthesis. In Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
International Conference on Learning Representations, Y. Generative adversarial nets. In Advances in neural
2019. URL https://openreview.net/forum? information processing systems, pp. 2672–2680, 2014.
id=B1xsqj09Fm.
Grcić, M., Grubišić, I., and Šegvić, S. Densely connected
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and normalizing flows. Advances in Neural Information Pro-
Chan, W. Wavegrad: Estimating gradients for waveform cessing Systems, 34:23968–23982, 2021.
generation. In International Conference on Learning Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.,
Representations (ICLR), 2021. Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z.,
Gheshlaghi Azar, M., et al. Bootstrap your own latent-a
Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, new approach to self-supervised learning. Advances in
J.-H. Residual flows for invertible generative modeling. neural information processing systems, 33:21271–21284,
In Advances in Neural Information Processing Systems, 2020.
pp. 9916–9926, 2019a.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, mentum contrast for unsupervised visual representation
N. Self-supervised gans via auxiliary rotation loss. In learning. In Proceedings of the IEEE/CVF conference on
Proceedings of the IEEE/CVF conference on computer computer vision and pattern recognition, pp. 9729–9738,
vision and pattern recognition, pp. 12154–12163, 2019b. 2020.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
J. C. Diffusion posterior sampling for general noisy in- Hochreiter, S. GANs trained by a two time-scale update
verse problems. In International Conference on Learning rule converge to a local Nash equilibrium. In Advances in
Representations, 2023. URL https://openreview. Neural Information Processing Systems, pp. 6626–6637,
net/forum?id=OnD9zGAGT0k. 2017.
Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Proba-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, bilistic Models. Advances in Neural Information Process-
L. Imagenet: A large-scale hierarchical image database. ing Systems, 33, 2020.
In 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255. Ieee, 2009. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko,
A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J.,
Dhariwal, P. and Nichol, A. Diffusion models beat gans et al. Imagen video: High definition video generation
on image synthesis. Advances in Neural Information with diffusion models. arXiv preprint arXiv:2210.02303,
Processing Systems (NeurIPS), 2021. 2022a.
Consistency Models

Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu,
M., and Fleet, D. J. Video diffusion models. In ICLR C. Vitgan: Training gans with vision transformers. arXiv
Workshop on Deep Generative Models for Highly Struc- preprint arXiv:2107.04589, 2021.
tured Data, 2022b. URL https://openreview.
net/forum?id=BBelR2NdDZ5. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
Hyvärinen, A. and Dayan, P. Estimation of non-normalized control with deep reinforcement learning. arXiv preprint
statistical models by score matching. Journal of Machine arXiv:1509.02971, 2015.
Learning Research (JMLR), 6(4), 2005.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure Han, J. On the variance of the adaptive learning rate and
transformers can make one strong gan, and that can scale beyond. arXiv preprint arXiv:1908.03265, 2019.
up. Advances in Neural Information Processing Systems,
34:14745–14758, 2021. Liu, X., Gong, C., and Liu, Q. Flow straight and fast:
Learning to generate and transfer data with rectified flow.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres-
arXiv preprint arXiv:2209.03003, 2022.
sive growing of GANs for improved quality, stability,
and variation. In International Conference on Learning Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J.
Representations, 2018. URL https://openreview. Dpm-solver: A fast ode solver for diffusion probabilis-
net/forum?id=Hk99zCeAb. tic model sampling in around 10 steps. arXiv preprint
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., arXiv:2206.00927, 2022.
and Aila, T. Analyzing and improving the image quality
Luhman, E. and Luhman, T. Knowledge distillation in
of stylegan. 2020.
iterative generative models for improved sampling speed.
Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating arXiv preprint arXiv:2101.02388, 2021.
the design space of diffusion-based generative models. In
Proc. NeurIPS, 2022. Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon,
S. Sdedit: Image synthesis and editing with stochastic
Kawar, B., Vaksman, G., and Elad, M. Snips: Solving differential equations. arXiv preprint arXiv:2108.01073,
noisy inverse problems stochastically. arXiv preprint 2021.
arXiv:2105.14951, 2021.
Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and
Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising Salimans, T. On distillation of guided diffusion models.
diffusion restoration models. In Advances in Neural In- arXiv preprint arXiv:2210.03142, 2022.
formation Processing Systems, 2022.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Kingma, D. P. and Dhariwal, P. Glow: Generative flow Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
with invertible 1x1 convolutions. In Bengio, S., Wal- atari with deep reinforcement learning. arXiv preprint
lach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., arXiv:1312.5602, 2013.
and Garnett, R. (eds.), Advances in Neural Information
Processing Systems 31, pp. 10215–10224. 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
Kingma, D. P. and Welling, M. Auto-encoding variational land, A. K., Ostrovski, G., et al. Human-level control
bayes. In International Conference on Learning Repre- through deep reinforcement learning. nature, 518(7540):
sentations, 2014. 529–533, 2015.
Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro,
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
B. DiffWave: A Versatile Diffusion Model for Audio
P., McGrew, B., Sutskever, I., and Chen, M. Glide:
Synthesis. arXiv preprint arXiv:2009.09761, 2020.
Towards photorealistic image generation and editing
Krizhevsky, A., Hinton, G., et al. Learning multiple layers with text-guided diffusion models. arXiv preprint
of features from tiny images. 2009. arXiv:2112.10741, 2021.

Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and Parmar, G., Li, D., Lee, K., and Tu, Z. Dual contradistinctive
Aila, T. Improved precision and recall metric for assess- generative autoencoder. In Proceedings of the IEEE/CVF
ing generative models. Advances in Neural Information Conference on Computer Vision and Pattern Recognition,
Processing Systems, 32, 2019. pp. 823–832, 2021.
Consistency Models

Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudi- Song, Y. and Ermon, S. Improved Techniques for Training
nov, M. Grad-TTS: A diffusion probabilistic model for Score-Based Generative Models. Advances in Neural
text-to-speech. arXiv preprint arXiv:2105.06337, 2021. Information Processing Systems, 33, 2020.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score
M. Hierarchical text-conditional image generation with matching: A scalable approach to density and score esti-
clip latents. arXiv preprint arXiv:2204.06125, 2022. mation. In Proceedings of the Thirty-Fifth Conference on
Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv,
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Israel, July 22-25, 2019, pp. 204, 2019.
backpropagation and approximate inference in deep gen-
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,
erative models. In Proceedings of the 31st International
Ermon, S., and Poole, B. Score-based generative mod-
Conference on Machine Learning, pp. 1278–1286, 2014.
eling through stochastic differential equations. In In-
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and ternational Conference on Learning Representations,
Ommer, B. High-resolution image synthesis with latent 2021. URL https://openreview.net/forum?
diffusion models. In Proceedings of the IEEE/CVF Con- id=PxTIG12RRHS.
ference on Computer Vision and Pattern Recognition, pp. Song, Y., Shen, L., Xing, L., and Ermon, S. Solving inverse
10684–10695, 2022. problems in medical imaging with score-based genera-
tive models. In International Conference on Learning
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton,
Representations, 2022. URL https://openreview.
E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S.,
net/forum?id=vaRCHVj0uGI.
Lopes, R. G., et al. Photorealistic text-to-image diffusion
models with deep language understanding. arXiv preprint Süli, E. and Mayers, D. F. An introduction to numerical
arXiv:2205.11487, 2022. analysis. Cambridge university press, 2003.

Salimans, T. and Ho, J. Progressive distillation for fast Tian, Y., Wang, Q., Huang, Z., Li, W., Dai, D., Yang, M.,
sampling of diffusion models. In International Confer- Wang, J., and Fink, O. Off-policy reinforcement learn-
ence on Learning Representations, 2022. URL https: ing for efficient and effective gan architecture search. In
//openreview.net/forum?id=TIdIXIpzhoI. Computer Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings,
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Part VII 16, pp. 175–192. Springer, 2020.
Radford, A., and Chen, X. Improved techniques for train-
ing gans. In Advances in neural information processing Uria, B., Murray, I., and Larochelle, H. Rnade: the real-
systems, pp. 2234–2242, 2016. valued neural autoregressive density-estimator. In Pro-
ceedings of the 26th International Conference on Neu-
Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling ral Information Processing Systems-Volume 2, pp. 2175–
stylegan to large diverse datasets. In ACM SIGGRAPH 2183, 2013.
2022 conference proceedings, pp. 1–10, 2022.
Uria, B., Côté, M.-A., Gregor, K., Murray, I., and
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Larochelle, H. Neural autoregressive distribution esti-
Ganguli, S. Deep Unsupervised Learning Using Nonequi- mation. The Journal of Machine Learning Research, 17
librium Thermodynamics. In International Conference (1):7184–7220, 2016.
on Machine Learning, pp. 2256–2265, 2015. Vahdat, A., Kreis, K., and Kautz, J. Score-based generative
modeling in latent space. Advances in Neural Information
Song, J., Meng, C., and Ermon, S. Denoising diffusion
Processing Systems, 34:11287–11302, 2021.
implicit models. arXiv preprint arXiv:2010.02502, 2020.
Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.
Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pixel recurrent neural networks. In Proceedings of the
Pseudoinverse-guided diffusion models for inverse prob- 33rd International Conference on International Confer-
lems. In International Conference on Learning Represen- ence on Machine Learning - Volume 48, ICML’16, pp.
tations, 2023. URL https://openreview.net/ 1747–1756. JMLR.org, 2016. URL http://dl.acm.
forum?id=9_gsMA8MRKQ. org/citation.cfm?id=3045390.3045575.
Song, Y. and Ermon, S. Generative Modeling by Estimating Vincent, P. A Connection Between Score Matching and
Gradients of the Data Distribution. In Advances in Neural Denoising Autoencoders. Neural Computation, 23(7):
Information Processing Systems, pp. 11918–11930, 2019. 1661–1674, 2011.
Consistency Models

Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel,
D. P., and Gool, L. V. Sliced wasserstein generative
models. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 3713–
3722, 2019.
Xiao, Z., Yan, Q., and Amit, Y. Generative latent flow. arXiv
preprint arXiv:1905.10485, 2019.

Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. S. Pois-


son flow generative models. In Oh, A. H., Agarwal, A.,
Belgrave, D., and Cho, K. (eds.), Advances in Neural
Information Processing Systems, 2022. URL https:
//openreview.net/forum?id=voV_TRqcWh.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Xiao, J. Lsun: Construction of a large-scale image dataset
using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015.
Zhang, H., Zhang, Z., Odena, A., and Lee, H. Consistency
regularization for generative adversarial networks. arXiv
preprint arXiv:1910.12027, 2019.
Zhang, Q. and Chen, Y. Fast sampling of diffusion
models with exponential integrator. arXiv preprint
arXiv:2204.13902, 2022.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,
O. The unreasonable effectiveness of deep features as a
perceptual metric. In CVPR, 2018.
Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and
Anandkumar, A. Fast sampling of diffusion models
via operator learning. arXiv preprint arXiv:2211.13449,
2022.
Consistency Models

Contents

1 Introduction 1

2 Diffusion Models 2

3 Consistency Models 3

4 Training Consistency Models via Distillation 4

5 Training Consistency Models in Isolation 5

6 Experiments 6
6.1 Training Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Few-Step Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.3 Zero-Shot Image Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7 Conclusion 9

Appendices 15

Appendix A Proofs 15
A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.2 Consistency Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.3 Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Appendix B Continuous-Time Extensions 18


B.1 Consistency Distillation in Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.2 Consistency Training in Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.3 Experimental Verifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Appendix C Additional Experimental Details 24


Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Parameterization for Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Schedule Functions for Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Appendix D Additional Results on Zero-Shot Image Editing 25


Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Colorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Consistency Models

Stroke-guided image generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Appendix E Additional Samples from Consistency Models 27

Appendices
A. Proofs
A.1. Notations
We use fθ px, tq to denote a consistency model parameterized by θ, and f px, t; φq the consistency function of the empirical
PF ODE in Eq. (3). Here φ symbolizes its dependency on the pre-trained score model sφ px, tq. For the consistency function
of the PF ODE in Eq. (2), we denote it as f px, tq. Given a multi-variate function hpx, yq, we let B1 hpx, yq denote the
Jacobian of h over x, and analogously B2 hpx, yq denote the Jacobian of h over y. Unless otherwise stated, x is supposed to
be a random variable sampled from the data distribution pdata pxq, n is sampled uniformly at random from J1, N ´ 1K, and
xtn is sampled from N px; t2n Iq. Here J1, N ´ 1K represents the set of integers t1, 2, ¨ ¨ ¨ , N ´ 1u. Furthermore, recall that
we define

x̂φ
tn :“ xtn`1 ` ptn ´ tn`1 qΦpxtn`1 , tn`1 ; φq,

where Φp¨ ¨ ¨ ; φq denotes the update function of a one-step ODE solver for the empirical PF ODE defined by the score
model sφ px, tq. By default, Er¨s denotes the expectation over all relevant random variables in the expression.

A.2. Consistency Distillation


Theorem 1. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u, and f p¨, ¨; φq be the consistency function of the empirical PF ODE
in Eq. (3). Assume fθ satisfies the Lipschitz condition: there exists L ą 0 such that for all t P r, T s, x, and y, we have
kfθ px, tq ´ fθ py, tqk2 ď L kx ´ yk2 . Assume further that for all n P J1, N ´ 1K, the ODE solver called at tn`1 has local
error uniformly bounded by Opptn`1 ´ tn qp`1 q with p ě 1. Then, if LN CD pθ, θ; φq “ 0, we have

sup }fθ px, tn q ´ f px, tn ; φq}2 “ Opp∆tqp q.


n,x

Proof. From LN
CD pθ, θ; φq “ 0, we have

φ
LN
CD pθ, θ; φq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ px̂tn , tn qqs “ 0. (11)

According to the definition, we have ptn pxtn q “ pdata pxq b N p0, t2n Iq where tn ě  ą 0. It follows that ptn pxtn q ą 0 for
every xtn and 1 ď n ď N . Therefore, Eq. (11) entails

λptn qdpfθ pxtn`1 , tn`1 q, fθ px̂φ


tn , tn qq ” 0. (12)

Because λp¨q ą 0 and dpx, yq “ 0 ô x “ y, this further implies that

fθ pxtn`1 , tn`1 q ” fθ px̂φ


tn , tn q. (13)

Now let en represent the error vector at tn , which is defined as

en :“ fθ pxtn , tn q ´ f pxtn , tn ; φq.

We can easily derive the following recursion relation

en`1 “ fθ pxtn`1 , tn`1 q ´ f pxtn`1 , tn`1 ; φq


Consistency Models

piq
“ fθ px̂φ
tn , tn q ´ f pxtn , tn ; φq

“ fθ px̂φ
tn , tn q ´ fθ pxtn , tn q ` fθ pxtn , tn q ´ f pxtn , tn ; φq

“ fθ px̂φ
tn , tn q ´ fθ pxtn , tn q ` en , (14)

where (i) is due to Eq. (13) and f pxtn`1 , tn`1 ; φq “ f pxtn , tn ; φq. Because fθ p¨, tn q has Lipschitz constant L, we have

ken`1 k2 ď ken k2 ` L x̂φ tn ´ xtn

2
piq
“ ken k2 ` L ¨ Opptn`1 ´ tn qp`1 q
“ ken k2 ` Opptn`1 ´ tn qp`1 q,

where (i) holds because the ODE solver has local error bounded by Opptn`1 ´ tn qp`1 q. In addition, we observe that e1 “ 0,
because

e1 “ fθ pxt1 , t1 q ´ f pxt1 , t1 ; φq
piq
“ xt1 ´ f pxt1 , t1 ; φq
piiq
“ xt1 ´ xt1
“ 0.

Here (i) is true because the consistency model is parameterized such that f pxt1 , t1 ; φq “ xt1 and (ii) is entailed by the
definition of f p¨, ¨; φq. This allows us to perform induction on the recursion formula Eq. (14) to obtain
n´1
ÿ
ken k2 ď ke1 k2 ` Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ Opptk`1 ´ tk qp`1 q
k“1
n´1
ÿ
“ ptk`1 ´ tk qOpptk`1 ´ tk qp q
k“1
n´1
ÿ
ď ptk`1 ´ tk qOpp∆tqp q
k“1
n´1
ÿ
“ Opp∆tqp q ptk`1 ´ tk q
k“1
“ Opp∆tqp qptn ´ t1 q
ď Opp∆tqp qpT ´ q
“ Opp∆tqp q,

which completes the proof.

A.3. Consistency Training


The following lemma provides an unbiased estimator for the score function, which is crucial to our proof for Theorem 2.
Lemma 1. Let x „ pdata pxq, xt „ N px; t2 Iq, and pt pxt q “ pdata pxqbN p0, t2 Iq. We have ∇ log pt pxq “ ´Er xtt´x
2 | xt s.
ş
Proof. According to the definition of pt pxt q, we have ∇ log pt pxt q “ ∇xt log pdata pxqppxt | xq dx, where ppxt | xq “
N pxt ; x, t2 Iq. This expression can be further simplified to yield
ş
pdata pxq∇xt ppxt | xq dx
∇ log pt pxt q “ ş
pdata pxqppxt | xq dx
Consistency Models
ş
pdata pxqppxt | xq∇xt log ppxt | xq dx
“ ş
pdata pxqppxt | xq dx
ş
pdata pxqppxt | xq∇xt log ppxt | xq dx

pt pxt q
ż
pdata pxqppxt | xq
“ ∇xt log ppxt | xq dx
pt pxt q
ż
piq
“ ppx | xt q∇xt log ppxt | xq dx

“ Er∇xt log ppxt | xq | xt s


„ 
xt ´ x
“ ´E | xt ,
t2
where (i) is due to Bayes’ rule.
Theorem 2. Let ∆t :“ maxnPJ1,N ´1K t|tn`1 ´ tn |u. Assume d and fθ´ are both twice continuously differentiable with
2
bounded second derivatives, the weighting function λp¨q is bounded, and Erk∇ log ptn pxtn qk2 s ă 8. Assume further that
we use the Euler ODE solver, and the pre-trained score model matches the ground truth, i.e., @t P r, T s : sφ px, tq ”
∇ log pt pxq. Then,

LN ´ N ´
CD pθ, θ ; φq “ LCT pθ, θ q ` op∆tq,

where the expectation is taken with respect to x „ pdata , n „ UJ1, N ´ 1K, and xtn`1 „ N px; t2n`1 Iq. The consistency
training objective, denoted by LN ´
CT pθ, θ q, is defined as

Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs,

where z „ N p0, Iq. Moreover, LN ´ N ´


CT pθ, θ q ě Op∆tq if inf N LCD pθ, θ ; φq ą 0.

Proof. With Taylor expansion, we have


φ
LN ´
CD pθ, θ ; φq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ ´ px̂tn , tn qs
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 ` ptn`1 ´ tn qtn`1 ∇ log ptn`1 pxtn`1 q, tn qqs
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 q ` B1 fθ´ pxtn`1 , tn`1 qptn`1 ´ tn qtn`1 ∇ log ptn`1 pxtn`1 q
` B2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 q ` op|tn`1 ´ tn |qqs
“Etλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq ` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqr
B1 fθ´ pxtn`1 , tn`1 qptn`1 ´ tn qtn`1 ∇ log ptn`1 pxtn`1 q ` B2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 q ` op|tn`1 ´ tn |qsu
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB1 fθ´ pxtn`1 , tn`1 qptn`1 ´ tn qtn`1 ∇ log ptn`1 pxtn`1 qsu
` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs.
(15)

Then, we apply Lemma 1 to Eq. (15) and use Taylor expansion in the reverse direction to obtain

LN ´
CD pθ, θ ; φq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ „ *
xtn`1 ´ x ˇˇ
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1 E tn`1
t2
ˇx

` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs
piq
“Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqs
" „ ˆ ˙*
xtn`1 ´ x
` E λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2
Consistency Models

` Etλptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qsu ` Erop|tn`1 ´ tn |qs

“E λptn qdpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq
„ ˆ ˙
xtn`1 ´ x
` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qq B1 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qtn`1
t2

` λptn qB2 dpfθ pxtn`1 , tn`1 q, fθ´ pxtn`1 , tn`1 qqrB2 fθ´ pxtn`1 , tn`1 qptn ´ tn`1 qs ` op|tn`1 ´ tn |q

` Erop|tn`1 ´ tn |qs
„ ˆ ˆ ˙˙
xt ´x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 qtn`1 n`1 , tn ` Erop|tn`1 ´ tn |qs
t2n`1
„ ˆ ˆ ˙˙
xt ´x
“E λptn qd fθ pxtn`1 , tn`1 q, fθ´ xtn`1 ` ptn ´ tn`1 q n`1 , tn ` Erop|tn`1 ´ tn |qs
tn`1
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn`1 z ` ptn ´ tn`1 qz, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop|tn`1 ´ tn |qs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` Erop∆tqs
“E rλptn qd pfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs ` op∆tq
“LN ´
CT pθ, θ q ` op∆tq, (16)
xt ´x
where (i) is due to the law of total expectation, and z :“ n`1 tn`1 is distributed according to the standard Gaussian
distribution. This implies LN CD pθ, θ ´
; φq “ LN
CT pθ, θ ´
q ` op∆tq and thus completes the proof for Eq. (9). More-
over, we have LNCT pθ, θ ´
q ě Op∆tq whenever inf LN
N CD pθ, θ ´
; φq ą 0. Otherwise, LN ´
CT pθ, θ q ă Op∆tq and thus
N ´ N ´
lim∆tÑ0 LCD pθ, θ ; φq “ 0, a clear contradiction to inf N LCD pθ, θ ; φq ą 0.

B. Continuous-Time Extensions
The consistency distillation and consistency training objectives can be generalized to hold for infinite time steps (N Ñ 8)
under suitable conditions.

B.1. Consistency Distillation in Continuous Time


Depending on whether θ ´ “ θ or θ ´ “ stopgradpθq (same as setting µ “ 0), there are two possible continuous-time
extensions for the consistency distillation objective LN ´
CD pθ, θ ; φq. Given a twice continuously differentiable metric function
dpx, yq, we define Gpxq as a matrix, whose pi, jq-th entry is given by
ˇ
B 2 dpx, yq ˇˇ
rGpxqsij :“ .
Byi Byj ˇy“x

Similarly, we define Hpxq as


ˇ
B 2 dpy, xq ˇˇ
rHpxqsij :“ .
Byi Byj ˇ y“x

The matrices G and H play a crucial role in forming continuous-time objectives for consistency distillation. Additionally,
we denote the Jacobian of fθ px, tq with respect to x as BfθBx
px,tq
.
When θ ´ “ θ (with no stopgrad operator), we have the following theoretical result.
Theorem 3. Let tn “ pn´1q T ´
N ´1 pT ´ q `  and ∆t “ N ´1 , where n P J1, N K. Assume d is three times continuously
differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and second
derivatives. Assume further that the weighting function λp¨q is bounded, and supx,tPr,T s ksφ px, tqk2 ă 8. Suppose we use
the Euler ODE solver. Then,
LN
CD pθ, θ; φq
lim “ L8
CD pθ, θ; φq, (17)
∆tÑ0 p∆tq2
Consistency Models

where
« ˆ ˙T ˆ ˙ff
1 Bfθ px t , tq Bf θ px t , tq Bfθ px t , tq Bfθ px t , tq
L8
CD pθ, θ; φq :“ E λptq ´t sφ pxt , tq Gpfθ pxt , tqq ´t sφ pxt , tq .
2 Bt Bxt Bt Bxt
(18)

Here the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.

Proof. First, we can derive the following equation with Taylor expansion:

fθ px̂φ
tn , tn q ´ fθ pxtn`1 , tn`1 q “ fθ pxtn`1 ` tn`1 sφ pxtn`1 , tn`1 q∆t, tn q ´ fθ pxtn`1 , tn`1 q
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
“tn`1 sφ pxtn`1 , tn`1 q∆t ´ ∆t ` Opp∆tq2 q. (19)
Bxtn`1 Btn`1

Applying Taylor expansion again, we get


1 1
LN pθ, θ; φq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ px̂φ
tn , tn qs
p∆tq2 CD p∆tq2
ˆ
piq 1
“ Etλptn qrfθ px̂φ T
tn , tn q ´ fθ pxtn`1 , tn`1 qs Gpfθ pxtn`1 , tn`1 qq
2p∆tq2
˙
¨ rfθ px̂φ
tn , tn q
3
´ fθ pxtn`1 , tn`1 qsu ` ErOp|∆t| qs
„ ˆ ˙T
piiq 1 Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
“ E λptn q ´ tn`1 sφ pxtn`1 , tn`1 q Gpfθ pxtn`1 , tn`1 qq (20)
2 Btn`1 Bxtn`1
ˆ ˙
Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q
¨ ´ tn`1 sφ pxtn`1 , tn`1 q ` ErOp|∆t|qs
Btn`1 Bxtn`1

where we obtain (i) by first expanding dpfθ pxtn`1 , tn`1 q, ¨q to second order, and then noting that dpx, xq ” 0 and
∇y dpx, yq|y“x ” 0. We obtain (ii) using Eq. (19). By taking the limit for both sides of Eq. (20) as ∆t Ñ 0, we arrive at
Eq. (17), which completes the proof.
Remark 1. Although Theorem 3 assumes uniform timesteps and the Euler ODE solver for technical simplicity, we believe
an analogous result can be derived without those assumptions, as all time boundaries and ODE solvers should perform
similarly as N Ñ 8. We leave a more general version of Theorem 3 as future work.
Remark 2. Theorem 3 implies that consistency models can be trained by minimizing L8 CD pθ, θ; φq. In particular, when
2
dpx, yq “ kx ´ yk2 , we have
« 2 ff
Bfθ px t , tq Bf θ px t , tq
L8CD pθ, θ; φq “ E λptq
´t sφ pxt , tq
. (21)
Bt Bxt 2

However, this continuous-time objective requires computing Jacobian-vector products as a subroutine to evaluate the loss
function, which can be slow and laborious to implement in deep learning frameworks that do not support forward-mode
automatic differentiation.
Remark 3. If fθ px, tq matches the ground truth consistency function for the empirical PF ODE of sφ px, tq, then

Bfθ px, tq Bfθ px, tq


´t sφ px, tq ” 0
Bt Bx
and therefore L8CD pθ, θ; φq “ 0. This can be proved by noting that fθ pxt , tq ” x for all t P r, T s, and then taking the
time-derivative of this identity:

fθ pxt , tq ” x
Bfθ pxt , tq dxt Bfθ pxt , tq
ðñ ` ”0
Bxt dt Bt
Consistency Models

Bfθ pxt , tq Bfθ pxt , tq


ðñ r´tsφ pxt , tqs ` ”0
Bxt Bt
Bfθ pxt , tq Bfθ pxt , tq
ðñ ´t sφ pxt , tq ” 0.
Bt Bxt
The above observation provides another motivation for L8
CD pθ, θ; φq, as it is minimized if and only if the consistency model
matches the ground truth consistency function.

For some metric functions, such as the `1 norm, the Hessian Gpxq is zero so Theorem 3 is vacuous. Below we show that a
non-vacuous statement holds for the `1 norm with just a small modification of the proof for Theorem 3.
Theorem 4. Let tn “ pn´1q T ´
N ´1 pT ´ q `  and ∆t “ N ´1 , where n P J1, N K. Assume fθ is twice continuously differ-
entiable with bounded first and second derivatives. Assume further that the weighting function λp¨q is bounded, and
supx,tPr,T s ksφ px, tqk2 ă 8. Suppose we use the Euler ODE solver, and set dpx, yq “ kx ´ yk1 . Then,

LN
CD pθ, θ; φq
lim “ L8
CD, `1 pθ, θ; φq, (22)
∆tÑ0 ∆t
where
„ 
Bfθ pxt , tq Bfθ pxt , tq
L8
CD, `1 pθ, θ; φq :“ E λptq
t Bxt sφ pxt , tq ´
Bt
1

where the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.

Proof. We have
1 N 1
LCD pθ, θ; φq “ Erλptn q}fθ pxtn`1 , tn`1 q ´ fθ px̂φtn , tn q}1 s
∆t „ ∆t 
piq 1 Bfθ pxtn`1 , tn`1 q Bfθ pxtn`1 , tn`1 q 2

“ E λptn q t n`1 s px ,
φ tn`1 n`1 t q∆t ´ ∆t ` Opp∆tq q
∆t Bxtn`1 Btn`1
1
„ 
Bf θ px tn`1 , tn`1 q Bfθ px t n`1 , tn`1 q
“E λptn q tn`1 sφ pxtn`1 , tn`1 q ´ ` Op∆tq (23)
Bxtn`1
Btn`1
1

where (i) is obtained by plugging Eq. (19) into the previous equation. Taking the limit for both sides of Eq. (23) as ∆t Ñ 0
leads to Eq. (22), which completes the proof.
Remark 4. According to Theorem 4, consistency models can be trained by minimizing L8 CD, `1 pθ, θ; φq. Moreover, the same
d
reasoning in Remark 3 can be applied to show that L8
CD, `1 pθ, θ; φq “ 0 if and only if f θ pxt , tq “ x for all xt P R and
t P r, T s.

In the second case where θ ´ “ stopgradpθq, we can derive a pseudo-objective whose gradient matches the gradient of
LN ´
CD pθ, θ ; φq in the limit of N Ñ 8. Minimizing this pseudo-objective with gradient descent gives another way to train
consistency models via distillation. This pseudo-objective is provided by the theorem below.
Theorem 5. Let tn “ pn´1q T ´
N ´1 pT ´ q `  and ∆t “ N ´1 , where n P J1, N K. Assume d is three times continu-
ously differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and
second derivatives. Assume further that the weighting function λp¨q is bounded, supx,tPr,T s ksφ px, tqk2 ă 8, and
supx,tPr,T s k∇θ fθ px, tqk2 ă 8. Suppose we use the Euler ODE solver, and θ ´ “ stopgradpθq. Then,

∇θ LN ´
CD pθ, θ ; φq
lim “ ∇θ L 8 ´
CD pθ, θ ; φq, (24)
∆tÑ0 ∆t
where
„ ˆ ˙
T Bfθ´ pxt , tq Bfθ´ pxt , tq
L8
CD pθ, θ ´
; φq :“ E λptqfθ px t , tq Hpf θ ´ pxt , tqq ´ t s px
φ t , tq . (25)
Bt Bxt

Here the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.
Consistency Models

Proof. First, we leverage Taylor series expansion to obtain


1 N 1
LCD pθ, θ ´ ; φq “ Erλptn qdpfθ pxtn`1 , tn`1 q, fθ´ px̂φ
tn , tn qs
∆t ˆ ∆t
piq 1
“ Etλptn qrfθ pxtn`1 , tn`1 q ´ fθ´ px̂φ T φ
tn , tn qs Hpfθ ´ px̂tn , tn qq
2∆t
˙
¨ rfθ pxtn`1 , tn`1 q ´ fθ´ px̂φ
tn , tn qsu
3
` ErOp|∆t| qs
1
“ Etλptn qrfθ pxtn`1 , tn`1 q ´ fθ´ px̂φ T φ φ 2
tn , tn qs Hpfθ ´ px̂tn , tn qqrfθ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qsu ` ErOp|∆t| qs
2∆t
(26)

where (i) is derived by first expanding dp¨, fθ´ px̂φ


tn , tn qq to second order, and then noting that dpx, xq ” 0 and
∇y dpy, xq|y“x ” 0. Next, we compute the gradient of Eq. (26) with respect to θ and simplify the result to obtain
1
∇θ L N ´
CD pθ, θ ; φq
∆t
piq 1
“ ∇θ Etλptn qrfθ pxtn`1 , tn`1 q ´ fθ´ px̂φ T φ φ 2
tn , tn qs Hpfθ ´ px̂tn , tn qqrfθ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qsu ` ErOp|∆t| qs
2∆t
1
“ Etλptn qr∇θ fθ pxtn`1 , tn`1 qsT Hpfθ´ px̂φ φ 2
tn , tn qqrfθ pxtn`1 , tn`1 q ´ fθ ´ px̂tn , tn qsu ` ErOp|∆t| qs
∆t " „
piiq 1 T φ Bfθ´ pxtn`1 , tn`1 q
“ E λptn qr∇θ fθ pxtn`1 , tn`1 qs Hpfθ´ px̂tn , tn qq tn`1 sφ pxtn`1 , tn`1 q∆t
∆t Bxtn`1
*
Bfθ´ pxtn`1 , tn`1 q
´ ∆t ` ErOp|∆t|qs
Btn`1
" „
Bfθ´ pxtn`1 , tn`1 q
“E λptn qr∇θ fθ pxtn`1 , tn`1 qsT Hpfθ´ px̂φ tn , tn qq tn`1 sφ pxtn`1 , tn`1 q
Bxtn`1
*
Bfθ´ pxtn`1 , tn`1 q
´ ` ErOp|∆t|qs
Btn`1
" „
Bfθ´ pxtn`1 , tn`1 q
“∇θ E λptn qrfθ pxtn`1 , tn`1 qsT Hpfθ´ px̂φ ,
tn n t qq tn`1 sφ pxtn`1 , tn`1 q (27)
Bxtn`1
*
Bfθ´ pxtn`1 , tn`1 q
´ ` ErOp|∆t|qs
Btn`1
Here (i) results from the chain rule, and (ii) follows from Eq. (19) and fθ px, tq ” fθ´ px, tq, since θ ´ “ stopgradpθq.
Taking the limit for both sides of Eq. (27) as ∆t Ñ 0 yields Eq. (24), which completes the proof.
2
Remark 5. When dpx, yq “ kx ´ yk2 , the pseudo-objective L8 ´
CD pθ, θ ; φq can be simplified to
„ ˆ ˙
8 ´ T Bfθ´ pxt , tq Bfθ´ pxt , tq
LCD pθ, θ ; φq “ 2E λptqfθ pxt , tq ´t sφ pxt , tq . (28)
Bt Bxt
Remark 6. The objective L8 ´
CD pθ, θ ; φq defined in Theorem 5 is only meaningful in terms of its gradient—one cannot
measure the progress of training by tracking the value of L8 ´
CD pθ, θ ; φq, but can still apply gradient descent to this objective
to distill consistency models from pre-trained diffusion models. Because this objective is not a typical loss function, we refer
to it as the “pseudo-objective” for consistency distillation.
Remark 7. Following the same reasoning in Remark 3, we can easily derive that L8 ´
CD pθ, θ ; φq “ 0 and
8 ´
∇θ LCD pθ, θ ; φq “ 0 if fθ px, tq matches the ground truth consistency function for the empirical PF ODE that in-
volves sφ px, tq. However, the converse does not hold true in general. This distinguishes L8 ´ 8
CD pθ, θ ; φq from LCD pθ, θ; φq,
the latter of which is a true loss function.

B.2. Consistency Training in Continuous Time


A remarkable observation is that the pseudo-objective in Theorem 5 can be estimated without any pre-trained diffusion
models, which enables direct consistency training of consistency models. More precisely, we have the following result.
Consistency Models

Theorem 6. Let tn “ pn´1q T ´


N ´1 pT ´ q `  and ∆t “ N ´1 , where n P J1, N K. Assume d is three times contin-
uously differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first
2
and second derivatives. Assume further that the weighting function λp¨q is bounded, Erk∇ log ptn pxtn qk2 s ă 8, and
´
supx,tPr,T s k∇θ fθ px, tqk2 ă 8. Suppose we use the Euler ODE solver and θ “ stopgradpθq. Then,
∇θ L N ´
CT pθ, θ q
lim “ ∇θ L 8 ´
CT pθ, θ q, (29)
∆tÑ0 ∆t
where
„ ˆ ˙
T Bfθ´ pxt , tq Bfθ´ pxt , tq xt ´ x
L8
CT pθ, θ ´
q :“ E λptqfθ px t , tq Hpf θ ´ pxt , tqq ` ¨ . (30)
Bt Bxt t
Here the expectation above is taken over x „ pdata , t „ Ur, T s, and xt „ N px, t2 Iq.

Proof. The proof mostly follows that of Theorem 5. First, we leverage Taylor series expansion to obtain
1 N 1
LCT pθ, θ ´ q “ Erλptn qdpfθ px ` tn`1 z, tn`1 q, fθ´ px ` tn z, tn qqs
∆t ˆ ∆t
piq 1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆t
˙
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|3 qs
1
“ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq (31)
2∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
where (i) is derived by first expanding dp¨, fθ´ px ` tn z, tn qq to second order, and then noting that dpx, xq ” 0 and
∇y dpy, xq|y“x ” 0. Next, we compute the gradient of Eq. (31) with respect to θ and simplify the result to obtain
1
∇θ L N ´
CT pθ, θ q
∆t
1
“ ∇θ Etλptn qrfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsT Hpfθ´ px ` tn z, tn qq
2∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
piq 1
“ Etλptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq
∆t
¨ rfθ px ` tn`1 z, tn`1 q ´ fθ´ px ` tn z, tn qsu ` ErOp|∆t|2 qs
" „
piiq 1 T
“ E λptn qr∇θ fθ px ` tn`1 z, tn`1 qs Hpfθ´ px ` tn z, tn qq ∆tB1 fθ´ px ` tn z, tn qz
∆t
*
` B2 fθ´ px ` tn z, tn q∆t ` ErOp|∆t|qs
" „
“E λptn qr∇θ fθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
*
` B2 fθ px ` tn z, tn q ` ErOp|∆t|qs
´

" „
“∇θ E λptn qrfθ px ` tn`1 z, tn`1 qsT Hpfθ´ px ` tn z, tn qq B1 fθ´ px ` tn z, tn qz
*
` B2 fθ´ px ` tn z, tn q ` ErOp|∆t|qs
" „ *
xt ´ x
“∇θ E λptn qrfθ pxtn`1 , tn`1 qsT Hpfθ´ pxtn , tn qq B1 fθ´ pxtn , tn q n ` B2 fθ´ pxtn , tn q ` ErOp|∆t|qs (32)
tn
Here (i) results from the chain rule, and (ii) follows from Taylor expansion. Taking the limit for both sides of Eq. (32) as
∆t Ñ 0 yields Eq. (29), which completes the proof.
Consistency Models

(a) Consistency Distillation (b) Consistency Training

Figure 7: Comparing discrete consistency distillation/training algorithms with continuous counterparts.

Remark 8. Note that L8 ´


CT pθ, θ q does not depend on the diffusion model parameter φ and hence can be optimized without
any pre-trained diffusion models.
2
Remark 9. When dpx, yq “ kx ´ yk2 , the continuous-time consistency training objective becomes
„ ˆ ˙
8 ´ T Bfθ´ pxt , tq Bfθ´ pxt , tq xt ´ x
LCT pθ, θ q “ 2E λptqfθ pxt , tq ` ¨ . (33)
Bt Bxt t

Remark 10. Under suitable regularity conditions, we can combine the insights of Theorems 2 and 6 to derive

∇θ LN ´
CD pθ, θ ; φq
lim “ ∇θ L 8 ´
CT pθ, θ q.
∆tÑ0 ∆t
As in Theorems 2 and 6, here φ represents diffusion model parameters that satisfy sφ px, tq ” ∇ log pt pxq, and θ ´ “
stopgradpθq.
Remark 11. Similar to L8 ´ 8 ´
CD pθ, θ ; φq in Theorem 5, LCT pθ, θ q is a pseudo-objective; one cannot track training by
8 ´
monitoring the value of LCT pθ, θ q, but can still apply gradient descent on this loss function to train a consistency
model fθ px, tq directly from data. Moreover, the same observation in Remark 7 holds true: L8 ´
CT pθ, θ q “ 0 and
8 ´
∇θ LCT pθ, θ q “ 0 if fθ px, tq matches the ground truth consistency function for the PF ODE.

B.3. Experimental Verifications


To experimentally verify the efficacy of our continuous-time CD and CT objectives, we train consistency models with
a variety of loss functions on CIFAR-10. All results are provided in Fig. 7. Unless otherwise noted, models are trained
with hyperparameters given in Table 3. We occasionally modify some hyperparameters for improved performance. For
distillation, we compare the following objectives:

• CD p`2 q: Consistency distillation LN


CD with N “ 18 and the `2 metric.

• CD p`1 q: Consistency distillation LN


CD with N “ 18 and the `1 metric. We set the learning rate to 2e-4.

• CD (LPIPS): Consistency distillation LN


CD with N “ 18 and the LPIPS metric.

• CD8 p`2 q: Consistency distillation L8


CD in Theorem 3 with the `2 metric. We set the learning rate to 1e-3 and dropout
to 0.13.
• CD8 p`1 q: Consistency distillation L8
CD in Theorem 4 with the `1 metric. We set the learning rate to 1e-3 and dropout
to 0.3.
• CD8 (stopgrad, `2 ): Consistency distillation L8
CD in Theorem 5 with the `2 metric. We set the learning rate to 5e-6.
Consistency Models

Table 3: Hyperparameters used for training CD and CT models

Hyperparameter CIFAR-10 ImageNet 64 ˆ 64 LSUN 256 ˆ 256


CD CT CD CT CD CT
Learning rate 4e-4 4e-4 8e-6 8e-6 1e-5 1e-5
Batch size 512 512 2048 2048 2048 2048
µ 0 0.95 0.95
µ0 0.9 0.95 0.95
s0 2 2 2
s1 150 200 150
EMA decay rate 0.9999 0.9999 0.999943 0.999943 0.999943 0.999943
Training iterations 800k 800k 600k 800k 600k 1000k
Mixed-Precision (FP16) No No Yes Yes Yes Yes
Dropout probability 0.0 0.0 0.0 0.0 0.0 0.0
Number of GPUs 8 8 64 64 64 64

• CD8 (stopgrad, LPIPS): Consistency distillation L8


CD in Theorem 5 with the LPIPS metric. We set the learning rate to
5e-6.

We did not investigate using the LPIPS metric in Theorem 3 because minimizing the resulting objective would require
back-propagating through second order derivatives of the VGG network used in LPIPS, which is computationally expensive
and prone to numerical instability. As revealed by Fig. 7a, the stopgrad version of continuous-time distillation (Theorem 5)
works better than the non-stopgrad version (Theorem 3) for both the LPIPS and `2 metrics, and the LPIPS metric works
the best for all distillation approaches. Additionally, discrete-time consistency distillation outperforms continuous-time
consistency distillation, possibly due to the larger variance in continuous-time objectives, and the fact that one can use
effective higher-order ODE solvers in discrete-time objectives.
For consistency training (CT), we find it important to initialize consistency models from a pre-trained EDM model in order
to stabilize training when using continuous-time objectives. We hypothesize that this is caused by the large variance in our
continuous-time loss functions. For fair comparison, we thus initialize all consistency models from the same pre-trained
EDM model on CIFAR-10 for both discrete-time and continuous-time CT, even though the former works well with random
initialization. We leave variance reduction techniques for continuous-time CT to future research.
We empirically compare the following objectives:

• CT (LPIPS): Consistency training LNCT with N “ 120 and the LPIPS metric. We set the learning rate to 4e-4, and the
EMA decay rate for the target network to 0.99. We do not use the schedule functions for N and µ here because they
cause slower learning when the consistency model is initialized from a pre-trained EDM model.
• CT8 p`2 q: Consistency training L8
CT with the `2 metric. We set the learning rate to 5e-6.

• CT8 (LPIPS): Consistency training L8


CT with the LPIPS metric. We set the learning rate to 5e-6.

As shown in Fig. 7b, the LPIPS metric leads to improved performance for continuous-time CT. We also find that continuous-
time CT outperforms discrete-time CT with the same LPIPS metric. This is likely due to the bias in discrete-time CT, as
∆t ą 0 in Theorem 2 for discrete-time objectives, whereas continuous-time CT has no bias since it implicitly drives ∆t to 0.

C. Additional Experimental Details


Model Architectures We follow Song et al. (2021); Dhariwal & Nichol (2021) for model architectures. Specifically, we
use the NCSN++ architecture in Song et al. (2021) for all CIFAR-10 experiments, and take the corresponding network
architectures from Dhariwal & Nichol (2021) when performing experiments on ImageNet 64 ˆ 64, LSUN Bedroom
256 ˆ 256 and LSUN Cat 256 ˆ 256.

Parameterization for Consistency Models We use the same architectures for consistency models as those used for
EDMs. The only difference is we slightly modify the skip connections in EDM to ensure the boundary condition holds for
Consistency Models

consistency models. Recall that in Section 3 we propose to parameterize a consistency model in the following form:

fθ px, tq “ cskip ptqx ` cout ptqFθ px, tq.

In EDM (Karras et al., 2022), authors choose


2
σdata σdata t
cskip ptq “ 2 , cout ptq “ a 2 ,
t2 ` σdata σdata ` t2
where σdata “ 0.5. However, this choice of cskip and cout does not satisfy the boundary condition when the smallest time
instant  ‰ 0. To remedy this issue, we modify them to
2
σdata σdata pt ´ q
cskip ptq “ 2 , cout ptq “ a 2 ,
pt ´ q2 ` σdata σdata ` t2

which clearly satisfies cskip pq “ 1 and cout pq “ 0.

Schedule Functions for Consistency Training As discussed in Section 5, consistency generation requires specifying
schedule functions N p¨q and µp¨q for best performance. Throughout our experiments, we use schedule functions that take
the form below:
Sc W
k
N pkq “ pps1 ` 1q2 ´ s20 q ` s20 ´ 1 ` 1
K
ˆ ˙
s0 log µ0
µpkq “ exp ,
N pkq
where K denotes the total number of training iterations, s0 denotes the initial discretization steps, s1 ą s0 denotes the target
discretization steps at the end of training, and µ0 ą 0 denotes the EMA decay rate at the beginning of model training.

Training Details In both consistency distillation and progressive distillation, we distill EDMs (Karras et al., 2022). We
trained these EDMs ourselves according to the specifications given in Karras et al. (2022). The original EDM paper did
not provide hyperparameters for the LSUN Bedroom 256 ˆ 256 and Cat 256 ˆ 256 datasets, so we mostly used the same
hyperparameters as those for the ImageNet 64 ˆ 64 dataset. The difference is that we trained for 600k and 300k iterations
for the LSUN Bedroom and Cat datasets respectively, and reduced the batch size from 4096 to 2048.
We used the same EMA decay rate for LSUN 256 ˆ 256 datasets as for the ImageNet 64 ˆ 64 dataset. For progressive
distillation, we used the same training settings as those described in Salimans & Ho (2022) for CIFAR-10 and ImageNet
64 ˆ 64. Although the original paper did not test on LSUN 256 ˆ 256 datasets, we used the same settings for ImageNet
64 ˆ 64 and found them to work well.
In all distillation experiments, we initialized the consistency model with pre-trained EDM weights. For consistency training,
we initialized the model randomly, just as we did for training the EDMs. We trained all consistency models with the
Rectified Adam optimizer (Liu et al., 2019), with no learning rate decay or warm-up, and no weight decay. We also applied
EMA to the weights of the online consistency models in both consistency distillation and consistency training, as well as
to the weights of the training online consistency models according to Karras et al. (2022). For LSUN 256 ˆ 256 datasets,
we chose the EMA decay rate to be the same as that for ImageNet 64 ˆ 64, except for consistency distillation on LSUN
Bedroom 256 ˆ 256, where we found that using zero EMA worked better.
When using the LPIPS metric on CIFAR-10 and ImageNet 64 ˆ 64, we rescale images to resolution 224 ˆ 224 with bilinear
upsampling before feeding them to the LPIPS network. For LSUN 256 ˆ 256, we evaluated LPIPS without rescaling inputs.
In addition, we performed horizontal flips for data augmentation for all models and on all datasets. We trained all models on
a cluster of Nvidia A100 GPUs. Additional hyperparameters for consistency training and distillation are listed in Table 3.

D. Additional Results on Zero-Shot Image Editing


With consistency models, we can perform a variety of zero-shot image editing tasks. As an example, we present additional
results on colorization (Fig. 8), super-resolution (Fig. 9), inpainting (Fig. 10), interpolation (Fig. 11), denoising (Fig. 12),
Consistency Models

Algorithm 4 Zero-Shot Image Editing


1: Input: Consistency model fθ p¨, ¨q, sequence of time points t1 ą t2 ą ¨ ¨ ¨ ą tN , reference image y, invertible linear
transformation A, and binary image mask Ω
2: y Ð A´1 rpAyq d p1 ´ Ωq ` 0 d Ωs
3: Sample x „ N py, t21 Iq
4: x Ð fθ px, t1 q
5: x Ð A´1 rpAyq d p1 ´ Ωq ` pAxq d Ωs
6: for n “ 2 to N do
7: Sample x „ N px, pt2n ´ 2 qIq
8: x Ð fθ px, tn q
9: x Ð A´1 rpAyq d p1 ´ Ωq ` pAxq d Ωs
10: end for
11: Output: x

and stroke-guided image generation (SDEdit, Meng et al. (2021), Fig. 13). The consistency model used here is trained via
consistency distillation on the LSUN Bedroom 256 ˆ 256.
All these image editing tasks, except for image interpolation and denoising, can be performed via a small modification to the
multistep sampling algorithm in Algorithm 1. The resulting pseudocode is provided in Algorithm 4. Here y is a reference
image that guides sample generation, Ω is a binary mask, d computes element-wise products, and A is an invertible linear
transformation that maps images into a latent space where the conditional information in y is infused into the iterative
generation procedure by masking with Ω. Unless otherwise stated, we choose
ˆ ˙ρ
i ´ 1 1{ρ
ti “ T 1{ρ ` p ´ T 1{ρ q
N ´1
in our experiments, where N “ 40 for LSUN Bedroom 256 ˆ 256.
Below we describe how to perform each task using Algorithm 4.

Inpainting When using Algorithm 4 for inpainting, we let y be an image where missing pixels are masked out, Ω be a
binary mask where 1 indicates the missing pixels, and A be the identity transformation.

Colorization The algorithm for image colorization is similar, as colorization becomes a special case of inpainting once we
transform data into a decoupled space. Specifically, let y P Rhˆwˆ3 be a gray-scale image that we aim to colorize, where
all channels of y are assumed to be the same, i.e., yr:, :, 0s “ yr:, :, 1s “ yr:, :, 2s in NumPy notation. In our experiments,
each channel of this gray scale image is obtained from a colorful image by averaging the RGB channels with

0.2989R ` 0.5870G ` 0.1140B.

We define Ω P t0, 1uhˆwˆ3 to be a binary mask such that


#
1, k “ 1 or 2
Ωri, j, ks “ .
0, k“0

Let Q P R3ˆ3 be an orthogonal matrix whose first column is proportional to the vector p0.2989, 0.5870, 0.1140q. This
orthogonal matrix can be obtained easily via QR decomposition, and we use the following in our experiments
¨ ˛
0.4471 ´0.8204 0.3563
Q “ ˝0.8780 0.4785 0 ‚.
0.1705 ´0.3129 ´0.9343

We then define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rhˆwˆ3 , where


2
ÿ
yri, j, ks “ xri, j, lsQrl, ks.
l“0
Consistency Models

Because Q is orthogonal, the inversion A´1 : y P Rhˆw ÞÑ x P Rhˆwˆ3 is easy to compute, where
2
ÿ
xri, j, ks “ yri, j, lsQrk, ls.
l“0

With A and Ω defined as above, we can now use Algorithm 4 for image colorization.

Super-resolution With a similar strategy, we employ Algorithm 4 for image super-resolution. For simplicity, we assume
that the down-sampled image is obtained by averaging non-overlapping patches of size p ˆ p. Suppose the shape of full
resolution images is h ˆ w ˆ 3. Let y P Rhˆwˆ3 denote a low-resolution image naively up-sampled to full resolution,
2
where pixels in each non-overlapping patch share the same value. Additionally, let Ω P t0, 1uh{pˆw{pˆp ˆ3 be a binary
mask such that
#
1, kě1
Ωri, j, k, ls “ .
0, k“0
2 2
Similar to image colorization, super-resolution requires an orthogonal matrix Q P Rp ˆp whose first column is
p1{p, 1{p, ¨ ¨ ¨ , 1{pq. This orthogonal matrix can be obtained with QR decomposition. To perform super-resolution, we
2
define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rh{pˆw{pˆp ˆ3 , where
2
pÿ ´1
yri, j, k, ls “ xri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrm, ks.
m“0

2
The inverse transformation A´1 : y P Rh{pˆw{pˆp ˆ3
ÞÑ x P Rhˆwˆ3 is easy to derive, with
2
pÿ ´1
xri, j, k, ls “ yri ˆ p ` pm ´ m mod pq{p, j ˆ p ` m mod p, lsQrk, ms.
m“0

Above definitions of A and Ω allow us to use Algorithm 4 for image super-resolution.

Stroke-guided image generation We can also use Algorithm 4 for stroke-guided image generation as introduced in
SDEdit (Meng et al., 2021). Specifically, we let y P Rhˆwˆ3 be a stroke painting. We set A “ I, and define Ω P Rhˆwˆ3
as a matrix of ones. In our experiments, we set t1 “ 5.38 and t2 “ 2.24, with N “ 2.

Denoising It is possible to denoise images perturbed with various scales of Gaussian noise using a single consistency
model. Suppose the input image x is perturbed with N p0; σ 2 Iq. As long as σ P r, T s, we can evaluate fθ px, σq to produce
the denoised image.

Interpolation We can interpolate between two images generated by consistency models. Suppose the first sample x1 is
produced by noise vector z1 , and the second sample x2 is produced by noise vector z2 . In other words, x1 “ fθ pz1 , T q and
x2 “ fθ pz2 , T q. To interpolate between x1 and x2 , we first use spherical linear interpolation to get

sinrp1 ´ αqψs sinpαψq


z“ z1 ` z2 ,
sinpψq sinpψq
zT z
where α P r0, 1s and ψ “ arccosp kz1 k1 kz2 2 k q, then evaluate fθ pz, T q to produce the interpolated image.
2 2

E. Additional Samples from Consistency Models


We provide additional samples from consistency distillation (CD) and consistency training (CT) on CIFAR-10 (Figs. 14
and 18), ImageNet 64 ˆ 64 (Figs. 15 and 19), LSUN Bedroom 256 ˆ 256 (Figs. 16 and 20) and LSUN Cat 256 ˆ 256
(Figs. 17 and 21).
Consistency Models

Figure 8: Gray-scale images (left), colorized images by a consistency model (middle), and ground truth (right).
Consistency Models

Figure 9: Downsampled images of resolution 32 ˆ 32 (left), full resolution (256 ˆ 256) images generated by a consistency
model (middle), and ground truth images of resolution 256 ˆ 256 (right).
Consistency Models

Figure 10: Masked images (left), imputed images by a consistency model (middle), and ground truth (right).
Consistency Models

Figure 11: Interpolating between leftmost and rightmost images with spherical linear interpolation. All samples are generated
by a consistency model trained on LSUN Bedroom 256 ˆ 256.
Consistency Models

Figure 12: Single-step denoising with a consistency model. The leftmost images are ground truth. For every two rows, the
top row shows noisy images with different noise levels, while the bottom row gives denoised images.
Consistency Models

Figure 13: SDEdit with a consistency model. The leftmost images are stroke painting inputs. Images on the right side are
the results of stroke-guided image generation (SDEdit).
Consistency Models

(a) EDM (FID=2.04)

(b) CD with single-step generation (FID=3.55)

(c) CD with two-step generation (FID=2.93)

Figure 14: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=2.44)

(b) CD with single-step generation (FID=6.20)

(c) CD with two-step generation (FID=4.70)

Figure 15: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=3.57)

(b) CD with single-step generation (FID=7.80)

(c) CD with two-step generation (FID=5.22)

Figure 16: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=6.69)

(b) CD with single-step generation (FID=10.99)

(c) CD with two-step generation (FID=8.84)

Figure 17: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=2.04)

(b) CT with single-step generation (FID=8.73)

(c) CT with two-step generation (FID=5.83)

Figure 18: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=2.44)

(b) CT with single-step generation (FID=12.96)

(c) CT with two-step generation (FID=11.12)

Figure 19: Uncurated samples from ImageNet 64 ˆ 64. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=3.57)

(b) CT with single-step generation (FID=16.00)

(c) CT with two-step generation (FID=7.80)

Figure 20: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.
Consistency Models

(a) EDM (FID=6.69)

(b) CT with single-step generation (FID=20.70)

(c) CT with two-step generation (FID=11.76)

Figure 21: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.

You might also like