You are on page 1of 23

Masked Diffusion Models are Fast Distribution Learners

Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba*, Zhan Qin, Zhibo Wang
Zhenguang Liu, Kui Ren
Zhejiang University, Hangzhou, China
{jiachenlei, qinglong.wang, peng_cheng, zhongjieba, qinzhan, zhibowang
liuzhenguang, kuiren}@zju.edu.cn

Abstract controllable image generation [2, 7, 12, 31–33, 36, 38]. How-
ever, this approach is commonly adopted by training a model
Diffusion model has emerged as the de-facto model for to simultaneously learn all fine-grained visual details pre-
image generation, yet the heavy training overhead hinders sented in images throughout the entire training process, de-
its broader adoption in the research community. We observe manding intensive computational resources, especially for
that diffusion models are commonly trained to learn all fine- generating high-resolution images. In this work, we try to
grained visual information from scratch. This paradigm may investigate if current denoising training paradigm can be
cause unnecessary training costs hence requiring in-depth improved for accelerating the overall training process by
investigation. In this work, we show that it suffices to train avoiding modeling complete images in the early training
a strong diffusion model by first pre-training the model to stage and if the improved paradigm can be applied in tandem
learn some primer distribution that loosely characterizes with previous studies.
the unknown real image distribution. Then the pre-trained We start by describing our approach by take painting
model can be fine-tuned for various generation tasks effi- as an intuitive example. Rather than directly traversing all
ciently. In the pre-training stage, we propose to mask a fine-grained details, a painter usually starts with more dis-
high proportion (e.g., up to 90%) of input images to approx- tinguishing features, such as the global structure or local
imately represent the primer distribution and introduce a prominent texture. We anticipate that this natural task de-
masked denoising score matching objective to train a model composition can also be applied to train diffusion models,
to denoise visible areas. In subsequent fine-tuning stage, we making training comparably easier by first approximating
efficiently train diffusion model without masking. Utilizing some "primer" distributions that preserve salient image fea-
the two-stage training framework, we achieves significant tures. This also leads to easier training since inspecting
training acceleration and a new FID score record of 6.27 all intricate details usually causes much training difficulty.
on CelebA-HQ 256 × 256 for ViT-based diffusion models. With a well pre-trained model that learns salient image fea-
The generalizability of a pre-trained model further helps tures, the subsequent learning of complete detailed image
building models that perform better than ones trained from information can be effectively accelerated.
scratch on different downstream datasets. For instance, a However, it is non-trivial to learn such primer distribu-
diffusion model pre-trained on VGGFace2 attains a 46% tions from the real distribution which is unknown by itself.
quality improvement when fine-tuned on a different dataset To address this challenge, we first define a primer distribu-
that contains only 3000 images. Our code is available at tion as one that shares the same group of marginals, which
https://github.com/jiachenlei/maskdm. contains diverse important features, with the target distribu-
tion, hence can be further transformed into the target distri-
bution. There exist many distributions that satisfy this defi-
1. Introduction nition. In our approach, we propose a simple yet effective
Diffusion models [17, 40, 42, 43] have demonstrated ex- method to implicitly approximate this primer distribution by
ceptional performance in image generation and emerged as modeling various marginals of the target distribution. Specif-
the state-of-the-art learning models for this task. The core ically, we apply random masking to every image input to a
denoising training approach has also been quickly adopted diffusion model. Each masked image can be regarded as a
in various tasks such as image editing [1, 14, 37, 39] and sample drawn from some arbitrary marginal distribution. We
also provide to the model positional information of visible
* Corresponding author pixels as clues to distinguish different marginals. We con-

i
sider this approach as sharing a close spirit as Dropout [44], For each step t ∈ [1, T ] in the forward process, a diffu-
which essentially learns a distribution of models. By per- sion model adds noise ϵt sampled from the Gaussian dis-
forming denoising training on the visible parts, we try to tribution N (0, I) to data xt−1 and
√ obtains disturbed data
approximately learn a joint distribution, which is composed xt from q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt2 I). β de-
of various marginal distributions that can be aggregated to termines the scale of added noise at each step and can be
preserve meaningful local or global features. prescribed in different ways [17, 29] such that p(xT ) ≈
Consequently, the prevalent end-to-end process for train- N (0, I). Noticeably, instead of sampling sequentially along
ing a diffusion model can be decomposed into a two-stage the Markov chain, we can sample xt at any√time step t
path: the first masked pre-training stage, which enables a in the closed form viaQq(xt |x0 ) = N (xt ; ᾱt x0 , (1 −
t
preferable initialization point for modeling the target distribu- ᾱt )I), where ᾱt = s=1 (1 − βs ). The reverse pro-
tion by performs masked denoising score matching (MDSM) cess is also defined as a Markov chain: pθ (x0:T ) =
QT
on visible parts, followed by denoising fine-tuning equipped p(xT ) t=1 pθ (xt−1 |xt ). In DDPM [17], pθ (xt−1 |xt ) is
with the conventional weighted denoising score matching parameterized as N (xt ; µθ (xt , t), σt ), where µθ (xt , t) =
(DSM) objective [17, 46] in the second stage. The masking √1 (xt − √ βt ϵθ (xt , t)) and σt is a time-dependent con-
αt 1−ᾱt
strategy and rate are chosen empirically as hyper-parameters stant. Given xt and the time step t, ϵθ is a neural network
and remain fixed throughout the training process. We name and aims at predicting the noise ϵ ∼ N (0, I) used to con-
the models yielded by this training framework as Masked struct xt together with xt−1 . Using this parameterization,
Diffusion Models (MaskDM). It is important to note that a the variational objective in [40] is ultimately simplified to
sufficient masked pre-training accelerates training a model Eq.1, which can be seen as a variant of DSM [46] over
across various datasets, even when a model can only be multiple noise scales.
fine-tuned with limited data. The contributions can be sum- h √ √ i
2
marized as follows: L(θ) = Et,x0 ,ϵ ϵ − ϵθ ( ᾱt x0 + 1 − ᾱt ϵ, t) (1)
(i) We design a two-stage training framework that in-
tegrates various masking strategies into the pre-training In our two-stage training framework, we mainly adopt the
stage to improve the efficiency of diffusion model training. above vanilla training procedure of DDPM in the fine-tuning
Through thorough experiments, we examine the effects of stage. The pre-trained model can also be fine-tuned with
different masking configurations on both model performance other training framework, such as VPSDE [43], the continu-
and efficiency improvement, offering practical guidance for ous variant of DDPM (investigated in Sec.4.4). Before the
implementing our proposed framework. In particular, we fine-tuning starts, the model is loaded with weights obtained
apply the proposed framework to set a new record for the via masked pre-training, which we introduced in Sec.3.1.
FID score of 6.27 on the CelebA-HQ 256 × 256 dataset.
(ii) We demonstrate that masked pre-training enables re- 3. Masked diffusion models
duced computational expenses when adapting pre-trained We present an intuitive explanation for our design in Fig. 2.
models to various datasets through fine-tuning. Specifically, Assuming that we are approximating a 2D Swiss roll distribu-
we demonstrate that fine-tuning models pre-trained on differ- tion p(z) (marked by the red spiral line), where z = (x, y).
ent datasets yields better performance than training models There is another distribution pϕ (z) (represented as a blue
from scratch on the given dataset, under the same training heatmap) that fully covers p(z) by traversing all modes.
time and data conditions. This performance gap is even Rather than directly approximating p(z), it is expected to
larger when the available data is limited. be comparably easier to gradually transform an initial dis-
(iii) We employ masked pre-training to significantly mit- tribution pϕ (z), which shares with p(z) the same marginal
igate the training complexity of ViT-based diffusion mod- distribution, i.e., p(x) and p(y), into p(z). For image data,
els. Through extensive experiments, we compare the per- as the dimensionality increases, the data space expands sig-
formance of models trained with and without our proposed nificantly faster than the space expanded by real image sam-
framework. We show that the improvement in both train- ples. As such, approximating a high-dimensional p(z) with
ing efficiency and model generation performance becomes pϕ (z), which partially preserves the sophisticated relations
increasingly evident as image resolution increases. These between different marginal distributions, may bring even
results help pave the way for broader adoption of the ViT ar- more computational benefits.
chitecture for constructing more powerful diffusion models.
3.1. Masked pre-training
2. Preliminary on diffusion models We denote an image x0 1 by a vector: (x10 , x20 , , x30 , , ..., xN
0 ),
Training diffusion model [17, 40] takes a forward and a re- where N represents the number of pixels. Then the data dis-
verse process. The forward process is defined
QT as a discrete 1We follow the conventions and denote a clean image as x , where the
0
Markov chain of length T : q(x1:T |x0 ) = t=1 q(xt |xt−1 ). subscript 0 is the time step.

ii
Positional Information Masked Token

+ Backbone
𝝉 𝝉
𝑷𝜽(𝒙𝟎𝟏 ) 𝑷𝜽(𝒙𝟎𝟐 ) …

… Marginals

(a) (b)

Figure 1. (a) Illustration of the pre-training process. The masked input can be viewed as a sample Figure 2. 2D Swiss roll exam-
from pϕ (xτ0 ), while red boxes denote the parameters of pθ (xτ0 ). (b) An explanation to the process ple. The red spiral line denotes the
of approximating primer distribution pϕ : the diffusion model pθ approximate pϕ by modeling its true data distribution while the blue
marginals pϕ (xτ0 ) via pθ (xτ0 ). Noticeably, pϕ and the true data distribution p(x0 ) shares the same set heatmap is a learned primer distri-
of marginal distributions according to our definition. bution.

tribution p(x0 ) can be expressed as the joint distribution of presented in Eq. 2:


N pixels. Let τ represents a randomly selected subsequence h √ √ 2
i
of [1, ..., N ] with a length of S. We denote the subset of L(θ) = Et,x̂0 ,ϵ̂ ϵ̂ − ϵθ ( ᾱt x̂0 + 1 − ᾱt ϵ̂, t) (2)
selected pixels as {xτ0i }Si=1 and the resulting marginal distri- In the working pipeline overview, as illustrated in Fig.1,
bution as p(x̂τ0 ) = p(xτ01 , xτ02 , xτ03 , ..., xτ0S ). For simplicity, we present an example use case where a face image is
with S being fixed, we utilize x̂0 to represent any marginal masked with a set of grey patches. The masked image can be
variable combinations {τ ∈ [1, ..., N ], |τ | = S | x̂0 τ }, and seen as a sample drawn from a marginal distribution that is
p(x̂0 ) to represent the corresponding marginal distribution. identified by the selected square blocks, which marginalize
It is evident that p(x0 ) belongs to a family Q of distributions out all covered pixels. Considering the positional informa-
that share the same set of marginals p(x̂0 ). We introduce tion H as some fixed or learnable parameters of the model,
the term primer distribution to refer to any distribution in then pθ (x0 ) is also "marginalized" by applying masking
Q other than p(x0 ) that satisfies this condition. We repre- to subsample H. As such, given sufficient training time,
sent such distributions using the notation pϕ (x0 ), where ϕ pθ (x0 ) converges to a certain primer distribution pϕ (x0 )
represents the unknown true distribution parameters. from Q, based on which we further approximate the true
It is non-trivial to approximate pϕ (x0 ), particularly when data distribution p(x0 ) via fine-grained denoising training.
the samples from pϕ (x0 ) are not available. We initialize The details are discussed in Sec.2. Additionally, following
the task of approximating pϕ (x0 ) with a diffusion model the conventional sampling procedure [17, 41] of diffusion
pθ (x0 ), defined as introduced in Sec.2. In each training iter- models, we could draw samples from pθ (x0 ) or its marginal
ation, by training with a batch of images sampled from some distributions by customizing M.
arbitrary marginal distributions, which can be further viewed
as sampled from pθ (x0 ), we are implicitly approximating
3.2. Model architecture and masking configuration
pϕ (x0 ) by modeling its various marginals. Recent studies have grown interest in adopting ViT in build-
To achieve this, we mask each image input x0 with a vec- ing diffusion models. This is much due to facts that this ar-
tor M ∈ {0, 1}N , and incorporate the positional information chitecture allows easy scaling up [30] and is compatible with
H ∈ RN of the visible pixels into the model input as addi- different data modalities [4]. Note that there are different
tional clues to distinguish different marginal distributions. ViT variants, e.g., U-ViT [3] and DiT [30]. In our imple-
Therefore, the model input becomes x̂0 = M⊙(x0 +H) and mentation, we choose U-ViT as the backbone considering its
the noise is ϵ̂ = M ⊙ (ϵ + H). In practice, we observe that simpler architecture design 2 . Nevertheless, the substantial
this simple masking approach suffices to preserve meaning- computational burden, including high CUDA memory usage
ful visual details while enabling a much faster pre-training and lengthy training times, imposes a predominant challenge
convergence. Furthermore, it facilitates the subsequent fine- in training ViT-based diffusion models. In our experiments,
tuning, hence reducing the overall training time. The masked we find that our masked pre-training significantly boosts the
image x̂0 and training efficiency of ViT-based diffusion models (Fig. 5).
√ noise ϵ̂ √are then integrated to construct x̂t such
In practice, it is crucial to carefully configure the masking
that x̂t = ᾱt x̂0 + 1 − ᾱt ϵ̂. Then we substitute x0 and S
ϵ defined in Eq. 1 with x̂0 and ϵ̂, respectively, to optimize configuration, including both S (or the mask rate m = 1− N )
model parameters. For notation clarity, we name the updated 2 In our early experiments, we observe that DiT and U-ViT achieves

objective as masked denoising score matching (MDSM), as similar results.

iii
and the strategy for sampling the mask vector M. Specif- Euler-Maruyama SDE sampler with 1k sampling steps on
ically, the mask rate m determines the average degree of CelebA 64 × 64 and DDIM sampler with 500 sampling
similarity between the true data distribution and the primer steps on LSUN Church 64 × 64, CelebA 128 × 128, and
distributions such that a lower value of m indicates a greater CelebA-HQ 256 × 256.
resemblance. Besides, given U-ViT as the backbone, a mask
is sampled as a group of neighbouring pixels instead of 4.2. Investigating mask configurations
individual and independent pixels. As such, the sampled
To investigate the impact of different masking strategies,
masks essentially determine the range of primer distribu-
we adjust masking granularity and experiment with patch-
tions that could be possibly learned. As illustrations, Fig.3b
wise masking, block-wise masking, and cropping, as demon-
and Fig.3c display various samples from two different primer
strated in Fig. 3a. Specifically, we pre-train models with
distributions, which are implicitly learned via different mask
different configurations on CelebA 64×64, using mask rates
sampling strategies.
of 10%, 50%, and 90%, respectively. During pre-training,
In this work, we have designed three different masking the GPU memory usage is fixed across different experiments
strategies, namely, patch-wise masking, block-wise mask- and the default pre-training iterations are set as 50k in all
ing, and cropping. Examples for each masking type are experiments, for which we observe the pre-training curves
shown in Fig. 3a. Patch-wise masking entails the random are saturated. Subsequently, given a pre-trained model, we
occlusion of a predefined number of image patches. Block- fine-tune it for 200k steps to optimize the objective delin-
wise masking involves randomly selecting image blocks for eated in Eq. 1. To demonstrate the effect of adopting masked
masking, where each block comprises a fixed quantity of pre-training, we setup a baseline model which is trained on
image patches. Lastly, cropping entails randomly selecting a the same set of data from scratch (as detailed in Sec. 2) for
top-left coordinate and the corresponding fixed-size square 250k steps. This ensures comparable total training costs
region then masking the area outside the chosen square. We between the baseline model and its pre-trained counterpart.
explore and compare a range of configurations in Sec.4.2 Comparing different masking types and mask rates.
As shown in Tab. 1(a), we first observe that models trained
4. Experiments with cropping generally obtain the worst FID scores as we
vary the mask rate. A possible explanation is that randomly
4.1. Experimental setup
cropped images retain limited global structural information,
Implementation details. We compare with existing meth- which constraints the model from building long-range con-
ods on three datasets: CelebA [27], LSUN Church [50] and nections among different variables. As a result, given a pair
CelebA-HQ [20]. We resize images from CelebA to 64 × 64 of fixed batch size and training step, cropping makes it more
and 128 × 128, images from LSUN Church to 64 × 64, and challenging for a diffusion model to capture the consistent
images from CelebA-HQ to 256 × 256. We implemented critical visual features. On the other hand, block-wise mask-
MaskDM models with the U-ViT architecture introduced ing (including both 2x2 and 4x4 block) achieves the best
in [3] with certain modifications. Specifically, we utilize the results across all settings, while patch-wise masking achieves
U-ViT-Small setup from [3] to build MaskDM-S models and the second best FID scores.
remove five transformer blocks from U-ViT-Mid to build By comparing the FID scores obtained by selecting dif-
MaskDM-B models. This allows a MaskDM-B model can ferent mask rates, we observe that pre-trained models paired
be fitted in 1 Tesla V100 GPU card given a single 256 × 256 with the 50% mask rate outperform pre-trained models that
image as input with a 4 × 4 patch size. In all MaskDMs, adopt other masking rates in most cases. In particular, the
we discard the appending convolutional blocks initially ap- model pre-trained with 2x2 block-wise masking and a 50%
pearing in the U-ViT model and find the performance to be masking rate achieves an FID score of 6.51, which is signifi-
trivially affected. Unless specified, we adopt 2 × 2 block- cantly better than the baseline. We also notice that models
wise masking when pre-training on 64 × 64 images, and pre-trained with a 90% mask rate exhibit a rapid divergence
4 × 4 block-wise masking on images with resolution equal in FID scores after 50k training steps. We delve deeper
to or greater than 128 × 128. Further detailed information is into the causes of this problem by studying different influen-
provided in the Appendix 10. tial factors (detailed findings are presented in Appendix 8.)
Evaluation settings. During the evaluation, we fol- and find that the adopted linear noise schedule contributes
low the convention and utilize Fréchet Inception Dis- significantly to training instability. This can be effectively
tance(FID) [16] to measure the quality of generated images. mitigated by utilizing the cosine noise schedule [29].
We mainly employ two different samplers, namely, Euler- Moreover, the results presented in Tab. 1(a) show that
Maruyama SDE sampler [43] and DDIM [41], to generate different block sizes also impact the generation performance.
samples. When comparing with current methods, we com- By taking the patch-wise masking as a 1x1 block-wise mask-
pute FID scores on 50k generated samples, and we apply ing, we compare different block sizes and demonstrate the

iv
(a) (b) (c)

Figure 3. (a) Three masking strategies are examined in our experiments. From left to right, the strategies are represented as patch-wise
masking, block-wise masking, and cropping. (b) and (c) Samples from primer distribution, captured utilizing patch-wise and block-wise
masking respectively, given a mask rate of 90%. Notably, the model pre-trained with cropping at 90% mask rate exhibits limited capability
in generating plausible samples; therefore, we do not illustrate the results here (Please refer to the Appendix10).

Table 1. Mask configuration investigation on CelebA 64 × 64, where pre-trained weights are acquired from different masking configurations.
The baseline model, trained from scratch without loading pre-trained weights, is marked in gray .

(a) Impact of mask configuration (b) Impact of computational budget (c) Impact of block size
Mask 10% 50% 90% Mask Rate Steps bs=128 bs=256 Mask FID↓
patch 6.85 6.58 7.34 patch 10% 50k 6.85 6.31 patch 6.58
2x2 block 6.77 6.51 8.99 2x2 block 10% 50k 6.77 6.71 2x2 block 6.51
4x4 block 6.92 6.88 6.91 2x2 block 50% 50k - 6.51 4x4 block 6.88
cropping 6.92 6.82 8.62 2x2 block 50% 100k - 6.27 8x8 block 7.43
from scratch 7.55 2x2 block 50% 150k - 6.05

results in Tab. 1(c). We observe that using mask with a the GPU memory capacity constraint still holds and confine
larger block size generally leads to performance degradation. to the previous best setting with a 2x2 block-wise mask-
Therefore, in the following, we mainly focus on evaluat- ing and a mask rate of 50%. We continue upon the above
ing patch-wise and block-wise masking approaches with investigation to employing a batch size of 256, exploring
different mask rates. the effect of extending the resource constraint in terms of
Delving into mask rate, batch size, and training steps. pre-training steps. Specifically, we increase the number of
In our experiments, we are particularly interested in the case pre-training steps from 50k to 100k and 150k, respectively,
where the overall computation resource is limited, which is and present the results in Tab. 1(b). We observe a clear trend
common in academic research. More concretely, we assume of performance improvement as the number of pre-training
there is a fixed GPU resource budget. Given this resource steps increase. The results are in alignment with our expecta-
constraint, we confront a trade-off between mask rate and tion and indicate that a longer pre-training time is generally
batch size. For instance, to maintain a constant GPU usage, helpful for improving the overall training performance.
when applying a lower mask rate, which consumes more The above investigation indicates the importance of prop-
CUDA memory per image, we are limited to pre-train a erly configuring the mask rate, batch size, and training step,
model with a smaller batch of data. Indeed, in our exper- for optimizing model performance while aligning with af-
iments, models using a mask rate of 10% consume 1.5× fordable computing resources. These empirical findings
more GPUs than those using a mask rate of 50%. This raises open the opportunity for designing an automated dynamic
the question that the less competitive performance obtained training schedule, similar to Successive Halving [19], that
by block-wise and patch-wise masking with a rate of 10% balances the trade-off between these intertwined hyper-
may result from an inadequate batch size of 128. To in- parameters under a constant training budget. In fact, we
vestigate this question, we enlarge the batch size for both have explored manually adjusting the training schedule and
aforementioned settings to 256. This setting corresponds obtained the best generation performance in Sec. 4.3. We
to the case where the computation resources are sufficient leave a more systematic study of training schedule automa-
to support larger batch size pre-training. As presented in tion to our future work.
Tab. 1(b), both models trained with block-wise and patch-
wise masking with a rate of 10% and a batch size of 256 4.3. Image Generation
exhibit improved performance as expected. To thoroughly evaluate and demonstrates the efficacy of our
We also find that a higher mask rate often requires fewer proposed training paradigm, we conduct multiple experi-
computing resources (i.e., GPUs) but slightly more train- ments on image generation tasks with different resolutions,
ing steps to achieve performance comparable to its lower including the CelebA 64 × 64 and 128 × 128, LSUN Church
mask rate counterparts. As such, we return to the case where 64 × 64 and CelebA-HQ 256 × 256.

v
Table 2. FID results on CelebA. Expense is measured in Table 4. FID results on CelebA-HQ 256 × 256. Results of latent diffusion
V100 days.†: converted into V100 days. ‡: estimated in models are listed in gray color. Expense is measured in A100 days. ‡:
V100 days estimated in V100 days. ∗: utilize extra model.

Method FID ↓ Params Expense Method FID ↓ Params Expense


64 × 64 VQ-GAN [13] 10.2 355M -
PGGAN [20] 8.03 - -
DDIM [41] 3.26 79M -
DDGAN [48] 7.64 - -
PNDM [26] 2.71 79M -
U-ViT [3] 2.87 44M 2.08‡ Latent diffusion models
MaskDM-S 2.27 44M 2.09 LSGM∗ [45] 7.22 - -
128 × 128 LDM-4∗ [34] 5.11 274M 14.4‡
Gen-ViT [49] 22.07 12.9M - CNN-based diffusion models
U-ViT 12.96 102M 11.67 VESDE [43] 7.23 66M -
MaskDM-B 6.83 102M 8.7† Soft Truncation [23] 7.16 66M -
P2 Weighting [8] 6.91 94M 18.75‡
Table 3. FID results on LSUN 64 × 64. Expense is
ViT-based diffusion models
measured in V100 days
U-ViT 24.83 102M 18.28
Method FID ↓ Params Expense MaskDM-B 6.27 102M 12.19
U-ViT 6.58 44M 1.87
MaskDM-S 5.04 44M 1.80

Figure 4. Qualitative results on CelebA-HQ 256 × 256 and LSUN Church 256 × 256

Generation quality. On the CelebA 64×64 dataset, after among all compared models. The training takes approxi-
loading the best pre-trained weight reported in Tab.1b, we mately 3.96 A100 days on 2 A100 GPUs, comprised of 3.26
fine-tune MaskDM-S on 2 V100 GPUs and achieve an FID A100 days pre-training and 0.69 A100 days fine-tuning.
score of 2.2. The overall training takes approximately 2.09 On the LSUN Church 64×64, we construct a MaskDM-S
V100 days. Our result significantly surpasses the FID scores model with a 50% mask rate and spent 1.63 V100 days for
reported by other works shown in Tab.2. pre-training, followed by 0.17 V100 days of fine-tuning.
On the CelebA 128 × 128 dataset, to the best of our Through a comparison between the trained MaskDM-S
knowledge, GenViT [49] is the solitary ViT-based diffu- model and the baseline U-ViT model, we demonstrate a
sion model built for this task. We train a baseline U-ViT significant improvement of 23% in the FID score. Further-
model with the identical settings (architecture, fine-tuning more, given the considerable difference between the LSUN
hyper-parameters and computational cost) as our MaskDM- Church dataset images and face images, this outcome con-
B model and the objective detailed in Eq. 1 for comparison. vincingly showcases the generalizability of our proposed
As previously mentioned in Sec. 4.2, we find that manually training methodology.
adjusting the mask rates and training steps during the en- On the CelebA-HQ 256 × 256 dataset, we also manually
tire pre-training stage leads to further model performance adjust the mask rate and pre-train a MaskDM-B with 90%
improvement. Specifically, we pre-train a MaskDM-B with mask rate and 50% mask rate in a sequence. Thereafter,
70% mask rate and 30% mask rate subsequently. Then we we fine-tune the model and achieve an FID score of 6.27.
fine-tune the model and achieve the lowest FID score of 6.83 Training this MaskDM-B model takes 12.19 A100 days

vi
70 MaskDM: Pre-train 30
MaskDM: Fine-tune 70 125
U-ViT

Improve on FID
50 50
FID-10k

FID-3k

FID-3k
75
30 30
10
25 5
10 10 10
0 5 10 0 10 20 30 40 0 50 100 150 200 64 128 256
Training Hours Training Hours Training Hours Data resolution
(a) CelebA 64 × 64 (b) CelebA 128 × 128 (c) CelebA-HQ 256 × 256 (d) The improvement on FID

Figure 5. Comparison of training efficiency between the baseline model and MaskDM on data of varying resolutions. Noticeably, the
reduction in training time increases as the resolution of data increases.

9 Scatch 60 U-ViT
Pre-trained on CelebA MaskDM
7 50 Pre-trained on VGGFace 42%
40

FID-10k

FID-3k
5
30 46%
3 20
10
1
0
VPSDE VPCosine AFHQ FFHQ 10% data 1% data

(a) (b)
Figure 6. Qualitative comparison between our pre-trained model (top
row) and the U-ViT baseline (bottom row) for reaching the same FID Figure 7. (a) Comparison between the performance of scratch-
score, as indicated by the gray line in Fig.5(c). The more realistic trained and adapted models. (b) Comparing models trained with
synthesis images sampled from our model evidently indicate better no masking (yellow), masked pre-training (green), and cross-data
generation quality. fine-tuning (blue) on limited data.

on 2 A100 GPUs, which comprises of 9.86 A100 days for the training efficiency between baseline and MaskDM mod-
pre-training and 2.33 A100 days for fine-tuning. We build a els on CelebA and CelebA-HQ datasets in Fig. 5. For reach-
baseline model as previously done for CelebA 128×128, and ing similar FID scores (shown in vertical axes), MaskDM
only achieves an FID score of 24.83 at the training cost of models can save 60% (∼5/12 in Fig. 5(a)) to 80% (∼30/165
18.28 A100 days. Comparing with other methods that either in Fig. 5(c)) training hours than baseline models. Moreover,
optimize models with adversarial training [13, 20, 48] or the reduction in training time increases as the data dimension-
training based on UNet[35] in raw pixel space [8, 23, 43], our ality increases. We further select pairs of example human
MaskDM-B model achieves the best FID score utilizing U- face images, as shown in Fig. 6, to provide a qualitative
ViT architecture with comparable model parameters, while comparison between the synthesis results obtained by pre-
maintaining a reasonable training cost. trained and baseline model for reaching the same FID scores.
Comparison with LDMs. There are also studies [34, 45] The more realistic synthesis images sampled from MaskDM
on improving training efficiency for diffusion models by evidently indicate better generation quality. We note that the
focusing on the latent space. In comparison with these pre-training curve is saturated for several steps. In practice,
two methods, our MaskDM-B model significantly outper- however, we find more pre-training time eventually leads to
forms LSGM [45], but obtains worse performance than better FID scores, consistent with findings in Sec.4.2.
LDM [34]. It is important to note that LDM [34] utilizes an
extra VAE [24] as the feature extraction model, which is pre- 4.4. Generalizability of MaskDM
trained on ImageNet 256 × 256 for hundreds of thousands As previously mentioned in Section 1, we expect a MaskDM
of steps. In contrast, we only train one 102M-parameter model to have desirable generalizability to facilitate fine-
ViT-based model consistently on CelebA-HQ 256 × 256 tuning for various downstream image generation tasks. In
without any extra data. We anticipate our performance could the following experiments, we assess the generalizability of
be further enhanced by scaling up model parameters and MaskDM models by fine-tuning them with different datasets
incorporating advanced training techniques [11, 22]. and training frameworks (Fig. 7(a)), and we also verify the
Training efficiency. We further present comparisons of benefit of pre-trained model for tasks confronted with data

vii
scarcity by fine-tuning with limited data (Fig. 7(b)). Training heavy computation costs for adapting to new tasks. In con-
details are presented in Appendix 8 and 9. trast, our method avoids dependencies on external models
We first choose CelebA 64 × 64 as the source pre-training and maintains effectiveness even with limited data from
dataset and consider the scenario where the diffusion model downstream tasks. Also, our pre-trained MaskDM are con-
training paradigms are not aligned across pre-training and siderably easier and efficient to generalize to various image
fine-tuning (left bars in Fig. 7(a)). Specifically, we utilize generation tasks.
DDPM in masked pre-training and adopt VPSDE [43] or ViT-based diffusion models. Early research on diffusion
VPCosine [29] in fine-tuning. We also collect the pre-trained models primarily utilized CNN architectures such as UNet.
models from Sec.4.2 and fine-tune them on another two However, more recent studies [3, 5, 30, 49] have shown
different datasets, i.e., FFHQ [21] and AFHQ [9], with all increasing interest in employing vision transformers (ViT)
images resized to the resolution of 64 × 64. This creates as the backbone for diffusion models. In comparison with
data distribution shifts between the pre-training and fine- CNN-based architecture, ViT-based architecture is readily
tuning datasets. As shown in Fig. 7(a)), when compared scalable [30] and compatible [4] with different data modali-
with models trained from scratch for each setting, pre-trained ties. Notably, recent works [18, 51] have demonstrated the
models demonstrate clear stronger generalizability for both superiority of ViT-based diffusion models, such as MaskDiT,
training paradigm and data distribution shifts. which achieved a FID score of 2.28 on ImageNet in 200
Then we construct two small training datasets, one with V100 days, while LDM (based on UNet) costed 271 V100
3000 images(10%) and the other with 300 images(1%), from days to achieve a FID score of 3.60 on the same dataset. In
the CelebA-HQ 256 × 256 dataset. For each dataset, we a previous study [47] parallel to our research, it was sug-
maintain similar computational expenses for training models gested to alternately train CNN-based diffusion models on
from scratch and training MaskDM models. As illustrated both entire images and cropped patches. However, previous
in Fig. 7(b)), the final FID scores evaluated on the complete studies [3, 5, 30, 49] have analyzed the factors that affect the
CelebA-HQ 256 × 256 dataset demonstrate that our pro- quality of generated samples and suggest smaller patch sizes
posed training framework significantly improve the quality correlate with better generated sample quality. This find-
of generated images. ing exacerbates ViT’s computational drawbacks, including
Moreover, we leverage another model, which is pre- high CUDA memory usage and extended training times. In
trained on the VGGFace2 256 × 256 dataset [6] (contain- this work, since we aim at reducing the training costs spent
ing approximately 3M training images) with a mask rate of on the cumbersome approximation of full-resolution data,
90% for comparison. As shown in Fig. 7(b)), after loading our proposed training framework can effectively mitigate
the pre-trained weights, the FID score is substantially im- the computation challenge for training ViT-based diffusion
proved by 46% and 42% in experiments that contains 3000 models.
and 300 images respectively. These improved results under-
score the potential for utilizing a model which is sufficiently
pre-trained for a generation task to tackle similar tasks that 6. Conclusion
confront with data scarcity issues. In this work, a masked pre-training approach is proposed
to improve the training efficiency for diffusion models
5. Related Work in the context of image synthesis. We design a masked
denoising score matching objective to guide the model for
Efficient training for diffusion models. There have learning a primer distribution that shares some diverse and
been various research efforts in improving the training ef- important features, conveyed in group of marginals, with
ficiency for diffusion models. One rapidly evolving re- the target data distribution. We empirically investigate
search thread involves employing masking strategies during various masking configurations for their impacts on
the training process, as exemplified by the masked autoen- model performance and training efficiency, and evaluate
coder (MAE) [15] approach. Specifically, two recent stud- our approach using U-ViT for image synthesis in the
ies [34, 45] propose latent diffusion models (LDM). These pixel space on several different datasets. The evaluation
models apply the diffusion process on low-dimensional fea- results show that our approach substantially reduces the
tures extracted by a pre-trained VAE [24], reducing the cost training cost while maintaining high generalization quality,
of approximating extraneous details in the raw data. Based outperforming the standard DDPM training method by
a significant margin. We also conduct experiments for
on LDM, subsequent studies propose MDT and MaskDiT, evaluating the generalizability of models pre-trained
which incorporate masking into the latent space during train- through our approach in the cases of training paradigm
ing. However, the training complexity for building the fea- mismatch, data distribution shift, and limited training data.
ture extraction model (usually requires an extra large-scale We demonstrate that our approach suffices to produce a
training dataset such as ImageNet [10] or OpenImage [25]) pre-trained model with strong generalization capabilities.
is often overlooked. Moreover, these methods often requires

viii
References vision learners. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 16000–
[1] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended 16009, 2022. viii
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. i
[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- hard Nessler, and Sepp Hochreiter. Gans trained by a two
aming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli time-scale update rule converge to a local nash equilibrium.
Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion Advances in neural information processing systems, 30, 2017.
models with an ensemble of expert denoisers. arXiv preprint
iv
arXiv:2211.01324, 2022. i
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[3] Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth
sion probabilistic models. Advances in Neural Information
words: a vit backbone for score-based diffusion models. arXiv
Processing Systems, 33:6840–6851, 2020. i, ii, iii
preprint arXiv:2209.12152, 2022. iii, iv, vi, viii
[18] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple
[4] Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu,
diffusion: End-to-end diffusion for high resolution images.
Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One
arXiv preprint arXiv:2301.11093, 2023. viii
transformer fits all distributions in multi-modal diffusion at
scale. arXiv preprint arXiv:2303.06555, 2023. iii, viii [19] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best
arm identification and hyperparameter optimization. In Artifi-
[5] He Cao, Jianan Wang, Tianhe Ren, Xianbiao Qi, Yihao Chen,
cial intelligence and statistics, pages 240–248. PMLR, 2016.
Yuan Yao, and Lei Zhang. Exploring vision transformers as
v
diffusion learners. arXiv preprint arXiv:2212.13771, 2022.
viii [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[6] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and An- Progressive growing of gans for improved quality, stability,
drew Zisserman. Vggface2: A dataset for recognising faces and variation. arXiv preprint arXiv:1710.10196, 2017. iv, vi,
across pose and age. In 2018 13th IEEE international con- vii
ference on automatic face & gesture recognition (FG 2018), [21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
pages 67–74. IEEE, 2018. viii generator architecture for generative adversarial networks. In
[7] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free Proceedings of the IEEE/CVF conference on computer vision
layout control with cross-attention guidance. arXiv preprint and pattern recognition, pages 4401–4410, 2019. viii
arXiv:2304.03373, 2023. i [22] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
[8] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Elucidating the design space of diffusion-based generative
Kim, Hyunwoo Kim, and Sungroh Yoon. Perception pri- models. arXiv preprint arXiv:2206.00364, 2022. vii
oritized training of diffusion models. In Proceedings of [23] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo
the IEEE/CVF Conference on Computer Vision and Pattern Kang, and Il-Chul Moon. Soft truncation: A universal training
Recognition, pages 11472–11481, 2022. vi, vii technique of score-based diffusion model for high precision
[9] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. score estimation. arXiv preprint arXiv:2106.05527, 2021. vi,
Stargan v2: Diverse image synthesis for multiple domains. In vii
Proceedings of the IEEE/CVF conference on computer vision [24] Diederik P Kingma and Max Welling. Auto-encoding vari-
and pattern recognition, pages 8188–8197, 2020. viii ational bayes. arXiv preprint arXiv:1312.6114, 2013. vii,
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li viii
Fei-Fei. Imagenet: A large-scale hierarchical image database. [25] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
In 2009 IEEE conference on computer vision and pattern Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
recognition, pages 248–255. Ieee, 2009. viii Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and
[11] Prafulla Dhariwal and Alexander Nichol. Diffusion models Vittorio Ferrari. The open images dataset v4: Unified im-
beat gans on image synthesis. Advances in Neural Information age classification, object detection, and visual relationship
Processing Systems, 34:8780–8794, 2021. vii detection at scale. IJCV, 2020. viii
[12] Dave Epstein, Allan Jabri, Ben Poole, Alexei A Efros, and [26] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
Aleksander Holynski. Diffusion self-guidance for controllable numerical methods for diffusion models on manifolds. arXiv
image generation. arXiv preprint arXiv:2306.00986, 2023. i preprint arXiv:2202.09778, 2022. vi
[13] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming [27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
transformers for high-resolution image synthesis. In Proceed- Deep learning face attributes in the wild. In Proceedings of
ings of the IEEE/CVF conference on computer vision and International Conference on Computer Vision (ICCV), 2015.
pattern recognition, pages 12873–12883, 2021. vi, vii iv
[14] Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu [28] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan
Sebe, Trevor Darrell, Zhangyang Wang, and Humphrey Shi. Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion
Pair-diffusion: Object-level image editing with structure- probabilistic model sampling in around 10 steps. Advances in
and-appearance paired diffusion models. arXiv preprint Neural Information Processing Systems, 35:5775–5787, 2022.
arXiv:2303.17546, 2023. i iv
[15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [29] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
Dollár, and Ross Girshick. Masked autoencoders are scalable denoising diffusion probabilistic models. In International

ix
Conference on Machine Learning, pages 8162–8171. PMLR, [42] Yang Song and Stefano Ermon. Generative modeling by
2021. ii, iv, viii estimating gradients of the data distribution. Advances in
[30] William Peebles and Saining Xie. Scalable diffusion models neural information processing systems, 32, 2019. i
with transformers. arXiv preprint arXiv:2212.09748, 2022. [43] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
iii, viii hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
[31] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao generative modeling through stochastic differential equations.
Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard arXiv preprint arXiv:2011.13456, 2020. i, ii, iv, vi, vii, viii
Schölkopf. Controlling text-to-image diffusion by orthogonal [44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
finetuning. arXiv preprint arXiv:2306.07280, 2023. i Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
to prevent neural networks from overfitting. The journal of
[32] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
machine learning research, 15(1):1929–1958, 2014. ii
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Zero-shot text-to-image generation. In International Confer- [45] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
ence on Machine Learning, pages 8821–8831. PMLR, 2021. generative modeling in latent space. Advances in Neural
Information Processing Systems, 34:11287–11302, 2021. vi,
[33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
vii, viii
and Mark Chen. Hierarchical text-conditional image genera-
[46] Pascal Vincent. A connection between score matching and
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.
denoising autoencoders. Neural computation, 23(7):1661–
i
1674, 2011. ii
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[47] Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao
Patrick Esser, and Björn Ommer. High-resolution image
Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen,
synthesis with latent diffusion models. In Proceedings of
and Mingyuan Zhou. Patch diffusion: Faster and more
the IEEE/CVF Conference on Computer Vision and Pattern
data-efficient training of diffusion models. arXiv preprint
Recognition, pages 10684–10695, 2022. vi, vii, viii
arXiv:2304.12526, 2023. viii
[35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: [48] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling
Convolutional networks for biomedical image segmentation. the generative learning trilemma with denoising diffusion
In International Conference on Medical image computing gans. arXiv preprint arXiv:2112.07804, 2021. vi, vii, i
and computer-assisted intervention, pages 234–241. Springer,
[49] Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao,
2015. vii
and Shihao Ji. Your vit is secretly a hybrid discriminative-
[36] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, generative diffusion model. arXiv preprint arXiv:2208.07791,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine 2022. vi, viii
tuning text-to-image diffusion models for subject-driven gen- [50] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
eration. In Proceedings of the IEEE/CVF Conference on Com- Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
puter Vision and Pattern Recognition, pages 22500–22510, large-scale image dataset using deep learning with humans in
2023. i the loop. arXiv preprint arXiv:1506.03365, 2015. iv
[37] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, [51] Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anand-
Jonathan Ho, Tim Salimans, David Fleet, and Mohammad kumar. Fast training of diffusion models with masked trans-
Norouzi. Palette: Image-to-image diffusion models. In ACM formers. arXiv preprint arXiv:2306.09305, 2023. viii
SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
i
[38] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li,
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
torealistic text-to-image diffusion models with deep language
understanding. Advances in Neural Information Processing
Systems, 35:36479–36494, 2022. i
[39] Jaskirat Singh, Stephen Gould, and Liang Zheng. High-
fidelity guided image synthesis with latent diffusion models.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5997–6006, 2023. i
[40] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
i, ii
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. arXiv preprint arXiv:2010.02502,
2020. iii, iv, vi

x
Masked Diffusion Models are Fast Distribution Learners
Supplementary Material
7. Pre-training schedule necessitating a longer training time to achieve a similar level
of performance for the model.
As illustrated in Fig.8, during pre-training, the MaskDM Subsequently, we adopt a cosine schedule in place of the
model rapidly converges merely after 200k steps of training. linear schedule, and the resulting FID score curve demon-
Besides, we demonstrate the performance of models that strates the superiority of the cosine schedule in improving
are pre-trained under two distinctive schedules: one is di- the training stability of the model. It effectively mitigates the
rectly pre-trained at 50% mask rate and the other is pre- convergence issues observed with the linear schedule. Ad-
trained at 90% mask rate and subsequently at 50% mask rate. ditionally, we implement extra optimization strategies, such
We fine-tune both models using the same hyperparameters as Warmup and gradient clipping, which also contribute to a
and maintain the overall training expense to be consistent more stable convergence of the model.
across schedules. As a result, the model that is pre-trained
progressively with two different mask rates achieves the best 9. Model Configurations
performance. More qualitative comparisons are presented in
Fig.13, Fig.14 and Fig.15. We present the configuration of our MaskDM models imple-
mented in our experiments in Tab.5. Patch size is set to 4 in
all experiments.
250 90% pre-training
50% pre-training Table 5. Details of MaskDM models
200 finetune
Model Depth Dim MLP Dim Heads Params
150
FID-3k

MaskDM-S 13 512 2048 8 44M


100 MaskDM-B 12 768 3172 12 102M

50

0 100 200 300 400 500 10. Implementation Details


Training Steps 10k
In this section, we present training hyperparameters of our
Figure 8. Comparison on different pre-training schedules. main experiments. Specifically, in Tab.6, we present hyper-
parameters used in experiments in Tab.1, Tab.2, Tab.2 and
Tab.4. In Tab.7, we present hyperparameters used in Fig.5.
Finally, in Tab.8 and Tab.9, we present hyperparameters used
8. The Pre-training instability at 90% mask rate in Fig.7 and Fig.7 respectively.
In early experiments, we consistently observe that the model Lower LR used in 10% masked pre-training in Tab.1.
hardly converges when trained at 90% mask ratio. There- In early experiments, we observe that the model yields a
fore, we investigate the impact of various factors on the poor performance when the learning rate is set to 2e-4, using
pre-training process, including batch size, learning rate, 128 batch size. Therefore, we scale the learning rate linearly
noise schedule, gradient clipping, and learning rate sched- according to the batch size and use 1e-4 in our experiments.
ule (Warmup). The default experiment setting is: lr=2e-4,
batch size=256, linear noise schedule and no gradient clip or
11. Additional Visualization Results
warmup. We further present more qualitative results in this section.
As shown in Fig.9, increasing the batch size from 128 to We display samples generated by MaskDM model that is:
2048 still results in divergence. It should be noted that the pre-trained with 90% cropping on CelebA-HQ 256 × 256
model demonstrate more stability when the batch size is set in Fig.10, pre-trained with 90% 4x4 block-wise masking on
to 128, which is smaller than 256. A similar phenomenon is CelebA-HQ 256 × 256 in Fig.11, pre-trained with 90% 4x4
also observed in [48]. block-wise masking on LSUN Church 256 × 256 in Fig.12,
Besides, when we reduce the learning rate from 2e-4 to pre-trained with 50% 4x4 block-wise masking on CelebA-
1e-4 and the batch size from 256 to 128, we observe a stable HQ 256 × 256 in Fig.13, pre-trained subsequently with 90%
and gradual plateau in the FID score of the model after 200k and 50% 4x4 block-wise masking on CelebA-HQ 256 × 256
training steps. However, this more conservative hyperparam- in Fig.14 and Fig.15, fine-tuned on CelebA-HQ 256 × 256
eter setting leads to a relatively slower convergence speed, in Fig.16, and fine-tuned on other datasets in Fig.17.

i
Default bs1024 Default bs128 lr1e-4 Default cosine lr1e-4 Default 3k Warmup
400 bs128 bs2048 lr1e-4 cosine gradclip=1 5k Warmup lr1e-4
bs512 cosine gradclip=1 5k Warmup

300
FID-10k

200

100
10k 50k 100k 150k 200k 10k 50k 100k 150k 200k 10k 50k 100k 150k 200k 10k 50k 100k 150k 200k
Training Steps Training Steps Training Steps Training Steps

Figure 9. Investigation into training instability. We maintain a fixed mask ratio of 90% and adopt the parameters used in pre-training
experiments at the 50% mask ratio (See Tab.6) as default setting. The experiments are conducted on CelebA64 × 64 dataset

Table 6. hyperparameters for experiments in Tab.1, Tab.2, Tab.2, Tab.4. Noticeably, on LSUN 64 × 64, the baseline model is trained for
550k steps with identical hyperparameters that are used when fine-tuning the MaskDM model.

Dataset CelebA64 × 64 CelebA64 × 64 CelebA 128 × 128 CelebA 256 × 256 LSUN Church64 × 64
Experiment Tab.1 Tab.2 Tab.2 Tab.4 Tab.3

pre-train

Masking any - 2x2 block-wise 4x4 block-wise 4x4 block-wise 4x4 block-wise

Mask rate 10% 50% 90% - 50% 70%, 50% 90%, 50% 50%
Lr 1e-4 2e-4 2e-4 - 2e-4 2e-4, 1e-4 2e-4 2e-4
Batch size 128 256 512 - 256 256, 128 128, 64 256
Steps 50k any 50k - 150k 50k, 350k 200k, 500k 500k
Gradient clip - - 1.0, 1.0 -, 1.0 -
Warmup - 5k steps 5k steps, 5k steps -, 5k steps 5k steps
Noise schedule Linear Linear Cosine Cosine Linear

fine-tune

Lr 1e-4 1e-4 1e-4 1e-4 2e-4 5e-5 1e-5 2e-4


Batch size 128 128 128 128 128 64 32 128
Steps 200k 200k 200k 250k 350k 100k 100k 50k
EMA setting 0.999 update every 1 0.9999 update every 1 0.999 update every 1 0.999 update every 1 0.999 update every 1
Gradient clip - - 1.0 1.0 -
Warmup - 5k steps 5k steps 5k steps 5k steps

shared parameters

Model MaskDM-S MaskDM-S MaskDM-B MaskDM-B MaskDM-S


Noise schedule Linear VPSDE Cosine Cosine Linear
Horizontal flip - 0.5 - - -

sampling

Sampler DDIM Euler-Maruyama DDIM DDIM DDIM


Sampling steps 500 steps 1000 steps 500 steps 500 steps 500 steps
Num of samples 10k 50k 50k 50k 50k

ii
Table 7. hyperparameters for MaskDM and baseline models in Fig.5 and baseline models in Tab.2, Tab.2, Tab.4.
For the baseline model, we use exactly the same hyperparameter settings as in the fine-tuning of MaskDM.
Additionally, the training step is set to 250k, 550k and 800k steps for baseline model on CelebA64 × 64, CelebA
128 × 128, CelebA-HQ256 × 256, respectively, maintaining a consistent computational cost with the MaskDM
counterpart.

Dataset CelebA64 × 64 CelebA 128 × 128 CelebA 256 × 256


pre-train
Masking 2x2 block-wise 4x4 block-wise 4x4 block-wise
Mask rate 50% 70%, 50% 90%, 50%
Lr 2e-4 2e-4, 1e-4 2e-4
Batch size 256 256, 128 128, 64
Steps 50k 50k, 200k 200k, 500k
Gradient clip - 1.0, 1.0 -, 1.0
Warmup - 5k steps, 5k steps -, 5k steps
fine-tune
Lr 1e-4 5e-5 1e-5
Batch size 128 64 32
Steps 200k 100k 100k
EMA setting 0.9999 update every 1 0.999 update every 1 0.999 update every 1
Gradient clip - 1.0 1.0
Warmup - 5k steps 5k steps
shared parameters
Model MaskDM-S MaskDM-B MaskDM-B
Noise schedule Linear Cosine Cosine
Horizontal flip - - -
sampling
Sampler DDIM DDIM DDIM
Sampling steps 500 steps 250 steps 250 steps
Num of samples 10k 10k 3k

iii
Table 8. hyperparameters of experiments in Fig.7. Following the parameters used during fine-tuning MaskDM,
the baseline models are trained for 250k steps. We employ DPM-Solver [28] to generate samples on CelebA64 ×
64 dataset.

Dataset CelebA64 × 64 FFHQ 64 × 64 AFHQ 64 × 64


pre-train
Masking 2x2 block-wise
Mask rate 50%
Lr 2e-4
Batch size 256
Steps 50k
Noise schedule Linear
fine-tune
Lr 1e-4
Batch size 128
Steps 200k
EMA setting 0.999 update every
shared parameters
Model MaskDM-S
Noise schedule VPSDE, VPCosine Linear Linear
Horizontal flip 0.5 - -
sampling
Sampler DPM-Solver DDIM DDIM
Sampling steps 50 steps 500 steps 500 steps
Num of samples 10k 10k 10k

iv
Table 9. hyperparameters of experiments in Fig.7. Following the parameters used when fine-tuning MaskDM,
the baseline models are trained for 200k steps.

Dataset CelebA-HQ 256 × 256 10% or 1% VGGFace2 256 × 256


pre-train
Masking 4x4 block-wise 4x4 block-wise
Mask rate 50% 90%
Lr 2e-4 2e-4
Batch size 64 256
Steps 200k 200k
Noise schedule Cosine Cosine
fine-tune
Lr 5e-4
Batch size 64
Steps 50k
EMA setting 0.999 update every 1
shared parameters
Model MaskDM-S
Noise schedule Cosine
Horizontal flip 0.5
sampling
Sampler DDIM
Sampling steps 250 steps
Num of samples 3k

v
Figure 10. Samples generated by a MaskDM model pre-trained on CelebA-HQ256 × 256 given a masking strategy of cropping and 90%
mask rate.

vi
Figure 11. Uncurated samples generated by a MaskDM model pre-trained on CelebA-HQ256 × 256 (4x4 block-wise masking and 90%
mask rate).

vii
Figure 12. Uncurated samples generated by a MaskDM model pre-trained on LSUN Church256 × 256 (4x4 block-wise masking and 90%
mask rate).

viii
Figure 13. Uncurated samples generated by a MaskDM model pre-trained on CelebA-HQ256 × 256 (4x4 block-wise masking and 50%
mask rate).

ix
Figure 14. Uncurated samples generated by a MaskDM model pre-trained on CelebA-HQ256 × 256, given the configuration of 4x4
block-wise masking and 50% mask rate, after loading weights that are pre-trained at 90% mask rate. The pre-training at 50% mask rate takes
100k steps.

x
Figure 15. Same setting as Fig.??. The pre-training at 50% mask rate takes 500k steps.

xi
Figure 16. Uncurated samples generated by our MaskDM-B model in Tab.4.

xii
Figure 17. Uncurated samples generated by our MaskDM-S models.

xiii

You might also like