Diffusionbert: Improving Generative Masked Language Models With Diffusion Models

DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models
Zhengfu He∗ Tianxiang Sun∗ Kuanning Wang Xuanjing Huang Xipeng Qiu
School of Computer Science, Fudan University
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
{zfhe19,txsun19,wangkn20,xjhuang,xpqiu}@fudan.edu.cn
Abstract "[MASK] [MASK] [MASK]" "[MASK] world !"

pθ (xt−1 |xt , t)
"Hello world !"
xT ··· xt xt−1 ··· x0

We present DiffusionBERT, a new generative q(xt |xt−1 )
masked language model based on discrete dif-
arXiv:2211.15029v2 [cs.CL] 30 Nov 2022
fusion models. Diffusion models and many (a) Diffusion models for discrete data
pre-trained language models have a shared "[MASK] [MASK] [MASK]" "Hello [MASK] !" "Hello world !"
pθ (xt−1 |xt )
training objective, i.e., denoising, making it xT ··· xt xt−1 ··· x0
possible to combine the two powerful models q(xt |xt−1 , x0 )
and enjoy the best of both worlds. On the one
hand, diffusion models offer a promising train-
ing strategy that helps improve the generation (b) Non-Markovian DiffusionBERT
quality. On the other hand, pre-trained denois-
Figure 1: In contrast to conventional discrete diffusion
ing language models (e.g., BERT) can be used
models, DiffusionBERT uses BERT as its backbone to
as a good initialization that accelerates conver-
perform text generation. The main differences are high-
gence. We explore training BERT to learn the
lighted in color: (1) DiffusionBERT performs decoding
reverse process of a discrete diffusion process
without knowing the current time step while canonical
with an absorbing state and elucidate several
diffusion models are conditioned on time step. (2) The
designs to improve it. First, we propose a new
diffusion process of DiffusionBERT is non-Markovian
noise schedule for the forward diffusion pro-
in that it generates noise samples xt conditioning not
cess that controls the degree of noise added at
only on xt−1 but also on x0 . Such a non-Markov pro-
each step based on the information of each to-
cess is due to our proposed noise schedule.
ken. Second, we investigate several designs of
incorporating the time step into BERT. Experi-
ments on unconditional text generation demon-
strate that DiffusionBERT achieves significant nature of the text. A few prior works that explored
improvement over existing diffusion models using diffusion models on text data can be divided
for text (e.g., D3PM and Diffusion-LM) and into two lines. The first is to extend diffusion mod-
previous generative masked language models
els to discrete state spaces (Hoogeboom et al., 2021;
in terms of perplexity and BLEU score.1
Austin et al., 2021). The second is to perform
the diffusion process and its reverse process in the
1 Introduction continuous domain and bridge the continuous and
the discrete domain through embedding and round-
Diffusion models (Sohl-Dickstein et al., 2015;
ing (Li et al., 2022; Gong et al., 2022). However,
Ho et al., 2020; Song et al., 2021) have recently
none of these works leveraged pre-trained language
emerged as a new class of state-of-the-art gener-
models (PLMs, Devlin et al. (2019); Lewis et al.
ative models, achieving high-quality synthesis re-
(2020); Raffel et al. (2020); Brown et al. (2020);
sults on image data (Ramesh et al., 2022; Rom-
Qiu et al. (2020)), which are an unmissable treasure
bach et al., 2022; Saharia et al., 2022). Though
in the NLP community.
these models captured widespread attention from
not only the research community but also the pub- This work, to our knowledge, is the first attempt
lic, applying diffusion models to text data is still to combine diffusion models with PLMs. Such
challenging and under-explored due to the discrete a combination is built upon a shared training ob-
∗ jective between diffusion models and PLMs, i.e.,
Equal contribution.
1
Our code is publicly available at https://github. denoising. Diffusion models consist of a forward
com/Hzfinfdu/Diffusion-BERT process (data to noise) and a reverse process (noise
to data). In the forward process, a small amount of text generation (e.g., D3PM (Austin et al., 2021)
noise is gradually added to the data. Then, a neural and Diffusion-LM (Li et al., 2022)) and previous
network (pθ in Figure 1) is employed to learn the generative masked language models (e.g., BERT-
reverse process step by step, i.e., learn to denoise. Mouth (Wang and Cho, 2019)). The effectiveness
Such a denoising neural network is naturally re- of the proposed spindle schedule and time-agnostic
lated to a wide class of PLMs that are pre-trained decoding is confirmed by ablation studies. In a
with denoising objectives such as BERT (Devlin nutshell, DiffusionBERT enjoys the best of both
et al., 2019) and BART (Lewis et al., 2020). Hence, worlds.
pre-trained denoising language models can serve as
a good start point to learn the reverse diffusion pro- 2 Background
cess. On the other hand, diffusion models also offer 2.1 Diffusion Models
a promising training strategy for generative PLMs.
In contrast to commonly used generative PLMs Diffusion models (Sohl-Dickstein et al., 2015; Ho
(e.g., GPT (Brown et al., 2020)) that relies on an et al., 2020) are a class of latent variable models
autoregressive factorization of the joint probability, that are originally designed for continuous domains.
diffusion models provide another way of factor- A diffusion model is consisting of a forward diffu-
ization along the dimension of time and therefore sion process and a reverse diffusion process. Given
allow the model to be not necessarily autoregres- a sample x0 ∼ q(x0 ), a Markov chain of latent
sive. Thus, diffusion models can be combined with variables x1 , · · · , xT are produced in the forward
a variety of PLMs that may not be pre-trained for process by progressively adding a small amount of
generation. Gaussian noise to the sample:
p
In the discrete domain, the forward diffusion q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (1)
process can be implemented by a chain of tran-
sition matrices that gradually corrupt the clean where {βt ∈ (0, 1)}Tt=1 is a noise schedule control-
text. As shown in Figure 1, the clean text "Hello ling the step size of adding noise. Eventually xT
world !" is gradually corrupted into "[MASK] becomes an isotropic Gaussian distribution. If βt
[MASK] [MASK]" during the diffusion process. is small enough, the reverse process q(xt−1 |xt ) is
In this work, we explore using pre-trained denois- also a Gaussian, which is learned by a parameter-
ing language models (e.g., BERT) to learn the re- ized model
verse diffusion process and demonstrate their ad-
vantages in accelerating convergence and improv- pθ (xt−1 |xt , t) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)),
ing generation quality. Further, we propose a new (2)
noise schedule of the forward process based on the where µθ (·) and Σθ (·) can be implemented by a
principle of distributing the corrupted information U-Net or a Transformer. When conditioning also
uniformly across the forward process. The noise on x0 , q(xt−1 |xt , x0 ) has a closed form so we can
schedule, called spindle schedule, generates noise manage to minimize the variational lower bound to
for xt conditioned not only on xt−1 but also on x0 , optimize log pθ (x0 ):
making the forward process non-Markovian with-
Lvlb = Eq [DKL (q(xT |x0 ) k pθ (xT ))]
out changing the original training objective. Note
T
that the denoising model takes as input xt and time X
step t to predict xt−1 , where t is unseen during the + Eq [ DKL (q(xt−1 |xt , x0 ) k pθ (xt−1 |xt , t))]
t=2
pre-training of language models so we investigate
several ways of incorporating the time step into − log pθ (x0 |x1 ), (3)
PLMs. As a result, we find that the best result is
where Eq (·) denotes the expectation over the joint
achieved by throwing away the time information,
distribution q(x0:T ).
which we call time-agnostic decoding (TAD).
Experimental results on unconditional text gen- 2.2 Diffusion Models in Discrete Domain
eration demonstrate the benefit of combining dif- For discrete domains, each element of xt is a dis-
fusion models with PLMs: the proposed Dif- crete random variables with K categories. For text
fusionBERT significantly improves the genera- data, K = |V | is the size of the vocabulary. De-
tion quality over existing diffusion models for note xt as a stack of one-hot vectors, the process
t=0 [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
t=8 [MASK] of [MASK] five [MASK] remain [MASK] in [MASK] .
t = 16 two of [MASK] five structures remain [MASK] this location .
BERT-Mouth
t = 24 five of [MASK] the windows remain at this location .
t = 32 most of even the windows stand still this day .
t=8 [MASK] [MASK] [MASK] [MASK] been [MASK] [MASK] [MASK] [MASK] .
t = 16 [MASK] [MASK] [MASK] [MASK] been [MASK] [MASK] the [MASK] .
D3PM
t = 24 [MASK] [MASK] [MASK] also been [MASK] by the [MASK] .
t = 32 the man has also been arrested by the police .
t=8 [MASK] , [MASK] [MASK] [MASK] [MASK] [MASK] that [MASK] .
t = 16 today , [MASK] will be [MASK] [MASK] that [MASK] .
DiffusionBERT
t = 24 today , [MASK] will be remembered for that mistake .
t = 32 today , he will be remembered for that mistake .
Table 1: Examples generated by three generative masked language models showing the difference of noise schedule
and generation quality.
of adding noise can be written as process (§ 3.2). Then, we investigate several alter-
natives of incorporating the time step into PLMs
q(xt |xt−1 ) = Cat(xt ; p = xt−1 Qt ), (4) for predicting xt−1 given xt and t (§ 3.3).
where Cat(·) is a category distribution and Qt is 3.1 Diffusion Models with a Discrete
a transition matrix that is applied to each token Absorbing State
in the sequence independently: [Qt ]i,j = q(xt = To be combined with pre-trained denoising lan-
j|xt−1 = i). It is easy to obtain that guage models, we incorporate an absorbing state,
e.g., [MASK] for BERT, in the Markov process. In
q(xt |xt−1 , x0 )q(xt−1 |x0 )
q(xt−1 |xt , x0 ) = particular, each token in the sequence either stays
q(xt |x0 )
! the same or transitions to [MASK] with some prob-
xt Q> t x0 Qt−1 ability. Formally, each entry of the transition matrix
= Cat xt−1 ; p = , (5)
x0 Qt x> t
at step t is as follows,

where Qt = Q1 Q2 · · · Qt . Note that is element- 1
 if i = j = [M],
wise multiplication and the division is row-wise. [Qt ]i,j = βt if j = [M], i 6= [M], (6)

With q(xt−1 |xt , x0 ) at hand, according to 1 − βt if i = j 6= [M],

Eq. (3), we can use a parameterized model
pθ (xt−1 |xt , t) to learn the reverse diffusion pro- where [M] is the abbreviation of [MASK]. Such a
cess. Markov process converges to a stationary distribu-
tion q(xT ), which places all probability mass on a
3 DiffusionBERT sequence with all [MASK] tokens.
The t-step marginal q(xit |xi0 ) can be easily ob-
In contrast to recently proposed diffusion models tained in a closed form,
for text, e.g., Diffusion-LM (Li et al., 2022) and (
DiffuSeq (Gong et al., 2022), which are based on i i αt if xit = xi0 ,
q(xt |x0 ) = (7)
continuous diffusion models, we instead explore 1 − αt if xit = [M],
discrete diffusion models to integrate PLMs as the
where αt = ti=1 (1 − βi ), xit denotes the i-th
Q
backbone. We first introduce a specific instance
of discrete diffusion models (Austin et al., 2021), token in the sequence at step t. Combining with
which considers a transition matrix with an absorb- Eq. (3) and (5), we can derive a training objective
ing state for the sake of using PLMs (§ 3.1). Sec- to optimize pθ (xt−1 |xt , t) and generate a sample
ondly, we introduce a new noise schedule of the by performing the reverse diffusion process:
forward diffusion process, called spindle schedule, T
which is based on the principle of distributing the pθ (x0:T ) = p(xT )
Y
pθ (xt−1 |xt , t). (8)
corrupted information uniformly across the forward t=1
3.2 Spindle Noise Schedule notes the i-th token in the sequence and n denotes
The noise schedule in the continuous domain, such the length
Qt of the sequence. According to Eq. (7),
i i
αt = j=1 (1−βj ) denotes the probability that the
as the linear schedule (Ho et al., 2020) and the co-
sine schedule (Nichol and Dhariwal, 2021a), has i-th token remains the same at step t, i.e., xit = xi0 .
shown to be important to the performance of diffu- We expect that αit > αjt if H(xit ) < H(xjt ) such
sion models. that easy (low-information) tokens emerges earlier
In contrast to the continuous domain where the than hard (high-information) tokens during the re-
noise can be easily controlled by the variance of verse process.
the Gaussian, (1) it is less obvious how to control Considering these aforementioned properties,
the degree of noise added at each step in the dis- we construct αit as follows,
crete domain. For the discrete domain, the noise
t
schedule βt = (T − t + 1)−1 has been explored αit = 1 − − S(t) · H̃(xi0 ), (10)
T
for the case of the uniform transition matrix (Sohl-
Dickstein et al., 2015; Hoogeboom et al., 2021) and tπ
the absorbing-state transition matrix (Austin et al., S(t) = λ sin , (11)
T
2021). However, (2) such a schedule assumes all
tokens carry the same amount of information and Pn j
j=1 H(x0 )
does not consider the linguistic difference among H̃(xi0 ) =1− , (12)
nH(xi0 )
the tokens in a sequence. Besides, (3) it violates
the easy-first-generation nature of denoising lan- where S(t) is introduced to control the effect of
guage models. That is, the model tends to generate the informativeness at time step t. It is designed
tokens that are most frequently appearing (and is to be sinusoidal to ensure S(0) = S(T ) = 0 such
least surprising) in the training corpus to achieve a that xt can retain all (zero) information when t = 0
higher likelihood. As the context becomes richer, (t = T ). The effect of S(t) is controlled by a hy-
more details come up in the sequence. perparameter λ. When λ = 0, the noise schedule is
To address the above issues, we consider a noise degraded to βt = (T −t+1)−1 as in Sohl-Dickstein
schedule that (1) measures the added noise at et al. (2015); Hoogeboom et al. (2021); Austin et al.
each step by the corrupted information and en- (2021). Figure 2 shows how α progresses during
courage the corrupted information to be uniformly the forward process. The schedule is named as
distributed across the diffusion steps. Since the spindle due to the shape of the probability curves.
information is measured independently for each In our proposed schedule, the transition proba-
token, (2) different tokens in a sequence are as- bility at time step t depends not only on the current
signed different probabilities of transitioning to the state but also on the original text, making the for-
[MASK] token. Moreover, inspired by the easy- ward diffusion process non-Markovian. Neverthe-
first-generation phenomenon, (3) we put the tokens less, as revealed by Eq. (5), this does not change
in a sequence in descending order of their informa- the original training objective.
tion and divide them into T buckets. Each bucket is
ensured to contain the same amount of information. 3.3 The Design Space of Feeding Time Steps
That is, we mask the most informative tokens at the Typically, a diffusion model takes as input a
start of the forward process and mask the least in- noised sample and the time step to predict the
formative tokens at the end of the forward process denoised sample during the reverse process, i.e.,
such that the learnable reverse process follows an pθ (xt−1 |xt , t). However, t is an additional vari-
easy-first generative behavior. able that is unseen during the pre-training of lan-
In particular, distributing corrupted information guage models and therefore it is less trivial how to
uniformly across the forward steps can be formally feed the time information into the PLMs. Here we
described by explore three design choices of feeding time steps.
Pn Pn
H(xit ) i i Layer-wise Time Embedding A straightfor-
t i=1 i=1 αt H(x0 )
1 − = Pn i
= P n i
, (9) ward choice is to include the time step as the same
T i=1 H(x0 ) i=1 H(x0 )
way as positional encoding, i.e., using the Trans-
where H denotes the entropy, which measures the former sinusoidal embedding or a learnable MLP
amount of information of a random variable, xi de- in each Transformer layer. Note that this way is
LM1B is a language corpus with about 30 million
sentences and a vocabulary of about 793k. We
use the standard train-test split and take 1% of
the training set for validation. All text data are
lower-cased to align with the settings of Austin
et al. (2021).
Our DiffusionBERT is based on B ERT-BASE -
U NCASED with about 110M parameters. We
train DiffusionBERT using the AdamW opti-
mizer (Loshchilov and Hutter, 2019) for 1.9 million
steps with learning rate of 3e-6, dropout probability
of 0.1, batch size of 32. For the first 10K steps, we
use a linear warmup schedule starting from learn-
Figure 2: Each token in a sequence has a specific noise
ing rate of 1e-8. All experiments are conducted
schedule depending on how much information is lost
when they are masked. For instance, in the sentence on NVIDIA A100 Tensor Core GPUs. We use 4
"Bella is sitting over there.", "Bella" GPUs for training and a single GPU for sampling.
is the most informative word. Thus it is encouraged to
be masked at the early stage so that our model learns to 4.2 Baselines
recover it in the last place. We conduct comparison on unconditional text gen-
eration against several non-autoregressive (NAR)
commonly adopted in previous work (Ho et al., baselines: D3PM (Austin et al., 2021), Diffusion-
2020; Austin et al., 2021; Li et al., 2022). LM (Li et al., 2022), and BERT-Mouth (Wang and
Cho, 2019).2
Prefix Time Embedding Prompting language
models by prepending trainable soft tokens to the D3PM D3PM is a general framework of discrete
input sequence has shown promising results re- diffusion models. We implement an instance of
cently (Lester et al., 2021; Sun et al., 2022). Hence, D3PM with the absorbing state and a layer-wise
we also explore including a time step token embed- time embedding. Both DiffusionBERT and D3PM
ding v(t) as a prefix of the input token embeddings are implemented with a sequence length n = 128
hv(x1t ), v(x2t ), · · · , v(xnt )i. In particular, the time and diffusion steps T = 2048. During inference,
step token is inserted in between the [CLS] token we perform decoding with 16 time steps in each
and the input sequence. These added time step iteration. The total inference cost is 128 iterations,
token embeddings are trained along with the PLM. which is smaller than that chosen in existing dif-
fusion or diffusion-like models for unconditional
Time-Agnostic Decoding Another alternative is generation (Hoogeboom et al., 2021; Savinov et al.,
not to explicitly incorporate the time step t because 2022). This has no impact on our conclusions since
it can be implied by the noised sample xt . In con- increasing the diffusion step does not bring sub-
trast to the image data, it is easier to implicitly in- stantial improvement.
fer the diffusion time step by counting the number
of corrupted tokens (i.e., [MASK]) in the noised Diffusion-LM Diffusion-LM learns an embed-
sequence. In this way, the PLM has to perform iter- ding to map discrete text into the continuous space
ative decoding while being ignorant of the current where it performs Gaussian diffusion process. A
time step, i.e., pθ (xt−1 |xt ). rounding step is required to map the continuous em-
beddings into discrete texts. We re-implemented
4 Experiments Diffusion-LM with the model architecture of BERT
and diffusion steps T = 2000. Since the perfor-
4.1 Experimental Setup mance drop of Diffusion-LM is bigger than D3PM
We mainly focus on unconditional text generation 2
Another strong baseline of NAR text generation is SUN-
in complex scenarios where the training data covers DAE (Savinov et al., 2022) but unfortunately there is no public
a wide range of topics and is composed of a large implementation available. We will include comparison with
SUNDAE in later versions by directly using the results re-
vocabulary. Experiments are conducted on the One ported in the original paper and use the same settings to train
Billion Word dataset (LM1B) (Chelba et al., 2014). DiffusionBERT for fair comparison.
Method Pretrained Schedule Time Step PPL ↓ BLEU ↑ Self-BLEU ↓
LTE 82.34 0.3897 0.2347
(T − t + 1)−1
D3PM (Austin et al., 2021) % TAD
::::
125.15 0.3390 0.2720
Spindle
:::::::
LTE 77.50 0.4241 0.2288
% Cosine LTE 118.62 0.3553 0.2668
Diffusion-LM (Li et al., 2022)
! Cosine LTE 132.12 0.3562 0.2798
BERT-Mouth (Wang and Cho, 2019) ! - - 142.89 0.2867 0.1240
LTE 92.53 0.3995 0.2118
(T − t + 1)−1
PTE
::::
79.95 0.3886 0.2156
DiffusionBERT
:::::::::::::: !
TAD
::::
78.76 0.4213 0.2116
Spindle
:::::::
TAD
::::
63.78 0.4358 0.2151
Table 2: Main results on LM1B. The methods proposed in this work are marked with :::: wavy:::::
lines. The best
results are in bold and the second best results are underlined. LTE: layer-wise time embedding. PTE: prefix time
embedding. TAD: time-agnostic decoding.
and DiffusionBERT when we sample less steps schedule is substantially higher. Evidence of lower
during generation, we do not skip steps so the num- bound is used as a proxy of the perplexities of Dif-
ber of inference is about 4 times that of Diffusion- fusionBERT and D3PM since the exact likelihood
BERT and the exact generation time comparison is of diffusion models is intractable.
reported in § 4.5.
DiffusionBERT vs. Other Generative BERT
BERT-Mouth BERT-Mouth samples text from Models We compare DiffusionBERT with
BERT via order-agnostic autoregressive masked another representative generative masked lan-
language modeling. Starting from a sequence of guage model, BERT-Mouth (Wang and Cho,
[MASK], BERT samples one token at each time 2019). Experimental results show that Diffusion-
step in random order. Another option is decod- BERT achieves better performance in terms of the
ing from left to right, like autoregressive models. perplexity and the BLEU score. We attribute the
In our preliminary experiments, we find that ran- superior performance of DiffusionBERT to its one-
dom position sampling performs better. We con- time sampling of all tokens, which helps Diffusion-
tinue pretraining BERT on LM1B to adapt BERT BERT generate more coherent text, especially in
to downstream training corpus. a long range. Although such decoding may face
the problem of multimodality (Gu et al., 2018), in-
4.3 Main Results appropriate phrases can be fixed in the upcoming
diffusion steps. The probabilistic modeling offers
Our main results are included in Table 2. We more flexibility in that generated tokens with low
choose BLEU-4 as the metric for generation qual- probability are more likely to be masked and re-
ity and diversity. For each method, we sample sampled. In BERT-Mouth, however, the tokens are
1K text for evaluating BLEU score and another fixed once sampled. Wang and Cho (2019) also pro-
1K for self-BLEU. Note that with different sam- posed to continue masking and predicting tokens
pling strategy, the BLEU/self-BLEU results may after the whole sequence is complete, revising the
vary. For fair comparison, the sentences sampled sentence for higher quality. But such randomness
by D3PM and DiffusionBERT have a fixed length in the selection and replacement of tokens results
n = 64 and are sampled by a top-K filter where in low inference speed.
K = 30. Diffusion-LM and BERT-Mouth are
trained and sampled following their original im- Discrete vs. Continuous Diffusion Models
plementation. Overall, DiffusionBERT achieves We then focus on the comparison of discrete
the best generation quality and diversity trade-off and continuous diffusion models for text gener-
among the considered NAR methods. Besides, the ation. To achieve this, we mainly compare Diffu-
perplexity of DiffusionBERT with the spindle noise sionBERT with recently proposed Diffusion-LM,
Figure 3: BLEU scores on the LM1B test set. Left is Figure 4: Curve of validation ELBO during training.
better, lower is better. For GPT and Transformer de-
coder, we control quality-variation with sampling tem-
perature. D3PM and DiffusionBERT are controlled by DiffusionBERT to perform generation without ex-
truncation sampling hyperparameter K. plicitly providing time information yet achieving
better generation results. The resemblance between
BERT pre-training objective and absorbing diffu-
which is based on continuous diffusion models. As sion models makes it easier for DiffusionBERT to
a result, despite of its outstanding controlling abil- generalize to noisier scenarios while a Transformer
ity, we show that the texts generated by Diffusion- encoder trained from scratch needs a specific time-
LM have a lower quality than DiffusionBERT. aware module to model the reverse process.
Though both DiffusionBERT and Diffusion-LM
adopt the same configuration of Transformer, it Effect of the Spindle Noise Schedule We try
is worth noting that the superior performance of our proposed spindle noise schedule on both Diffu-
DiffusionBERT may be contributed by not only sionBERT and D3PM. The perplexity is improved
the discrete diffusion models but also the use of by 18% and 19% for D3PM and DiffusionBERT, re-
pre-trained models. To disentangle the effect of spectively. Besides, D3PM with the spindle sched-
pre-training and discrete/continuous diffusion module outperforms that with the standard (T −t+1)−1
els, we also explore initializing Diffusion-LM with schedule in generation quality. The same trend
BERT. As shown in Table 2, training Diffusion- holds for DiffusionBERT but with a smaller mar-
LM from BERT initialization performs even worse gin.
than training from scratch. We conjecture that the 4.4 Quality-Diversity Trade-off
continuous nature of Diffusion-LM is not compat-
As shown in Figure 3, DiffusionBERT exhibits
ible with the initialization from BERT since the
comparable generation ability with a Transformer
embedding learned by BERT may not be suitable
decoder trained from scratch and pushes the Pareto
for the Gaussian diffusion process. In contrast, the
front of NAR generation quality/diversity trade-off
comparison of D3PM and DiffusionBERT shows
by a large margin. However, it still falls behind
that DiffusionBERT benefits much from the BERT
pretrained AR models of the same size.
initialization due to its discrete diffusion process.
4.5 Efficiency of Training and Generation
Effect of Time Step In terms of both likelihood
One important feature of DiffusionBERT is that
and generation quality, the layer-wise time embed-
with time-agnostic decoding, all parameters are ini-
ding (LTE) lags far behind the other two time step
tialized by pretrained models. Consequently, Dif-
designs for DiffusionBERT while time-agnostic
fusionBERT includes fewer parameters and is free
decoding (TAD) achieves the best result. By con-
from adapting new parameters, improving training
trast, D3PM without time step embedding per-
and decoding efficiency.
forms significantly worse. In a nutshell, simpli-
fying time step design has positive effect on Dif- Faster Convergence DiffusionBERT converges
fusionBERT but is quite harmful for D3PM. This remarkably faster than D3PM. Figure 4 demon-
suggests that initializing pθ with PLMs enables strates the curve of validation ELBO in the training
Method Steps Inference Time (secs) PPL (Ramesh et al., 2022; Kong et al., 2021; Nichol
2 0.66 313.57 and Dhariwal, 2021b). Despite their great success
8 1.39 91.01 and state-of-the-art sample quality in the above do-
DiffusionBERT
16 1.80 75.66 mains, diffusion models for text still struggle to
64 4.25 65.83 match autoregressive models in various generation
128 7.53 63.78
tasks. Since the Gaussian noise proposed in Sohl-
512 27.48 54.63
Dickstein et al. (2015) cannot be directly applied
Diffusion-LM 2000 83.67 112.12
to discrete data, they also introduced a discrete
64 2.18 142.89 forward process with a Bernoulli transition kernel.
BERT-Mouth
512 14.39 86.78 Hoogeboom et al. (2021) made a step forward from
GPT 64 1.55 38.7 Bernoulli to categorical distributions. A more gen-
eral family of discrete diffusion processes was in-
Table 3: Comparison of inference time and perplexity troduced in Austin et al. (2021); Hoogeboom et al.
among baselines and DiffusionBERT.
(2022), including absorbing kernels and combina-
tions of absorbing and uniform transition kernels.
process. Even if the training budget is cut to 30% Li et al. (2022) models text in the continuous em-
(i.e. 0.5 million steps), DiffusionBERT is still able bedding space, which is closer to the settings in
to match the performance reported in Table 2. earlier works of diffusion models and shows im-
pressive performance in classifier-controlled text
Sampling Speed With the x0 -parameterization generation. While the decoding and convergence
proposed in Song et al. (2021) and Austin et al. speed are substantially slower and the generated
(2021), DiffusionBERT is able to perform infer- text lacks coherence. Moreover, in scenarios where
ence with any given budget by controlling the step the vocabulary is large, the k-nearest-neighbor al-
size in the reverse process. We also control the sam- gorithm used in decoding holds up decoding even
pling time of BERT-Mouth by adjusting the max more severely.
iteration count of its mask-predict process. We list
the decoding speed and the corresponding perplex-
ity on the LM1B test set in Table 3. Overall, Diffu- 5.3 Non-Autoregressive Text Generation
sionBERT exhibits competitive performance even
when it reaches comparable speed to GPT and out- Absorbing discrete diffusion models resembles
performs BERT-Mouth in efficiency-performance conditional masked language models (CMLMs)
tradeoff. (Ghazvininejad et al., 2019) in that both methods
predict the whole sequence simultaneously and fol-
5 Related Work lows a construct-destruct pattern to iteratively re-
fine the generated text. The main difference lies
5.1 BERT for Text Generation
in the training objective: DiffusionBERT models
It has been shown by Wang and Cho (2019) that a stochastic process and drives BERT to learn a
the transfer-learning ability of BERT does not only group of distributions to gradually recover training
helps to achieve impressive results in natural lan- data while CMLMs forces the neural network to
guage understanding but also benefits sequential deterministically recover the whole sequence in ev-
sampling for text generation. However, its bi- ery iteration, thus it fails to explicitly model the
directionality nature holds BERT from matching denoising process. Savinov et al. (2022) proposed
the decoder-only counterparts (Radford et al., 2018) to approach the problem of non-autoregressive text
in modeling text from left to right. modeling via unrolling the generation path to pre-
pare the model for the partially corrupted sequences
5.2 Diffusion Models for Text it will encounter during generation, which resem-
This work lies in the line of diffusion models, a bles the idea of diffusion models for unconditional
latent variable generative framework proposed by text generation. Non-autoregressive models are
Sohl-Dickstein et al. (2015). It has been architec- also considered in translation but implemented in
turally improved by Ho et al. (2020) and has gained various ways, e.g., insertion/deletion (Gu et al.,
broad attention for its impressive generation abil- 2019) and iterative sequence alignment (Saharia
ity in continuous domain (e. g. image and audio) et al., 2020).
6 Conclusion Technologies, NAACL-HLT 2019, Minneapolis, MN,
USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
This work aims to approach the problem of un- pers), pages 4171–4186. Association for Computa-
conditional text generation for non-autoregressive tional Linguistics.
models. To achieve this, we combine pretrained Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
language models with absorbing-state discrete dif- Luke Zettlemoyer. 2019. Mask-predict: Parallel
fusion models for text. The training procedure of decoding of conditional masked language models.
our proposed DiffusionBERT includes two main In Proceedings of the 2019 Conference on Empiri-
cal Methods in Natural Language Processing and
deviations from current discrete diffusion models,
the 9th International Joint Conference on Natural
i.e., a new family of time step designs and the spin- Language Processing, EMNLP-IJCNLP 2019, Hong
dle noise schedule. The novel spindle noise as- Kong, China, November 3-7, 2019, pages 6111–
signs a schedule for each token according to its 6120. Association for Computational Linguistics.
frequency in the training corpus. Experimental re- Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu,
sults demonstrate the success of DiffusionBERT in and Lingpeng Kong. 2022. Diffuseq: Sequence
terms of perplexity. It also pushes the Pareto front to sequence text generation with diffusion models.
of quality-variance tradeoff of NAR methods by a CoRR, abs/2210.08933.
large margin, comparable to a Transformer decoder Jiatao Gu, James Bradbury, Caiming Xiong, Vic-
trained from scratch. tor O. K. Li, and Richard Socher. 2018. Non-
autoregressive neural machine translation. In 6th
International Conference on Learning Representa-
References tions, ICLR 2018, Vancouver, BC, Canada, April 30
- May 3, 2018, Conference Track Proceedings. Open-
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Review.net.
Tarlow, and Rianne van den Berg. 2021. Structured
denoising diffusion models in discrete state-spaces. Jiatao Gu, Changhan Wang, and Junbo Zhao. 2019.
In Advances in Neural Information Processing Sys- Levenshtein transformer. In Advances in Neural
tems 34: Annual Conference on Neural Information Information Processing Systems 32: Annual Con-
Processing Systems 2021, NeurIPS 2021, December ference on Neural Information Processing Systems
6-14, 2021, virtual, pages 17981–17993. 2019, NeurIPS 2019, December 8-14, 2019, Vancou-
ver, BC, Canada, pages 11179–11189.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda noising diffusion probabilistic models. In Advances
Askell, Sandhini Agarwal, Ariel Herbert-Voss, in Neural Information Processing Systems 33: An-
Gretchen Krueger, Tom Henighan, Rewon Child, nual Conference on Neural Information Processing
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Systems 2020, NeurIPS 2020, December 6-12, 2020,
Clemens Winter, Christopher Hesse, Mark Chen, virtual.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc- Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bast-
Candlish, Alec Radford, Ilya Sutskever, and Dario ings, Ben Poole, Rianne van den Berg, and Tim
Amodei. 2020. Language models are few-shot learn- Salimans. 2022. Autoregressive diffusion models.
ers. In Advances in Neural Information Processing In The Tenth International Conference on Learning
Systems 33: Annual Conference on Neural Informa- Representations, ICLR 2022, Virtual Event, April 25-
tion Processing Systems 2020, NeurIPS 2020, De- 29, 2022. OpenReview.net.
cember 6-12, 2020, virtual.
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini,
Ciprian Chelba, Tomás Mikolov, Mike Schuster, Qi Ge, Patrick Forré, and Max Welling. 2021. Argmax
Thorsten Brants, Phillipp Koehn, and Tony Robin- flows and multinomial diffusion: Towards
son. 2014. One billion word benchmark for measur- non-autoregressive language models. CoRR,
ing progress in statistical language modeling. In IN- abs/2102.05379.
TERSPEECH 2014, 15th Annual Conference of the
International Speech Communication Association, Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao,
Singapore, September 14-18, 2014, pages 2635– and Bryan Catanzaro. 2021. Diffwave: A versatile
2639. ISCA. diffusion model for audio synthesis. In 9th Inter-
national Conference on Learning Representations,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
Kristina Toutanova. 2019. BERT: pre-training of OpenReview.net.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
of the North American Chapter of the Association The power of scale for parameter-efficient prompt
for Computational Linguistics: Human Language tuning. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, LA, USA, June 18-24, 2022, pages 10674–10685.
EMNLP 2021, Virtual Event / Punta Cana, Domini- IEEE.
can Republic, 7-11 November, 2021, pages 3045–
3059. Association for Computational Linguistics. Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Ghasemipour, Burcu Karagol Ayan, S. Sara Mah-
jan Ghazvininejad, Abdelrahman Mohamed, Omer davi, Rapha Gontijo Lopes, Tim Salimans, Jonathan
Levy, Veselin Stoyanov, and Luke Zettlemoyer. Ho, David J. Fleet, and Mohammad Norouzi.
2020. BART: denoising sequence-to-sequence pre- 2022. Photorealistic text-to-image diffusion mod-
training for natural language generation, translation, els with deep language understanding. CoRR,
and comprehension. In Proceedings of the 58th An- abs/2205.11487.
nual Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5-10, 2020, Chitwan Saharia, William Chan, Saurabh Saxena, and
pages 7871–7880. Association for Computational Mohammad Norouzi. 2020. Non-autoregressive ma-
Linguistics. chine translation with latent alignments. In Proceed-
ings of the 2020 Conference on Empirical Methods
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, in Natural Language Processing, EMNLP 2020, On-
Percy Liang, and Tatsunori B. Hashimoto. 2022. line, November 16-20, 2020, pages 1098–1108. As-
Diffusion-lm improves controllable text generation. sociation for Computational Linguistics.
CoRR, abs/2205.14217.
Nikolay Savinov, Junyoung Chung, Mikolaj
Ilya Loshchilov and Frank Hutter. 2019. Decou- Binkowski, Erich Elsen, and Aäron van den
pled weight decay regularization. In 7th Inter- Oord. 2022. Step-unrolled denoising autoencoders
national Conference on Learning Representations, for text generation. In The Tenth International Con-
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. ference on Learning Representations, ICLR 2022,
OpenReview.net. Virtual Event, April 25-29, 2022. OpenReview.net.
Alexander Quinn Nichol and Prafulla Dhariwal. 2021a. Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah-
Improved denoising diffusion probabilistic models. eswaranathan, and Surya Ganguli. 2015. Deep un-
In Proceedings of the 38th International Conference supervised learning using nonequilibrium thermody-
on Machine Learning, ICML 2021, 18-24 July 2021, namics. In Proceedings of the 32nd International
Virtual Event, volume 139 of Proceedings of Ma- Conference on Machine Learning, ICML 2015, Lille,
chine Learning Research, pages 8162–8171. PMLR. France, 6-11 July 2015, volume 37 of JMLR Work-
shop and Conference Proceedings, pages 2256–
Alexander Quinn Nichol and Prafulla Dhariwal. 2021b. 2265. JMLR.org.
Improved denoising diffusion probabilistic models.
In Proceedings of the 38th International Conference Jiaming Song, Chenlin Meng, and Stefano Ermon.
on Machine Learning, ICML 2021, 18-24 July 2021, 2021. Denoising diffusion implicit models. In 9th
Virtual Event, volume 139 of Proceedings of Ma- International Conference on Learning Representa-
chine Learning Research, pages 8162–8171. PMLR. tions, ICLR 2021, Virtual Event, Austria, May 3-7,
2021. OpenReview.net.
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,
Ning Dai, and Xuanjing Huang. 2020. Pre-trained Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing
models for natural language processing: A survey. Huang, and Xipeng Qiu. 2022. Black-box tuning for
SCIENCE CHINA Technological Sciences. language-model-as-a-service. In Proceedings of the
39th International Conference on Machine Learning,
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya ICML 2022, Baltimore, Maryland, USA.
Sutskever, et al. 2018. Improving language under-
standing by generative pre-training. Alex Wang and Kyunghyun Cho. 2019. BERT has a
mouth, and it must speak: BERT as a markov ran-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine dom field language model. CoRR, abs/1902.04094.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21:140:1–140:67.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
Chu, and Mark Chen. 2022. Hierarchical text-
conditional image generation with CLIP latents.
CoRR, abs/2204.06125.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. 2022. High-
resolution image synthesis with latent diffusion mod-
els. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2022, New Orleans,

Diffusionbert: Improving Generative Masked Language Models With Diffusion Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Diffusionbert: Improving Generative Masked Language Models With Diffusion Models

Uploaded by

Copyright:

Available Formats

DiffusionBERT: Improving Generative Masked Language Models with

Abstract "[MASK] [MASK] [MASK]" "[MASK] world !"

xT ··· xt xt−1 ··· x0

You might also like