Dlvu Lecture10

Deep generative modeling:
ARMs and Normalizing Flows

Jakub M. Tomczak
Deep Learning

TYPES OF GENERATIVE MODELS
Generative
model
Autoregressive Flow-based   Latent variable

(e.g., PixelCNN) (e.g., RealNVP, GLOW) models
Implicit models Prescribed models

(e.g., GANs) (e.g., VAE)
2
GENERATIVE MODELS
Training Likelihood Sampling Compression
Autoregressive models
Stable Exact Slow No
(e.g., PixelCNN)
Flow-based models 
Stable Exact Fast/Slow No
(e.g., RealNVP)
Implicit models 
Unstable No Fast No
(e.g., GANs)
Prescribed models 
Stable Approximate Fast Yes
(e.g., VAEs)
3
GENERATIVE MODELS
Training Likelihood Sampling Compression
Autoregressive models
Stable Exact Slow No
(e.g., PixelCNN)
Flow-based models 
Stable Exact Fast/Slow No
(e.g., RealNVP)
Implicit models 
Unstable No Fast No
(e.g., GANs)
Prescribed models 
Stable Approximate Fast Yes
(e.g., VAEs)
4
ARMS: AUTOREGRESSIVE MODELS
5
REPRESENTING A JOINT DISTRIBUTION
There are two main rules in the probability theory:
• Sum rule: p(x) = ∑ p(x, y)

y
• Product rule: p(x, y) = p(y | x) p(x)

y
Before, we used these two rules for latent-variable models:
∫
p(x) = p(x, z) dz
∫
= p(x | z) p(z) dz
7


y
Now, we will use the product rule to express the distribution of x ∈ ℝD:
D
∑
p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤
8

We use the product rule to express the distribution of x ∈ ℝD:

D
∑
p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

D
∑
p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.
The order of variables isn’t important.
10


D
∑
p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

However, modeling all conditionals separately is infeasible…
11


D
∑
p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

However, modeling all conditionals separately is infeasible…
Can we do better that?
12

We can assume a finite dependency.
13

For instance, for two last variables:

D
∑
p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
Now, we can model p(xd | xd−2, xd−1) by a single model.
➡E.g., we can take a neural network.
14


D
∑
p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
15


D
∑
p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
16


D
∑
p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
17


D
∑
p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3

However, it is still pretty limiting, because we need to decide on the
length of the dependency.
18

AUTOREGRESSIVE MODELS (ARM)
Instead, we can use RNNs to model the conditionals:
p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))
where hd = RNN(xd−1, hd−1) .
19

Advantages:
➡ We don’t need to define dependencies.

➡ A single parameterization.
20

21

Advantages:

RNN are slow, because they’re sequential.

22

Advantages:

RNN are slow, because they’re sequential. Can we do better?

23

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES
Let us consider a sequence x = [x1, x2, …, xD]⊤.

We assume all observed data are D-dimensional.
24

We assume all observed data are D-dimensional.
We can use 1D convolutional layers to process all signals at once.
Moreover, we can use dilation to learn long-range dependencies.
25
26

Notice:
• Causal convolution
(i.e., looking only to the “past”)
27


Notice:
28


Notice:
29


Notice:
30


Notice:
• Very efficient using current
DL frameworks.
31

 
 


Notice:
DL frameworks.
• For deep neural networks, NNs learn long-range dependencies.
32

 
 
 


Notice:
DL frameworks.
• For deep neural networks, NNs learn long-range dependencies.
33
Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

 
 
 

PIXELCNN
We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!
34

PIXELCNN
convolutions!
Moreover, we should think of
2 or even 3 dimensions (CxHxW).
We can accomplish that by composing two Conv2D layers.
35

PIXELCNN
convolutions!
The first Conv2D layer covers the upper part.
36

PIXELCNN
convolutions!
The first Conv2D layer covers the upper part.
The second Conv2D layer covers the left part.
37
Van den Oord, Aaron, et al. "Conditional image generation with PixelCNN decoders." NIPS. 2016.

PIXELCNN
convolutions!
The first Conv2D layer covers the upper part. using maskin
for kernel weights.
The second Conv2D layer covers the left part.
38
Van den Oord, Aaron, et al. "Conditional image generation with PixelCNN decoders." NIPS. 2016.
g
PIXELCNN
Originally, PixelCNN used the softmax non-linearity at the end to output

integers between 0 and 255 (i.e., pixel values).
Currently, a mixture of discretized logistic distributions is used:
∑ [ ]
K
P(x ∣ π, μ, s) = πi σ ((x + 0.5 − μi) /si) − σ ((x − 0.5 − μi) /si)
i=1
39 Salimans, Tim, et al. "PixelCNN++: Improving the pixelcnn with discretized logistic mixture likelihood and other
modifications." arXiv (2017).

PIXELCNN

∑ [ ]
K
i=1
sigmoid function

PIXELCNN

∑ [ ]
K
i=1
Learnable as a parameter

PIXELCNN
Sampled Real
AUTOREGRESSIVE MODELS
Advantages Disadvantages
✓ Exact likelihood. - Very slow sampling.
✓ Stable training. - No compression.
- Sometimes, low visual quality.
43

FLOW-BASED MODELS
44
CHANGE OF VARIABLES
Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).
45
𝒩

CHANGE OF VARIABLES

Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u?
46
𝒩

CHANGE OF VARIABLES

A: The pdf is p(u) = (u | 1,0.752).
47
𝒩
𝒩

CHANGE OF VARIABLES

A: The pdf is p(u) = (u | 1,0.752).
In general, we have:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
48
𝒩
𝒩

CHANGE OF VARIABLES

Q: What is the pdf for u? −1 u−1
f (u) =
A: The pdf is p(u) = (u | 1,0.752). 0.75
In general, we have: ∂f −1(u) 4

=
∂f −1
(u) ∂u 3
p(u) = p (f (u))
−1
∂u
49
𝒩
𝒩

CHANGE OF VARIABLES

Q: What is the pdf for u? −1 u−1
f (u) =
A: The pdf is p(u) = (u | 1,0.752). 0.75
In general, we have: ∂f −1(u) 4

=
∂f −1
(u) ∂u 3
p(u) = p (f (u))
−1
∂u
( 0.75 ) 3
u−1 4 1
p(u) = p = exp {−(u − 1)2 /0.752}
50 2π 0.752
𝒩
𝒩

CHANGE OF VARIABLES
Multidimensional case:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u
51

CHANGE OF VARIABLES
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u Jacobian
∂f 1−1 ∂f 1−1
∂u1
… ∂uD
Jf −1 = ⋮ ⋱ ⋮
∂f D−1 ∂f D−1
52 ∂u1
⋯ ∂uD

CHANGE OF VARIABLES
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u Jacobian
∂f 1−1 ∂f 1−1
∂u1
… ∂uD
Jf −1 = ⋮ ⋱ ⋮
How can we utilize this idea? ∂f D−1 ∂f D−1
53 ∂u1
⋯ ∂uD

APPLYING CHANGE OF VARIABLES AND INVERTIBLE TRANSFORMATIONS
Let us consider a sequence of invertible transformations fk : ℝD → ℝD.

We can start with a simple distribution, e.g., π(z) = (z | 0, I).
54
𝒩
Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv

...
0 0 0
“latent” space pixel space
55
𝒩

...
0 0 0
“latent” space pixel space
−1
K ∂fi (zi−1)
This results in: p(x) = π (z0)
∏
det
i=1
∂zi−1
56
𝒩
2D EXAMPLE
{fk}
{f k−1}
57
DENSITY MODELING WITH INVERTIBLE NEURAL NETWORKS
−1
K ∂fi (zi−1)
The density model: p(x) = π (z0)
∏
det
i=1
∂zi−1
58
−1
K ∂fi (zi−1)
∏
det
i=1
∂zi−1
In order to obtain flexible transformations fk, we use neural networks.

However, we need to ensure that they are invertible.
59
−1
K ∂fi (zi−1)
∏
det
i=1
∂zi−1

Moreover, we cannot use any invertible neural network, because we need
to remember about the Jacobian!
60

−1
K ∂fi (zi−1)
∏
det
i=1
∂zi−1

Moreover, we cannot use any invertible neural network, because we need
to remember about the Jacobian!
Calculating Jacobian is the main challenge in flow-based models.
61

REALNVP
Design the invertible transformations as follows:
y1:d = x1:d
yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)
where: s ( ⋅ ) and t ( ⋅ ) are arbitrary neural networks.
62 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using RealNVP. arXiv preprint arXiv:1605.08803.
REALNVP
y1:d = x1:d
This is invertible by design, because:
xd+1:D = (yd+1:D − t (y1:d)) ⊙ exp (−s (y1:d))
x1:d = y1:d
REALNVP
y1:d = x1:d
Known as
affine coupling layer
This is invertible by design, because:
xd+1:D = (yd+1:D − t (y1:d)) ⊙ exp (−s (y1:d))
x1:d = y1:d

REALNVP
The invertible transformation:
y1:d = x1:d
What about the Jacobian?
65
REALNVP
y1:d = x1:d
66
REALNVP
y1:d = x1:d
D−d D−d
exp (s (x1:d)) = exp
∏ ∑ ( 1:d)j
and det(J) = s x
j
j=1 j=1
67
REALNVP
y1:d = x1:d
Easy to calculate!
D−d D−d
exp (s (x1:d)) = exp
∏ ∑ ( 1:d)j
and det(J) = s x
j
j=1 j=1
68
REALNVP
We can also introduce the idea of autoregressive modeling here as well:
69
REALNVP
p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)
70
REALNVP
p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)
71
REALNVP
p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)
72
REALNVP
Moreover, we can use additional transformations:

1. Permutations of variables (this is invertible).
→ this helps to “mix” variables.
2. Divide variables using a checkerboard pattern.
→ this helps to learn higher-order dependencies.
3. Use squeezing: reshape input tensor
→ reshaping can help to “mix” variables.
73

REALNVP
74
REALNVP
75
GLOW: REALNVP WITH 1X1 CONVOLUTIONS
A model contains ~1000 convolutions.

A new component: 1x1 convolution instead of a permutation matrix.
76
Kingma, D. P., & Dhariwal, P. (2018). GLOW: Generative flow with invertible 1x1 convolutions. NeurIPS 2018

GLOW: SAMPLES
CelebAHQ
77
GLOW: LATENT INTERPOLATION
CelebAHQ
78
VAES WITH NORMALIZING FLOWS
q(z | x) ∝
˜ p(x | z) p(z)
Variational inference
Flow-based priors
with normalizing flows
Rezende & Mohamed. "Variational Chen, Kingma, Salimans, Duan, Dhariwal,
inference with normalizing flows." Schulman, Abbeel, “Variational lossy
autoencoder”
v.d. Berg, Hasenclever, Tomczak, Welling,
“Sylvester normalizing flows for Gatopoulos, Tomczak. "Self-Supervised
variational inference” Variational Auto-Encoders."
Kingma, Salimans, Jozefowicz, Chen,

Sutskever, Welling “Improved variational
inference with inverse autoregressive flow”
Tomczak, Welling, “Improving variational

auto-encoders using householder flow”
79

Thank you!

Dlvu Lecture10

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dlvu Lecture10

Uploaded by

Copyright:

Available Formats

Deep generative modeling:

ARMs and Normalizing Flows

TYPES OF GENERATIVE MODELS

Autoregressive Flow-based Latent variable

Implicit models Prescribed models

Training Likelihood Sampling Compression

Training Likelihood Sampling Compression

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)

REPRESENTING A JOINT DISTRIBUTION

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)

Before, we used these two rules for latent-variable models:

REPRESENTING A JOINT DISTRIBUTION

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:

The order of variables isn’t important.

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:

The order of variables isn’t important.

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:

The order of variables isn’t important.

Can we do better that?

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:

➡E.g., we can take a neural network.

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:

➡E.g., we can take a neural network.

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:

➡E.g., we can take a neural network.

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

➡ We don’t need to define dependencies.

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

Autoregressive Flow-based   Latent variable