You are on page 1of 80

Deep generative modeling:

ARMs and Normalizing Flows


Jakub M. Tomczak
Deep Learning

TYPES OF GENERATIVE MODELS

Generative
model

Autoregressive Flow-based 
 Latent variable


(e.g., PixelCNN) (e.g., RealNVP, GLOW) models

Implicit models Prescribed models


(e.g., GANs) (e.g., VAE)
2
GENERATIVE MODELS

Training Likelihood Sampling Compression

Autoregressive models
Stable Exact Slow No
(e.g., PixelCNN)

Flow-based models

Stable Exact Fast/Slow No
(e.g., RealNVP)

Implicit models

Unstable No Fast No
(e.g., GANs)

Prescribed models

Stable Approximate Fast Yes
(e.g., VAEs)

3
GENERATIVE MODELS

Training Likelihood Sampling Compression

Autoregressive models
Stable Exact Slow No
(e.g., PixelCNN)

Flow-based models

Stable Exact Fast/Slow No
(e.g., RealNVP)

Implicit models

Unstable No Fast No
(e.g., GANs)

Prescribed models

Stable Approximate Fast Yes
(e.g., VAEs)

4
ARMS: AUTOREGRESSIVE MODELS

5
REPRESENTING A JOINT DISTRIBUTION

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)


y
• Product rule: p(x, y) = p(y | x) p(x)

REPRESENTING A JOINT DISTRIBUTION

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)


y
• Product rule: p(x, y) = p(y | x) p(x)

Before, we used these two rules for latent-variable models:


p(x) = p(x, z) dz


= p(x | z) p(z) dz
7

REPRESENTING A JOINT DISTRIBUTION

There are two main rules in the probability theory:

• Sum rule: p(x) = ∑ p(x, y)


y
• Product rule: p(x, y) = p(y | x) p(x)

Now, we will use the product rule to express the distribution of x ∈ ℝD:
D


p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤
8

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:


D


p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:


D


p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

The order of variables isn’t important.

10

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:


D


p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

The order of variables isn’t important.


However, modeling all conditionals separately is infeasible…

11

REPRESENTING A JOINT DISTRIBUTION

We use the product rule to express the distribution of x ∈ ℝD:


D


p(x) = p(x1) p(xd | x<d )
d=2
where x<d = [x1, x2, …, xd−1]⊤.

The order of variables isn’t important.


However, modeling all conditionals separately is infeasible…

Can we do better that?

12

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

13

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:


D


p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
Now, we can model p(xd | xd−2, xd−1) by a single model.

➡E.g., we can take a neural network.

14

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:


D


p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3

15

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:


D


p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
Now, we can model p(xd | xd−2, xd−1) by a single model.

➡E.g., we can take a neural network.

16

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:


D


p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3

17

REPRESENTING A JOINT DISTRIBUTION

We can assume a finite dependency.

For instance, for two last variables:


D


p(x) = p(x1)p(x2 | x1) p(xd | xd−2, xd−1)
d=3
Now, we can model p(xd | xd−2, xd−1) by a single model.

➡E.g., we can take a neural network.


However, it is still pretty limiting, because we need to decide on the
length of the dependency.
18

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

19

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

Advantages:

➡ We don’t need to define dependencies.


➡ A single parameterization.

20

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

21

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

Advantages:

➡ We don’t need to define dependencies.


➡ A single parameterization.

RNN are slow, because they’re sequential.


22

AUTOREGRESSIVE MODELS (ARM)

Instead, we can use RNNs to model the conditionals:

p(xd | x<d ) = p (xd | RNN(xd−1, hd−1))

where hd = RNN(xd−1, hd−1) .

Advantages:

➡ We don’t need to define dependencies.


➡ A single parameterization.

RNN are slow, because they’re sequential. Can we do better?


23

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

Let us consider a sequence x = [x1, x2, …, xD]⊤.


We assume all observed data are D-dimensional.

24

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

Let us consider a sequence x = [x1, x2, …, xD]⊤.


We assume all observed data are D-dimensional.
We can use 1D convolutional layers to process all signals at once.
Moreover, we can use dilation to learn long-range dependencies.

25

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

Let us consider a sequence x = [x1, x2, …, xD]⊤.

26

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)

27

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)

28

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)

29

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)

30

USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)
• Very efficient using current
DL frameworks.

31



USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)
• Very efficient using current
DL frameworks.

• For deep neural networks, NNs learn long-range dependencies.

32




USING CONVOLUTIONAL NEURAL NETWORKS FOR MODELING SEQUENCES

We can use 1D convolutional layers to process all signals at once.


Notice:
• Causal convolution
(i.e., looking only to the “past”)
• Very efficient using current
DL frameworks.

• For deep neural networks, NNs learn long-range dependencies.

33
Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).




PIXELCNN

We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!

34

PIXELCNN

We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!
Moreover, we should think of
2 or even 3 dimensions (CxHxW).
We can accomplish that by composing two Conv2D layers.

35

PIXELCNN

We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!
Moreover, we should think of
2 or even 3 dimensions (CxHxW).
We can accomplish that by composing two Conv2D layers.
The first Conv2D layer covers the upper part.

36

PIXELCNN

We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!
Moreover, we should think of
2 or even 3 dimensions (CxHxW).
We can accomplish that by composing two Conv2D layers.
The first Conv2D layer covers the upper part.
The second Conv2D layer covers the left part.
37
Van den Oord, Aaron, et al. "Conditional image generation with PixelCNN decoders." NIPS. 2016.

PIXELCNN

We can utilize the very same idea for images, but using 2D convolutions.
We need to remember about causal
convolutions!
Moreover, we should think of
2 or even 3 dimensions (CxHxW).
We can accomplish that by composing two Conv2D layers.
The first Conv2D layer covers the upper part. using maskin
for kernel weights.
The second Conv2D layer covers the left part.
38
Van den Oord, Aaron, et al. "Conditional image generation with PixelCNN decoders." NIPS. 2016.
g

PIXELCNN

Originally, PixelCNN used the softmax non-linearity at the end to output


integers between 0 and 255 (i.e., pixel values).
Currently, a mixture of discretized logistic distributions is used:

∑ [ ]
K
P(x ∣ π, μ, s) = πi σ ((x + 0.5 − μi) /si) − σ ((x − 0.5 − μi) /si)
i=1

39 Salimans, Tim, et al. "PixelCNN++: Improving the pixelcnn with discretized logistic mixture likelihood and other
modifications." arXiv (2017).

PIXELCNN

Originally, PixelCNN used the softmax non-linearity at the end to output


integers between 0 and 255 (i.e., pixel values).
Currently, a mixture of discretized logistic distributions is used:

∑ [ ]
K
P(x ∣ π, μ, s) = πi σ ((x + 0.5 − μi) /si) − σ ((x − 0.5 − μi) /si)
i=1

sigmoid function

40 Salimans, Tim, et al. "PixelCNN++: Improving the pixelcnn with discretized logistic mixture likelihood and other
modifications." arXiv (2017).

PIXELCNN

Originally, PixelCNN used the softmax non-linearity at the end to output


integers between 0 and 255 (i.e., pixel values).
Currently, a mixture of discretized logistic distributions is used:

∑ [ ]
K
P(x ∣ π, μ, s) = πi σ ((x + 0.5 − μi) /si) − σ ((x − 0.5 − μi) /si)
i=1

Learnable as a parameter

41 Salimans, Tim, et al. "PixelCNN++: Improving the pixelcnn with discretized logistic mixture likelihood and other
modifications." arXiv (2017).

PIXELCNN

Sampled Real
42 Salimans, Tim, et al. "PixelCNN++: Improving the pixelcnn with discretized logistic mixture likelihood and other
modifications." arXiv (2017).
AUTOREGRESSIVE MODELS

Advantages Disadvantages
✓ Exact likelihood. - Very slow sampling.
✓ Stable training. - No compression.
- Sometimes, low visual quality.

43

FLOW-BASED MODELS

44
CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).

45
𝒩

CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).


Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u?

46
𝒩

CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).


Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u?
A: The pdf is p(u) = (u | 1,0.752).

47
𝒩
𝒩

CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).


Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u?
A: The pdf is p(u) = (u | 1,0.752).
In general, we have:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u

48
𝒩
𝒩

CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).


Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u? −1 u−1
f (u) =
A: The pdf is p(u) = (u | 1,0.752). 0.75

In general, we have: ∂f −1(u) 4


=
∂f −1
(u) ∂u 3
p(u) = p (f (u))
−1
∂u

49
𝒩
𝒩

CHANGE OF VARIABLES

Let us consider a random variable v ∈ ℝ with p(v) = (v | 0,1).


Then, we take the following transformation: u = 0.75 ⋅ v + 1.
Q: What is the pdf for u? −1 u−1
f (u) =
A: The pdf is p(u) = (u | 1,0.752). 0.75

In general, we have: ∂f −1(u) 4


=
∂f −1
(u) ∂u 3
p(u) = p (f (u))
−1
∂u

( 0.75 ) 3
u−1 4 1
p(u) = p = exp {−(u − 1)2 /0.752}
50 2π 0.752
𝒩
𝒩

CHANGE OF VARIABLES

Multidimensional case:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u

51

CHANGE OF VARIABLES

Multidimensional case:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u Jacobian
∂f 1−1 ∂f 1−1
∂u1
… ∂uD
Jf −1 = ⋮ ⋱ ⋮
∂f D−1 ∂f D−1
52 ∂u1
⋯ ∂uD

CHANGE OF VARIABLES

Multidimensional case:
−1
∂f (u)
p(u) = p (f (u))
−1
∂u
where:
∂f −1(u)
= det Jf −1(u)
∂u Jacobian
∂f 1−1 ∂f 1−1
∂u1
… ∂uD
Jf −1 = ⋮ ⋱ ⋮
How can we utilize this idea? ∂f D−1 ∂f D−1
53 ∂u1
⋯ ∂uD

APPLYING CHANGE OF VARIABLES AND INVERTIBLE TRANSFORMATIONS

Let us consider a sequence of invertible transformations fk : ℝD → ℝD.


We can start with a simple distribution, e.g., π(z) = (z | 0, I).

54
𝒩
Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv

APPLYING CHANGE OF VARIABLES AND INVERTIBLE TRANSFORMATIONS

Let us consider a sequence of invertible transformations fk : ℝD → ℝD.


We can start with a simple distribution, e.g., π(z) = (z | 0, I).

...
0 0 0
“latent” space pixel space

55
𝒩
Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv

APPLYING CHANGE OF VARIABLES AND INVERTIBLE TRANSFORMATIONS

Let us consider a sequence of invertible transformations fk : ℝD → ℝD.


We can start with a simple distribution, e.g., π(z) = (z | 0, I).

...
0 0 0
“latent” space pixel space

−1
K ∂fi (zi−1)
This results in: p(x) = π (z0)

det
i=1
∂zi−1

56
𝒩
Rippel, O., & Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv

2D EXAMPLE

{fk}

{f k−1}

57
DENSITY MODELING WITH INVERTIBLE NEURAL NETWORKS

−1
K ∂fi (zi−1)
The density model: p(x) = π (z0)

det
i=1
∂zi−1

58

DENSITY MODELING WITH INVERTIBLE NEURAL NETWORKS

−1
K ∂fi (zi−1)
The density model: p(x) = π (z0)

det
i=1
∂zi−1

In order to obtain flexible transformations fk, we use neural networks.


However, we need to ensure that they are invertible.

59

DENSITY MODELING WITH INVERTIBLE NEURAL NETWORKS

−1
K ∂fi (zi−1)
The density model: p(x) = π (z0)

det
i=1
∂zi−1

In order to obtain flexible transformations fk, we use neural networks.


However, we need to ensure that they are invertible.
Moreover, we cannot use any invertible neural network, because we need
to remember about the Jacobian!

60

DENSITY MODELING WITH INVERTIBLE NEURAL NETWORKS

−1
K ∂fi (zi−1)
The density model: p(x) = π (z0)

det
i=1
∂zi−1

In order to obtain flexible transformations fk, we use neural networks.


However, we need to ensure that they are invertible.
Moreover, we cannot use any invertible neural network, because we need
to remember about the Jacobian!
Calculating Jacobian is the main challenge in flow-based models.
61

REALNVP

Design the invertible transformations as follows:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

where: s ( ⋅ ) and t ( ⋅ ) are arbitrary neural networks.

62 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using RealNVP. arXiv preprint arXiv:1605.08803.
REALNVP

Design the invertible transformations as follows:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

where: s ( ⋅ ) and t ( ⋅ ) are arbitrary neural networks.

This is invertible by design, because:

xd+1:D = (yd+1:D − t (y1:d)) ⊙ exp (−s (y1:d))

x1:d = y1:d

63 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using RealNVP. arXiv preprint arXiv:1605.08803.
REALNVP

Design the invertible transformations as follows:

y1:d = x1:d

Known as
yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)
affine coupling layer

where: s ( ⋅ ) and t ( ⋅ ) are arbitrary neural networks.

This is invertible by design, because:

xd+1:D = (yd+1:D − t (y1:d)) ⊙ exp (−s (y1:d))

x1:d = y1:d

64 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using RealNVP. arXiv preprint arXiv:1605.08803.

REALNVP

The invertible transformation:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

What about the Jacobian?

65
REALNVP

The invertible transformation:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

What about the Jacobian?

66
REALNVP

The invertible transformation:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

What about the Jacobian?

D−d D−d
exp (s (x1:d)) = exp
∏ ∑ ( 1:d)j
and det(J) = s x

j
j=1 j=1

67
REALNVP

The invertible transformation:

y1:d = x1:d

yd+1:D = xd+1:D ⊙ exp (s (x1:d)) + t (x1:d)

What about the Jacobian?

Easy to calculate!
D−d D−d
exp (s (x1:d)) = exp
∏ ∑ ( 1:d)j
and det(J) = s x

j
j=1 j=1

68
REALNVP

We can also introduce the idea of autoregressive modeling here as well:

69
REALNVP

We can also introduce the idea of autoregressive modeling here as well:

p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)

70
REALNVP

We can also introduce the idea of autoregressive modeling here as well:

p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)

71
REALNVP

We can also introduce the idea of autoregressive modeling here as well:

p(z1)p(z2)p(z3 | z1, z2)p(z4 | z1, z2, z3)

72
REALNVP

Moreover, we can use additional transformations:


1. Permutations of variables (this is invertible).
→ this helps to “mix” variables.
2. Divide variables using a checkerboard pattern.
→ this helps to learn higher-order dependencies.
3. Use squeezing: reshape input tensor
→ reshaping can help to “mix” variables.

73

REALNVP

74
REALNVP

75
GLOW: REALNVP WITH 1X1 CONVOLUTIONS

A model contains ~1000 convolutions.


A new component: 1x1 convolution instead of a permutation matrix.

76
Kingma, D. P., & Dhariwal, P. (2018). GLOW: Generative flow with invertible 1x1 convolutions. NeurIPS 2018

GLOW: SAMPLES

CelebAHQ
77
GLOW: LATENT INTERPOLATION

CelebAHQ
78
VAES WITH NORMALIZING FLOWS

q(z | x) ∝
˜ p(x | z) p(z)

Variational inference
Flow-based priors
with normalizing flows
Rezende & Mohamed. "Variational Chen, Kingma, Salimans, Duan, Dhariwal,
inference with normalizing flows." Schulman, Abbeel, “Variational lossy
autoencoder”
v.d. Berg, Hasenclever, Tomczak, Welling,
“Sylvester normalizing flows for Gatopoulos, Tomczak. "Self-Supervised
variational inference” Variational Auto-Encoders."

Kingma, Salimans, Jozefowicz, Chen,


Sutskever, Welling “Improved variational
inference with inverse autoregressive flow”

Tomczak, Welling, “Improving variational


auto-encoders using householder flow”
79

Thank you!

You might also like