JupyterLab02 Diffusion Model

27/04/2024 15:49 02_Diffusion_Models
2. Diffusion Models
In the previous notebook, we learned how to separate noise from an image
using a U-Net, but it was not capable of generating believable new images from
noise. Diffusion models are much better at generating images from scratch.
The good news, our neural network model will not change much. We will be
building off of the U-Net architecture with some slight modifications.
Instead, the big difference is how we use our model. Rather than adding noise to
our images all at once, we will be adding a small amount of noise multiple times.
We can then use our neural network on a noisy image multiple times to
generate a new image like so:
Learning Objectives
The goals of this notebook are to:
Construct a forward diffusion variance schedule

Define the forward diffusion function, q
Update the U-Net architecture to accommodate a timestep, t
Train a model to detect noise added to an image based on the timestep t
Define a reverse diffusion function to emulate p
Attempt to generate articles of clothing (again)
We've moved some of the functions from the previous notebook into a utils.py
file. We can use it to reload the fashionMNIST dataset:
In [1]: import torch

import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.optim import Adam
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 1/24
# Visualization tools
import matplotlib.pyplot as plt
from IPython.display import Image
# User defined libraries

from utils import other_utils
IMG_SIZE = 16
IMG_CH = 1
BATCH_SIZE = 128
data, dataloader = other_utils.load_transformed_fashionMNIST(IMG_SIZE, BA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-images-idx3-ubyte.gz
in-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyt
e.gz
100%|██████████| 26421880/26421880 [00:01<00:00, 13842662.92it/s]
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/Fa
shionMNIST/raw
in-labels-idx1-ubyte.gz
in-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyt
e.gz
100%|██████████| 29515/29515 [00:00<00:00, 326793.37it/s]
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/Fa
shionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-images-idx3-ubyte.gz
k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.g
z
100%|██████████| 4422102/4422102 [00:02<00:00, 1688015.43it/s]
Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/Fas
hionMNIST/raw
k-labels-idx1-ubyte.gz
k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.g
z
100%|██████████| 5148/5148 [00:00<00:00, 13126004.25it/s]
Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/Fas
hionMNIST/raw
2.1 Forward Diffusion
Let T be the number of times we will add noise to an image. We can use t to
keep track of the current timestep .
In the previous notebook, we used the term beta to represent what

percentage of the new image was noise compared to the original image. The
default was 50% noise and 50% the original image. This time, we will use a
variance schedule , represented as $\beta_t$, or B in code. This will
describe how much noise will be added to our image at each timestep t .
In section 4 of the paper Denoising Diffusion Probabilistic Models, the authors

discuss the art of defining a good schedule. It should be large enough for the
model to recognize noise was added (especially since the image may already be
noisy), but still as small as possible.
In [2]: nrows = 10
ncols = 15
T = nrows * ncols
start = 0.0001
end = 0.02
B = torch.linspace(start, end, T).to(device)
B
Out[2]: tensor([1.0000e-04, 2.3356e-04, 3.6711e-04, 5.0067e-04, 6.3423e-04, 7.67

79e-04,
9.0134e-04, 1.0349e-03, 1.1685e-03, 1.3020e-03, 1.4356e-03, 1.56
91e-03,
1.7027e-03, 1.8362e-03, 1.9698e-03, 2.1034e-03, 2.2369e-03, 2.37
05e-03,
2.5040e-03, 2.6376e-03, 2.7711e-03, 2.9047e-03, 3.0383e-03, 3.17
18e-03,
3.3054e-03, 3.4389e-03, 3.5725e-03, 3.7060e-03, 3.8396e-03, 3.97
32e-03,
4.1067e-03, 4.2403e-03, 4.3738e-03, 4.5074e-03, 4.6409e-03, 4.77
45e-03,
4.9081e-03, 5.0416e-03, 5.1752e-03, 5.3087e-03, 5.4423e-03, 5.57
58e-03,
5.7094e-03, 5.8430e-03, 5.9765e-03, 6.1101e-03, 6.2436e-03, 6.37
72e-03,
6.5107e-03, 6.6443e-03, 6.7779e-03, 6.9114e-03, 7.0450e-03, 7.17
85e-03,
7.3121e-03, 7.4456e-03, 7.5792e-03, 7.7128e-03, 7.8463e-03, 7.97
99e-03,
8.1134e-03, 8.2470e-03, 8.3805e-03, 8.5141e-03, 8.6477e-03, 8.78
12e-03,
8.9148e-03, 9.0483e-03, 9.1819e-03, 9.3154e-03, 9.4490e-03, 9.58
26e-03,
9.7161e-03, 9.8497e-03, 9.9832e-03, 1.0117e-02, 1.0250e-02, 1.03
84e-02,
1.0517e-02, 1.0651e-02, 1.0785e-02, 1.0918e-02, 1.1052e-02, 1.11
85e-02,
1.1319e-02, 1.1452e-02, 1.1586e-02, 1.1719e-02, 1.1853e-02, 1.19
87e-02,
1.2120e-02, 1.2254e-02, 1.2387e-02, 1.2521e-02, 1.2654e-02, 1.27
88e-02,
1.2921e-02, 1.3055e-02, 1.3189e-02, 1.3322e-02, 1.3456e-02, 1.35
89e-02,
1.3723e-02, 1.3856e-02, 1.3990e-02, 1.4123e-02, 1.4257e-02, 1.43
91e-02,
1.4524e-02, 1.4658e-02, 1.4791e-02, 1.4925e-02, 1.5058e-02, 1.51
92e-02,
1.5326e-02, 1.5459e-02, 1.5593e-02, 1.5726e-02, 1.5860e-02, 1.59
93e-02,
1.6127e-02, 1.6260e-02, 1.6394e-02, 1.6528e-02, 1.6661e-02, 1.67
95e-02,
1.6928e-02, 1.7062e-02, 1.7195e-02, 1.7329e-02, 1.7462e-02, 1.75
96e-02,
1.7730e-02, 1.7863e-02, 1.7997e-02, 1.8130e-02, 1.8264e-02, 1.83
97e-02,
1.8531e-02, 1.8664e-02, 1.8798e-02, 1.8932e-02, 1.9065e-02, 1.91
99e-02,
1.9332e-02, 1.9466e-02, 1.9599e-02, 1.9733e-02, 1.9866e-02, 2.00
00e-02],
device='cuda:0')
A Normal Dsitribution has the following signature:
$\mathcal{N}(x;u,\sigma^2)$ = $\frac{1}{\sigma\sqrt{2\pi}}\mathcal{e}^{-\frac{1}
{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}$
which reads as, "the normal distribution of $x$ with parameters $u$ (the mean)
and $\sigma^2$ (the variance). When $\mu$ is 0 amd $\sigma$ is 1, we have a
standard normal distribution $\mathcal{N}(x;0,1)$, which has the probability
density of the shape below:
If we are altering our image with noise multiple times accross many timesteps,
let's describe $\mathbf{x}_{t}$ as our image at timestep $t$. Then,
$\mathbf{x}_{t-1}$ would be the image at the previous timestep and $x_{0}$
would be the original image.
In the previous notebook, we added noise to images using the equation:
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};(1-\beta_{t}) \cdot
\mathbf{x}_{t-1},\beta_{t}^{2} \cdot \mathbf{I})$
Where $q$ represents a probability distribution for the forward diffusion process
and $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ describes the probability distribution for
a new, noiser image $\mathbf{x}_{t}$ based on $\mathbf{x}_{t-1}$.
This time, we will alter images with the similar equation:
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}
\cdot \mathbf{x}_{t-1},\beta_{t} \cdot \mathbf{I})$
We can sample from this probability distribution by first sampling from a

standard normal distribution $\mathcal{N}(x;0,1)$ using torch.randn_like:
noise = torch.randn_like(x_t)
We can then multiply and add to the noise to sample from q :
x_t = torch.sqrt(1 - B[t]) * x_t + torch.sqrt(B[t]) * noise
Let's see all of this in practice. Run the code cell below to perform forward
diffusion T (or 150 ) times on the first image of our dataset.
In [3]: plt.figure(figsize=(8, 8))

x_0 = data[0][0].to(device) # Initial image
x_t = x_0 # Set up recursion
xs = [] # Store x_t for each T to see change
for t in range(T):
noise = torch.randn_like(x_t)
x_t = torch.sqrt(1 - B[t]) * x_t + torch.sqrt(B[t]) * noise # sample

img = torch.squeeze(x_t).cpu()
xs.append(img)
ax = plt.subplot(nrows, ncols, t + 1)
ax.axis("off")
plt.imshow(img)
plt.savefig("forward_diffusion.png", bbox_inches="tight")
Or in animated form:
In [4]: gif_name = "forward_diffusion.gif"

other_utils.save_animation(xs, gif_name)
MovieWriter ffmpeg unavailable; using Pillow instead.
In [5]: Image(open(gif_name,'rb').read())
Out[5]:
2.2 Skipping Ahead

We could take each image of our dataset and add noise to them T times to
create T more new images, but do we need to?
Thanks to the power of recursion, we can estimate what $x_t$ would look like
given our beta schedule $\beta_t$. A full breakdown of the math can be found in
Lilian Weng's Blog. Let's bring back a lpha, which is the compliment of $\beta$.
We can define $\alpha_t$ as $1 - \beta_t$, and we can define $\bar{\alpha}_t$ as
the cumulative product of $\alpha_t$.
For example, $\bar{\alpha}_3 = \alpha_0 \cdot \alpha_1 \cdot \alpha_2 \cdot

\alpha_3$
Because of the bar symbol, let's call $\bar{\alpha}_t$ a_bar . Our new noisy
image distribution becomes:
$q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}
(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}} \cdot x_{0},(1 - \bar{\alpha}_t) \cdot
\mathbf{I})$
Which translates to code as:
x_t = sqrt_a_bar_t * x_0 + sqrt_one_minus_a_bar_t * noise
We are now no longer dependent on $\mathbf{x}_{t-1}$ and can estimate

$\mathbf{x}_t$ from $x_0$. Let's define these variables in code:
In [6]: a = 1. - B
a_bar = torch.cumprod(a, dim=0)
sqrt_a_bar = torch.sqrt(a_bar) # Mean Coefficient
sqrt_one_minus_a_bar = torch.sqrt(1 - a_bar) # St. Dev. Coefficient
We have all the pieces, let's code our forward diffusion sampling function q :
$q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}
(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}} \cdot \mathbf{x}_{0},(1 - \bar{\alpha}_t)
\cdot \mathbf{I})$
Currently, sqrt_a_bar and sqrt_one_minus_a_bar only have one

dimension, and if we index into them with t , they will each only have one
value. If we want to multiply this value with each of the pixel values in our
images, we will need to match the number of dimensions in order to broadcast.
We can add an extra dimension by indexing with None . This is a PyTorch

shortcut to add an extra dimension to the resulting tensor. For reference, a batch
of images has the dimensions: batch dimension x image channels x
image height x image width .
In [7]: def q(x_0, t):

"""
Samples a new image from q

Returns the noise applied to an image at timestep t
x_0: the original image
t: timestep
"""
t = t.int()
noise = torch.randn_like(x_0)
sqrt_a_bar_t = sqrt_a_bar[t, None, None, None]
sqrt_one_minus_a_bar_t = sqrt_one_minus_a_bar[t, None, None, None]
x_t = sqrt_a_bar_t * x_0 + sqrt_one_minus_a_bar_t * noise

return x_t, noise
Let's test out this new method compared to our old method of recursively
generating the images.
In [8]: plt.figure(figsize=(8, 8))

xs = []
for t in range(T):
t_tenser = torch.Tensor([t]).type(torch.int64)
x_t, _ = q(x_0, t_tenser)
img = torch.squeeze(x_t).cpu()
xs.append(img)
ax = plt.subplot(nrows, ncols, t + 1)
ax.axis('off')
other_utils.show_tensor_image(x_t)
plt.savefig("forward_diffusion_skip.png", bbox_inches='tight')
In [9]: gif_name = "forward_diffusion_skip.gif"

other_utils.save_animation(xs, gif_name)
MovieWriter ffmpeg unavailable; using Pillow instead.
In [10]: Image(open(gif_name,'rb').read())
Out[10]:
Compared to the previous technique, can you see any differences? When noise
is added sequentially, there is a smaller difference between the images of
consecutive timesteps. Despite this, the neural network will do a good job
separating the noise from the original image in the reverse diffusion process.
2.3 Predicting Noise

The architecture for our neural network will mostly be the same as before.
However, because the amount of noise added changes with each time step, we
will need a way to tell the model which time step our input image is at.
To do that, we can create an embedding block like the one below.
input_dim is the number of dimensions of the value we'd like to embed.

We'll be embedding t , which is a one-dimensional scalar.
emb_dim is the number of dimensions we would like to convert our input
value into by using a Linear layer.
UnFlatten is used to reshape a vector into a multidimensional space. Since
we'll be adding the result of this embedding to a multidimensional feature
map, we will add a few extra dimensions similar to how we expanded the
dimension in the q function above.
In [11]: class EmbedBlock(nn.Module):

def __init__(self, input_dim, emb_dim):
super().__init__()
self.input_dim = input_dim
layers = [
nn.Linear(input_dim, emb_dim),
nn.ReLU(),
nn.Linear(emb_dim, emb_dim),
nn.Unflatten(1, (emb_dim, 1, 1))
]
self.model = nn.Sequential(*layers)
def forward(self, input):

input = input.view(-1, self.input_dim)
return self.model(input)
We will add this time embedding block to each UpBlock of our U-Net,
resulting in the following architecture.
TODO: Our DownBlock is the same as before. Using the above image as a
reference, can you replace the FIXME s with the correct variable? Each FIXME
can be one of:
in_chs
out_chs
kernel_size
stride
padding
Click the ... below for the correct answer.
In [12]: class DownBlock(nn.Module):

def __init__(self, in_chs, out_chs):
kernel_size = 3
stride = 1
padding = 1
super().__init__()
layers = [
nn.Conv2d(FIXME, FIXME, FIXME, FIXME, FIXME),
nn.BatchNorm2d(FIXME),
nn.ReLU(),
nn.ReLU(),
nn.MaxPool2d(2)
]
def forward(self, x):

return self.model(x)
In [19]: class DownBlock(nn.Module):

kernel_size = 3
stride = 1
padding = 1
super().__init__()
layers = [
nn.Conv2d(in_chs, out_chs, kernel_size, stride, padding),
nn.BatchNorm2d(out_chs),
nn.ReLU(),
nn.Conv2d(out_chs, out_chs, kernel_size, stride, padding),
nn.ReLU(),
nn.MaxPool2d(2)
]
def forward(self, x):

return self.model(x)
The UpBlock follows a similar logic, but instead uses Transposed Convolution.
TODO: Can you replace the FIXME s with the correct variable? Each FIXME can
be one of:
in_chs
out_chs
kernel_size
stride
padding
strideT
out_paddingT
x
skip
In [13]: class UpBlock(nn.Module):

# Convolution variables
kernel_size = 3
stride = 1
padding = 1
# Transpose variables
strideT = 2
out_paddingT = 1
super().__init__()
# 2 * in_chs for concatenated skip connection
layers = [
nn.ConvTranspose2d(FIXME, FIXME, FIXME, FIXME, FIXME, FIXME),
nn.ReLU(),
nn.ReLU()
]
def forward(self, x, skip):

x = torch.cat((FIXME, FIXME), 1)
x = self.model(FIXME)
return x
In [20]: class UpBlock(nn.Module):

# Convolution variables
kernel_size = 3
stride = 1
padding = 1
# Transpose variables
strideT = 2
out_paddingT = 1
super().__init__()
# 2 * in_chs for concatenated skip connection
layers = [
nn.ConvTranspose2d(2 * in_chs, out_chs, kernel_size, strideT,
nn.ReLU(),
nn.Conv2d(out_chs, out_chs, kernel_size, stride, padding),
nn.ReLU()
]
def forward(self, x, skip):

x = torch.cat((x, skip), 1)
x = self.model(x)
return x
The final U-Net is similar to what we used in the first lab. The difference is we
now have a time embedding connected to our UpBlock s.
TODO: While the time embeddings have been integrated into the model, there
are still a number of FIXME s to replace. This time, the image channels, up
channels, and down channels need fixing. Can you work down and up the U-Net
to set the correct number of channels at each step?
Each FIXME could be:
img_chs
A value in down_chs
A value in up_chs
In [17]: class UNet(nn.Module):

def __init__(self):
super().__init__()
img_chs = IMG_CH
down_chs = (16, 32, 64)
up_chs = down_chs[::-1] # Reverse of the down channels
latent_image_size = IMG_SIZE // 4 # 2 ** (len(down_chs) - 1)
t_dim = 1 # New
# Inital convolution
self.down0 = nn.Sequential(
nn.Conv2d(FIXME, down_chs[0], 3, padding=1),
nn.ReLU()
)
# Downsample
self.down1 = DownBlock(down_chs[0], down_chs[1])
self.down2 = DownBlock(FIXME, FIXME)
self.to_vec = nn.Sequential(nn.Flatten(), nn.ReLU())
# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(FIXME*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], FIXME),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)
self.temb_1 = EmbedBlock(t_dim, up_chs[0]) # New
# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (FIXME, latent_image_size, latent_image_size)
nn.Conv2d(FIXME, up_chs[0], 3, padding=1),
nn.BatchNorm2d(up_chs[0]),
nn.ReLU(),
)
self.up1 = UpBlock(up_chs[0], up_chs[1])
self.up2 = UpBlock(FIXME, FIXME)
# Match output channels

self.out = nn.Sequential(
nn.Conv2d(FIXME, FIXME, 3, 1, 1),
nn.BatchNorm2d(up_chs[-1]),
nn.ReLU(),
nn.Conv2d(up_chs[-1], img_chs, 3, 1, 1)
)
def forward(self, x, t):

down0 = self.down0(x)
down1 = self.down1(down0)
latent_vec = self.to_vec(down2)
latent_vec = self.dense_emb(latent_vec)
# New
t = t.float() / T # Convert from [0, T] to [0, 1]
temb_1 = self.temb_1(t)
up0 = self.up0(latent_vec)
up1 = self.up1(up0+temb_1, down2)
return self.out(up2)
In [18]: class UNet(nn.Module):

def __init__(self):
super().__init__()
img_chs = IMG_CH
down_chs = (16, 32, 64)
up_chs = down_chs[::-1] # Reverse of the down channels
latent_image_size = IMG_SIZE // 4 # 2 ** (len(down_chs) - 1)
t_dim = 1 # New
# Inital convolution
self.down0 = nn.Sequential(
nn.Conv2d(img_chs, down_chs[0], 3, padding=1),
nn.BatchNorm2d(down_chs[0]),
nn.ReLU()
)
# Downsample
self.to_vec = nn.Sequential(nn.Flatten(), nn.ReLU())
# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(down_chs[2]*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)
# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (up_chs[0], latent_image_size, latent_image_s
nn.Conv2d(up_chs[0], up_chs[0], 3, padding=1),
nn.BatchNorm2d(up_chs[0]),
nn.ReLU(),
)
# Match output channels

self.out = nn.Sequential(
nn.Conv2d(up_chs[-1], up_chs[-1], 3, 1, 1),
nn.BatchNorm2d(up_chs[-1]),
nn.ReLU(),
nn.Conv2d(up_chs[-1], img_chs, 3, 1, 1)
)
def forward(self, x, t):

down0 = self.down0(x)
latent_vec = self.to_vec(down2)
# New
t = t.float() / T # Convert from [0, T] to [0, 1]
latent_vec = self.dense_emb(latent_vec)
up0 = self.up0(latent_vec)
return self.out(up2)
In [21]: model = UNet()

print("Num params: ", sum(p.numel() for p in model.parameters()))
model = torch.compile(UNet().to(device))
Num params: 240385
2.3.1 The Loss Function

In the first notebook, we used a Mean Squared Error loss function comparing
the original image and the predicted original image based on noise.
This time, we'll compare the real noise that was added to the image and the
predicted noise. Lilian Weng goes into the math in this blog post. Originally, the
loss function was based on the Evidence Lower Bound (ELBO) Log-Likelihood,
but it was found in the Denoising Diffusion Probabilistic Models Paper that the
Mean Squared Error between the predicted noise and true noise was better in
practice. If curious, Lilian Weng walks through the derivation here.
In [22]: def get_loss(model, x_0, t):

x_noisy, noise = q(x_0, t)
noise_pred = model(x_noisy, t)
return F.mse_loss(noise, noise_pred)
2.4 Reverse Diffusion

We now have a model that predicts the noise added to an image at timestep t ,
but generating images is not as easy as repeatedly subtracting and adding
noise. The q function can be reversed such that we generate $\mathbf{x}_(t-1)$
from $\mathbf{x}_t$.
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_{t-1};
{\mathbf{\tilde{\mu}}}(\mathbf{x_t},\mathbf{x_0}), \tilde{\beta}_t \cdot
\mathbf{I})$
Note: $\tilde{\beta}_t$ was originally calculated to be $\frac{1-\overline{a}_{t-1}}

{1-\overline{a}_{t}}\beta_t$, but in practice, using only $\beta_t$ is more effective.
Using Bayes' Theorem, we can derive an equation for the model_mean u_t at
timestep t .
${\mathbf{\tilde{\mu}}}_t = \frac{1}{\sqrt{\alpha_t}}(\mathbf{x_t}-\frac{1-\alpha_t}
{\sqrt{1-\overline{\alpha_t}}}\mathbf{\epsilon}_t)$
The image $\mathbf{x}_{t-1}$ can be estimated by ${\mathbf{\tilde{\mu}}}_t +

\tilde{\beta}_t \cdot \mathbf{I}$, so we'll use this equation to generate sample
images recursively until we reach t == 0 . Let's see what this means in code.
First, we will precompute the values needed to calculate u_t .
In [23]: sqrt_a_inv = torch.sqrt(1 / a)

pred_noise_coeff = (1 - a) / torch.sqrt(1 - a_bar)
Next, we will create the reverse diffusion function, reverse_q .
In [24]: @torch.no_grad()
def reverse_q(x_t, t, e_t):
t = torch.squeeze(t[0].int()) # All t values should be the same
pred_noise_coeff_t = pred_noise_coeff[t]
sqrt_a_inv_t = sqrt_a_inv[t]
u_t = sqrt_a_inv_t * (x_t - pred_noise_coeff_t * e_t)
if t == 0:
return u_t # Reverse diffusion complete!
else:
B_t = B[t-1]
new_noise = torch.randn_like(x_t)
return u_t + torch.sqrt(B_t) * new_noise
Let's create a function to iteratively remove noise from an image until it is noise
free. Let's also display these images so we can see how the model is improving.
In [25]: @torch.no_grad()
def sample_images(ncols, figsize=(8,8)):
plt.figure(figsize=figsize)
plt.axis("off")
hidden_rows = T / ncols
# Noise to generate images from

x_t = torch.randn((1, IMG_CH, IMG_SIZE, IMG_SIZE), device=device)
# Go from T to 0 removing and adding noise until t = 0

plot_number = 1
for i in range(0, T)[::-1]:
t = torch.full((1,), i, device=device)
e_t = model(x_t, t) # Predicted noise
x_t = reverse_q(x_t, t, e_t)
if i % hidden_rows == 0:
ax = plt.subplot(1, ncols+1, plot_number)
ax.axis('off')
other_utils.show_tensor_image(x_t.detach().cpu())
plot_number += 1
plt.show()
Time to train the model! How about it? Does it look like the model is learning?
In [26]: optimizer = Adam(model.parameters(), lr=0.001)

epochs = 3
ncols = 15 # Should evenly divide T
model.train()
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
t = torch.randint(0, T, (BATCH_SIZE,), device=device)

x = batch[0].to(device)
loss = get_loss(model, x, t)
loss.backward()
optimizer.step()
if epoch % 1 == 0 and step % 100 == 0:

print(f"Epoch {epoch} | Step {step:03d} | Loss: {loss.item()}
sample_images(ncols)
print("Final sample:")
sample_images(ncols)
Epoch 0 | Step 000 | Loss: 1.1125600337982178

/tmp/ipykernel_41/3817596156.py:17: MatplotlibDeprecationWarning: Auto-rem
oval of overlapping axes is deprecated since 3.6 and will be removed two m
inor releases later; explicitly call ax.remove() as needed.
Epoch 0 | Step 100 | Loss: 0.40190282464027405
Epoch 0 | Step 200 | Loss: 0.3047477602958679
Epoch 0 | Step 300 | Loss: 0.23347440361976624
Epoch 0 | Step 400 | Loss: 0.2521078586578369
Epoch 0 | Step 500 | Loss: 0.21015289425849915
Epoch 1 | Step 000 | Loss: 0.22007949650287628
Epoch 1 | Step 100 | Loss: 0.22005987167358398
Epoch 1 | Step 200 | Loss: 0.22039155662059784
Epoch 1 | Step 300 | Loss: 0.17277097702026367
Epoch 1 | Step 400 | Loss: 0.1892915666103363
Epoch 1 | Step 500 | Loss: 0.16379770636558533
Epoch 2 | Step 000 | Loss: 0.1794012486934662
Epoch 2 | Step 100 | Loss: 0.16939221322536469
Epoch 2 | Step 200 | Loss: 0.15691816806793213
Epoch 2 | Step 300 | Loss: 0.17358076572418213
Epoch 2 | Step 400 | Loss: 0.19089604914188385
Epoch 2 | Step 500 | Loss: 0.18465223908424377
Final sample:
If you squint your eyes, can you make out what the model is generating?
In [27]: model.eval()
figsize=(8,8) # Change me
ncols = 3 # Should evenly divide T
for _ in range(10):
sample_images(ncols, figsize=figsize)
/tmp/ipykernel_41/3817596156.py:17: MatplotlibDeprecationWarning: Auto-rem

oval of overlapping axes is deprecated since 3.6 and will be removed two m
inor releases later; explicitly call ax.remove() as needed.
2.5 Next
The model is learning ... something. It looks a little pixelated. Why would that
be? Continue to the next notebook to find out more!
In [1]: import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Out[1]: {'status': 'ok', 'restart': True}

JupyterLab02 Diffusion Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JupyterLab02 Diffusion Model

Uploaded by

Copyright:

Available Formats

27/04/2024 15:49 02_Diffusion_Models

Construct a forward diffusion variance schedule

In [1]: import torch

# User defined libraries

2.1 Forward Diffusion

In the previous notebook, we used the term beta to represent what

In section 4 of the paper Denoising Diffusion Probabilistic Models, the authors

Out[2]: tensor([1.0000e-04, 2.3356e-04, 3.6711e-04, 5.0067e-04, 6.3423e-04, 7.67

A Normal Dsitribution has the following signature:

In the previous notebook, we added noise to images using the equation:

This time, we will alter images with the similar equation:

We can sample from this probability distribution by first sampling from a

We can then multiply and add to the noise to sample from q :

x_t = torch.sqrt(1 - B[t]) * x_t + torch.sqrt(B[t]) * noise

In [3]: plt.figure(figsize=(8, 8))

x_t = torch.sqrt(1 - B[t]) * x_t + torch.sqrt(B[t]) * noise # sample

In [4]: gif_name = "forward_diffusion.gif"

MovieWriter ffmpeg unavailable; using Pillow instead.

2.2 Skipping Ahead

For example, $\bar{\alpha}_3 = \alpha_0 \cdot \alpha_1 \cdot \alpha_2 \cdot

Which translates to code as:

x_t = sqrt_a_bar_t * x_0 + sqrt_one_minus_a_bar_t * noise

We are now no longer dependent on $\mathbf{x}_{t-1}$ and can estimate

Currently, sqrt_a_bar and sqrt_one_minus_a_bar only have one

We can add an extra dimension by indexing with None . This is a PyTorch

In [7]: def q(x_0, t):

Samples a new image from q

x_t = sqrt_a_bar_t * x_0 + sqrt_one_minus_a_bar_t * noise

In [8]: plt.figure(figsize=(8, 8))

In [9]: gif_name = "forward_diffusion_skip.gif"

MovieWriter ffmpeg unavailable; using Pillow instead.

2.3 Predicting Noise

To do that, we can create an embedding block like the one below.

input_dim is the number of dimensions of the value we'd like to embed.

In [11]: class EmbedBlock(nn.Module):

def forward(self, input):

Click the ... below for the correct answer.

In [12]: class DownBlock(nn.Module):

def forward(self, x):

In [19]: class DownBlock(nn.Module):

def forward(self, x):

Click the ... below for the correct answer.

In [13]: class UpBlock(nn.Module):

def forward(self, x, skip):

In [20]: class UpBlock(nn.Module):

def forward(self, x, skip):

Each FIXME could be:

Click the ... below for the correct answer.

In [17]: class UNet(nn.Module):

# Match output channels

def forward(self, x, t):

In [18]: class UNet(nn.Module):

# Match output channels

def forward(self, x, t):

In [21]: model = UNet()

Num params: 240385

2.3.1 The Loss Function

In [22]: def get_loss(model, x_0, t):

2.4 Reverse Diffusion

Note: $\tilde{\beta}_t$ was originally calculated to be $\frac{1-\overline{a}_{t-1}}

The image $\mathbf{x}_{t-1}$ can be estimated by ${\mathbf{\tilde{\mu}}}_t +

In [23]: sqrt_a_inv = torch.sqrt(1 / a)

Next, we will create the reverse diffusion function, reverse_q .