Professional Documents
Culture Documents
JupyterLab02 Diffusion Model
JupyterLab02 Diffusion Model
2. Diffusion Models
In the previous notebook, we learned how to separate noise from an image
using a U-Net, but it was not capable of generating believable new images from
noise. Diffusion models are much better at generating images from scratch.
The good news, our neural network model will not change much. We will be
building off of the U-Net architecture with some slight modifications.
Instead, the big difference is how we use our model. Rather than adding noise to
our images all at once, we will be adding a small amount of noise multiple times.
We can then use our neural network on a noisy image multiple times to
generate a new image like so:
Learning Objectives
The goals of this notebook are to:
We've moved some of the functions from the previous notebook into a utils.py
file. We can use it to reload the fashionMNIST dataset:
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 1/24
27/04/2024 15:49 02_Diffusion_Models
# Visualization tools
import matplotlib.pyplot as plt
from IPython.display import Image
IMG_SIZE = 16
IMG_CH = 1
BATCH_SIZE = 128
data, dataloader = other_utils.load_transformed_fashionMNIST(IMG_SIZE, BA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyt
e.gz
100%|██████████| 26421880/26421880 [00:01<00:00, 13842662.92it/s]
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/Fa
shionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyt
e.gz
100%|██████████| 29515/29515 [00:00<00:00, 326793.37it/s]
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/Fa
shionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.g
z
100%|██████████| 4422102/4422102 [00:02<00:00, 1688015.43it/s]
Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/Fas
hionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.g
z
100%|██████████| 5148/5148 [00:00<00:00, 13126004.25it/s]
Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/Fas
hionMNIST/raw
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 2/24
27/04/2024 15:49 02_Diffusion_Models
Let T be the number of times we will add noise to an image. We can use t to
keep track of the current timestep .
In [2]: nrows = 10
ncols = 15
T = nrows * ncols
start = 0.0001
end = 0.02
B = torch.linspace(start, end, T).to(device)
B
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 3/24
27/04/2024 15:49 02_Diffusion_Models
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 4/24
27/04/2024 15:49 02_Diffusion_Models
$\mathcal{N}(x;u,\sigma^2)$ = $\frac{1}{\sigma\sqrt{2\pi}}\mathcal{e}^{-\frac{1}
{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}$
which reads as, "the normal distribution of $x$ with parameters $u$ (the mean)
and $\sigma^2$ (the variance). When $\mu$ is 0 amd $\sigma$ is 1, we have a
standard normal distribution $\mathcal{N}(x;0,1)$, which has the probability
density of the shape below:
If we are altering our image with noise multiple times accross many timesteps,
let's describe $\mathbf{x}_{t}$ as our image at timestep $t$. Then,
$\mathbf{x}_{t-1}$ would be the image at the previous timestep and $x_{0}$
would be the original image.
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};(1-\beta_{t}) \cdot
\mathbf{x}_{t-1},\beta_{t}^{2} \cdot \mathbf{I})$
Where $q$ represents a probability distribution for the forward diffusion process
and $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ describes the probability distribution for
a new, noiser image $\mathbf{x}_{t}$ based on $\mathbf{x}_{t-1}$.
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}
\cdot \mathbf{x}_{t-1},\beta_{t} \cdot \mathbf{I})$
noise = torch.randn_like(x_t)
Let's see all of this in practice. Run the code cell below to perform forward
diffusion T (or 150 ) times on the first image of our dataset.
for t in range(T):
noise = torch.randn_like(x_t)
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 5/24
27/04/2024 15:49 02_Diffusion_Models
Or in animated form:
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 6/24
27/04/2024 15:49 02_Diffusion_Models
In [5]: Image(open(gif_name,'rb').read())
Out[5]:
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 7/24
27/04/2024 15:49 02_Diffusion_Models
Thanks to the power of recursion, we can estimate what $x_t$ would look like
given our beta schedule $\beta_t$. A full breakdown of the math can be found in
Lilian Weng's Blog. Let's bring back a lpha, which is the compliment of $\beta$.
We can define $\alpha_t$ as $1 - \beta_t$, and we can define $\bar{\alpha}_t$ as
the cumulative product of $\alpha_t$.
Because of the bar symbol, let's call $\bar{\alpha}_t$ a_bar . Our new noisy
image distribution becomes:
$q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}
(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}} \cdot x_{0},(1 - \bar{\alpha}_t) \cdot
\mathbf{I})$
In [6]: a = 1. - B
a_bar = torch.cumprod(a, dim=0)
sqrt_a_bar = torch.sqrt(a_bar) # Mean Coefficient
sqrt_one_minus_a_bar = torch.sqrt(1 - a_bar) # St. Dev. Coefficient
We have all the pieces, let's code our forward diffusion sampling function q :
$q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}
(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}} \cdot \mathbf{x}_{0},(1 - \bar{\alpha}_t)
\cdot \mathbf{I})$
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 8/24
27/04/2024 15:49 02_Diffusion_Models
Let's test out this new method compared to our old method of recursively
generating the images.
for t in range(T):
t_tenser = torch.Tensor([t]).type(torch.int64)
x_t, _ = q(x_0, t_tenser)
img = torch.squeeze(x_t).cpu()
xs.append(img)
ax = plt.subplot(nrows, ncols, t + 1)
ax.axis('off')
other_utils.show_tensor_image(x_t)
plt.savefig("forward_diffusion_skip.png", bbox_inches='tight')
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 9/24
27/04/2024 15:49 02_Diffusion_Models
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 10/24
27/04/2024 15:49 02_Diffusion_Models
In [10]: Image(open(gif_name,'rb').read())
Out[10]:
Compared to the previous technique, can you see any differences? When noise
is added sequentially, there is a smaller difference between the images of
consecutive timesteps. Despite this, the neural network will do a good job
separating the noise from the original image in the reverse diffusion process.
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 11/24
27/04/2024 15:49 02_Diffusion_Models
We will add this time embedding block to each UpBlock of our U-Net,
resulting in the following architecture.
TODO: Our DownBlock is the same as before. Using the above image as a
reference, can you replace the FIXME s with the correct variable? Each FIXME
can be one of:
in_chs
out_chs
kernel_size
stride
padding
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 12/24
27/04/2024 15:49 02_Diffusion_Models
super().__init__()
layers = [
nn.Conv2d(FIXME, FIXME, FIXME, FIXME, FIXME),
nn.BatchNorm2d(FIXME),
nn.ReLU(),
nn.Conv2d(FIXME, FIXME, FIXME, FIXME, FIXME),
nn.BatchNorm2d(FIXME),
nn.ReLU(),
nn.MaxPool2d(2)
]
self.model = nn.Sequential(*layers)
super().__init__()
layers = [
nn.Conv2d(in_chs, out_chs, kernel_size, stride, padding),
nn.BatchNorm2d(out_chs),
nn.ReLU(),
nn.Conv2d(out_chs, out_chs, kernel_size, stride, padding),
nn.BatchNorm2d(out_chs),
nn.ReLU(),
nn.MaxPool2d(2)
]
self.model = nn.Sequential(*layers)
The UpBlock follows a similar logic, but instead uses Transposed Convolution.
TODO: Can you replace the FIXME s with the correct variable? Each FIXME can
be one of:
in_chs
out_chs
kernel_size
stride
padding
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 13/24
27/04/2024 15:49 02_Diffusion_Models
strideT
out_paddingT
x
skip
# Transpose variables
strideT = 2
out_paddingT = 1
super().__init__()
# 2 * in_chs for concatenated skip connection
layers = [
nn.ConvTranspose2d(FIXME, FIXME, FIXME, FIXME, FIXME, FIXME),
nn.BatchNorm2d(FIXME),
nn.ReLU(),
nn.Conv2d(FIXME, FIXME, FIXME, FIXME, FIXME),
nn.BatchNorm2d(FIXME),
nn.ReLU()
]
self.model = nn.Sequential(*layers)
# Transpose variables
strideT = 2
out_paddingT = 1
super().__init__()
# 2 * in_chs for concatenated skip connection
layers = [
nn.ConvTranspose2d(2 * in_chs, out_chs, kernel_size, strideT,
nn.BatchNorm2d(out_chs),
nn.ReLU(),
nn.Conv2d(out_chs, out_chs, kernel_size, stride, padding),
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 14/24
27/04/2024 15:49 02_Diffusion_Models
nn.BatchNorm2d(out_chs),
nn.ReLU()
]
self.model = nn.Sequential(*layers)
The final U-Net is similar to what we used in the first lab. The difference is we
now have a time embedding connected to our UpBlock s.
TODO: While the time embeddings have been integrated into the model, there
are still a number of FIXME s to replace. This time, the image channels, up
channels, and down channels need fixing. Can you work down and up the U-Net
to set the correct number of channels at each step?
img_chs
A value in down_chs
A value in up_chs
# Inital convolution
self.down0 = nn.Sequential(
nn.Conv2d(FIXME, down_chs[0], 3, padding=1),
nn.BatchNorm2d(FIXME),
nn.ReLU()
)
# Downsample
self.down1 = DownBlock(down_chs[0], down_chs[1])
self.down2 = DownBlock(FIXME, FIXME)
self.to_vec = nn.Sequential(nn.Flatten(), nn.ReLU())
# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(FIXME*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], FIXME),
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 15/24
27/04/2024 15:49 02_Diffusion_Models
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)
self.temb_1 = EmbedBlock(t_dim, up_chs[0]) # New
self.temb_2 = EmbedBlock(t_dim, up_chs[1]) # New
# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (FIXME, latent_image_size, latent_image_size)
nn.Conv2d(FIXME, up_chs[0], 3, padding=1),
nn.BatchNorm2d(up_chs[0]),
nn.ReLU(),
)
self.up1 = UpBlock(up_chs[0], up_chs[1])
self.up2 = UpBlock(FIXME, FIXME)
latent_vec = self.dense_emb(latent_vec)
# New
t = t.float() / T # Convert from [0, T] to [0, 1]
temb_1 = self.temb_1(t)
temb_2 = self.temb_2(t)
up0 = self.up0(latent_vec)
up1 = self.up1(up0+temb_1, down2)
up2 = self.up2(up1+temb_2, down1)
return self.out(up2)
# Inital convolution
self.down0 = nn.Sequential(
nn.Conv2d(img_chs, down_chs[0], 3, padding=1),
nn.BatchNorm2d(down_chs[0]),
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 16/24
27/04/2024 15:49 02_Diffusion_Models
nn.ReLU()
)
# Downsample
self.down1 = DownBlock(down_chs[0], down_chs[1])
self.down2 = DownBlock(down_chs[1], down_chs[2])
self.to_vec = nn.Sequential(nn.Flatten(), nn.ReLU())
# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(down_chs[2]*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)
self.temb_1 = EmbedBlock(t_dim, up_chs[0]) # New
self.temb_2 = EmbedBlock(t_dim, up_chs[1]) # New
# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (up_chs[0], latent_image_size, latent_image_s
nn.Conv2d(up_chs[0], up_chs[0], 3, padding=1),
nn.BatchNorm2d(up_chs[0]),
nn.ReLU(),
)
self.up1 = UpBlock(up_chs[0], up_chs[1])
self.up2 = UpBlock(up_chs[1], up_chs[2])
# New
t = t.float() / T # Convert from [0, T] to [0, 1]
latent_vec = self.dense_emb(latent_vec)
temb_1 = self.temb_1(t)
temb_2 = self.temb_2(t)
up0 = self.up0(latent_vec)
up1 = self.up1(up0+temb_1, down2)
up2 = self.up2(up1+temb_2, down1)
return self.out(up2)
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 17/24
27/04/2024 15:49 02_Diffusion_Models
This time, we'll compare the real noise that was added to the image and the
predicted noise. Lilian Weng goes into the math in this blog post. Originally, the
loss function was based on the Evidence Lower Bound (ELBO) Log-Likelihood,
but it was found in the Denoising Diffusion Probabilistic Models Paper that the
Mean Squared Error between the predicted noise and true noise was better in
practice. If curious, Lilian Weng walks through the derivation here.
$q(\mathbf{x}_{t}|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_{t-1};
{\mathbf{\tilde{\mu}}}(\mathbf{x_t},\mathbf{x_0}), \tilde{\beta}_t \cdot
\mathbf{I})$
Using Bayes' Theorem, we can derive an equation for the model_mean u_t at
timestep t .
${\mathbf{\tilde{\mu}}}_t = \frac{1}{\sqrt{\alpha_t}}(\mathbf{x_t}-\frac{1-\alpha_t}
{\sqrt{1-\overline{\alpha_t}}}\mathbf{\epsilon}_t)$
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 18/24
27/04/2024 15:49 02_Diffusion_Models
In [24]: @torch.no_grad()
def reverse_q(x_t, t, e_t):
t = torch.squeeze(t[0].int()) # All t values should be the same
pred_noise_coeff_t = pred_noise_coeff[t]
sqrt_a_inv_t = sqrt_a_inv[t]
u_t = sqrt_a_inv_t * (x_t - pred_noise_coeff_t * e_t)
if t == 0:
return u_t # Reverse diffusion complete!
else:
B_t = B[t-1]
new_noise = torch.randn_like(x_t)
return u_t + torch.sqrt(B_t) * new_noise
Let's create a function to iteratively remove noise from an image until it is noise
free. Let's also display these images so we can see how the model is improving.
In [25]: @torch.no_grad()
def sample_images(ncols, figsize=(8,8)):
plt.figure(figsize=figsize)
plt.axis("off")
hidden_rows = T / ncols
Time to train the model! How about it? Does it look like the model is learning?
model.train()
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 19/24
27/04/2024 15:49 02_Diffusion_Models
optimizer.zero_grad()
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 20/24
27/04/2024 15:49 02_Diffusion_Models
Final sample:
If you squint your eyes, can you make out what the model is generating?
In [27]: model.eval()
figsize=(8,8) # Change me
ncols = 3 # Should evenly divide T
for _ in range(10):
sample_images(ncols, figsize=figsize)
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 21/24
27/04/2024 15:49 02_Diffusion_Models
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 22/24
27/04/2024 15:49 02_Diffusion_Models
2.5 Next
The model is learning ... something. It looks a little pixelated. Why would that
be? Continue to the next notebook to find out more!
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 23/24
27/04/2024 15:49 02_Diffusion_Models
dli-69a8471a1f06-c87b23.aws.labs.courses.nvidia.com/lab/lab 24/24