You are on page 1of 16

28/04/2024 10:14 03_Optimizations

3. Optimizations
Currently, the model is experiencing the checkerboard problem.

Thankfully, we have a few tricks up our generated T-shirt sleeve to resolve this
and generally improve the performance of the model.

Learning Objectives
The goals of this notebook are to:

Implement Group Normalization


Implement GELU
Implement Rearrange Pooling
Implement Sinusoidal Position Embeddings
Define a reverse diffusion function to emulate p
Attempt to generate articles of clothing (again)

Like before, let's use fashionMIST to experiment:

In [1]: import torch


import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.optim import Adam

# Visualization tools
import matplotlib.pyplot as plt
from torchview import draw_graph
import graphviz
from IPython.display import Image

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 1/16
28/04/2024 10:14 03_Optimizations

# User defined libraries


from utils import other_utils
from utils import ddpm_utils

IMG_SIZE = 16
IMG_CH = 1
BATCH_SIZE = 128
data, dataloader = other_utils.load_transformed_fashionMNIST(IMG_SIZE, BA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyt
e.gz
100%|██████████| 26421880/26421880 [00:01<00:00, 13882102.94it/s]
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/Fa
shionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/tra
in-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyt
e.gz
100%|██████████| 29515/29515 [00:00<00:00, 329948.99it/s]
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/Fa
shionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.g
z
100%|██████████| 4422102/4422102 [00:00<00:00, 6062736.92it/s]
Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/Fas
hionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10
k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.g
z
100%|██████████| 5148/5148 [00:00<00:00, 12205922.55it/s]
Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/Fas
hionMNIST/raw

We have created a ddpm_util.py with a DDPM class to group our diffusion


functions. Let's use it to set up the same Beta schedule as what we used
previously.

In [2]: nrows = 10
ncols = 15

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 2/16
28/04/2024 10:14 03_Optimizations

T = nrows * ncols
B_start = 0.0001
B_end = 0.02
B = torch.linspace(B_start, B_end, T).to(device)
ddpm = ddpm_utils.DDPM(B, device)

3.1 Group Normalization and GELU


The first improvement we will look at is optimizing our standard convolution
process. We will be reusing this block many times throughout our neural
network, so it is an important piece to get right.

3.1.1 Group Normalization


Batch Normalization converts the output of each kernel channel to a z-score. It
does this by calculating the mean and standard deviation across a batch of
inputs. This is ineffective if the batch size is small.

On the other hand, Group Normalization normalizes the output of a group of


kernels for each sample image, effectively "grouping" a set of features.

Considering color images have multiple color channels, this can have an
interesting impact on the output colors of generated images. Try experimenting
to see the effect!

Learn more about normalization techniques in this blog post by Aakash Bindal.

3.1.2 GELU
ReLU is a popular choice for an activation function because it is computationally
quick and easy to calculate the gradient for. Unfortunately, it isn't perfect. When
the bias term becomes largely negative, a ReLU neuron "dies" because both its
output and gradient are zero.

At a slight cost in computational power, GELU seeks to rectify the rectified linear
unit by mimicking the shape of the ReLU function while avoiding a zero gradient.

In this small example with FashionMNIST, it is unlikely we will see any dead
neurons. However, the larger a model gets, the more likely it can face the dying
ReLU phenomenon.

In [3]: class GELUConvBlock(nn.Module):


def __init__(
self, in_ch, out_ch, group_size):

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 3/16
28/04/2024 10:14 03_Optimizations

super().__init__()
layers = [
nn.Conv2d(in_ch, out_ch, 3, 1, 1),
nn.GroupNorm(group_size, out_ch),
nn.GELU()
]
self.model = nn.Sequential(*layers)

def forward(self, x):


return self.model(x)

3.2 Rearrange pooling


In the previous notebook, we used Max Pooling to halve the size of our latent
image, but is that the best technique? There are many types of pooling layers
including Min Pooling and Average Pooling. How about we let the neural
network decide what is important.

Enter the einops library and the Rearrange layer. We can assign each layer a
variable and use that to rearrange our values. Additionally, we can use
parentheses () to identify a set of variables that are multiplied together.

For example, in the code block below, we have:

Rearrange("b c (h p1) (w p2) -> b (c p1 p2) h w", p1=2, p2=2)

b is our batch dimension


c is our channel dimension
h is our height dimension
w is our width dimension

We also have a p1 and p2 value that are both equal to 2 . The left portion of
the equation before the arrow is saying "split the height and width dimensions
in half. The right portion of the equation after the arrow is saying "stack the split
dimensions along the channel dimension".

The code block below sets up a test_image to practice on. Try swapping h
with p1 on the left side of the arrow. What happens? How about when w and
p2 are swapped? What happens when p1 is set to 3 instead of 2 ?

In [4]: from einops.layers.torch import Rearrange

rearrange = Rearrange("b c (h p1) (w p2) -> b (c p1 p2) h w", p1=2, p2=2)

test_image = [
[
[
[1, 2, 3, 4, 5, 6],

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 4/16
28/04/2024 10:14 03_Optimizations

[7, 8, 9, 10, 11, 12],


[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24],
[25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36],
]
]
]
test_image = torch.tensor(test_image)
print(test_image)
output = rearrange(test_image)
output

tensor([[[[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24],
[25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36]]]])
Out[4]: tensor([[[[ 1, 3, 5],
[13, 15, 17],
[25, 27, 29]],

[[ 2, 4, 6],
[14, 16, 18],
[26, 28, 30]],

[[ 7, 9, 11],
[19, 21, 23],
[31, 33, 35]],

[[ 8, 10, 12],
[20, 22, 24],
[32, 34, 36]]]])

Next, we can pass this through our GELUConvBlock to let the neural network
decide how it wants to weigh the values within our "pool". Notice the
4*in_chs as a parameter of the GELUConvBlock ? This is because the
channel dimension is now p1 * p2 larger.

In [5]: class RearrangePoolBlock(nn.Module):


def __init__(self, in_chs, group_size):
super().__init__()
self.rearrange = Rearrange("b c (h p1) (w p2) -> b (c p1 p2) h w"
self.conv = GELUConvBlock(4 * in_chs, in_chs, group_size)

def forward(self, x):


x = self.rearrange(x)
return self.conv(x)

We now have the components to redefine our DownBlock s and UpBlock s.


Multiple GELUConvBlock s have been added to help combat the checkerboard
problem.

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 5/16
28/04/2024 10:14 03_Optimizations

In [6]: class DownBlock(nn.Module):


def __init__(self, in_chs, out_chs, group_size):
super(DownBlock, self).__init__()
layers = [
GELUConvBlock(in_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
RearrangePoolBlock(out_chs, group_size)
]
self.model = nn.Sequential(*layers)

def forward(self, x):


return self.model(x)

TODO: There's an input to the UpBlock that makes separates it from the
DownBlock . What was it again?

If needed, click the ... below for the correct answer.

In [7]: class UpBlock(nn.Module):


def __init__(self, in_chs, out_chs, group_size):
super(UpBlock, self).__init__()
layers = [
nn.ConvTranspose2d(2 * in_chs, out_chs, 2, 2),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size)
]
self.model = nn.Sequential(*layers)

def forward(self, x, skip):


x = torch.cat((x, skip), 1)
x = self.model(x)
return x

In [18]: class UpBlock(nn.Module):


def __init__(self, in_chs, out_chs, group_size):
super(UpBlock, self).__init__()
layers = [
nn.ConvTranspose2d(2 * in_chs, out_chs, 2, 2),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
GELUConvBlock(out_chs, out_chs, group_size),
]
self.model = nn.Sequential(*layers)

def forward(self, x, skip):


x = torch.cat((x, skip), 1)
x = self.model(x)
return x

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 6/16
28/04/2024 10:14 03_Optimizations

3.3 Time Embeddings


The better the model understands the timestep it is in for the reverse diffusion
process, the better it will be able to correctly identify the added noise. In the
previous notebook, we created an embedding for t/T . Can we help the model
interpret this better?

Before diffusion models, this was a problem that plagued natural language
processing. For long dialogues, how can we capture where we are? The goal was
to find a way to uniquely represent a large range of discrete numbers with a
small number of continuous numbers. Using a single float is ineffective since the
neural network will interpret timesteps as continuous rather than discrete.
Researchers ultimately settled on a sum of sines and cosines.

For an excellent explanation for why this works and how this technique was likely
developed, please refer to Jonathan Kernes' Master Positional Encoding.

In [8]: import math

class SinusoidalPositionEmbedBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim

def forward(self, time):


device = time.device
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -e
embeddings = time[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=
return embeddings

TODO: We will feed the output of the SinusoidalPositionEmbedBlock into


our EmbedBlock . Thankfully, our EmbedBlock remains unchanged from
before.

It looks like the one below has been overrun with FIXME s. Can you remember
how it was supposed to look?

If needed, click the ... below for the correct answer.

In [19]: class EmbedBlock(nn.Module):


def __init__(self, input_dim, emb_dim):
super(EmbedBlock, self).__init__()
self.input_dim = input_dim
layers = [
nn.Linear(input_dim, FIXME),
nn.GELU(),

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 7/16
28/04/2024 10:14 03_Optimizations

nn.Linear(emb_dim, FIXME),
nn.Unflatten(1, (FIXME, 1, 1))
]
self.model = nn.Sequential(*layers)

def forward(self, x):


x = x.view(-1, self.input_dim)
return self.model(x)

In [9]: class EmbedBlock(nn.Module):


def __init__(self, input_dim, emb_dim):
super(EmbedBlock, self).__init__()
self.input_dim = input_dim
layers = [
nn.Linear(input_dim, emb_dim),
nn.GELU(),
nn.Linear(emb_dim, emb_dim),
nn.Unflatten(1, (emb_dim, 1, 1))
]
self.model = nn.Sequential(*layers)

def forward(self, x):


x = x.view(-1, self.input_dim)
return self.model(x)

3.4 Residual Connections


The last trick to eliminate the checkerboard problem is to add more residual or
skip connections. We can create a ResidualConvBlock for our initial
convolution. We could add residual connections in other places as well, such as
within our "DownBlocks" and "UpBlocks".

In [10]: class ResidualConvBlock(nn.Module):


def __init__(self, in_chs, out_chs, group_size):
super().__init__()
self.conv1 = GELUConvBlock(in_chs, out_chs, group_size)
self.conv2 = GELUConvBlock(out_chs, out_chs, group_size)

def forward(self, x):


x1 = self.conv1(x)
x2 = self.conv2(x1)
out = x1 + x2
return out

Below is the updated model. Notice the change at the very last line? Another
skip connection has been added from the output of our ResidualConvBlock
to the final self.out block. This connection is surprisingly powerful, and of all
the changes listed above, had the biggest influence on the checkerboard
problem for this dataset.

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 8/16
28/04/2024 10:14 03_Optimizations

TODO: A couple of new variables have been added: small_group_size and


big_group_size for group normalization. They are both dependent on the
variable group_base_size . Set group_base_size to either 3 , 4 , 5 , 6 ,
or 7 . One of these values is correct and the rest will result in an error.

Hint: The group sizes and down_chs are related.

If needed, click the ... below for the correct answer.

In [11]: class UNet(nn.Module):


def __init__(self):
super().__init__()
img_chs = IMG_CH
down_chs = (64, 64, 128)
up_chs = down_chs[::-1] # Reverse of the down channels
latent_image_size = IMG_SIZE // 4 # 2 ** (len(down_chs) - 1)
t_dim = 8
group_size_base = FIXME
small_group_size = 2 * group_size_base # New
big_group_size = 8 * group_size_base # New

# Inital convolution
self.down0 = ResidualConvBlock(img_chs, down_chs[0], small_group_

# Downsample
self.down1 = DownBlock(down_chs[0], down_chs[1], big_group_size)
self.down2 = DownBlock(down_chs[1], down_chs[2], big_group_size)
self.to_vec = nn.Sequential(nn.Flatten(), nn.GELU())

# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(down_chs[2]*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)

self.sinusoidaltime = SinusoidalPositionEmbedBlock(t_dim) # New


self.temb_1 = EmbedBlock(t_dim, up_chs[0])
self.temb_2 = EmbedBlock(t_dim, up_chs[1])

# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (up_chs[0], latent_image_size, latent_image_s
GELUConvBlock(up_chs[0], up_chs[0], big_group_size) # New
)
self.up1 = UpBlock(up_chs[0], up_chs[1], big_group_size) # New
self.up2 = UpBlock(up_chs[1], up_chs[2], big_group_size) # New

# Match output channels and one last concatenation

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 9/16
28/04/2024 10:14 03_Optimizations

self.out = nn.Sequential(
nn.Conv2d(2 * up_chs[-1], up_chs[-1], 3, 1, 1),
nn.GroupNorm(small_group_size, up_chs[-1]), # New
nn.ReLU(),
nn.Conv2d(up_chs[-1], img_chs, 3, 1, 1)
)

def forward(self, x, t):


down0 = self.down0(x)
down1 = self.down1(down0)
down2 = self.down2(down1)
latent_vec = self.to_vec(down2)

latent_vec = self.dense_emb(latent_vec)
t = t.float() / T # Convert from [0, T] to [0, 1]
t = self.sinusoidaltime(t) # New
temb_1 = self.temb_1(t)
temb_2 = self.temb_2(t)

up0 = self.up0(latent_vec)
up1 = self.up1(up0+temb_1, down2)
up2 = self.up2(up1+temb_2, down1)
return self.out(torch.cat((up2, down0), 1)) # New

In [12]: class UNet(nn.Module):


def __init__(self):
super().__init__()
img_chs = IMG_CH
down_chs = (64, 64, 128)
up_chs = down_chs[::-1] # Reverse of the down channels
latent_image_size = IMG_SIZE // 4 # 2 ** (len(down_chs) - 1)
t_dim = 8
group_size_base = 4
small_group_size = 2 * group_size_base # New
big_group_size = 8 * group_size_base # New

# Inital convolution
self.down0 = ResidualConvBlock(img_chs, down_chs[0], small_group_

# Downsample
self.down1 = DownBlock(down_chs[0], down_chs[1], big_group_size)
self.down2 = DownBlock(down_chs[1], down_chs[2], big_group_size)
self.to_vec = nn.Sequential(nn.Flatten(), nn.GELU())

# Embeddings
self.dense_emb = nn.Sequential(
nn.Linear(down_chs[2]*latent_image_size**2, down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[1]),
nn.ReLU(),
nn.Linear(down_chs[1], down_chs[2]*latent_image_size**2),
nn.ReLU()
)

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 10/16
28/04/2024 10:14 03_Optimizations

self.sinusoidaltime = SinusoidalPositionEmbedBlock(t_dim) # New


self.temb_1 = EmbedBlock(t_dim, up_chs[0])
self.temb_2 = EmbedBlock(t_dim, up_chs[1])

# Upsample
self.up0 = nn.Sequential(
nn.Unflatten(1, (up_chs[0], latent_image_size, latent_image_s
GELUConvBlock(up_chs[0], up_chs[0], big_group_size) # New
)
self.up1 = UpBlock(up_chs[0], up_chs[1], big_group_size) # New
self.up2 = UpBlock(up_chs[1], up_chs[2], big_group_size) # New

# Match output channels and one last concatenation


self.out = nn.Sequential(
nn.Conv2d(2 * up_chs[-1], up_chs[-1], 3, 1, 1),
nn.GroupNorm(small_group_size, up_chs[-1]), # New
nn.ReLU(),
nn.Conv2d(up_chs[-1], img_chs, 3, 1, 1)
)

def forward(self, x, t):


down0 = self.down0(x)
down1 = self.down1(down0)
down2 = self.down2(down1)
latent_vec = self.to_vec(down2)

latent_vec = self.dense_emb(latent_vec)
t = t.float() / T # Convert from [0, T] to [0, 1]
t = self.sinusoidaltime(t) # New
temb_1 = self.temb_1(t)
temb_2 = self.temb_2(t)

up0 = self.up0(latent_vec)
up1 = self.up1(up0+temb_1, down2)
up2 = self.up2(up1+temb_2, down1)
return self.out(torch.cat((up2, down0), 1)) # New

In [13]: model = UNet()


print("Num params: ", sum(p.numel() for p in model.parameters()))
model = torch.compile(model.to(device))

Num params: 1979777

Finally, it's time to train the model. Let's see if all these changes made a
difference.

In [ ]: optimizer = Adam(model.parameters(), lr=0.001)


epochs = 5

model.train()
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()

t = torch.randint(0, T, (BATCH_SIZE,), device=device).float()

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 11/16
28/04/2024 10:14 03_Optimizations

x = batch[0].to(device)
loss = ddpm.get_loss(model, x, t)
loss.backward()
optimizer.step()

if epoch % 1 == 0 and step % 100 == 0:


print(f"Epoch {epoch} | step {step:03d} Loss: {loss.item()} "
ddpm.sample_images(model, IMG_CH, IMG_SIZE, ncols)

Epoch 0 | step 000 Loss: 0.07548007369041443

Epoch 0 | step 100 Loss: 0.08011850714683533

Epoch 0 | step 200 Loss: 0.07913320511579514

Epoch 0 | step 300 Loss: 0.09114592522382736

Epoch 0 | step 400 Loss: 0.07626697421073914

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 12/16
28/04/2024 10:14 03_Optimizations

Epoch 0 | step 500 Loss: 0.08785144984722137

Epoch 1 | step 000 Loss: 0.08220058679580688

Epoch 1 | step 100 Loss: 0.08423619717359543

Epoch 1 | step 200 Loss: 0.0972428247332573

Epoch 1 | step 300 Loss: 0.08549884706735611

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 13/16
28/04/2024 10:14 03_Optimizations

How about a closer look? Can you recognize a shoe, a purse, or a shirt?

In [15]: model.eval()
plt.figure(figsize=(8,8))
ncols = 3 # Should evenly divide T
for _ in range(10):
ddpm.sample_images(model, IMG_CH, IMG_SIZE, ncols)

<Figure size 800x800 with 0 Axes>

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 14/16
28/04/2024 10:14 03_Optimizations

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 15/16
28/04/2024 10:14 03_Optimizations

3.5 Next
If you don't see a particular class such as a shoe or a shirt, try running the above
code block again. Currently, our model does not accept category input, so the
user can't define what kind of output they would like. Where's the fun in that?

In the next notebook, we will finally add a way for users to control the model!

In [ ]: import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

dli-69a8471a1f06-53ed30.aws.labs.courses.nvidia.com/lab/lab 16/16

You might also like