JupyterLab05 CLIP

28/04/2024 12:55 05_CLIP
Header
5. CLIP
Contrastive Language-Image Pre-Training or CLIP is a text and image encoding
tool used with many popular Generative AI models such as DALL-E and Stable
Diffusion.
CLIP in itself is not a Generative AI model, but is instead used to align text
encodings with image encodings. If there is such a thing as the perfect text
description of an image, the goal of CLIP is to create the same vector
embedding for both the image and the text. Let's see what this means in
practice.
The goals of this notebook are to:
Learn how to use CLIP Encodings

Get an image encoding
Get a text encoding
Calculate the cosine similarity between them
Use CLIP to create a text-to-image neural network
5.1 Encodings
First, let's load the libraries needed for this exercise.
In [1]: import csv

import glob
import numpy as np
import torch
import torch.nn.functional as F
from torch.optim import Adam
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
# Visualization tools
import matplotlib.pyplot as plt
from PIL import Image
from torchvision.utils import save_image, make_grid
from textwrap import wrap
# User defined libraries

from utils import other_utils
from utils import ddpm_utils
from utils import UNet_utils
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 1/17
28/04/2024 12:55 05_CLIP
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
There are a few different variations of CLIP based on popular image recognition
neural networks:
In [2]: import clip
clip.available_models()
Out[2]: ['RN50',
'RN101',
'RN50x4',
'RN50x16',
'RN50x64',
'ViT-B/32',
'ViT-B/16',
'ViT-L/14',
'ViT-L/14@336px']
For this notebook, we will be using ViT-B/32 , which is based on the Vision
Transformer architecture. It has 512 features, which we will later feed into our
diffusion model.
In [3]: clip_model, clip_preprocess = clip.load("ViT-B/32")

clip_model.eval()
CLIP_FEATURES = 512
100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 116

MiB/s]
5.1.1 Image Encodings

When we load CLIP, it will also come with a set of image transformations we can
use to feed images into the CLIP model:
In [4]: clip_preprocess
Out[4]: Compose(
Resize(size=224, interpolation=bicubic, max_size=None, antialias=war
n)
CenterCrop(size=(224, 224))
<function _convert_image_to_rgb at 0x7f89f0485510>
ToTensor()
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954,
0.26130258, 0.27577711))
)
We can test this on one of our flower photos. Let's start with a picturesque daisy.
In [5]: DATA_DIR = "data/cropped_flowers/"

img_path = DATA_DIR + "daisy/2877860110_a842f8b14a_m.jpg"
28/04/2024 12:55 05_CLIP
img = Image.open(img_path)
img.show()
We can find the CLIP embedding by first transforming our image with
clip_preprocess and converting the result to a tensor. Since the
clip_model expects a batch of images, we can use np.stack to turn the
processed image into a single element batch.
In [6]: clip_imgs = torch.tensor(np.stack([clip_preprocess(img)])).to(device)

clip_imgs.size()
Out[6]: torch.Size([1, 3, 224, 224])
Then, we can pass the batch to clip_model.encode_image to find the

embedding for the image. Uncomment clip_img_encoding if you would like
to see what an encoding looks like. When we print the size, it lists 512 features
for our 1 image.
In [7]: clip_img_encoding = clip_model.encode_image(clip_imgs)

print(clip_img_encoding.size())
#clip_img_encoding
torch.Size([1, 512])
5.1.2 Text Encodings

Now that we have an image encoding, let's see if we can get a matching text
encoding. Below is a list of different flower descriptions. Like with the images,
the text needs to be preprocessed before it can be encoded by CLIP. To do this,
CLIP comes with a tokenize function in order to convert each word into an
integer.
In [8]: text_list = [
"A round white daisy with a yellow center",
"An orange sunflower with a big brown center",
"A red rose bud"
28/04/2024 12:55 05_CLIP
]
text_tokens = clip.tokenize(text_list).to(device)
text_tokens
Out[8]: tensor([[49406, 320, 2522, 1579, 12865, 593, 320, 4481, 2119,
49407,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0],
[49406, 550, 4287, 21559, 593, 320, 1205, 2866, 2119,
49407,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0],
[49406, 320, 736, 3568, 10737, 49407, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0]], device='cud
a:0',
dtype=torch.int32)
Then, we can pass the tokens to encode_text to get our text encodings.
Uncomment clip_text_encodings if you would like to see what an encoding
28/04/2024 12:55 05_CLIP
looks like. Similar to our image encoding, there are 512 features for each of our
3 images.
In [9]: clip_text_encodings = clip_model.encode_text(text_tokens).float()

print(clip_text_encodings.size())
#clip_text_encodings
torch.Size([3, 512])
5.1.3 Similarity
In order to see which one of our text descriptions best describes the daisy, we
can calculate the cosine similarity between the text encodings and the image
encodings. When the cosine similarity is 1 , it's a perfect match. When the
cosine similarity is -1 , the two encodings are opposites.
The cosine similarity is equivalent to a dot product with each vector normalized
by their magnitude. In other words, the magnitude of each vector becomes 1 .
We can use the following formula to calculate the dot product:
$X \cdot Y = \sum_{i=1}^{n} x_i y_i = x_1y_1 + x_2 y_2 + \cdots + x_n y_n$
In [10]: clip_img_encoding /= clip_img_encoding.norm(dim=-1, keepdim=True)

clip_text_encodings /= clip_text_encodings.norm(dim=-1, keepdim=True)
similarity = (clip_text_encodings * clip_img_encoding).sum(-1)
similarity
Out[10]: tensor([0.3704, 0.2471, 0.1767], device='cuda:0', grad_fn=<SumBackward1

>)
What do you think? Does the most descriptive text get the highest score?
In [11]: for idx, text in enumerate(text_list):

print(text, " - ", similarity[idx])
A round white daisy with a yellow center - tensor(0.3704, device='cuda:

0', grad_fn=<SelectBackward0>)
An orange sunflower with a big brown center - tensor(0.2471, device='cud
a:0', grad_fn=<SelectBackward0>)
A red rose bud - tensor(0.1767, device='cuda:0', grad_fn=<SelectBackward
0>)
Let's practice a little more. Below, we've added a sunflower and a rose image.
In [12]: img_paths = [
DATA_DIR + "daisy/2877860110_a842f8b14a_m.jpg",
DATA_DIR + "sunflowers/2721638730_34a9b7a78b.jpg",
DATA_DIR + "roses/8032328803_30afac8b07_m.jpg"
]
imgs = [Image.open(path) for path in img_paths]
28/04/2024 12:55 05_CLIP
for img in imgs:

img.show()
TODO: The below get_img_encodings function is riddled with FIXMEs .

Please replace each FIXME with the appropriate code to generate CLIP
encodings from PIL images.
Click the ... for an answer.
In [13]: def get_img_encodings(imgs):

processed_imgs = [FIXME(img) for img in imgs]
clip_imgs = torch.tensor(np.stack(FIXME)).to(device)
clip_img_encodings = FIXME.encode_image(clip_imgs)
return clip_img_encodings
28/04/2024 12:55 05_CLIP
In [15]: def get_img_encodings(imgs):

processed_imgs = [clip_preprocess(img) for img in imgs]
clip_imgs = torch.tensor(np.stack(processed_imgs)).to(device)
clip_img_encodings = clip_model.encode_image(clip_imgs)
return clip_img_encodings
In [16]: clip_img_encodings = get_img_encodings(imgs)

clip_img_encodings
Out[16]: tensor([[-0.2722, -0.0156, -0.1793, ..., 0.5815, 0.0871, -0.1442],

[ 0.2590, -0.1023, -0.3442, ..., -0.0083, 0.4956, 0.0825],
[-0.0613, 0.4138, 0.0088, ..., 0.3269, 0.4639, -0.1385]],
device='cuda:0', dtype=torch.float16, grad_fn=<MmBackward0>)
TODO: Find text that describes the above images well and will result in a high
similarity score. After calculating the similarity score, feel free to repeat this
exercise and modify. We will be using this text list again later.
Click the ... for an example.
In [17]: text_list = [
"A daisy",
"A sunflower",
"A rose"
]
text_list = [
"A round white daisy with a yellow center",
"A deep red rose flower"
]
In [18]: text_tokens = clip.tokenize(text_list).to(device)

clip_text_encodings = clip_model.encode_text(text_tokens).float()
clip_text_encodings
Out[18]: tensor([[-0.2874, -0.1919, 0.1517, ..., -0.2301, 0.0572, -0.1427],

[ 0.0701, 0.0188, 0.2164, ..., -0.2563, -0.1208, 0.1393],
[-0.2050, 0.2688, 0.2397, ..., -0.5176, -0.0798, -0.2930]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
It would be nice to compare each combination of text and image. To do so, we

can repeat each text encoding for each image encoding. Similarly, we can
repeat_interleave each image encoding for each text encoding.
In [19]: clip_img_encodings /= clip_img_encodings.norm(dim=-1, keepdim=True)

clip_text_encodings /= clip_text_encodings.norm(dim=-1, keepdim=True)
n_imgs = len(imgs)
n_text = len(text_list)
In [20]: repeated_clip_text_encodings = clip_text_encodings.repeat(n_imgs, 1)
28/04/2024 12:55 05_CLIP
repeated_clip_text_encodings
Out[20]: tensor([[-0.0295, -0.0197, 0.0156, ..., -0.0237, 0.0059, -0.0147],

[ 0.0074, 0.0020, 0.0227, ..., -0.0269, -0.0127, 0.0146],
[-0.0199, 0.0261, 0.0233, ..., -0.0503, -0.0078, -0.0285],
...,
[-0.0295, -0.0197, 0.0156, ..., -0.0237, 0.0059, -0.0147],
[ 0.0074, 0.0020, 0.0227, ..., -0.0269, -0.0127, 0.0146],
[-0.0199, 0.0261, 0.0233, ..., -0.0503, -0.0078, -0.0285]],
device='cuda:0', grad_fn=<RepeatBackward0>)
In [21]: repeated_clip_img_encoding = clip_img_encodings.repeat_interleave(n_text,

repeated_clip_img_encoding
Out[21]: tensor([[-0.0247, -0.0014, -0.0163, ..., 0.0528, 0.0079, -0.0131],

[-0.0247, -0.0014, -0.0163, ..., 0.0528, 0.0079, -0.0131],
[-0.0247, -0.0014, -0.0163, ..., 0.0528, 0.0079, -0.0131],
...,
[-0.0053, 0.0357, 0.0008, ..., 0.0282, 0.0400, -0.0119],
[-0.0053, 0.0357, 0.0008, ..., 0.0282, 0.0400, -0.0119],
[-0.0053, 0.0357, 0.0008, ..., 0.0282, 0.0400, -0.0119]],
device='cuda:0', dtype=torch.float16, grad_fn=<IndexSelectBackwar
d0>)
In [22]: similarity = (repeated_clip_text_encodings * repeated_clip_img_encoding).

similarity = torch.unflatten(similarity, 0, (n_text, n_imgs))
similarity
Out[22]: tensor([[0.3257, 0.2693, 0.2328],

[0.2559, 0.3112, 0.2081],
[0.2162, 0.1985, 0.2937]], device='cuda:0', grad_fn=<ViewBackwar
d0>)
Let's compare. Ideally, the diagonal from the top left to the bottom right should
be a bright yellow corresponding to their high value. The rest of the values
should be low and blue.
In [23]: fig = plt.figure(figsize=(10, 10))

gs = fig.add_gridspec(2, 3, wspace=.1, hspace=0)
for i, img in enumerate(imgs):

ax = fig.add_subplot(gs[0, i])
ax.axis("off")
plt.imshow(img)
ax = fig.add_subplot(gs[1, :])
plt.imshow(similarity.detach().cpu().numpy().T, vmin=0.1, vmax=0.3)
labels = [ '\n'.join(wrap(text, 20)) for text in text_list ]

plt.yticks(range(n_text), labels, fontsize=10)
plt.xticks([])
for x in range(similarity.shape[1]):
for y in range(similarity.shape[0]):
plt.text(x, y, f"{similarity[x, y]:.2f}", ha="center", va="center
28/04/2024 12:55 05_CLIP
5.2 A CLIP Dataset

In the previous notebook, we used the flower category as the label. This time,
we're going to use CLIP encodings as our label.
If the goal of CLIP is to align text encodings with image encodings, do we need a
text description for each of the images in our dataset? Hypothesis: we do not
need text descriptions and only need the image CLIP encodings to create a text-
to-image pipeline.
To test this out, let's add the CLIP encodings as the "label" to our dataset.
Running CLIP on each batch of data augmented images would be more
accurate, but it is also slower. We can speed things up by preprocessing and
storing the encodings ahead of time.
We can use glob to list all of our image filepaths:
In [24]: data_paths = glob.glob(DATA_DIR + '*/*.jpg', recursive=True)

data_paths[:5]
28/04/2024 12:55 05_CLIP
Out[24]: ['data/cropped_flowers/sunflowers/3062794421_295f8c2c4e.jpg',
'data/cropped_flowers/sunflowers/5076821914_c21b58fd4c_m.jpg',
'data/cropped_flowers/sunflowers/5994569021_749d5e2da3_n.jpg',
'data/cropped_flowers/sunflowers/24459750_eb49f6e4cb_m.jpg',
'data/cropped_flowers/sunflowers/4814106562_7c3564d2d9_n.jpg']
The next code block runs the following loop for each filepath:
Open the image associated with the path and store it in img
Preprocess the image, find the CLIP encoding, and store it in clip_img
Convert the CLIP encoding from a tensor to a python list
Store the filepath and the CLIP encoding as a row in a csv file
In [25]: csv_path = 'clip.csv'
with open(csv_path, 'w', newline='') as csvfile:

writer = csv.writer(csvfile, delimiter=',')
for idx, path in enumerate(data_paths):
img = Image.open(path)
clip_img = torch.tensor(np.stack([clip_preprocess(img)])).to(devi
label = clip_model.encode_image(clip_img)[0].tolist()
writer.writerow([path] + label)
It may take a few seconds to process the full dataset. When complete, open
clip.csv to see the results.
We can use the same image transformations as we did with the other notebook:
In [26]: IMG_SIZE = 32 # Due to stride and pooling, must be divisible by 2 multipl

IMG_CH = 3
BATCH_SIZE = 128
INPUT_SIZE = (IMG_CH, IMG_SIZE, IMG_SIZE)
pre_transforms = [
transforms.Resize(IMG_SIZE),
transforms.ToTensor(), # Scales data into [0,1]
transforms.Lambda(lambda t: (t * 2) - 1) # Scale between [-1, 1]
]
pre_transforms = transforms.Compose(pre_transforms)
random_transforms = [
transforms.RandomCrop(IMG_SIZE),
transforms.RandomHorizontalFlip(),
]
random_transforms = transforms.Compose(random_transforms)
Below is the code to initialize our new dataset. Since we've

preprocessed_clip , we will preload it onto our GPU with the __init__
function. We've kept the "on the fly" CLIP encoding as an example. It will
produce slightly better results, but it is much slower.
In [27]: class MyDataset(Dataset):
28/04/2024 12:55 05_CLIP
def __init__(self, csv_path, preprocessed_clip=True):

self.imgs = []
self.preprocessed_clip = preprocessed_clip
if preprocessed_clip:
self.labels = torch.empty(
len(data_paths), CLIP_FEATURES, dtype=torch.float, device
)
with open(csv_path, newline='') as csvfile:

reader = csv.reader(csvfile, delimiter=',')
for idx, row in enumerate(reader):
img = Image.open(row[0])
self.imgs.append(pre_transforms(img).to(device))
if preprocessed_clip:
label = [float(x) for x in row[1:]]
self.labels[idx, :] = torch.FloatTensor(label).to(dev
def __getitem__(self, idx):

img = random_transforms(self.imgs[idx])
if self.preprocessed_clip:
label = self.labels[idx]
else:
batch_img = img[None, :, :, :]
encoded_imgs = clip_model.encode_image(clip_preprocess(batch_
label = encoded_imgs.to(device).float()[0]
return img, label
def __len__(self):
return len(self.imgs)
In [28]: train_data = MyDataset(csv_path)

dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,
The U-Net model is the same architecture as last time, but with one small
difference. Instead of using the number of classes as our c_embed_dim , we will
use the number of CLIP_FEATURES . Last time, c might have stood for "class",
but this time, it stands for "context". Thankfully, they both start with c , so we
do not need to refactor the code to reflect this change in intention.
In [29]: T = 400
B_start = 0.0001
B_end = 0.02
B = torch.linspace(B_start, B_end, T).to(device)
ddpm = ddpm_utils.DDPM(B, device)

model = UNet_utils.UNet(
T, IMG_CH, IMG_SIZE, down_chs=(256, 256, 512), t_embed_dim=8, c_embed
)
print("Num params: ", sum(p.numel() for p in model.parameters()))
model_flowers = torch.compile(model.to(device))
Num params: 44900355
The get_context_mask function will change a little bit. Since we're replacing
28/04/2024 12:55 05_CLIP
our categorical input with a CLIP embedding, we no longer need to one-hot

encode our label. We'll still randomly set values in our encoding to 0 to help
the model learn without context.
In [30]: def get_context_mask(c, drop_prob):

c_mask = torch.bernoulli(torch.ones_like(c).float() - drop_prob).to(d
return c_mask
Let's also recreate the sample_flowers function. This time, it will take our
text_list as a parameter and convert it to a CLIP encoding. The sample_w
function remains mostly the same and has been moved to the bottom of
ddpm_utils.py.
In [31]: def sample_flowers(text_list):

text_tokens = clip.tokenize(text_list).to(device)
c = clip_model.encode_text(text_tokens).float()
x_gen, x_gen_store = ddpm_utils.sample_w(model, ddpm, INPUT_SIZE, T,
return x_gen, x_gen_store
Time to get training! After about 50 epochs , the model will start generating
something recognizable, and at 100 it will hit its stride. What do you think? Do
the generated images match your descriptions?
In [32]: epochs=100
c_drop_prob = 0.1
lrate = 1e-4
save_dir = "05_images/"
In [33]: optimizer = torch.optim.Adam(model.parameters(), lr=lrate)
model.train()
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
t = torch.randint(0, T, (BATCH_SIZE,), device=device).float()
x, c = batch
c_mask = get_context_mask(c, c_drop_prob)
loss = ddpm.get_loss(model_flowers, x, t, c, c_mask)
loss.backward()
optimizer.step()
print(f"Epoch {epoch} | Step {step:03d} | Loss: {loss.item()}")

if epoch % 5 == 0 or epoch == int(epochs - 1):
x_gen, x_gen_store = sample_flowers(text_list)
grid = make_grid(x_gen.cpu(), nrow=len(text_list))
save_image(grid, save_dir + f"image_ep{epoch:02}.png")
print("saved images in " + save_dir + f" for episode {epoch}")
28/04/2024 12:55 05_CLIP
Epoch 0 | Step 008 | Loss: 0.5396527647972107

saved images in 05_images/ for episode 0
Epoch 1 | Step 008 | Loss: 0.22679030895233154
Epoch 2 | Step 008 | Loss: 0.21654252707958221
Epoch 3 | Step 008 | Loss: 0.16067500412464142
Epoch 4 | Step 008 | Loss: 0.21205465495586395
Epoch 5 | Step 008 | Loss: 0.1812688559293747
Epoch 6 | Step 008 | Loss: 0.14925695955753326
Epoch 7 | Step 008 | Loss: 0.15896153450012207
Epoch 8 | Step 008 | Loss: 0.18373191356658936
Epoch 9 | Step 008 | Loss: 0.1776713728904724
Epoch 10 | Step 008 | Loss: 0.15570874512195587
Epoch 11 | Step 008 | Loss: 0.1345992535352707
Epoch 12 | Step 008 | Loss: 0.16464856266975403
Epoch 13 | Step 008 | Loss: 0.11995367705821991
Epoch 14 | Step 008 | Loss: 0.1288788616657257
Epoch 15 | Step 008 | Loss: 0.08368758857250214
Epoch 16 | Step 008 | Loss: 0.1357186883687973
Epoch 17 | Step 008 | Loss: 0.1131509467959404
Epoch 18 | Step 008 | Loss: 0.13419324159622192
Epoch 19 | Step 008 | Loss: 0.11003635823726654
Epoch 20 | Step 008 | Loss: 0.12464983761310577
Epoch 21 | Step 008 | Loss: 0.1246059387922287
Epoch 22 | Step 008 | Loss: 0.11123006790876389
Epoch 23 | Step 008 | Loss: 0.11870753020048141
Epoch 24 | Step 008 | Loss: 0.13459059596061707
Epoch 25 | Step 008 | Loss: 0.12788470089435577
Epoch 26 | Step 008 | Loss: 0.10690812766551971
Epoch 27 | Step 008 | Loss: 0.14000564813613892
Epoch 28 | Step 008 | Loss: 0.1178329586982727
Epoch 29 | Step 008 | Loss: 0.13066796958446503
Epoch 30 | Step 008 | Loss: 0.09358654916286469
Epoch 31 | Step 008 | Loss: 0.10045814514160156
Epoch 32 | Step 008 | Loss: 0.08212883770465851
Epoch 33 | Step 008 | Loss: 0.13626143336296082
Epoch 34 | Step 008 | Loss: 0.07938585430383682
Epoch 35 | Step 008 | Loss: 0.10134029388427734
Epoch 36 | Step 008 | Loss: 0.08763744682073593
Epoch 37 | Step 008 | Loss: 0.10414650291204453
Epoch 38 | Step 008 | Loss: 0.08313066512346268
Epoch 39 | Step 008 | Loss: 0.0839347243309021
Epoch 40 | Step 008 | Loss: 0.08997826278209686
Epoch 41 | Step 008 | Loss: 0.10426294058561325
Epoch 42 | Step 008 | Loss: 0.10350389778614044
Epoch 43 | Step 008 | Loss: 0.08897244930267334
Epoch 44 | Step 008 | Loss: 0.0762656107544899
Epoch 45 | Step 008 | Loss: 0.08446862548589706
28/04/2024 12:55 05_CLIP

Epoch 46 | Step 008 | Loss: 0.0749298557639122
Epoch 47 | Step 008 | Loss: 0.08930391073226929
Epoch 48 | Step 008 | Loss: 0.09107818454504013
Epoch 49 | Step 008 | Loss: 0.09443990886211395
Epoch 50 | Step 008 | Loss: 0.07757984101772308
Epoch 51 | Step 008 | Loss: 0.07305000722408295
Epoch 52 | Step 008 | Loss: 0.08464519679546356
Epoch 53 | Step 008 | Loss: 0.09414657950401306
Epoch 54 | Step 008 | Loss: 0.061714380979537964
Epoch 55 | Step 008 | Loss: 0.07871334999799728
Epoch 56 | Step 008 | Loss: 0.0709354430437088
Epoch 57 | Step 008 | Loss: 0.08690609037876129
Epoch 58 | Step 008 | Loss: 0.09393151849508286
Epoch 59 | Step 008 | Loss: 0.08445491641759872
Epoch 60 | Step 008 | Loss: 0.09120597690343857
Epoch 61 | Step 008 | Loss: 0.11181411147117615
Epoch 62 | Step 008 | Loss: 0.0644778311252594
Epoch 63 | Step 008 | Loss: 0.06472909450531006
Epoch 64 | Step 008 | Loss: 0.07827340811491013
Epoch 65 | Step 008 | Loss: 0.06551306694746017
Epoch 66 | Step 008 | Loss: 0.08966843783855438
Epoch 67 | Step 008 | Loss: 0.08107855916023254
Epoch 68 | Step 008 | Loss: 0.10005500912666321
Epoch 69 | Step 008 | Loss: 0.08585812896490097
Epoch 70 | Step 008 | Loss: 0.08340724557638168
Epoch 71 | Step 008 | Loss: 0.07515019178390503
Epoch 72 | Step 008 | Loss: 0.07887092232704163
Epoch 73 | Step 008 | Loss: 0.07599500566720963
Epoch 74 | Step 008 | Loss: 0.07889366149902344
Epoch 75 | Step 008 | Loss: 0.07820103317499161
Epoch 76 | Step 008 | Loss: 0.0765932947397232
Epoch 77 | Step 008 | Loss: 0.07991796731948853
Epoch 78 | Step 008 | Loss: 0.07931873947381973
Epoch 79 | Step 008 | Loss: 0.07608060538768768
Epoch 80 | Step 008 | Loss: 0.08208446204662323
Epoch 81 | Step 008 | Loss: 0.09163268655538559
Epoch 82 | Step 008 | Loss: 0.07225029915571213
Epoch 83 | Step 008 | Loss: 0.08349858224391937
Epoch 84 | Step 008 | Loss: 0.08374186605215073
Epoch 85 | Step 008 | Loss: 0.0781339555978775
Epoch 86 | Step 008 | Loss: 0.07123538851737976
Epoch 87 | Step 008 | Loss: 0.0762331634759903
Epoch 88 | Step 008 | Loss: 0.08805690705776215
Epoch 89 | Step 008 | Loss: 0.07775669544935226
Epoch 90 | Step 008 | Loss: 0.0863957405090332
28/04/2024 12:55 05_CLIP
Epoch 91 | Step 008 | Loss: 0.07575301826000214

Epoch 92 | Step 008 | Loss: 0.06364306807518005
Epoch 93 | Step 008 | Loss: 0.0872955322265625
Epoch 94 | Step 008 | Loss: 0.10020013898611069
Epoch 95 | Step 008 | Loss: 0.08317261189222336
Epoch 96 | Step 008 | Loss: 0.07253693044185638
Epoch 97 | Step 008 | Loss: 0.07517684996128082
Epoch 98 | Step 008 | Loss: 0.0854099839925766
Epoch 99 | Step 008 | Loss: 0.08086627721786499
Now that the model is trained, let's play with it! What happens when we give it a
prompt of something not in the dataset? Or can you craft the perfect prompt to
generate an image you can imagine?
The art of crafting a prompt to get the results you desire is called prompt
engineering, and as shown here, is dependent on the kind of data the model is
trained on.
In [36]: # Change me
#text_list = [
# "A daisy",
# "A sunflower",
# "A rose"
#]
text_list = [
"A round green daisy with a black center",
"A deep red rose flower"
]
model.eval()
x_gen, x_gen_store = sample_flowers(text_list)
grid = make_grid(x_gen.cpu(), nrow=len(text_list))
other_utils.show_tensor_image([grid])
plt.show()
28/04/2024 12:55 05_CLIP
Once you've found a set of images you enjoy, run the below cell to turn it into
an animation. It will be saved to 05_images/flowers.gif
In [37]: grids = [other_utils.to_image(make_grid(x_gen.cpu(), nrow=len(text_list))

other_utils.save_animation(grids, "05_images/flowers.gif")
MovieWriter ffmpeg unavailable; using Pillow instead.
28/04/2024 12:55 05_CLIP
5.3 Next
Congratulations on making it to the end of the course! Hope the journey was
enjoyable and you were able to generate something worthy of sharing with your
friends and family.
Ready to put your skills to the test? Head on over to the assessment to earn a
certificate!
In [38]: import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Out[38]: {'status': 'ok', 'restart': True}
Header

JupyterLab05 CLIP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JupyterLab05 CLIP

Uploaded by

Copyright:

Available Formats

28/04/2024 12:55 05_CLIP

The goals of this notebook are to:

Learn how to use CLIP Encodings

In [1]: import csv

# User defined libraries

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]: import clip

In [3]: clip_model, clip_preprocess = clip.load("ViT-B/32")

100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 116

5.1.1 Image Encodings

In [5]: DATA_DIR = "data/cropped_flowers/"

In [6]: clip_imgs = torch.tensor(np.stack([clip_preprocess(img)])).to(device)

Out[6]: torch.Size([1, 3, 224, 224])

Then, we can pass the batch to clip_model.encode_image to find the

In [7]: clip_img_encoding = clip_model.encode_image(clip_imgs)

5.1.2 Text Encodings

In [9]: clip_text_encodings = clip_model.encode_text(text_tokens).float()

We can use the following formula to calculate the dot product:

In [10]: clip_img_encoding /= clip_img_encoding.norm(dim=-1, keepdim=True)

Out[10]: tensor([0.3704, 0.2471, 0.1767], device='cuda:0', grad_fn=<SumBackward1

In [11]: for idx, text in enumerate(text_list):

A round white daisy with a yellow center - tensor(0.3704, device='cuda:

imgs = [Image.open(path) for path in img_paths]

for img in imgs:

TODO: The below get_img_encodings function is riddled with FIXMEs .

Click the ... for an answer.

In [13]: def get_img_encodings(imgs):

In [15]: def get_img_encodings(imgs):

In [16]: clip_img_encodings = get_img_encodings(imgs)

Out[16]: tensor([[-0.2722, -0.0156, -0.1793, ..., 0.5815, 0.0871, -0.1442],

Click the ... for an example.

In [18]: text_tokens = clip.tokenize(text_list).to(device)

Out[18]: tensor([[-0.2874, -0.1919, 0.1517, ..., -0.2301, 0.0572, -0.1427],

It would be nice to compare each combination of text and image. To do so, we

In [19]: clip_img_encodings /= clip_img_encodings.norm(dim=-1, keepdim=True)

In [20]: repeated_clip_text_encodings = clip_text_encodings.repeat(n_imgs, 1)

Out[20]: tensor([[-0.0295, -0.0197, 0.0156, ..., -0.0237, 0.0059, -0.0147],

In [21]: repeated_clip_img_encoding = clip_img_encodings.repeat_interleave(n_text,

Out[21]: tensor([[-0.0247, -0.0014, -0.0163, ..., 0.0528, 0.0079, -0.0131],

In [22]: similarity = (repeated_clip_text_encodings * repeated_clip_img_encoding).

Out[22]: tensor([[0.3257, 0.2693, 0.2328],

In [23]: fig = plt.figure(figsize=(10, 10))

for i, img in enumerate(imgs):

labels = [ '\n'.join(wrap(text, 20)) for text in text_list ]

5.2 A CLIP Dataset

We can use glob to list all of our image filepaths:

In [24]: data_paths = glob.glob(DATA_DIR + '*/*.jpg', recursive=True)

In [25]: csv_path = 'clip.csv'

with open(csv_path, 'w', newline='') as csvfile:

In [26]: IMG_SIZE = 32 # Due to stride and pooling, must be divisible by 2 multipl

Below is the code to initialize our new dataset. Since we've

In [27]: class MyDataset(Dataset):

def __init__(self, csv_path, preprocessed_clip=True):

with open(csv_path, newline='') as csvfile:

def __getitem__(self, idx):

In [28]: train_data = MyDataset(csv_path)

ddpm = ddpm_utils.DDPM(B, device)

Num params: 44900355

our categorical input with a CLIP embedding, we no longer need to one-hot

In [30]: def get_context_mask(c, drop_prob):

In [31]: def sample_flowers(text_list):

In [33]: optimizer = torch.optim.Adam(model.parameters(), lr=lrate)

print(f"Epoch {epoch} | Step {step:03d} | Loss: {loss.item()}")

In [24]: data_paths = glob.glob(DATA_DIR + '/.jpg', recursive=True)

def init(self, csv_path, preprocessed_clip=True):

def getitem(self, idx):