Professional Documents
Culture Documents
JupyterLab05 CLIP
JupyterLab05 CLIP
Header
5. CLIP
Contrastive Language-Image Pre-Training or CLIP is a text and image encoding
tool used with many popular Generative AI models such as DALL-E and Stable
Diffusion.
CLIP in itself is not a Generative AI model, but is instead used to align text
encodings with image encodings. If there is such a thing as the perfect text
description of an image, the goal of CLIP is to create the same vector
embedding for both the image and the text. Let's see what this means in
practice.
5.1 Encodings
First, let's load the libraries needed for this exercise.
# Visualization tools
import matplotlib.pyplot as plt
from PIL import Image
from torchvision.utils import save_image, make_grid
from textwrap import wrap
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 1/17
28/04/2024 12:55 05_CLIP
There are a few different variations of CLIP based on popular image recognition
neural networks:
clip.available_models()
Out[2]: ['RN50',
'RN101',
'RN50x4',
'RN50x16',
'RN50x64',
'ViT-B/32',
'ViT-B/16',
'ViT-L/14',
'ViT-L/14@336px']
For this notebook, we will be using ViT-B/32 , which is based on the Vision
Transformer architecture. It has 512 features, which we will later feed into our
diffusion model.
In [4]: clip_preprocess
Out[4]: Compose(
Resize(size=224, interpolation=bicubic, max_size=None, antialias=war
n)
CenterCrop(size=(224, 224))
<function _convert_image_to_rgb at 0x7f89f0485510>
ToTensor()
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954,
0.26130258, 0.27577711))
)
We can test this on one of our flower photos. Let's start with a picturesque daisy.
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 2/17
28/04/2024 12:55 05_CLIP
img = Image.open(img_path)
img.show()
We can find the CLIP embedding by first transforming our image with
clip_preprocess and converting the result to a tensor. Since the
clip_model expects a batch of images, we can use np.stack to turn the
processed image into a single element batch.
torch.Size([1, 512])
In [8]: text_list = [
"A round white daisy with a yellow center",
"An orange sunflower with a big brown center",
"A red rose bud"
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 3/17
28/04/2024 12:55 05_CLIP
]
text_tokens = clip.tokenize(text_list).to(device)
text_tokens
Out[8]: tensor([[49406, 320, 2522, 1579, 12865, 593, 320, 4481, 2119,
49407,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0],
[49406, 550, 4287, 21559, 593, 320, 1205, 2866, 2119,
49407,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0],
[49406, 320, 736, 3568, 10737, 49407, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0]], device='cud
a:0',
dtype=torch.int32)
Then, we can pass the tokens to encode_text to get our text encodings.
Uncomment clip_text_encodings if you would like to see what an encoding
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 4/17
28/04/2024 12:55 05_CLIP
looks like. Similar to our image encoding, there are 512 features for each of our
3 images.
torch.Size([3, 512])
5.1.3 Similarity
In order to see which one of our text descriptions best describes the daisy, we
can calculate the cosine similarity between the text encodings and the image
encodings. When the cosine similarity is 1 , it's a perfect match. When the
cosine similarity is -1 , the two encodings are opposites.
The cosine similarity is equivalent to a dot product with each vector normalized
by their magnitude. In other words, the magnitude of each vector becomes 1 .
$X \cdot Y = \sum_{i=1}^{n} x_i y_i = x_1y_1 + x_2 y_2 + \cdots + x_n y_n$
What do you think? Does the most descriptive text get the highest score?
Let's practice a little more. Below, we've added a sunflower and a rose image.
In [12]: img_paths = [
DATA_DIR + "daisy/2877860110_a842f8b14a_m.jpg",
DATA_DIR + "sunflowers/2721638730_34a9b7a78b.jpg",
DATA_DIR + "roses/8032328803_30afac8b07_m.jpg"
]
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 5/17
28/04/2024 12:55 05_CLIP
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 6/17
28/04/2024 12:55 05_CLIP
TODO: Find text that describes the above images well and will result in a high
similarity score. After calculating the similarity score, feel free to repeat this
exercise and modify. We will be using this text list again later.
In [17]: text_list = [
"A daisy",
"A sunflower",
"A rose"
]
text_list = [
"A round white daisy with a yellow center",
"An orange sunflower with a big brown center",
"A deep red rose flower"
]
n_imgs = len(imgs)
n_text = len(text_list)
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 7/17
28/04/2024 12:55 05_CLIP
repeated_clip_text_encodings
Let's compare. Ideally, the diagonal from the top left to the bottom right should
be a bright yellow corresponding to their high value. The rest of the values
should be low and blue.
ax = fig.add_subplot(gs[1, :])
plt.imshow(similarity.detach().cpu().numpy().T, vmin=0.1, vmax=0.3)
for x in range(similarity.shape[1]):
for y in range(similarity.shape[0]):
plt.text(x, y, f"{similarity[x, y]:.2f}", ha="center", va="center
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 8/17
28/04/2024 12:55 05_CLIP
If the goal of CLIP is to align text encodings with image encodings, do we need a
text description for each of the images in our dataset? Hypothesis: we do not
need text descriptions and only need the image CLIP encodings to create a text-
to-image pipeline.
To test this out, let's add the CLIP encodings as the "label" to our dataset.
Running CLIP on each batch of data augmented images would be more
accurate, but it is also slower. We can speed things up by preprocessing and
storing the encodings ahead of time.
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 9/17
28/04/2024 12:55 05_CLIP
Out[24]: ['data/cropped_flowers/sunflowers/3062794421_295f8c2c4e.jpg',
'data/cropped_flowers/sunflowers/5076821914_c21b58fd4c_m.jpg',
'data/cropped_flowers/sunflowers/5994569021_749d5e2da3_n.jpg',
'data/cropped_flowers/sunflowers/24459750_eb49f6e4cb_m.jpg',
'data/cropped_flowers/sunflowers/4814106562_7c3564d2d9_n.jpg']
The next code block runs the following loop for each filepath:
Open the image associated with the path and store it in img
Preprocess the image, find the CLIP encoding, and store it in clip_img
Convert the CLIP encoding from a tensor to a python list
Store the filepath and the CLIP encoding as a row in a csv file
It may take a few seconds to process the full dataset. When complete, open
clip.csv to see the results.
We can use the same image transformations as we did with the other notebook:
pre_transforms = [
transforms.Resize(IMG_SIZE),
transforms.ToTensor(), # Scales data into [0,1]
transforms.Lambda(lambda t: (t * 2) - 1) # Scale between [-1, 1]
]
pre_transforms = transforms.Compose(pre_transforms)
random_transforms = [
transforms.RandomCrop(IMG_SIZE),
transforms.RandomHorizontalFlip(),
]
random_transforms = transforms.Compose(random_transforms)
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 10/17
28/04/2024 12:55 05_CLIP
def __len__(self):
return len(self.imgs)
The U-Net model is the same architecture as last time, but with one small
difference. Instead of using the number of classes as our c_embed_dim , we will
use the number of CLIP_FEATURES . Last time, c might have stood for "class",
but this time, it stands for "context". Thankfully, they both start with c , so we
do not need to refactor the code to reflect this change in intention.
In [29]: T = 400
B_start = 0.0001
B_end = 0.02
B = torch.linspace(B_start, B_end, T).to(device)
The get_context_mask function will change a little bit. Since we're replacing
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 11/17
28/04/2024 12:55 05_CLIP
Let's also recreate the sample_flowers function. This time, it will take our
text_list as a parameter and convert it to a CLIP encoding. The sample_w
function remains mostly the same and has been moved to the bottom of
ddpm_utils.py.
Time to get training! After about 50 epochs , the model will start generating
something recognizable, and at 100 it will hit its stride. What do you think? Do
the generated images match your descriptions?
In [32]: epochs=100
c_drop_prob = 0.1
lrate = 1e-4
save_dir = "05_images/"
model.train()
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
t = torch.randint(0, T, (BATCH_SIZE,), device=device).float()
x, c = batch
c_mask = get_context_mask(c, c_drop_prob)
loss = ddpm.get_loss(model_flowers, x, t, c, c_mask)
loss.backward()
optimizer.step()
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 12/17
28/04/2024 12:55 05_CLIP
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 13/17
28/04/2024 12:55 05_CLIP
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 14/17
28/04/2024 12:55 05_CLIP
Now that the model is trained, let's play with it! What happens when we give it a
prompt of something not in the dataset? Or can you craft the perfect prompt to
generate an image you can imagine?
The art of crafting a prompt to get the results you desire is called prompt
engineering, and as shown here, is dependent on the kind of data the model is
trained on.
In [36]: # Change me
#text_list = [
# "A daisy",
# "A sunflower",
# "A rose"
#]
text_list = [
"A round green daisy with a black center",
"An orange sunflower with a big brown center",
"A deep red rose flower"
]
model.eval()
x_gen, x_gen_store = sample_flowers(text_list)
grid = make_grid(x_gen.cpu(), nrow=len(text_list))
other_utils.show_tensor_image([grid])
plt.show()
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 15/17
28/04/2024 12:55 05_CLIP
Once you've found a set of images you enjoy, run the below cell to turn it into
an animation. It will be saved to 05_images/flowers.gif
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 16/17
28/04/2024 12:55 05_CLIP
5.3 Next
Congratulations on making it to the end of the course! Hope the journey was
enjoyable and you were able to generate something worthy of sharing with your
friends and family.
Ready to put your skills to the test? Head on over to the assessment to earn a
certificate!
Header
dli-69a8471a1f06-312408.aws.labs.courses.nvidia.com/lab/lab 17/17