Professional Documents
Culture Documents
https://all-things-vits.github.io/atv/
Who are we?
Cross attention
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., 2015
Effective Approaches to Attention-based Neural Machine Translation, Luong et al., 2015
Attention
〇 At the heart of the Transformer lies the simple attention mechanism.
○ Creates a contextualized representation for each input token.
Cross attention
Image Text
Attention is All you Need, Vaswani et al., 2017
Resources
〇 Papers:
○ Attention Is All You Need
〇 Blog posts:
○ The Illustrated Transformer
Part 2
Probing Vision Transformers
Overview
〇 Mean attention distance
〇 Centered kernel alignment (CKA)
〇 Role of skip connections
Vision Transformers (ViT) quickly
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.
Lower layers
have variable
MAD
Local + Global
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
It doesn’t change much when we use a conv prior.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
MAD = f(data, depth)
Strong connection between MAD, pre-training data, ViT architecture:
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
MAD = f(data, depth)
But not so much for …
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Some observations so far
〇 Without enough data, lower layers in ViTs don’t learn locality. This becomes
evident in deeper architectures.
〇 With enough pre-training data, lower layers learn to encode locality early on
which could be an indicator for good performance.
〇 ViT layers have access to global information almost uniformly. What are its
repercussions?
sim(representationvit, representationresnet)
〇 There’s a primary difference here – CNNs don’t combine both global and
local information like ViTs do.
〇 Does this lead to differences in their representation space?
Yes, it does!
sim(representationvit, representationresnet)
A quantifiable way to compare representations from neural architectures -
Centered Kernel Alignment (CKA)
Centering matrix
sim(representationvit, representationresnet)
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
sim(representationvit, representationresnet)
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Role of skip connection
〇 ViT’s representation space is uniform.
〇 Information from lower layers is propagated to the higher layers more
faithfully.
〇 How?
Role of skip connection
The setup – plot norm ratio of the following:
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Role of skip connection
Spatial tokens
CLS token
Overall low
norm ratios.
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Role of skip connection
Removing skip connections disrupts the uniformity of the representation
structure:
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Connections to robustness
ViT’s uniform representation structure impacts robustness:
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier (LIME), Ribeiro et al., 2016
Intro to Explainability
As the number of parameters and complexity of the networks increases, it
becomes increasingly challenging to develop such tools.
O Decompose an input prompt into the set of features used by the model.
O Decompose an input prompt into the set of features used by the model.
Grad-CAM
(Selvaraju et al.)
Input X Gradient
(Shrikumar et al.)
Integrated
gradients
(Sundararajan et al.) LIME
(Ribeiro et al.)
KernelSHAP DeepLift
(Lundberg et al.) (Shrikumar et al.)
〇 The attention matrix determines the amount of context that each token will
receive from the other tokens.
T1 T2 T3 T4
(The) (cat) (sat) (on)
T1 0.01 0.89 0.1 0
(The)
T2 … … … …
(cat)
T3
(sat)
T4
(on)
Attention as Explanation?
For classification tasks, the CLS token is appended to the sequence.
〇 There’s significant variance between the attention head (recall MAD score
from the previous section)
〇 How can we average across the different attention heads?
Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -
〇 Averaging the feature maps using the gradient w.r.t. the target class.
Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2017
Solution #3- TiBA
TiBA scales the attention maps by the gradients.
Gradients
Attention Residuals
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Formally,
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Formally,
Initialize with a one-hot vector for the target class (similar to gradient
propagation).
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
Solution #3- TiBA
Instead of using the raw attention values for the relevance matrix R, use the LRP
values.
Gradients
LRP(Attention) Residuals
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Solution #4- GAE Results
Extracting semantic segmentation from object detection
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Solution #4- GAE Results
CLIP
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Solution #4- GAE Results
CLIP explanations are valuable for downstream tasks such as image-to-sketch,
real image editing, etc.
Target edit text: red hat
Text2LIVE: Text-Driven Layered Image and Video Editing, Bar-Tal, et al. 2022
CLIPasso: Semantically-Aware Object Sketching, Vinker et al., 2022
Additional Methods
There have been many other great works examining explanations for
Transformers, which we do not cover due to time constraints.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
What if it could be better separated?
Delegating responsibilities better – Class Attention
Frozen patch
embeddings
Extract attention
Average from Reshape and
scores (softmax’d Attention map
multiple heads. upsample the map.
scores)
Transformer
Block 11
(ViT-DINO)
Investigating Vision Transformer representations Blog, Aritra Roy Gosthipaty and Sayak paul
Resources
〇 Papers:
○ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
○ Going deeper with Image Transformers
○ Emerging Properties in Self-Supervised Vision Transformers
○ DINOv2: Learning Robust Visual Features without Supervision (further reading)
○
〇 Blog posts:
○ Investigating Vision Transformer representations Blog
〇 Colab Notebooks
○ DINO attention maps
〇 Demos
○ Class attention maps
Guest section
Ron Mokady
Part 5
Attention for Downstream Tasks
Overview- Attend and Excite
〇 Catastrophic Neglect
〇 Intro to latent diffusion models
〇 Generative semantic nursing
〇 Results
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Text-Based Image Generation
☺
☹
Latent Diffusion Models
〇 💡 Idea:
○ Encoder maps the input to an embedding space.
○ Diffusion model is applied in the latent space.
〇 Lower training cost and faster inference.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
〇 Conditioning (e.g., text) is added
via cross-attention layers.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
〇 Conditioning (e.g., text) is added
via cross-attention layers.
〇 Finally, the latent is decoded back
to produce the output image.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Why Does the Model Fail?
〇 DDPM process:
○ Given an input text prompt, the DDPM gradually denoises a pure noise
latent to obtain the output image.
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Image
Latent Text
Features Embedding
(P=16)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
At[i,n] = presence of the
token n in patch i
Image
Latent Text
Features Embedding
(P=16)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Image
Latent Text
Features Embedding
(P=16)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Problem: crown gets
low attention values
for all patches
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Low attention = No generation
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
How Can We Fix This?
💡 Intuition: a generated subject should have an image patch that significantly
attends to the subject’s token
lion crown
max
patch
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Can We Improve This?
Yes. With Iterative Refinement.
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Can We Improve This?
Yes. With Iterative Refinement.
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Putting It All
Together
Attend to and
Excite all subject
tokens!
Attend-and-Excite: Attention-Based
Semantic Guidance for Text-to-Image
Diffusion Models, Chefer et al., 2023
Results
“A cat and a dog”
Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A red bench and a yellow clock”
Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A playful kitten chasing a butterfly in a wildflower
meadow”
Stable
Attend-and-Excite
Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A grizzly bear catching a salmon in a crystal clear
river surrounded by a forest”
Stable
Attend-and-Excite
Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results- Cross Attention Maps
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Resources
〇 Papers
○ High-Resolution Image Synthesis with Latent Diffusion Models
○ Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models
○ Compositional Visual Generation with Composable Diffusion Models
○ Training-Free Structured Diffusion Guidance for Compositional Text-to-Image
Synthesis
○ Stable Diffusion
〇 Demos
○ Generation with Attend and Excite interactive demo
Conclusion
Summary and open questions
Summary
〇 Probing what Vision Transformers learn
○ Mean attention distance
○ Similarity between ViTs and ResNets (representation space with CKA)
○ Skip connections
〇 Explaining attention
○ Attention flow and rollout
○ TiBA
○ Multiple modalities
〇 Attention as an explanation
○ Class attention
○ Semantic layouts in attention maps
〇 Attention for downstream applications
○ Prompt-to-prompt
○ NULL inversion
○ Attend and Excite
Q1: Evaluating explainability tools
How is a good explanation even defined?
What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods, Colin, et al. 2022
Q1: Evaluating explainability tools
How is a good explanation even defined?
〇 Faithfulness to the model: What features were used to arrive at the decision?
A Benchmark for Interpretability Methods in Deep Neural Networks, Hooker et al., 2019
Q2: Are smaller models more explainable?
Input prompt: “a man with eyeglasses”
CLIP w/ ViT-B/16
CLIP w/ ViT-L/14
Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Q3: Can we go beyond attention for interpretation?
Transformers are not just the attention layers!