You are on page 1of 152

All Things ViTs

Understanding and Interpreting


Attention in Vision
CVPR’23 Tutorial
Hila Chefer, Sayak Paul

https://all-things-vits.github.io/atv/
Who are we?

Hila Chefer is a PhD candidate at Tel-Aviv University and a


research intern at Google Tel-Aviv. You probably will not find
her off the work (hilach70@gmail.com).

Sayak works on 🧨 diffusers at Hugging Face 🤗. Off the


work, you can find him binge-watching Suits for the nth time
(spsayakpaul@gmail.com).
Our guest speaker - Ron Mokady

Ron is a Computer Science Ph.D. student at Tel-Aviv


University. He is currently a Research Lead at Bria.AI.
Previously, he worked at FAIR and Google
(ron.mokady@gmail.com).
Overview of the tutorial
〇 Introductions
〇 Attention in a jiffy
〇 Probing Vision Transformers
〇 Explaining Transformers’ predictions
〇 Attention as a (visual) explanation
〇 Attention to aid downstream applications
〇 Open questions and conclusion
💡 We’ll have multiple short breaks
in between the sections.
💡 For QnA, we can either use the
breaks or use RocketChat or both.
💡 Disclaimer: This tutorial is NOT
an exhaustive overview of all
possible methods.
💡 All the content (slides, demos,
code) are available here:
bit.ly/atv-cvpr
💡 All slides will cover the main
discussion. In the end, we will link
to relevant resources.
Part 1
Intro to Transformers
Overview
〇 From RNNs to Transformers
〇 Attention- Intuition
〇 The Beast with Many Heads
〇 Positional Encoding
〇 Cross-Attention
In the Beginning There Was an RNN
〇 RNN = Recurrent Neural Network.
〇 Widely used for Natural Language Processing (until 2017).
〇 Processing text sequentially, token by token.
In the Beginning There Was an RNN
〇 Why not stick to RNNs?
○ Sequential processing- time consuming (processing one by one).
○ Localization- the hidden state is mostly influenced by recent tokens.
○ Single direction context- from left to right (partially solved by BiLSTMs).
A Transformer is Born

Cross attention

Self attention Self attention

Attention is All you Need, Vaswani et al., 2017


A Transformer is Born

Slide courtesy: Lucas Beyer


A Transformer is Born

Slide courtesy: Lucas Beyer


A Transformer is Born- RNN vs. Attention
〇 Sequential processing vs. encoding done in parallel.
〇 Localization vs. context gained simultaneously from all tokens.
〇 Single direction context vs. context from the entire sequence.

Attention is All you Need, Vaswani et al., 2017


Attention Before Transformers
〇 Variants of attention have been employed for RNNs and LSTMs.
○ Mostly as a mechanism to support non-rigid information transfer
between the encoder and the decoder.

Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., 2015
Effective Approaches to Attention-based Neural Machine Translation, Luong et al., 2015
Attention
〇 At the heart of the Transformer lies the simple attention mechanism.
○ Creates a contextualized representation for each input token.

The cat sat on the mat


Attention- Intuition
〇 Similar to retrieval from databases:
○ Query = a query we wish to run on a database.
○ Key = the keys to search on in the database.
○ Value = values corresponding to each key in the database.
〇 Intuition- each token “searches the database” for tokens related to it.

The Illustrated Transformer Blog, Jay Alammar


Attention- Calculation
〇 Attention is calculated in two steps-
○ First- an attention score is
calculated, as a dot product
between the queries and the
keys.
○ The attention scores determine
the amount of context that will
be transferred.

The Illustrated Transformer Blog, Jay Alammar


Attention- Calculation
〇 Attention is calculated in two steps-
○ Next- the values are weighted
by the attention scores.
○ The weighted values are the
contextualized representations
of each token.
○ Intuitively: each token becomes
a convex combination of all the
tokens in the sequence.

The Illustrated Transformer Blog, Jay Alammar


The Beast with Many Heads
〇 Softmax tends to zero out almost all entries.
○ Coefficients of the linear combination are very sparse.
〇 In order to encourage diversity- a few attention heads are used, each
calculates its own attention values.

The Illustrated Transformer Blog, Jay Alammar


The Beast with Many Heads
〇 The outputs are concatenated, and a linear layer produces the output.

The Illustrated Transformer Blog, Jay Alammar


Positional Encoding
〇 The attention mechanism is invariant to order.
○ The attention scores are calculated by simple dot products.
〇 Therefore, to account for the order of tokens, a positional encoding is added
to each token before the processing.
○ Order is crucial:
She likes ice-cream, he does not
He likes ice-cream, she does not Positional
encoding

Attention is All you Need, Vaswani et al., 2017


A Transformer is Born

Cross attention

Self attention Self attention

Attention is All you Need, Vaswani et al., 2017


From Self-Attention to Cross-Attention
〇 Cross-attention is used to gain context from another modality / input type.
○ For example- gain context from text for image processing.
○ Simply extract the queries, keys matrices from the other modality.

Image Text
Attention is All you Need, Vaswani et al., 2017
Resources
〇 Papers:
○ Attention Is All You Need
〇 Blog posts:
○ The Illustrated Transformer
Part 2
Probing Vision Transformers
Overview
〇 Mean attention distance
〇 Centered kernel alignment (CKA)
〇 Role of skip connections
Vision Transformers (ViT) quickly

Going to shamelessly steal slides from Lucas Beyer!


An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020, A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, J Uszkoreit, N Houlsby

Many prior works attempted to introduce Vision Transformer


self-attention at the pixel level.

For 224px², that's 50k sequence length, too much!


(ViT)
Thus, most works restrict attention to local pixel
neighborhoods, or as high-level mechanism on
top of detections.

The key breakthrough in using the full


Transformer architecture, standalone, was to
"tokenize" the image by cutting it into
patches of 16px², and treating each patch as a
token, e.g. embedding it into input space.
Slide courtesy: Lucas Beyer
How do we know if it’s effective at all?
〇 Locality is important for computer vision.
〇 Having a global context is equally important.
〇 How do ViTs learn locality?
〇 Is there any similarity between CNNs and ViTs w.r.t their representation
spaces?
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.

The MAD is defined to be the geometric distance between


two patches scaled by the attention values.

〇 High MAD = distant patches receive high attention


values.
〇 Low MAD = close patches receive high attention
values.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.

Investigating Vision Transformer representations (blog post), credits - Ritwik Raha


Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.

Lower layers
have variable
MAD
Local + Global

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.

Lower layers Higher layers


have variable have higher
MAD MAD
Local + Global Global

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Mean attention distance (MAD)
It doesn’t change much when we use a conv prior.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
MAD = f(data, depth)
Strong connection between MAD, pre-training data, ViT architecture:

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
MAD = f(data, depth)
But not so much for …

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Some observations so far
〇 Without enough data, lower layers in ViTs don’t learn locality. This becomes
evident in deeper architectures.
〇 With enough pre-training data, lower layers learn to encode locality early on
which could be an indicator for good performance.
〇 ViT layers have access to global information almost uniformly. What are its
repercussions?
sim(representationvit, representationresnet)
〇 There’s a primary difference here – CNNs don’t combine both global and
local information like ViTs do.
〇 Does this lead to differences in their representation space?

Yes, it does!
sim(representationvit, representationresnet)
A quantifiable way to compare representations from neural architectures -
Centered Kernel Alignment (CKA)

〇 Invariant to orthogonal transformation of representations


〇 Invariant to isotropic scaling (scaling each dimension uniformly)

Similarity of Neural Network Representations Revisited, Kornblith et al., 2019


sim(representationvit, representationresnet)
Centered Kernel Alignment

K and L = Gram matrices,


K = XXT, L = YYT (X and Y are representations),
HSIC = Hilbert-Schmidt independence criterion
sim(representationvit, representationresnet)
Centered Kernel Alignment

Centered Gram matrices

Centering matrix
sim(representationvit, representationresnet)

Intra-network comparison with CKA

〇 ViTs show more uniform similarities between


both lower and higher layers.
〇 ResNets show uniform similarities in halves.

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
sim(representationvit, representationresnet)

Inter-network comparison with CKA

〇 ViTs compute similar features as ResNets


but with a smaller set of layers.
〇 ViTs propagate features more faithfully
across layers.
〇 Features across the higher layers in ViTs
and ResNets vary.

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Role of skip connection
〇 ViT’s representation space is uniform.
〇 Information from lower layers is propagated to the higher layers more
faithfully.
〇 How?
Role of skip connection
The setup – plot norm ratio of the following:

Hidden representations the Transformation of zi from


of i-th layer (skip the long branch (MLP or
connection) self-attention)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Role of skip connection

〇 Clear phase transition between CLS


and spatial tokens.
〇 1st half: skip connections propagate
the CLS token.
〇 2nd half: skip connections propagate
the spatial tokens.

Spatial tokens
CLS token

Image source: Do Vision Transformers See Like Convolutional Neural Networks?


Role of skip connection
Skip connections behave differently in ViTs and ResNets:

Overall low
norm ratios.

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Role of skip connection
Removing skip connections disrupts the uniformity of the representation
structure:

Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
Connections to robustness
ViT’s uniform representation structure impacts robustness:

Vision Transformers are Robust Learners, Paul et al., 2022


Resources
〇 Papers:
○ An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
○ Do Vision Transformers See Like Convolutional Neural Networks?
○ Vision Transformers are Robust Learners
○ What do Vision Transformers Learn? A Visual Exploration (further
reading)
〇 Colab Notebooks
○ Mean attention distance
Part 3
Explaining Transformers
Overview
〇 Intro to Explainability
〇 Why Not Use CNN Algorithms?
〇 Is Attention an Explanation?
〇 Algorithms to Explain Transformers
Intro to Explainability

“We all fear what we do not understand.”


― Dan Brown, The Lost Symbol
Intro to Explainability
Goal: developing a set of tools and frameworks to help you understand and
interpret predictions made by your machine learning models.

Quantifying ChatGPT’s gender bias, Sayash Kapoor and Arvind Narayanan


Intro to Explainability
〇 Goal: developing a set of tools and frameworks to help you understand and
interpret predictions made by your machine learning models.

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier (LIME), Ribeiro et al., 2016
Intro to Explainability
As the number of parameters and complexity of the networks increases, it
becomes increasingly challenging to develop such tools.

Image credits: Wikipedia


Sidenote- Explaining Generative Models
〇 Explaining generative models (e.g., GPT, Stable Diffusion) is challenging.
○ How does the model produce a generation from scratch?

Image credits: Wikipedia


Sidenote- Explaining Generative Models
New pre-print on interpreting diffusion models!

O Decompose an input prompt into the set of features used by the model.

The Hidden Language of Diffusion Models, Chefer et al., 2023


Sidenote- Explaining Generative Models
New pre-print on interpreting diffusion models!

O Decompose an input prompt into the set of features used by the model.

The Hidden Language of Diffusion Models, Chefer et al., 2023


Sidenote- Explaining Generative Models
New pre-print on interpreting diffusion models!

O The learned decomposition reveals non-trivial biases.

The Hidden Language of Diffusion Models, Chefer et al., 2023


Intro to Explainability
Approaches can be (roughly) divided to two categories:

〇 Model specific - applying the model’s activations, parameters, structure etc.


in the explanations (e.g., Grad-CAM).
〇 Model agnostic - referring to the model as a black box and developing a
general method (e.g., SHAP, LIME).

Image origin: Wikipedia


Intro to Explainability
Disclaimer:

The following is a non-exhaustive list of notable methods to interpret deep


neural networks.

Due to time constraints, we will not go over all methods in detail.


Intro to Explainability

Grad-CAM
(Selvaraju et al.)
Input X Gradient
(Shrikumar et al.)
Integrated
gradients
(Sundararajan et al.) LIME
(Ribeiro et al.)

KernelSHAP DeepLift
(Lundberg et al.) (Shrikumar et al.)

And many more!


CNNs vs. Transformers
CNNs and Transformers differ significantly in the architecture

〇 Attention vs. convolution.


〇 For Transformers- classification is mostly obtained by a CLS token.

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Attention as Explanation?
Is the attention mechanism naturally interpretable?

〇 The attention matrix determines the amount of context that each token will
receive from the other tokens.

T1 T2 T3 T4
(The) (cat) (sat) (on)
T1 0.01 0.89 0.1 0
(The)
T2 … … … …
(cat)
T3
(sat)
T4
(on)
Attention as Explanation?
For classification tasks, the CLS token is appended to the sequence.

〇 The CLS alone determines the classification.


〇 Can we obtain the explanation directly from the attention values?
T1 T2 T3 T4
(The) (cat) (sat) (on)
[CLS] row CLS 0.01 0.89 0.1 0
T1 … … … …
(The)
T2
(cat)
T3
(sat)
T4
(on)
Attention as Explanation?
💡 Idea: extracting the attention values that correspond to the CLS token as an
explanation.

T1 (P1) T2 (P2) T3 (P3) T4 (P4)


CLS 0.01 0.89 0.1 0
Attention as Explanation?
Issue #1: The beast with many heads:

〇 There’s significant variance between the attention head (recall MAD score
from the previous section)
〇 How can we average across the different attention heads?

T1 (P1) T2 (P2) T3 (P3) T4 (P4)


CLS 0.01 0.89 0.1 0
Attention as Explanation?
Issue #2: Deep neural network:

〇 Each layer mixes the tokens.


〇 In deeper layers- how can we account for the mixture of tokens in the
previous layers?

T1 (P1) T2 (P2) T3 (P3) T4 (P4)


CLS 0.01 0.89 0.1 0
Solution #1- Attention Rollout
〇 Aggregation across heads: averaging.
〇 Aggregation across layers: matrix multiplication of the attention maps to
track context.

Quantifying Attention Flow in Transformers, Abnar et al., 2020


Solution #1- Attention Rollout

Quantifying Attention Flow in Transformers- Blog, Samira Abnar


Solution #1- Attention Rollout

Quantifying Attention Flow in Transformers- Blog, Samira Abnar


Solution #1- Attention Rollout
〇 Aggregation across layers: matrix multiplication to track context:
○ The attention maps of all layers are multiplied.
○ The identity matrix is added to each self-attention matrix to account for
the residual connections.

Quantifying Attention Flow in Transformers, Abnar et al., 2020


Image source: Attention is All you Need, Vaswani et al., 2017
Solution #2- Attention Flow
〇 Aggregation across heads: averaging.
〇 Aggregation across layers: solving a max-flow problem on the attention
graph.

Quantifying Attention Flow in Transformers, Abnar et al., 2020


Solution #2- Attention Flow

Quantifying Attention Flow in Transformers Blog, Samira Abnar


Solution #2- Attention Flow
〇 Aggregation across layers: solving a max-flow problem on the attention
graph.
○ More computationally expensive:
■ d = depth, n = number of tokens in the sequence.

Quantifying Attention Flow in Transformers, Abnar et al., 2020


Attention Rollout- Issues
〇 Averaging across the attention heads may be a bit over simplistic.
○ Each head has its own functionality, some may be less relevant than
others.
〇 Solution focuses on the attention alone and ignores other parts of the
network e.g., activations, linear layers, etc.

Quantifying Attention Flow in Transformers, Abnar et al., 2020


Solution #3- TiBA
〇 Averaging across the attention heads may be a bit over simplistic.
○ 💡 Idea: use gradients to scale the attention heads.
○ Inspired by Grad-CAM for CNNs.

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA
〇 Gradients are used to average across the attention heads.
○ Inspired by Grad-CAM for CNNs.

Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -

〇 Averaging the feature maps using the gradient w.r.t. the target class.

Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -

〇 Gradients are averaged across the spatial dimension of the maps.

〇 The spatial activation maps are scaled by the gradient coefficients.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2017
Solution #3- TiBA
TiBA scales the attention maps by the gradients.

Gradients
Attention Residuals

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA
〇 Rollout ignores other parts of the network e.g., activations, linear layers, etc.
○ 💡 Idea: Use LRP (Layer-Wise Relevance Propagation) values instead
of raw attention values to account for other layers.
〇 Why LRP?
○ LRP conserves the sum of relevance in each layer, similar to the
attention mechanism that upholds the convention that the sum of each
row is 1 (softmax).

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


LRP- Intuition
Disclaimer:

LRP has many formulations and explanations.

We focus on the formulation presented for TiBA, and provide a high-level


intuition.

Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition

Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition

Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Formally,

Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
Formally,

Initialize with a one-hot vector for the target class (similar to gradient
propagation).

Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
Solution #3- TiBA
Instead of using the raw attention values for the relevance matrix R, use the LRP
values.

Gradients
LRP(Attention) Residuals

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA
〇 Layers are aggregated by matrix multiplication, similar to rollout.
〇 Overall,

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA Results

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA Results
Class specific (thanks to the gradients)

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


Solution #3- TiBA Results
Text sentiment analysis (BERT)

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., 2019
Solution #3- TiBA Results
VQA (ViLBERT)

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021


ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, Lu et al., 2019
Solution #4- GAE
〇 Remove LRP for faster computation.
〇 Extend propagation rules to apply to cross-attention modules as well.

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Solution #4- GAE Results
Extracting semantic segmentation from object detection

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Solution #4- GAE Results
CLIP

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Solution #4- GAE Results
CLIP explanations are valuable for downstream tasks such as image-to-sketch,
real image editing, etc.
Target edit text: red hat

Text2LIVE: Text-Driven Layered Image and Video Editing, Bar-Tal, et al. 2022
CLIPasso: Semantically-Aware Object Sketching, Vinker et al., 2022
Additional Methods
There have been many other great works examining explanations for
Transformers, which we do not cover due to time constraints.

〇 Learning to Estimate Shapley Values with Vision Transformers (Covert et al.).


〇 Explaining Information Flow Inside Vision Transformers Using Markov Chain
(Yuan et al.).
〇 AttCAT: Explaining Transformers via Attentive Class Activation Tokens (Qiang
et al.).
〇 And many more…
Resources
〇 Papers
○ The Hidden Language of Diffusion Models
○ Quantifying Attention Flow in Transformers
○ Transformer Interpretability Beyond Attention Visualization
○ Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
○ Grad-CAM: Why did you say that
○ Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
○ Layer-wise relevance propagation for neural networks with local renormalization layers
○ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
○ Text2LIVE: Text-Driven Layered Image and Video Editing
○ CLIPasso: Semantically-Aware Object Sketching
○ Learning Transferable Visual Models From Natural Language Supervision
〇 Blog posts
○ Quantifying Attention Flow in Transformers
〇 Colab Notebooks:
○ ViT explainability notebook
○ CLIP explainability notebook
〇 Demos
○ CLIP explainability interactive demo
○ Comparative explainability demo
Part 4
Attention as explanation
Overview
〇 Separation of the CLS token and spatial tokens
〇 Role of pre-training in the development of saliency
The self-attention (SA) block from ViT

〇 Inputs: CLS token + Image patch tokens


〇 Responsible for:
○ Learning dependencies between the
image patch tokens.
○ Summarizing info into a CLS token
for the head.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
What if it could be better separated?
Delegating responsibilities better – Class Attention

〇 A set of attention layers focusing on just the image patches.


〇 Another set of attention layers focusing on the interplay between the
patches and the CLS token.
What if it could be better separated?

Frozen patch
embeddings

Going deeper with Image Transformers, Touvron et al., 2021


Better separation of concerns
Head 1 Head 2 Head 3 Head 4

Going deeper with Image Transformers, Touvron et al., 2021


What’s in those attention maps?

Extract attention
Average from Reshape and
scores (softmax’d Attention map
multiple heads. upsample the map.
scores)

Transformer
Block 11
(ViT-DINO)

Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021


What’s in those attention maps?
What’s in those attention maps?
Well …

Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021


What’s in those attention maps?
Well …

Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021


What’s in those attention maps?

O The attention maps seem to contain the semantic


layout of different objects present in the input
images.
O But with supervised pre-training, the layouts
become sparse.

Investigating Vision Transformer representations Blog, Aritra Roy Gosthipaty and Sayak paul
Resources
〇 Papers:
○ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
○ Going deeper with Image Transformers
○ Emerging Properties in Self-Supervised Vision Transformers
○ DINOv2: Learning Robust Visual Features without Supervision (further reading)

〇 Blog posts:
○ Investigating Vision Transformer representations Blog
〇 Colab Notebooks
○ DINO attention maps
〇 Demos
○ Class attention maps
Guest section
Ron Mokady
Part 5
Attention for Downstream Tasks
Overview- Attend and Excite
〇 Catastrophic Neglect
〇 Intro to latent diffusion models
〇 Generative semantic nursing
〇 Results

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Text-Based Image Generation



Latent Diffusion Models
〇 💡 Idea:
○ Encoder maps the input to an embedding space.
○ Diffusion model is applied in the latent space.
〇 Lower training cost and faster inference.

High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.

High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.

High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
〇 Conditioning (e.g., text) is added
via cross-attention layers.

High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Latent Diffusion Models
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
〇 Conditioning (e.g., text) is added
via cross-attention layers.
〇 Finally, the latent is decoded back
to produce the output image.

High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
Why Does the Model Fail?
〇 DDPM process:
○ Given an input text prompt, the DDPM gradually denoises a pure noise
latent to obtain the output image.

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Image
Latent Text
Features Embedding
(P=16)

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
At[i,n] = presence of the
token n in patch i

Image
Latent Text
Features Embedding
(P=16)

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Image
Latent Text
Features Embedding
(P=16)

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Problem: crown gets
low attention values
for all patches

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Low attention = No generation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
How Can We Fix This?
💡 Intuition: a generated subject should have an image patch that significantly
attends to the subject’s token
lion crown

max
patch

How close are we to


having a strong
patch?

💡 Idea: strengthen the activation of the most neglected token


Image
Latent Text
Features Embedding
(P=16)

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Can We Improve This?
Yes. With Iterative Refinement.

Insight: Presence and spatial location


determined in early timesteps.

Idea: keep updating latent until a threshold is


reached for all subjects.

Motivation: encourage each subject to be


generated by gradually requiring higher
activation values.

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Can We Improve This?
Yes. With Iterative Refinement.

We perform iterative refinement in steps


0,5,20 with minimal attention thresholds of
0.05, 0.2, 0.8

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Putting It All
Together

Attend to and
Excite all subject
tokens!

Attend-and-Excite: Attention-Based
Semantic Guidance for Text-to-Image
Diffusion Models, Chefer et al., 2023
Results
“A cat and a dog”

Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A red bench and a yellow clock”

Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A playful kitten chasing a butterfly in a wildflower
meadow”

Stable
Attend-and-Excite
Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results
“A grizzly bear catching a salmon in a crystal clear
river surrounded by a forest”

Stable
Attend-and-Excite
Diffusion
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Results- Cross Attention Maps

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Resources
〇 Papers
○ High-Resolution Image Synthesis with Latent Diffusion Models
○ Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models
○ Compositional Visual Generation with Composable Diffusion Models
○ Training-Free Structured Diffusion Guidance for Compositional Text-to-Image
Synthesis
○ Stable Diffusion
〇 Demos
○ Generation with Attend and Excite interactive demo
Conclusion
Summary and open questions
Summary
〇 Probing what Vision Transformers learn
○ Mean attention distance
○ Similarity between ViTs and ResNets (representation space with CKA)
○ Skip connections
〇 Explaining attention
○ Attention flow and rollout
○ TiBA
○ Multiple modalities
〇 Attention as an explanation
○ Class attention
○ Semantic layouts in attention maps
〇 Attention for downstream applications
○ Prompt-to-prompt
○ NULL inversion
○ Attend and Excite
Q1: Evaluating explainability tools
How is a good explanation even defined?

〇 Human-centered: The “Usefulness metric”


○ Help identify the source of bias
○ Understand failure cases

What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods, Colin, et al. 2022
Q1: Evaluating explainability tools
How is a good explanation even defined?

〇 Faithfulness to the model: What features were used to arrive at the decision?

Sanity Checks for Saliency Maps, Adebayo et al., 2018


Q1: Evaluating explainability tools
Given a definition, how do we evaluate explanations?

〇 Typically evaluated by negative + positive perturbation tests.


○ Remove the most / least important pixels by the method and observe
the decrease in accuracy.
○ These metrics are problematic, since they create out-of-distribution
input images.
○ Evaluating explanations is an active field of research.

A Benchmark for Interpretability Methods in Deep Neural Networks, Hooker et al., 2019
Q2: Are smaller models more explainable?
Input prompt: “a man with eyeglasses”

CLIP w/ ViT-B/16

CLIP w/ ViT-L/14

Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Q3: Can we go beyond attention for interpretation?
Transformers are not just the attention layers!

〇 Where is the learned information encoded? Is it correct to focus the


research on just attention?
〇 For LLMs, it has been shown that a lot of the information is encoded in the
feed forward layers: “key-value memories”.

Transformer Feed-Forward Layers Are Key-Value Memories, Geva et al., 2021


Resources
〇 What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation
Framework for Explainability Methods
〇 Sanity Checks for Saliency Maps
〇 A Benchmark for Interpretability Methods in Deep Neural Networks
〇 Learning Transferable Visual Models From Natural Language Supervision
〇 Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers
〇 Transformer Feed-Forward Layers Are Key-Value Memories

You might also like