ATV - CVPR'23 Tutorial

All Things ViTs
Understanding and Interpreting

Attention in Vision
CVPR’23 Tutorial
Hila Chefer, Sayak Paul
https://all-things-vits.github.io/atv/
Who are we?
Hila Chefer is a PhD candidate at Tel-Aviv University and a

research intern at Google Tel-Aviv. You probably will not find
her off the work (hilach70@gmail.com).
Sayak works on 🧨 diffusers at Hugging Face 🤗. Off the

work, you can find him binge-watching Suits for the nth time
(spsayakpaul@gmail.com).
Our guest speaker - Ron Mokady
Ron is a Computer Science Ph.D. student at Tel-Aviv

University. He is currently a Research Lead at Bria.AI.
Previously, he worked at FAIR and Google
(ron.mokady@gmail.com).
Overview of the tutorial
〇 Introductions
〇 Attention in a jiffy
〇 Probing Vision Transformers
〇 Explaining Transformers’ predictions
〇 Attention as a (visual) explanation
〇 Attention to aid downstream applications
〇 Open questions and conclusion
💡 We’ll have multiple short breaks
in between the sections.
💡 For QnA, we can either use the
breaks or use RocketChat or both.
💡 Disclaimer: This tutorial is NOT
an exhaustive overview of all
possible methods.
💡 All the content (slides, demos,
code) are available here:
bit.ly/atv-cvpr
💡 All slides will cover the main
discussion. In the end, we will link
to relevant resources.
Part 1
Intro to Transformers
Overview
〇 From RNNs to Transformers
〇 Attention- Intuition
〇 The Beast with Many Heads
〇 Positional Encoding
〇 Cross-Attention
In the Beginning There Was an RNN
〇 RNN = Recurrent Neural Network.
〇 Widely used for Natural Language Processing (until 2017).
〇 Processing text sequentially, token by token.
In the Beginning There Was an RNN
〇 Why not stick to RNNs?
○ Sequential processing- time consuming (processing one by one).
○ Localization- the hidden state is mostly influenced by recent tokens.
○ Single direction context- from left to right (partially solved by BiLSTMs).
A Transformer is Born
Cross attention
Self attention Self attention
Attention is All you Need, Vaswani et al., 2017

Slide courtesy: Lucas Beyer


A Transformer is Born- RNN vs. Attention
〇 Sequential processing vs. encoding done in parallel.
〇 Localization vs. context gained simultaneously from all tokens.
〇 Single direction context vs. context from the entire sequence.

Attention Before Transformers
〇 Variants of attention have been employed for RNNs and LSTMs.
○ Mostly as a mechanism to support non-rigid information transfer
between the encoder and the decoder.
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., 2015
Effective Approaches to Attention-based Neural Machine Translation, Luong et al., 2015
Attention
〇 At the heart of the Transformer lies the simple attention mechanism.
○ Creates a contextualized representation for each input token.
The cat sat on the mat

Attention- Intuition
〇 Similar to retrieval from databases:
○ Query = a query we wish to run on a database.
○ Key = the keys to search on in the database.
○ Value = values corresponding to each key in the database.
〇 Intuition- each token “searches the database” for tokens related to it.
The Illustrated Transformer Blog, Jay Alammar

Attention- Calculation
〇 Attention is calculated in two steps-
○ First- an attention score is
calculated, as a dot product
between the queries and the
keys.
○ The attention scores determine
the amount of context that will
be transferred.

Attention- Calculation
〇 Attention is calculated in two steps-
○ Next- the values are weighted
by the attention scores.
○ The weighted values are the
contextualized representations
of each token.
○ Intuitively: each token becomes
a convex combination of all the
tokens in the sequence.

The Beast with Many Heads
〇 Softmax tends to zero out almost all entries.
○ Coefficients of the linear combination are very sparse.
〇 In order to encourage diversity- a few attention heads are used, each
calculates its own attention values.

The Beast with Many Heads
〇 The outputs are concatenated, and a linear layer produces the output.

Positional Encoding
〇 The attention mechanism is invariant to order.
○ The attention scores are calculated by simple dot products.
〇 Therefore, to account for the order of tokens, a positional encoding is added
to each token before the processing.
○ Order is crucial:
She likes ice-cream, he does not
He likes ice-cream, she does not Positional
encoding

Cross attention
Self attention Self attention

From Self-Attention to Cross-Attention
〇 Cross-attention is used to gain context from another modality / input type.
○ For example- gain context from text for image processing.
○ Simply extract the queries, keys matrices from the other modality.
Image Text
Resources
〇 Papers:
○ Attention Is All You Need
〇 Blog posts:
○ The Illustrated Transformer
Part 2
Probing Vision Transformers
Overview
〇 Mean attention distance
〇 Centered kernel alignment (CKA)
〇 Role of skip connections
Vision Transformers (ViT) quickly
Going to shamelessly steal slides from Lucas Beyer!

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020, A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, J Uszkoreit, N Houlsby
Many prior works attempted to introduce Vision Transformer

self-attention at the pixel level.
For 224px², that's 50k sequence length, too much!

(ViT)
Thus, most works restrict attention to local pixel
neighborhoods, or as high-level mechanism on
top of detections.
The key breakthrough in using the full

Transformer architecture, standalone, was to
"tokenize" the image by cutting it into
patches of 16px², and treating each patch as a
token, e.g. embedding it into input space.
How do we know if it’s effective at all?
〇 Locality is important for computer vision.
〇 Having a global context is equally important.
〇 How do ViTs learn locality?
〇 Is there any similarity between CNNs and ViTs w.r.t their representation
spaces?
Mean attention distance (MAD)
Dosovitskiy et al. (ICLR’21) investigated the idea of attention distance.
The MAD is defined to be the geometric distance between

two patches scaled by the attention values.
〇 High MAD = distant patches receive high attention

values.
〇 Low MAD = close patches receive high attention
values.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2021
Investigating Vision Transformer representations (blog post), credits - Ritwik Raha

Lower layers
have variable
MAD
Local + Global
Lower layers Higher layers

have variable have higher
MAD MAD
Local + Global Global
It doesn’t change much when we use a conv prior.
MAD = f(data, depth)
Strong connection between MAD, pre-training data, ViT architecture:
Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al., 2021
MAD = f(data, depth)
But not so much for …
Some observations so far
〇 Without enough data, lower layers in ViTs don’t learn locality. This becomes
evident in deeper architectures.
〇 With enough pre-training data, lower layers learn to encode locality early on
which could be an indicator for good performance.
〇 ViT layers have access to global information almost uniformly. What are its
repercussions?
sim(representationvit, representationresnet)
〇 There’s a primary difference here – CNNs don’t combine both global and
local information like ViTs do.
〇 Does this lead to differences in their representation space?
Yes, it does!
A quantifiable way to compare representations from neural architectures -
Centered Kernel Alignment (CKA)
〇 Invariant to orthogonal transformation of representations

〇 Invariant to isotropic scaling (scaling each dimension uniformly)
Similarity of Neural Network Representations Revisited, Kornblith et al., 2019

Centered Kernel Alignment
K and L = Gram matrices,

K = XXT, L = YYT (X and Y are representations),
HSIC = Hilbert-Schmidt independence criterion
Centered Kernel Alignment
Centered Gram matrices
Centering matrix
Intra-network comparison with CKA
〇 ViTs show more uniform similarities between

both lower and higher layers.
〇 ResNets show uniform similarities in halves.
Inter-network comparison with CKA
〇 ViTs compute similar features as ResNets

but with a smaller set of layers.
〇 ViTs propagate features more faithfully
across layers.
〇 Features across the higher layers in ViTs
and ResNets vary.
Role of skip connection
〇 ViT’s representation space is uniform.
〇 Information from lower layers is propagated to the higher layers more
faithfully.
〇 How?
The setup – plot norm ratio of the following:
Hidden representations the Transformation of zi from

of i-th layer (skip the long branch (MLP or
connection) self-attention)
〇 Clear phase transition between CLS

and spatial tokens.
〇 1st half: skip connections propagate
the CLS token.
〇 2nd half: skip connections propagate
the spatial tokens.
Spatial tokens
CLS token
Image source: Do Vision Transformers See Like Convolutional Neural Networks?

Skip connections behave differently in ViTs and ResNets:
Overall low
norm ratios.
Removing skip connections disrupts the uniformity of the representation
structure:
Connections to robustness
ViT’s uniform representation structure impacts robustness:
Vision Transformers are Robust Learners, Paul et al., 2022

Resources
〇 Papers:
○ An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
○ Do Vision Transformers See Like Convolutional Neural Networks?
○ Vision Transformers are Robust Learners
○ What do Vision Transformers Learn? A Visual Exploration (further
reading)
〇 Colab Notebooks
○ Mean attention distance
Part 3
Explaining Transformers
Overview
〇 Intro to Explainability
〇 Why Not Use CNN Algorithms?
〇 Is Attention an Explanation?
〇 Algorithms to Explain Transformers
Intro to Explainability
“We all fear what we do not understand.”

― Dan Brown, The Lost Symbol
Goal: developing a set of tools and frameworks to help you understand and
interpret predictions made by your machine learning models.
Quantifying ChatGPT’s gender bias, Sayash Kapoor and Arvind Narayanan

〇 Goal: developing a set of tools and frameworks to help you understand and
interpret predictions made by your machine learning models.
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier (LIME), Ribeiro et al., 2016
As the number of parameters and complexity of the networks increases, it
becomes increasingly challenging to develop such tools.
Image credits: Wikipedia

Sidenote- Explaining Generative Models
〇 Explaining generative models (e.g., GPT, Stable Diffusion) is challenging.
○ How does the model produce a generation from scratch?
Image credits: Wikipedia

New pre-print on interpreting diffusion models!
O Decompose an input prompt into the set of features used by the model.
The Hidden Language of Diffusion Models, Chefer et al., 2023

O Decompose an input prompt into the set of features used by the model.

O The learned decomposition reveals non-trivial biases.

Approaches can be (roughly) divided to two categories:
〇 Model specific - applying the model’s activations, parameters, structure etc.

in the explanations (e.g., Grad-CAM).
〇 Model agnostic - referring to the model as a black box and developing a
general method (e.g., SHAP, LIME).
Image origin: Wikipedia

Disclaimer:
The following is a non-exhaustive list of notable methods to interpret deep

neural networks.
Due to time constraints, we will not go over all methods in detail.

Grad-CAM
(Selvaraju et al.)
Input X Gradient
(Shrikumar et al.)
Integrated
gradients
(Sundararajan et al.) LIME
(Ribeiro et al.)
KernelSHAP DeepLift
(Lundberg et al.) (Shrikumar et al.)
And many more!

CNNs vs. Transformers
CNNs and Transformers differ significantly in the architecture
〇 Attention vs. convolution.

〇 For Transformers- classification is mostly obtained by a CLS token.
Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021

Attention as Explanation?
Is the attention mechanism naturally interpretable?
〇 The attention matrix determines the amount of context that each token will
receive from the other tokens.
T1 T2 T3 T4
(The) (cat) (sat) (on)
T1 0.01 0.89 0.1 0
(The)
T2 … … … …
(cat)
T3
(sat)
T4
(on)
For classification tasks, the CLS token is appended to the sequence.
〇 The CLS alone determines the classification.

〇 Can we obtain the explanation directly from the attention values?
T1 T2 T3 T4
(The) (cat) (sat) (on)
[CLS] row CLS 0.01 0.89 0.1 0
T1 … … … …
(The)
T2
(cat)
T3
(sat)
T4
(on)
💡 Idea: extracting the attention values that correspond to the CLS token as an
explanation.
T1 (P1) T2 (P2) T3 (P3) T4 (P4)

CLS 0.01 0.89 0.1 0
Issue #1: The beast with many heads:
〇 There’s significant variance between the attention head (recall MAD score
from the previous section)
〇 How can we average across the different attention heads?
T1 (P1) T2 (P2) T3 (P3) T4 (P4)

CLS 0.01 0.89 0.1 0
Issue #2: Deep neural network:
〇 Each layer mixes the tokens.

〇 In deeper layers- how can we account for the mixture of tokens in the
previous layers?
T1 (P1) T2 (P2) T3 (P3) T4 (P4)

CLS 0.01 0.89 0.1 0
Solution #1- Attention Rollout
〇 Aggregation across heads: averaging.
〇 Aggregation across layers: matrix multiplication of the attention maps to
track context.
Quantifying Attention Flow in Transformers, Abnar et al., 2020

Quantifying Attention Flow in Transformers- Blog, Samira Abnar

Quantifying Attention Flow in Transformers- Blog, Samira Abnar

〇 Aggregation across layers: matrix multiplication to track context:
○ The attention maps of all layers are multiplied.
○ The identity matrix is added to each self-attention matrix to account for
the residual connections.

Image source: Attention is All you Need, Vaswani et al., 2017
Solution #2- Attention Flow
〇 Aggregation across heads: averaging.
〇 Aggregation across layers: solving a max-flow problem on the attention
graph.

Quantifying Attention Flow in Transformers Blog, Samira Abnar

〇 Aggregation across layers: solving a max-flow problem on the attention
graph.
○ More computationally expensive:
■ d = depth, n = number of tokens in the sequence.

Attention Rollout- Issues
〇 Averaging across the attention heads may be a bit over simplistic.
○ Each head has its own functionality, some may be less relevant than
others.
〇 Solution focuses on the attention alone and ignores other parts of the
network e.g., activations, linear layers, etc.

Solution #3- TiBA
〇 Averaging across the attention heads may be a bit over simplistic.
○ 💡 Idea: use gradients to scale the attention heads.
○ Inspired by Grad-CAM for CNNs.

Solution #3- TiBA
〇 Gradients are used to average across the attention heads.
○ Inspired by Grad-CAM for CNNs.
Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -
〇 Averaging the feature maps using the gradient w.r.t. the target class.
Image source: Grad-CAM: Why did you say that?, Selvaraju et al., 2017
Grad-CAM
Grad-CAM for CNNs -
〇 Gradients are averaged across the spatial dimension of the maps.
〇 The spatial activation maps are scaled by the gradient coefficients.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2017
Solution #3- TiBA
TiBA scales the attention maps by the gradients.
Gradients
Attention Residuals

Solution #3- TiBA
〇 Rollout ignores other parts of the network e.g., activations, linear layers, etc.
○ 💡 Idea: Use LRP (Layer-Wise Relevance Propagation) values instead
of raw attention values to account for other layers.
〇 Why LRP?
○ LRP conserves the sum of relevance in each layer, similar to the
attention mechanism that upholds the convention that the sum of each
row is 1 (softmax).

LRP- Intuition
Disclaimer:
LRP has many formulations and explanations.
We focus on the formulation presented for TiBA, and provide a high-level

intuition.
Layer-wise relevance propagation for neural networks with local renormalization layers, Binder et al., 2016
LRP- Intuition
LRP- Intuition
LRP- Intuition
Formally,
LRP- Intuition
Formally,
Initialize with a one-hot vector for the target class (similar to gradient
propagation).
Solution #3- TiBA
Instead of using the raw attention values for the relevance matrix R, use the LRP
values.
Gradients
LRP(Attention) Residuals

Solution #3- TiBA
〇 Layers are aggregated by matrix multiplication, similar to rollout.
〇 Overall,

Solution #3- TiBA Results

Class specific (thanks to the gradients)

Text sentiment analysis (BERT)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., 2019
VQA (ViLBERT)

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, Lu et al., 2019
Solution #4- GAE
〇 Remove LRP for faster computation.
〇 Extend propagation rules to apply to cross-attention modules as well.
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al., 2021
Solution #4- GAE Results
Extracting semantic segmentation from object detection
CLIP
Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
CLIP explanations are valuable for downstream tasks such as image-to-sketch,
real image editing, etc.
Target edit text: red hat
Text2LIVE: Text-Driven Layered Image and Video Editing, Bar-Tal, et al. 2022
CLIPasso: Semantically-Aware Object Sketching, Vinker et al., 2022
Additional Methods
There have been many other great works examining explanations for
Transformers, which we do not cover due to time constraints.
〇 Learning to Estimate Shapley Values with Vision Transformers (Covert et al.).

〇 Explaining Information Flow Inside Vision Transformers Using Markov Chain
(Yuan et al.).
〇 AttCAT: Explaining Transformers via Attentive Class Activation Tokens (Qiang
et al.).
〇 And many more…
Resources
〇 Papers
○ The Hidden Language of Diffusion Models
○ Quantifying Attention Flow in Transformers
○ Transformer Interpretability Beyond Attention Visualization
○ Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
○ Grad-CAM: Why did you say that
○ Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
○ Layer-wise relevance propagation for neural networks with local renormalization layers
○ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
○ Text2LIVE: Text-Driven Layered Image and Video Editing
○ CLIPasso: Semantically-Aware Object Sketching
○ Learning Transferable Visual Models From Natural Language Supervision
〇 Blog posts
○ Quantifying Attention Flow in Transformers
〇 Colab Notebooks:
○ ViT explainability notebook
○ CLIP explainability notebook
〇 Demos
○ CLIP explainability interactive demo
○ Comparative explainability demo
Part 4
Attention as explanation
Overview
〇 Separation of the CLS token and spatial tokens
〇 Role of pre-training in the development of saliency
The self-attention (SA) block from ViT
〇 Inputs: CLS token + Image patch tokens

〇 Responsible for:
○ Learning dependencies between the
image patch tokens.
○ Summarizing info into a CLS token
for the head.
What if it could be better separated?
Delegating responsibilities better – Class Attention
〇 A set of attention layers focusing on just the image patches.

〇 Another set of attention layers focusing on the interplay between the
patches and the CLS token.
What if it could be better separated?
Frozen patch
embeddings
Going deeper with Image Transformers, Touvron et al., 2021

Better separation of concerns
Head 1 Head 2 Head 3 Head 4
Going deeper with Image Transformers, Touvron et al., 2021

What’s in those attention maps?
Extract attention
Average from Reshape and
scores (softmax’d Attention map
multiple heads. upsample the map.
scores)
Transformer
Block 11
(ViT-DINO)
Emerging Properties in Self-Supervised Vision Transformers, Caron et al., 2021

Well …

Well …

O The attention maps seem to contain the semantic

layout of different objects present in the input
images.
O But with supervised pre-training, the layouts
become sparse.
Investigating Vision Transformer representations Blog, Aritra Roy Gosthipaty and Sayak paul
Resources
〇 Papers:
○ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
○ Going deeper with Image Transformers
○ Emerging Properties in Self-Supervised Vision Transformers
○ DINOv2: Learning Robust Visual Features without Supervision (further reading)
○
〇 Blog posts:
○ Investigating Vision Transformer representations Blog
〇 Colab Notebooks
○ DINO attention maps
〇 Demos
○ Class attention maps
Guest section
Ron Mokady
Part 5
Attention for Downstream Tasks
Overview- Attend and Excite
〇 Catastrophic Neglect
〇 Intro to latent diffusion models
〇 Generative semantic nursing
〇 Results
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., 2023
Text-Based Image Generation
☺
☹
Latent Diffusion Models
〇 💡 Idea:
○ Encoder maps the input to an embedding space.
○ Diffusion model is applied in the latent space.
〇 Lower training cost and faster inference.
High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., 2022
〇 The encoder compresses the
input x into a latent vector z.
〇 The diffusion process is applied
on the latent vector z.
〇 Conditioning (e.g., text) is added
via cross-attention layers.
〇 Conditioning (e.g., text) is added
via cross-attention layers.
〇 Finally, the latent is decoded back
to produce the output image.
Why Does the Model Fail?
〇 DDPM process:
○ Given an input text prompt, the DDPM gradually denoises a pure noise
latent to obtain the output image.
Image
Latent Text
Features Embedding
(P=16)
At[i,n] = presence of the
token n in patch i
Image
Latent Text
Features Embedding
(P=16)
Image
Latent Text
Features Embedding
(P=16)
Problem: crown gets
low attention values
for all patches
Low attention = No generation
How Can We Fix This?
💡 Intuition: a generated subject should have an image patch that significantly
attends to the subject’s token
lion crown
max
patch
How close are we to

having a strong
patch?
💡 Idea: strengthen the activation of the most neglected token

Image
Latent Text
Features Embedding
(P=16)
Can We Improve This?
Yes. With Iterative Refinement.
Insight: Presence and spatial location

determined in early timesteps.
Idea: keep updating latent until a threshold is

reached for all subjects.
Motivation: encourage each subject to be

generated by gradually requiring higher
activation values.
Can We Improve This?
Yes. With Iterative Refinement.
We perform iterative refinement in steps

0,5,20 with minimal attention thresholds of
0.05, 0.2, 0.8
Putting It All
Together
Attend to and
Excite all subject
tokens!
Attend-and-Excite: Attention-Based
Semantic Guidance for Text-to-Image
Diffusion Models, Chefer et al., 2023
Results
“A cat and a dog”
Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion
Results
“A red bench and a yellow clock”
Stable Composable
StructureDiffusion Attend-and-Excite
Diffusion Diffusion
Results
“A playful kitten chasing a butterfly in a wildflower
meadow”
Stable
Attend-and-Excite
Diffusion
Results
“A grizzly bear catching a salmon in a crystal clear
river surrounded by a forest”
Stable
Attend-and-Excite
Diffusion
Results- Cross Attention Maps
Resources
〇 Papers
○ High-Resolution Image Synthesis with Latent Diffusion Models
○ Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models
○ Compositional Visual Generation with Composable Diffusion Models
○ Training-Free Structured Diffusion Guidance for Compositional Text-to-Image
Synthesis
○ Stable Diffusion
〇 Demos
○ Generation with Attend and Excite interactive demo
Conclusion
Summary and open questions
Summary
〇 Probing what Vision Transformers learn
○ Mean attention distance
○ Similarity between ViTs and ResNets (representation space with CKA)
○ Skip connections
〇 Explaining attention
○ Attention flow and rollout
○ TiBA
○ Multiple modalities
〇 Attention as an explanation
○ Class attention
○ Semantic layouts in attention maps
〇 Attention for downstream applications
○ Prompt-to-prompt
○ NULL inversion
○ Attend and Excite
Q1: Evaluating explainability tools
How is a good explanation even defined?
〇 Human-centered: The “Usefulness metric”

○ Help identify the source of bias
○ Understand failure cases
What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods, Colin, et al. 2022
How is a good explanation even defined?
〇 Faithfulness to the model: What features were used to arrive at the decision?
Sanity Checks for Saliency Maps, Adebayo et al., 2018

Given a definition, how do we evaluate explanations?
〇 Typically evaluated by negative + positive perturbation tests.

○ Remove the most / least important pixels by the method and observe
the decrease in accuracy.
○ These metrics are problematic, since they create out-of-distribution
input images.
○ Evaluating explanations is an active field of research.
A Benchmark for Interpretability Methods in Deep Neural Networks, Hooker et al., 2019
Q2: Are smaller models more explainable?
Input prompt: “a man with eyeglasses”
CLIP w/ ViT-B/16
CLIP w/ ViT-L/14
Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Q3: Can we go beyond attention for interpretation?
Transformers are not just the attention layers!
〇 Where is the learned information encoded? Is it correct to focus the

research on just attention?
〇 For LLMs, it has been shown that a lot of the information is encoded in the
feed forward layers: “key-value memories”.
Transformer Feed-Forward Layers Are Key-Value Memories, Geva et al., 2021

Resources
〇 What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation
Framework for Explainability Methods
〇 Sanity Checks for Saliency Maps
〇 A Benchmark for Interpretability Methods in Deep Neural Networks
〇 Learning Transferable Visual Models From Natural Language Supervision
〇 Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers
〇 Transformer Feed-Forward Layers Are Key-Value Memories

ATV - CVPR&#39;23 Tutorial

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ATV - CVPR&#39;23 Tutorial

Uploaded by

Copyright:

Available Formats

All Things ViTs

Understanding and Interpreting

Hila Chefer is a PhD candidate at Tel-Aviv University and a

Sayak works on 🧨 diffusers at Hugging Face 🤗. Off the

Ron is a Computer Science Ph.D. student at Tel-Aviv

Self attention Self attention

Attention is All you Need, Vaswani et al., 2017

Slide courtesy: Lucas Beyer

Slide courtesy: Lucas Beyer

Attention is All you Need, Vaswani et al., 2017

The cat sat on the mat

The Illustrated Transformer Blog, Jay Alammar

The Illustrated Transformer Blog, Jay Alammar

The Illustrated Transformer Blog, Jay Alammar

The Illustrated Transformer Blog, Jay Alammar

The Illustrated Transformer Blog, Jay Alammar

Attention is All you Need, Vaswani et al., 2017

Self attention Self attention

Attention is All you Need, Vaswani et al., 2017

Going to shamelessly steal slides from Lucas Beyer!

Many prior works attempted to introduce Vision Transformer

For 224px², that's 50k sequence length, too much!

The key breakthrough in using the full

The MAD is defined to be the geometric distance between

〇 High MAD = distant patches receive high attention

Investigating Vision Transformer representations (blog post), credits - Ritwik Raha

Lower layers Higher layers

〇 Invariant to orthogonal transformation of representations

Similarity of Neural Network Representations Revisited, Kornblith et al., 2019

K and L = Gram matrices,

Centered Gram matrices

Intra-network comparison with CKA

〇 ViTs show more uniform similarities between

Inter-network comparison with CKA

〇 ViTs compute similar features as ResNets

Hidden representations the Transformation of zi from

〇 Clear phase transition between CLS

Image source: Do Vision Transformers See Like Convolutional Neural Networks?

Vision Transformers are Robust Learners, Paul et al., 2022

“We all fear what we do not understand.”

Quantifying ChatGPT’s gender bias, Sayash Kapoor and Arvind Narayanan

Image credits: Wikipedia

Image credits: Wikipedia

The Hidden Language of Diffusion Models, Chefer et al., 2023

The Hidden Language of Diffusion Models, Chefer et al., 2023

O The learned decomposition reveals non-trivial biases.

The Hidden Language of Diffusion Models, Chefer et al., 2023

〇 Model specific - applying the model’s activations, parameters, structure etc.

Image origin: Wikipedia

The following is a non-exhaustive list of notable methods to interpret deep

Due to time constraints, we will not go over all methods in detail.

And many more!

〇 Attention vs. convolution.

Transformer Interpretability Beyond Attention Visualization, Chefer et al., 2021

〇 The CLS alone determines the classification.

T1 (P1) T2 (P2) T3 (P3) T4 (P4)

T1 (P1) T2 (P2) T3 (P3) T4 (P4)

〇 Each layer mixes the tokens.

T1 (P1) T2 (P2) T3 (P3) T4 (P4)

Quantifying Attention Flow in Transformers, Abnar et al., 2020

Quantifying Attention Flow in Transformers- Blog, Samira Abnar

Quantifying Attention Flow in Transformers- Blog, Samira Abnar

Quantifying Attention Flow in Transformers, Abnar et al., 2020

ATV - CVPR'23 Tutorial

ATV - CVPR'23 Tutorial