Professional Documents
Culture Documents
Abstract
This paper presents VTN, a transformer-based frame-
work for video recognition. Inspired by recent developments
in vision transformers, we ditch the standard approach in
video action recognition that relies on 3D ConvNets and
introduce a method that classifies actions by attending
to the entire video sequence information. Our approach
is generic and builds on top of any given 2D spatial
network. In terms of wall runtime, it trains 16.1× faster
and runs 5.1× faster during inference while maintaining
competitive accuracy compared to other state-of-the-art
methods. It enables whole video analysis, via a single
end-to-end pass, while requiring 1.5× fewer GFLOPs. We
report competitive results on Kinetics-400 and Moments
in Time benchmarks and present an ablation study of
VTN properties and the trade-off between accuracy and
Figure 1. Video Transformer Network architecture. Connecting
inference speed. We hope our approach will serve as a
three modules: A 2D spatial backbone (f (x)), used for feature ex-
new baseline and start a fresh line of research in the video
traction. Followed by a temporal attention-based encoder (Long-
recognition domain. Code and models are available at: former in this work), that uses the feature vectors (φi ) combined
https://github.com/bomri/SlowFast/blob/ with a position encoding. The [CLS] token is processed by a clas-
master/projects/vtn/README.md. sification MLP head to get the final class prediction.
mation in a video of hours, or even a few minutes, can be the two-stream architecture to allow a direct link between
grasped using only a snippet clip of a few seconds. Nev- RGB and OF layers. The idea of inflating 2D ConvNets into
ertheless, current networks are not designed to share long- their 3D counterpart (I3D) was introduced in [3]. I3D takes
term information across the full video. 2D ConvNets and expands its layers into 3D. Therefore it
VTN’s temporal processing component is based on a allows to leverage pre-trained state-of-the-art image recog-
Longformer [1]. This type of transformer-based model can nition architectures in the spatial-temporal domain and ap-
process a long sequence of thousands of tokens. The atten- ply them for video-based tasks.
tion mechanism proposed by the Longformer makes it fea- Non-local Neural Networks (NLN) [37] introduced a
sible to go beyond short clip processing and maintain global non-local operation, a type of self-attention, that computes
attention, which attends to all tokens in the input sequence. responses based on relationships between different loca-
In addition to long sequence processing, we also explore tions in the input signal. NLN demonstrated that the core
an important trade-off in machine learning – speed vs. ac- attention mechanism in Transformers can produce good re-
curacy. Our framework demonstrates a superior balance sults on video tasks, however it is confined to processing
of this trade-off, both during training and also at inference only short clips. In order to extract long temporal context,
time. In training, even though wall runtime per epoch is [39] introduced a long-term feature bank that acts as the
either equal or greater, compared to other networks, our ap- entire video memory and a Feature Bank Operator (FBO)
proach requires much fewer passes of the training dataset that computes interactions between short-term and long-
to reach its maximum performance; end-to-end, compared term features. However, it requires precomputed features,
to state-or-the-art networks, this results in a 16.1× faster and it is not efficient enough to support end-to-end training
training. At inference time, our approach can handle both of the feature extraction backbone.
multi-view and full video analysis while maintaining simi- SlowFast [13] explored a network architecture that op-
lar accuracy. In contrast, other networks’ performance sig- erates in two pathways and different frame rates. Lateral
nificantly decreases when analyzing the full video in a sin- connections fuse the information between the slow pathway,
gle pass. In terms of GFLOPS x Views, their inference cost focused on the spatial information, and the fast pathway fo-
is considerably higher than those of VTN, which concludes cused on temporal information.
to a 1.5× fewer GFLOPS and a 5.1× faster validation wall The X3D study [12] builds on top of SlowFast. It argues
runtime. that in contrast to image classification architectures, which
Our framework’s structure components are modular have been developed via a rigorous evolution, the video
(Fig. 1). First, the 2D spatial backbone can be replaced architectures have not been explored in detail, and histor-
with any given network. The attention-based module can ically are based on expanding image-based networks to fit
stack up more layers, more heads or can be set to a differ- the temporal domain. X3D introduces a set of networks that
ent Transformers model that can process long sequences. progressively expand in different axes, e.g., temporal, frame
Finally, the classification head can be modified to facilitate rate, spatial, width, bottleneck width, and depth. Compared
different video-based tasks, like temporal action localiza- to SlowFast, it offers a lightweight network (in terms of
tion. GFLOPS and parameters) with similar performance.
Table 2. Ablation experiments on Kinetics-400. The results are top-1 and top-5 accuracy (%) on the validation set using the full video
inference approach.
to the addition of a temporal dimension, these networks are ViT-B-VTN. Combining the state-of-the-art image classi-
limited by memory and runtime to clips of a small spatial fication model, ViT-Base [10], as the backbone in VTN. We
scale and a low number of frames. In [3], the authors use the use a ViT-Base network that was pre-trained on ImageNet-
whole video during inference, averaging predictions tempo- 21K. Using ViT as the backbone for VTN produces an end-
rally. More recent studies that achieved state-of-the-art re- to-end transformers-based network that uses attention both
sults processed numerous, but relatively short, clips during for the spatial and temporal domains.
inference. In [37], inference is done by sampling ten clips
R50/101-VTN. As a comparison, we also use a standard
evenly from the full-length video and average the softmax
2D ResNet-50 and ResNet-101 networks [18], pre-trained
scores to achieve the final prediction. SlowFast [13] follows
on ImageNet.
the same practice and introduces the term “view” – a tem-
poral clip with a spatial crop. SlowFast uses ten temporal DeiT-B/BD/Ti-VTN. Since ViT-Base was trained on
clips with three spatial crops at inference time; thus, 30 dif- ImageNet-21K we also want to compare VTN by using sim-
ferent views are averaged for the final prediction. X3D [12] ilar networks trained on ImageNet. We use the recent work
follows the same practice, but in addition, it uses larger spa- of [33] and apply DeiT-Tiny, DeiT-Base, and DeiT-Base-
tial scales to achieve its best results on 30 different views. Distilled as the backbone for VTN.
This common practice of multi-view inference is some- 4.1. Implementation Details
what counterintuitive, especially when handling long
videos. A more intuitive way is to “look” at the entire video Training. The spatial backbones we use were pre-trained
context before deciding on the action, rather than viewing on either ImageNet or ImageNet-21k. The Longformer and
only small portions of it. Fig. 2 shows 16 frames extracted the MLP classification head were randomly initialized from
evenly from a video of the abseiling category. The actual a normal distribution with zero mean and 0.02 std. We train
action is obscured or not visible in several parts of the video; the model end-to-end using video clips. These clips are
this might lead to a false action prediction in many views. formed by choosing a random frame as the starting point,
The potential in focusing on the segments in the video that then sampling 2.56 or 5.12 seconds as the video’s temporal
are most relevant is a powerful ability. However, full video footprint. The final clip frames are subsampled uniformly
inference produces poor performance in methods that were to a fixed number of frames N (N = 16, 32), depending on
trained using short clips (Table 3 and 4). In addition, it is the setup.
also limited in practice due to hardware, memory, and run- For the spatial domain, we randomly resize the shorter
time aspects. side of all the frames in the clip to a [256, 320] scale and
randomly crop all frames to 224 × 224. Horizontal flip is
also applied randomly on the entire clip.
4. Video Action Recognition with VTN The ablation experiments were done on a 4-GPU ma-
chine. Using a batch size of 16 for the ViT-VTN (on
In order to evaluate our approach and the impact of con- 16 frames per clip input) and a batch size of 32 for the
text attention on video action recognition, we use several R50/101-VTN. We use an SGD optimizer with an initial
spatial backbones pre-trained on 2D images. learning rate of 10−3 and a different learning rate reduction
Figure 3. Illustrating all the single-head first attention layer weights of the [CLS] token vs. 16 frames pulled evenly from a video. High
weight values are represented by a warm color (yellow) while low values by a cold color (blue). The video’s segments in which abseiling
category properties are shown (e.g., shackle, rope) exhibit higher weight values compared to segments in which non-relevant information
appears (e.g., shoes, people). The model prediction is abseiling for this video.
Table 3. To measure the overall time needed to train each model, we observe how long it takes to train a single epoch and how many epochs
are required to achieve the best performance. We compare these numbers to the validation top-1 and top-5 accuracy on Kinetics-400 and the
number of parameters per model. To measure the training wall runtime, we ran a single epoch for each model, on the same 8-V100-GPU
machine, with a 16GB memory per GPU. The models marked by (*) were taken from the PySlowFast GitHub repository [11]. We report
the accuracy as written in the Model Zoo, which was done using the 30 multi-view inference approach. To measure the wall runtime, we
used the code base of PySlowFast. To calculate the SlowFast-16X8-R101 time on the same GPU machine, we used a batch size of 16.
The number of epochs is reported, when possible, based on the original model paper. All other models, including the NL I3D, are trained
using our codebase and evaluated with a full video inference approach. (†) The model in the last row was trained with extensive data
augmentation.
Table 4. Comparing the number of GFLOPs during inference. The models marked by (*) were taken from the PySlowFast GitHub
repository [11]. We reproduced the SlowFast-8X8-R50 results by using the repository and our Kinetics-400 validation set and got 76.45%
compared to the reported value of 77%. When running this model using a full video inference approach, we get a significant drop in
performance of about 8%. We did not run the SlowFast-16X8-R101 because it was not published. The inference GFLOPs is reported by
multiplying the number of views with the GFLOPs calculated per view. ViT-B-VTN with one layer achieves 78.6% top-1 accuracy, a 0.3%
drop compared to SlowFast-16X8-R101 while using 1.5× fewer GFLOPS.
the one without any positional embedding achieved slightly Fine-tuning the backbone improves the results by 7% in
better results than the fixed and learned versions. Kinetics-400 top-1 accuracy.
As this is an interesting result, we also use the same
trained models and evaluate them after randomly shuffling Does attention matter? A key component in our approach
the input frames only in the validation set videos. This is is the impact of attention functionally on the way VTN per-
done by first taking the unshuffled frame embeddings, then ceives the full video sequence. To convey this impact we
shuffle their order, and finally add the positional embedding. train two VTN networks, using three layers in the Long-
This raised another surprising finding, in which the shuffle former, but with a single head for each layer. In one net-
version gives better results, reaching 78.9% top-1 accuracy work the head is trained as usual, while in the second net-
on the no positional embedding version. Even in the case work instead of computing attention based on query/key dot
of learned embeddings it does not have a diminishing ef- products and softmax, we replace the attention matrix with
fect. Similar to the Longformer depth, we believe that this a hard-coded uniform distribution that is not updated during
might be related to the relatively short videos in Kinetics- back-propagation.
400, and longer sequences might benefit more from posi- Fig. 4 shows the learning curves of these two networks.
tional information. We also argue that this could mean that Although the training has a similar trend, the learned atten-
Kinetics-400 is primarily a static frame, appearance based tion performs better. In contrast, the validation of the uni-
classification problem rather than a motion problem [29]. form attention collapses after a few epochs demonstrating
poor generalization of that network. Further, we visualize
Temporal footprint and number of frames in a clip. We the [CLS] token attention weights by processing the same
also explore the effect of using longer clips in the temporal video from Fig. 2 with the single-head trained network and
domain and compare a temporal footprint of 2.56 vs. 5.12 depicted, in Fig. 3, all the weights of the first attention layer
seconds. And also how the number of frames in the clip im- aligned to the video’s frames. Interestingly, the weights are
pact the network performance. The comparison is done on much higher in segments related to the abseiling category.
a ViT-B-VTN with one attention layer in the Longformer. In Appendix A. we show a few more examples.
Table 2c shows that top-1 and top-5 accuracy are similar,
implying that VTN is agnostic to these hyperparameters. Training and validation runtime. An interesting observa-
tion we make concerns the training and validation wall run-
Finetune the 2D spatial backbone. Instead of fine-tuning time of our approach. Although our networks have more pa-
the spatial backbone, by continuing the back-propagation rameters, and therefore, are longer to train and test, they are
process, when training VTN, we can use a frozen 2D net- actually much faster to converge and reach their best perfor-
work solely for feature extraction. Table 2d shows the vali- mance earlier. Since they are evaluated using a single view
dation accuracy when training a ViT-B-VTN with three at- of all video frames, they are also faster during validation.
tention layers with and without also training the backbone. Table 3 shows a comparison of different models and several
model pretrain MiT version inputs top-1 top-5
ResNet50 ImageNet v1 RGB 27.2 [24] 51.7 [24]
I3D ImageNet v1 RGB + OF 29.5 [24] 56.1 [24]
AssembelNet-50 [27] Kinetics400 v1 RGB + OF 33.9 60.9
AssembelNet-101 [27] Kinetics400 v1 RGB + OF 34.3 62.7
I3D ImageNet v2 RGB 28.4* 54.5*
NL I3D (our impl.) Kinetics400 v2 RGB 30.1 57.3
ViT-B-VTN Kinetics400 v2 RGB 37.4 65.3
ViT-B-VTN (w/ shuffle) Kinetics400 v2 RGB 37.4 65.4
Table 5. Comparison with the state-of-the-art on MiT-v1 and MiT-v2. The results marked by (*) are based on MiT GitHub repository3 .
VTN variants. Compared to the state-of-the-art SlowFast 5.2. Experiments on Moments in Time
model, our ViT-B-VTN with one layer achieves almost the
The Moments in Time (MiT) dataset is a large-scale col-
same results but completes an epoch faster while requiring
lection of short (3 seconds) videos [24]. MiT is a chal-
fewer epochs. This accumulates to a 16.1× faster end-to-
lenging dataset, with state-of-the-art results just above 34%
end training. The validation wall runtime is also 5.1× faster
top-1 accuracy [27]. In this work, we use MiT-v2, con-
due to the full video inference approach.
sisting of 727,305 training videos and 30,500 validation
To better demonstrate the fast convergence of our ap-
videos. Each video is labeled with one of 305 classes of
proach, we wanted to show an apples-to-apples compari-
dynamic events. Although previous studies worked on MiT-
son of different training and evaluating curves for various
v1 (802,264 training videos, 33,900 validation videos, 339
models. However, since other methods use the multi-view
classes), this dataset is no longer available. In Table 5, we
inference only post-training, but use a single view evalua-
show the results of various models on MiT-v1 and MiT-
tion while training their models, this was hard to achieve.
v2. Since the relation between v1 and v2 in terms of per-
Thus, to show such comparison and give the reader addi-
formance was not established and thus unknown, we also
tional visual information, we trained a NL I3D (pre-trained
trained our implementation of NL I3D on MiT-v2 using
on ImageNet) with a full video inference protocol during
RGB inputs and achieved comparable results to those of
validation (using our codebase and reproduced the original
I3D (RGB+OF) published on MiT-v1 [24]. Furthermore,
model results). We compare it to DeiT-B-VTN which was
ViT-B-VTN achieves the highest top-1 accuracy on MiT-v2
also pre-trained on ImageNet. Fig. 5 shows that the VTN-
while using only RGB frames as input.
based network converges to better results much faster than
the NL I3D and enables a much faster training process com-
pared to 3D-based networks. 6. Conclusion
We presented a modular transformer-based framework
Data augmentation. Recent studies showed that data for video recognition tasks. Our approach introduces an
augmentation significantly improves the performance of efficient way to evaluate videos at scale, both in terms of
transformers-based models [33]. To demonstrate its im- computational resources and wall runtime. It allows full
pact on VTN, we apply extensive data augmentation as sug- video processing during test time, making it more suitable
gested in DeiT [33] and RandAugment [6]. Table 3 shows for dealing with long videos. Although current video clas-
that our method reaches 79.8% top-1 accuracy, a 1.2% im- sification benchmarks are not ideal for testing long-term
provement vs. the same model trained without such aug- video processing ability, hopefully, in the future, when such
mentations. Training with augmentations requires 10 more datasets become available, models like VTN will show even
epochs but didn’t impact the training wall runtime. larger improvements compared to 3D ConvNets.
Final inference computational complexity. Finally, we Acknowledgements. We thank Ross Girshick for provid-
examine what is the final inference computational com- ing valuable feedback on this manuscript and for helpful
plexity for various models by measuring GFLOPs. Al- suggestions on several experiments.
though other models need to evaluate multiple views to
reach their highest performance, ViT-B-VTN performs al- References
most the same for both inference protocols. Table 4 shows a
[1] Iz Beltagy, Matthew E Peters, and Arman Cohan. Long-
significant drop of about 8% when evaluating the SlowFast- former: The long-document transformer. arXiv preprint
8X8-R50 model using the full video approach. In contrast, arXiv:2004.05150, 2020.
ViT-B-VTN maintains the same performance while requir-
ing, end-to-end, fewer GFLOPs at inference. 3 https://github.com/zhoubolei/moments_models
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Proceedings of the IEEE/CVF International Conference on
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Computer Vision, pages 6202–6211, 2019.
end object detection with transformers. In European Confer- [14] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis-
ence on Computer Vision, pages 213–229. Springer, 2020. serman. Video action transformer network. In Proceedings
[3] Joao Carreira and Andrew Zisserman. Quo vadis, action of the IEEE/CVF Conference on Computer Vision and Pat-
recognition? a new model and the kinetics dataset. In pro- tern Recognition, pages 244–253, 2019.
ceedings of the IEEE Conference on Computer Vision and [15] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
Pattern Recognition, pages 6299–6308, 2017. national conference on computer vision, pages 1440–1448,
[4] Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Sey- 2015.
bold, David A Ross, Jia Deng, and Rahul Sukthankar. Re- [16] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
thinking the faster r-cnn architecture for temporal action lo- Malik. Rich feature hierarchies for accurate object detection
calization. In Proceedings of the IEEE Conference on Com- and semantic segmentation. In Proceedings of the IEEE con-
puter Vision and Pattern Recognition, pages 1130–1139, ference on computer vision and pattern recognition, pages
2018. 580–587, 2014.
[5] R Christoph and Feichtenhofer Axel Pinz. Spatiotemporal [17] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
residual networks for video action recognition. Advances in shick. Mask r-cnn. In Proceedings of the IEEE international
Neural Information Processing Systems, pages 3468–3476, conference on computer vision, pages 2961–2969, 2017.
2016. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[6] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Deep residual learning for image recognition. In Proceed-
Le. Randaugment: Practical automated data augmenta- ings of the IEEE conference on computer vision and pattern
tion with a reduced search space. In Proceedings of the recognition, pages 770–778, 2016.
IEEE/CVF Conference on Computer Vision and Pattern [19] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu-
Recognition Workshops, pages 702–703, 2020. tional neural networks for human action recognition. IEEE
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, transactions on pattern analysis and machine intelligence,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image 35(1):221–231, 2012.
database. In 2009 IEEE conference on computer vision and [20] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
pattern recognition, pages 248–255. Ieee, 2009. Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
Toutanova. BERT: Pre-training of deep bidirectional trans- man action video dataset. arXiv preprint arXiv:1705.06950,
formers for language understanding. In Proceedings of the 2017.
2019 Conference of the North American Chapter of the As- [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
sociation for Computational Linguistics: Human Language Imagenet classification with deep convolutional neural net-
Technologies, Volume 1 (Long and Short Papers), pages works. Advances in neural information processing systems,
4171–4186, Minneapolis, Minnesota, June 2019. Associa- 25:1097–1105, 2012.
tion for Computational Linguistics. [22] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[9] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, moyer, and Veselin Stoyanov. Roberta: A robustly optimized
and Trevor Darrell. Long-term recurrent convolutional net- bert pretraining approach. arXiv preprint arXiv:1907.11692,
works for visual recognition and description. In Proceed- 2019.
ings of the IEEE conference on computer vision and pattern [23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
recognition, pages 2625–2634, 2015. convolutional networks for semantic segmentation. In Pro-
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ceedings of the IEEE conference on computer vision and pat-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tern recognition, pages 3431–3440, 2015.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [24] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown,
worth 16x16 words: Transformers for image recognition at Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments
scale. In International Conference on Learning Representa- in time dataset: one million videos for event understanding.
tions, 2021. IEEE transactions on pattern analysis and machine intelli-
[11] Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and gence, 42(2):502–508, 2019.
Christoph Feichtenhofer. Pyslowfast. https://github. [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
com/facebookresearch/slowfast, 2020. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
[12] Christoph Feichtenhofer. X3d: Expanding architectures for ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
efficient video recognition. In Proceedings of the IEEE/CVF differentiation in pytorch. 2017.
Conference on Computer Vision and Pattern Recognition, [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
pages 203–213, 2020. Faster r-cnn: towards real-time object detection with region
[13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and proposal networks. IEEE transactions on pattern analysis
Kaiming He. Slowfast networks for video recognition. In and machine intelligence, 39(6):1137–1149, 2016.
[27] Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, and ings of the IEEE/CVF Conference on Computer Vision and
Anelia Angelova. Assemblenet: Searching for multi-stream Pattern Recognition, pages 284–293, 2019.
neural connectivity in video architectures. In International [40] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
Conference on Learning Representations, 2020. and Caiming Xiong. End-to-end dense video captioning with
[28] Florian Schroff, Dmitry Kalenichenko, and James Philbin. masked transformer. In Proceedings of the IEEE Conference
Facenet: A unified embedding for face recognition and clus- on Computer Vision and Pattern Recognition, pages 8739–
tering. In Proceedings of the IEEE conference on computer 8748, 2018.
vision and pattern recognition, pages 815–823, 2015.
[29] Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj
Goswami, Matt Feiszli, and Lorenzo Torresani. Only time
can tell: Discovering temporal data for temporal modeling.
In Proceedings of the IEEE/CVF Winter Conference on Ap-
plications of Computer Vision, pages 535–544, 2021.
[30] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In Inter-
national Conference on Learning Representations, 2015.
[31] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior
Wolf. Deepface: Closing the gap to human-level perfor-
mance in face verification. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
1701–1708, 2014.
[32] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In International
Conference on Machine Learning, pages 6105–6114. PMLR,
2019.
[33] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. arXiv preprint arXiv:2012.12877, 2020.
[34] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In Proceedings of the IEEE inter-
national conference on computer vision, pages 4489–4497,
2015.
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Il-
lia Polosukhin. Attention is all you need. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Infor-
mation Processing Systems, volume 30, pages 5998–6008.
Curran Associates, Inc., 2017.
[36] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment net-
works: Towards good practices for deep action recognition.
In European conference on computer vision, pages 20–36.
Springer, 2016.
[37] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 7794–7803, 2018.
[38] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-
end video instance segmentation with transformers. arXiv
preprint arXiv:2011.14503, 2020.
[39] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
feature banks for detailed video understanding. In Proceed-
Appendix
A. Additional Qualitative Results
Figure 6. Additional qualitative examples of why attention matters. Similar to Fig. 3, we illustrate the [CLS] token weights of the first
attention layer. We show some examples for successful predictions (a, b, c) and some of the failure modes of our approach (d, e).