You are on page 1of 16

Telling Stories for Common Sense Zero-Shot Action

Recognition
Shreyank N Gowda* and Laura Sevilla-Lara
* Universityof Oxford, United Kingdom.
University of Edinburgh, United Kingdom.

*Corresponding author(s). E-mail(s): shreyank.narayanagowda@eng.ox.ac.uk;


arXiv:2309.17327v1 [cs.CV] 29 Sep 2023

Contributing authors: l.sevilla@ed.ac.uk;

Abstract
Video understanding has long suffered from reliance on large labeled datasets, motivating research into
zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot
video analysis, but constructing an effective semantic space relating action classes remains challenging.
We address this by introducing a novel dataset, Stories, which contains rich textual descriptions
for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence
narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This
contextual data enables modeling of nuanced relationships between actions, paving the way for zero-
shot transfer. We also propose an approach that harnesses Stories to improve feature generation for
training zero-shot classification. Without any target dataset fine-tuning, our method achieves new
state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories
provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual
narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled
data that has long impeded advancements in this exciting domain. The data can be found here:
https://github.com/kini5gowda/Stories.

Keywords: Zero-shot learning, Action recognition, Semantic space

1 Introduction classification [11, 27, 30] tasks. In practice, obtain-


ing large amounts of annotated examples for each
Action recognition technology has improved new class we aim to recognize is not realistic.
remarkably over the years, with methods becom- This is particularly true as we grow the number
ing more accurate and pushing the boundaries to of classes and wish to incorporate more flexi-
include novel tasks [28]. However, one of the main ble, natural language, for example, for retrieval.
challenges that remain today is the dependency of This general problem has led to research in the
these methods on annotated data for novel cat- zero-shot (ZS) domain.
egories. The availability of large labeled datasets In the typical ZS setting, there are seen classes
like ImageNet [10] has played a pivotal role in that contain visual examples and their class label,
propelling the field of supervised learning seeing and there are unseen classes where only the class
better than human level performance on image labels are available at training time. Given a visual
sample of the unseen set at test time, the task is to

1
In this work, we address the problem of build-
ing a meaningful space of action labels by leverag-
ing the story around each action. In particular, we
use the descriptions of the steps needed to achieve
each action, which are the Stories around this
action and encode them using a language embed-
ding. These steps typically contain the objects,
Fig. 1 Comparison of accuracy across state-of-the-art tools, scenes, verbs, adjectives, etc., associated
ZS approaches using different semantic embeddings: the with the action label. One could think of all these
proposed Stories, word2vec [41] (W2V) and elaborative additional pieces of information around an action
definitions [8] (ER), on UCF101 [53]. Using the pro-
as the “common sense” of associations humans
posed Stories to create the semantic space of class labels
improves the performance by a large margin across all would typically consider. For example, in the case
methods, showing that it is model-agnostic. of the penalty shot, the steps would describe to
first place the ball, run the hand through the
output the corresponding class label. Approaches grass, fluff the ball, take some steps back, kick,
typically learn a mapping from the visual space to etc. When playing soccer, the steps include kick-
the class labels using the seen classes and lever- ing the ball with the inside of your shoe for short
age that mapping in different ways to approximate passes across the grass, tapping the ball from foot
the mapping in the unseen classes. Some exam- to foot, etc. When we compare the stories of steps
ples of solutions include learning to map visual around these two classes, the overlap of terms
information and class labels to the same space becomes much more obvious. It is more likely that
or learning to generate visual features using a these related classes are closer in semantic space,
generative adversarial network (GAN) [55]. facilitating the transfer of knowledge between seen
One of the underlying assumptions of these and unseen classes. We show that this relatively
approaches is that the distances between the data simple approach to creating a semantic space is
points are meaningful both in the visual space and extremely effective across datasets and methods,
the semantic space. In other words, data points improving performance by up to 20% compared to
that are close together should be similar in con- the standard word2vec [41]. Figure 1 shows that
tent across both seen and unseen classes. In visual all state-of-the-art methods improve significantly
space, this tends to happen naturally as related and therefore the proposed semantic embedding is
classes will share objects, scenes, etc. For exam- general and model-agnostic.
ple, if we compare classes such as “penalty shot” Finally, we leverage these Stories to go beyond
and “playing soccer”, they will share the ball, simply learning semantic representations, to actu-
the soccer field, etc. However, in the space of ally generating additional features that improve
action labels, which we also refer to as seman- the semantic space further. We follow a feature-
tic space, this property is not straightforward to generating approach [39, 55] using a GAN [19] to
achieve. While some similar classes will contain synthesize visual data points from these seman-
overlapping terms (such as “horse-back riding” tic embeddings. These synthetic visual data points
and “horse racing”), some others might be sim- are then used to learn a mapping from visual to
ilar but not contain overlapping words (such as semantic space in the unseen classes. We show
the previous example of “penalty shot” and “play- that this method improves state-of-the-art by an
ing soccer”). This makes the step of transfer- additional 6.1%. We observe a strong trend of fea-
ring knowledge between seen and unseen classes ture generating networks benefiting particularly
harder. Previous efforts to improve the seman- from using Stories. The largest jumps using Sto-
tic space of class labels have included the use ries are in feature generating methods (SDR-I3D,
of manually annotated attributes or embedding SDR-CLIP and OD), as shown in Figure 1.
functions trained on language corpora, such as
word2vec [41], sentence2vec [46] and using defini-
tions of actions [8].

2
2 Related Work ReST do. Unlike JigSawNet, that works on the
visual features, we create enriched textual descrip-
Fully Supervised Action Recognition. tions by decomposing actions into a series of steps
In this setting, there is a large amount of train- and hence obtain richer semantic features. Other
ing samples with their associated labels and the solutions to deal with limited labeled data include
label spaces are the same at train and test time. augmentation [23], relational modules [47].
Much of the advances in this area of research is
often used on ZS. Early work in deep learning for Semantic Embeddings.
action recognition used many tools to represent To obtain semantic embeddings of action class
the spatio-temporal nature of videos, including labels, earlier works use word2vec [41] directly on
3D CNNs [7], 2D CNNs with temporal mod- the class labels. However, word2vec averages the
ules [35], 2D CNNS with relational modules [24] embedding for class labels with multiple words,
and two-stream networks [20, 52]. More recently, giving equal weight to each word. This causes
the transformer architecture [11] is particularly class names to lose context. For example, the
well suited to represent sequential information class “pommel horse” is a gymnastic class which
and has been successfully adapted to the video does not involve the animal horse. Unfortunately,
domain [2, 3, 13, 22, 33, 37]. While these are using word2vec makes the embedding of that class
extremely powerful tools, they are difficult to train close to “horse riding” or “horse racing” in the
with limited data. Instead, we use the standard word2vec space, even though they are not gym-
I3D [7] as our backbone feature generator to com- nastics sports. A recent solution is to use elaborate
pare directly with recent state-of-the-art papers descriptions [8] based on the principle of Elabo-
[25, 39]. rative Rehearsals (ER), which replace each class
name with a class definition. An object description
Zero-Shot Action Recognition. was also used to describe the objects in the par-
Early work [51] in this setting used script data ticular action. This resulted in a significant boost
in cooking activities to transfer to unseen classes. in performance. Still, ER uses descriptions of each
Considering each action class as a domain, Gan word in a class label independently to create the
et al. [17] address the identification of semantic semantic space, which potentially leads to errors.
representations as a multisource domain gener-
alization problem. To obtain semantic embed- Feature Generating Networks for ZS.
dings of class names, label embeddings such as Bucher et al. [5] proposed to bridge the gap
word2vec [41] has proven popular as only class between seen and unseen classes by synthesizing
names are needed. Some approaches use a common features for the unseen classes using four differ-
embedding space between video features and class ent generator models. Xian et al. [55] trained
labels [58, 59], error-correcting codes [49], pairwise their generators by using a conditional genera-
relationships between classes [15], interclass rela- tive adversarial network (GAN) [42]. In contrast,
tionships [16], out-of-distribution detectors [39], Verma et al. [54] trained a variational autoen-
and graph neural networks [18]. Recently, it was coder. Mishra et al. [43] introduced the generative
seen that clustering of joint visual-semantic fea- approach to the domain of ZS action recognition.
tures helped obtain better representations [25] Their approach models each action class as a prob-
for ZS action recognition. Similar to CLASTER, ability distribution in the visual space. Mandal
ReST [34] jointly encodes video data and text et al. [39] modified the work by Xian et al. [55]
labels for ZS action recognition. In ResT, trans- to work directly on video features. They also
formers are used to perform modality-specific introduced an out-of-distribution detector to help
attention. JigSawNet [48] also models visual and with the generalized zero-shot learning setting.
textual features jointly but decomposes videos Here we propose a variant of the work on out-
into atomic actions in an unsupervised manner of-distribution detectors, which suffers less from
and bridges group-to-group relationships between long converging times yet it improves its accuracy.
visual and semantic representations instead of More recently Gowda et al. [21], proposed to have
the one-to-one relationships that CLASTER and a selector that selects generated features based on

3
their importance on training the classifier rather cosine similarity to find the 25 that are the most
than on realness of the features. similar to the class definition [8].
Next, we manually check the sentences for each
3 The Stories Dataset class. If we find a mismatch between the article
and the action class, as was the case with the
Research in semantic representation of action “draw sword” example, we do a manual search to
labels has shown over the years that more sophis- pick the most relevant article from other sources
ticated representations help to build a meaningful such as Wikipedia. However, these alternative arti-
semantic space for ZS action recognition. Here we cles do not tend to contain the sequence of steps
go beyond previous work by representing not only and hence need more manual intervention to order
the class label but the story around it. This is, the sentences into a sequence of steps.
we build a representation that captures all the We finally clean each story by re-arranging the
steps needed to perform the action, which include sentences in sequential order and removing irrel-
objects, verbs, etc, typically associated with that evant sentences. In total, we had 6 people who
action. We call these representations Stories , and manually cleaned the descriptions after the ini-
we now describe how we build them. tial stage of noisy collection and a further 10 who
verified the descriptions. This was done using the
3.1 Building the Stories Dataset Prolific2 platform. The time taken for the clean-
ing Stories for UCF101, HMDB51 and Kinetics
We leverage textual descriptions of actions from was 7.2 hours, 3.3 hours and 25.3 hours on average
WikiHow 1 , a website that gives instructions on respectively. We followed this process to create a
how to perform actions. These instructions con- dataset of these textual representations for classes
sist of long paragraphs that describe each step in in UCF101, HMDB51, Olympics, and Kinetics-
completing the action. For example, for the action 400, as they are the most commonly used datasets
classes in the HMDB51 dataset, the WikiHow arti- for ZS action recognition. Detailed descriptions of
cles contain an average of 9.8 steps, ranging from each class can be found in the https://github.com/
4 to 20 steps. The most closely related work to kini5gowda/Stories.
us, ER [8] uses a single or at max two sentences
per class. Instead, for the proposed Stories , the 3.2 Learning from Stories
average number of lines is 14.4 for the classes in
the Olympics dataset, 9.6 for HMDB51, 13.2 for In order to test the impact of using richer seman-
UCF101, and 13.5 Kinetics. These rich descrip- tic representations, we used Stories as input to
tions inherently contain information of the objects the state-of-the-art methods in ZS. To provide fair
needed to perform an action. For example, the points of comparison, we also supply the stan-
class ‘biking’ has a paragraph that explains where dard word2vec [41] embeddings as well as the more
to bike, what equipment you need, and the steps recently introduced ER [8] embeddings to these
to do it. In comparison, ED [8] has only a single same models. By visually depicting the perfor-
line definition of the class. mance gains achieved by the models when using
Not all classes have articles on WikiHow. For our proposed embeddings instead of the word2vec
example, if we search for “draw sword” (a class in or ER embeddings in Fig. 1, we are able to
UCF101), we will get instructions on how to paint clearly demonstrate the substantial improvements
or sketch swords instead of the steps needed to obtained through our approach. It increases up
essentially remove a sword from its sheath. Hence, to 21% compared to the widely used word2vec
collecting clean, meaningful articles requires a embeddings and 11.8% over the ER embeddings.
more complex process than a simple search. After This comprehensive experimental analysis high-
scraping the articles from WikiHow corresponding lights how our semantically richer embeddings can
to all classes, we use sentence-BERT encoders [50] notably boost performance across a diverse range
to represent the sentences in the article and use of state-of-the-art models and is hence model
agnostic.

1
https://www.wikihow.com/ 2
https://www.prolific.co/

4
Fig. 2 Comparing nearest neighbors using Stories. We see an example where ER fails and Stories provides more context
and helps in obtaining better neighbors. This is one example of where ER fails, there are multiple such examples. Dataset
is UCF101.

We visualize the effect of using ER and Sto- event where the athlete throws a spherical object.
ries to generate features, using the t-SNE [38] for Similar problems happen for example with classes
10 classes in UCF101, all related to gymnastics such as “Sword Exercise”, or “Swing Baseball”.
and therefore easier to confuse. We see that using Overall, this shows the need for the more sophis-
Stories helps keep a more meaningful neighbor- ticated joint description that Stories provides,
hood for visual instances, and keep classes apart. instead of the per-word definition of ER.
The visualization also shows why ER fails as there
is not enough information to clearly distinguish Size of dataset.
classes. We also compare Stories to ER from a statisti-
cal perspective. We have briefly mentioned before
3.3 Why Are Stories Necessary? that Stories contains much more detailed descrip-
The proposed Stories dataset produces semantic tions of the classes. Here we look at the numerical
embeddings for the class labels that are much difference between ER and Stories , shown in
more meaningful than previous work, and that Table 1. The number of sentences is one order of
reflects in the experimental improvement shown in magnitude larger, going from over 1 sentence on
Fig. 1. Here we delve deeper into what properties average per class, to over 10 sentences on over-
make Stories superior. age, depending on the dataset. This ratio is also
consistent in the number of nouns verbs, etc.
Capturing meaning of words jointly.
Diversity of dataset.
Previous work [8, 41] represents a class label by
computing a representation for each word in the Another aspect that increases the specificity of the
class label and then computing the average. 3 class descriptions in addition to the size of the
While this in general is a sensible choice, it leads dataset is the diversity of the vocabulary. We look
often to errors caused by words that have mul- at the number of unique words in Table 1 and
tiple meanings depending on the context. Fig. 2 observe that Stories contains more unique words
illustrates this issue with an example. We use the than ER in all datasets and that the difference
class “Hammer Throw” from the UCF101 [53] is particularly remarkable in smaller datasets. We
dataset, which is a sporting event in which the argue that this diversity contributes to represent-
athlete throws a spherical object. If we retrieve ing each class label in a more unique way, leading
the nearest neighbor with ER [8], we obtain the to a more sparse yet meaningful space.
class “Hammering”, which is not actually related
in meaning but both contain the description of the Cleaning data manually.
tool hammer. However, if we use Stories , the near- Generally, training with more data tends to pro-
est neighbor is “Shot Put”, which is also a sporting duce better results. There often is a tension
between using a smaller amount of clean data
or a larger amount of noisy data. Here we have
3
In the case of ER, this is done for some of the classes.

5
Method D N V A Ad UW S
Stories K 68.5 37.9 7.1 0.04 35.1 13.5
ER K 9.2 3.8 1.0 0.01 30.2 1.8
Stories U 69.6 37.2 7.7 0.1 32.1 13.2
ER U 9.7 4.3 1.4 0.06 29.5 1.8
Stories H 59.2 33.4 6.5 0.3 34.5 9.6
ER H 5.8 2.1 0.9 0.05 17.7 1.2
Stories O 68.9 38.6 11.8 2.0 57.2 14.4
Fig. 3 Using the cleaned version of Stories to create the ER O 9.1 4.2 1.2 0.3 31.4 1.7
semantic space of class labels improves the performance by Table 1 Statistical comparison of Stories to ER [8]. We
a large margin. The dataset is UCF101. observe a larger number of Nouns (N), Verbs (V),
Adverbs (A), Adjectives (Ad), Unique Words (UW), and
Sentences (S) are averages across all the classes in a
particular dataset (D). ’K’ refers to Kinetics-600, ’U’ to
UCF101, ’H’ to HMDB51, and ’O’ to Olympics.

Similarly, some parts of the stories may


describe non-visual aspects like memorizing lines
for the ”acting in play” class that are not depicted
in the videos. Our current approach does not
explicitly address potential mismatches between
the textual stories and visual contents.
Still, we believe non-visual cues actually help
Fig. 4 Visualization of the features generated from the making the semantic embeddings more distinct,
embedding vs ER, using t-SNE [38]. We observe that the as these cues are unique to each class’s story.
samples of each class instance, depicted in a single color, are
better clustered together, pointing to a more semantically However, this remains a limitation worth noting.
meaningful space. One area of future work is exploring how to
make the stories more comprehensive by incor-
explored the effect of cleaning the data of Sto- porating multiple variations of actions. We also
ries manually. We see that using the noisy version plan to investigate techniques for identifying and
of the dataset (see Fig. 3) improves the per- excluding non-visual sentences that do not trans-
formance over ER across methods but is still late to visual features. Overall, handling the
consistently worse than the cleaned version, even diversity of real videos compared to procedural
though it is roughly twice as large. This shows that descriptions remains an open challenge that we
the effect of cleaning up the data manually is not aim to address.
trivial. In the ER work, however, there exist errors
that can be solved through manual revision. For 4 Experimental Details
example, “Table Tennis Shot” has the definition
“put (food) into the mouth and chew and swallow We now describe the details of the experiments
it” which clearly corresponds to the wrong action that we have performed to validate our claims of
class. the superiority of Stories both as a general seman-
tic embedding input to a wide suite of methods
3.4 Possible Limitations and for feature generation.

One possible limitation of our approach is that the 4.1 Hyperparameter Selection
Stories may focus on one specific way of perform-
ing each action, while other valid methods may Choosing the number of nearest neighbors for
exist to do the same action. both the data-based noise and the ranking loss is
For example, the story for ”shuffling cards” done empirically. We use the generalized zero-shot
details the riffle shuffling technique, but other action recognition performance to decide these
shuffling techniques scuh as the overhand shuffling hyperparameters.
could occur in videos of this class. We choose UCF101 as our dataset for the
hyperparameter tuning, but also plot the results
on HMDB51 as it also ended up following the same

6
maps visual features (xi ) to a semantic embed-
ding approximation (âi ). D separates real from
synthesized features. They’re jointly trained via
a mini-max game using an optimization function
(see Eq 1).

Fig. 5 Comparison of using different number of nearest


neighbors on both (left) the data-based noise and (right) LD = E(x,a)∼p(xS ×aS ) [D(x, P (x))]
the ranking loss.
− Ez∼pz Ea∼pa [D(G(a, z), a)]
h i
2
pattern. The results are shown in Figure 5. Based − αEz∼pz Ea∼pa (∥∇x̂ D(G(a, z))∥2 − 1) ,
on these results, we choose the number of nearest (1)
neighbors as 3 for the data-based noise and 5 for
the ranking loss. The results reported are on the
4.4 Technical Details
TruZe split.
We now describe a few technical choices that we
4.2 The Zero-Shot and Generalized have made to improve the standard pipeline.
Zero-Shot Settings First, the typical generator takes as input
attributes or semantic embeddings and normally
Let S be the training set of seen classes. S is com- distributed noise to generate visual features. The
posed of tuples (x, y, a(y)), where x represents the underlying assumption is that the normal distri-
spatiotemporal features of a video, y represents bution can represent all classes. However, this is
the class label in the set of YS seen class labels, not necessarily true [36].
and a(y) denotes the category-specific semantic Instead, we use the distribution of the seen
representation of class y, which is either manually classes’ features to create the “noise” for the gen-
annotated or computed automatically, for exam- erator [36], such that the synthetic unseen classes
ple using word2vec [41] or the proposed Stories . will follow the same distribution. To this end, we
Let U be the set of pairs (u, a(u)), where u is a use a variational auto-encoder (VAE), which takes
class in the set of unseen classes YU and a(u) are as input visual features from the seen classes and
the corresponding semantic representations. The it is trained to reconstruct them. Once trained,
seen YS and the unseen classes YU do not overlap. we use the low-dimensional representation of the
In the ZS setting, given an input video, the encoder as the “noise”. This simple change in the
task is to predict a class label in the unseen classes, noise distribution, which we call data-driven noise,
such as fZSL : X → YU . In the generalized zero- benefits in two ways: it improves the overall accu-
shot (GSZ) setting, given an input video, the task racy and reduces training time by 65% compared
is to predict a class label in the union of the seen to the standard Gaussian noise. See Fig. 7 and
and unseen classes, as fGZSL : X → YS ∪ YU . Fig. 8.
Second, one of the risks of learning to generate
4.3 Vanilla Feature Generation semantic embeddings (through the “Projection
Network” in Fig. 6) is that synthetic semantic
In the standard feature generation pipeline [29,
embeddings can be too similar to each other. To
36, 39, 43, 55] the high-level idea is to learn to
avoid this, we introduce a ranking loss [14] that
generate visual features for unseen classes using
pushes apart the generated semantic representa-
a GAN [19], and then, given these synthetic
tion (âi ) from those of their neighboring classes.
features, train a classifier that takes in visual fea-
Details of this can be found in the Section 4.6.
tures and predicts unseen class labels. Figure 6
We refer to this version of the feature generating
illustrates the overall method.
approach as SDR (for Stories , Data-based Noise
The GAN comprises a generator (G), discrimi-
and Ranking) and we observe that it achieves
nator (D), and projection network (P ). Generator
state-of-the-art results across all datasets and
G creates synthetic visual features (x̂i ) from class
settings.
label semantic embedding (ai ) and noise (z). P

7
Fig. 6 Using Stories for feature generation. The elements depicted in yellow are the standard vanilla approach to feature
generation for ZS. Depicted in green are the elements that we introduce.

Fig. 8 The generator loss using data-driven noise is much


more stable, leading to faster convergence and better accu-
racy.
Fig. 7 Training the generator using data-driven noise con-
verges much faster than using the standard Gaussian noise. 4.5 Implementation Details

p(xS ×aS ) is the joint distribution of visual and 4.6 Ranking Loss
semantic descriptors for seen classes, pa is the One of the risks of learning to generate semantic
empirical distribution of their semantic embed- embeddings (through the “Projection Network” in
dings, pz is noise, and α is a penalty coefficient. is that synthetic semantic embeddings can be too
Additional losses to enhance generated features similar to each other. To avoid this, we introduce
are the classification regularized loss (LCLS ) and a ranking loss [14] that pushes apart the generated
the mutual information loss (LM I ). These losses semantic representation (âi ) from those of their
form the objective function minimized to train the neighboring classes:
vanilla pipeline (see Eq 2).
Lrank = E[max(0, δ − aT âi + (a′ )T âi )], (3)
min min max LD + λ1 LCLS (G) + λ2 LM I (G).
G P D
(2) where a is the ground truth semantic embed-
Once these networks are trained on the seen ding, a′ is the semantic embedding of a class
classes, the generator is used to synthesize visual randomly sampled from the 5 classes (empirical
features for the unseen classes. The final step is results in Sec. 4.1 closest to the ground truth and
to train a simple classifier using these synthetic δ is a hyperparameter. Including this loss in the
visual features as input and the class labels. The overall objective function, we obtain:
loss is the standard cross-entropy loss, and the
classifier is a simple multilayer perceptron (MLP). min min max LD + λ1 LCLS (G)
G P D
(4)
+λ2 Lrank (P ) + λ3 LM I (G).

8
4.7 Features classifier for ZSAR and two classifiers for GZSAR
along with an out-of-distribution (OOD) detector.
For our visual features we consider two scenarios.
The classifiers are single-layer fully-connected
The first case, the appearance and flow features
networks with an input size equal to the video
are extracted from the Mixed 5c layer of the
feature size and output sizes equal to the num-
RGB and flow I3D networks, respectively. Both
ber of classes (seen or unseen). The OOD is a
I3D models are pre-trained on the Kinetics-400
three-layer fully connected network with output
dataset [7].
and hidden layer sizes equal to the number of seen
Given an input video, appearance and flow fea-
classes and 512, respectively. We use 8 RTX 2080
tures extracted are averaged across the temporal
Ti NVIDIA GPUs having 16 GB RAM each for
dimension and pooled by 4 in the spatial dimen-
our experiments.
sion and then flattened to obtain a vector of size
4096 each. These vectors are then concatenated to
obtain video features of size 8192. 4.10 Ablation Study
In the second case, we first train the X-CLIP- We propose modifications to the vanilla pipeline
B/16 [44] on 16 frames of the non-overlapping of feature generation and based on this, we show
classes of Kinetics [4] dubbed Kinetics-664 [4] the importance of each component here. We see
using the proposed ‘Stories’ as the semantic that every proposed contribution benefits over the
embedding. For the text embeddings we use the baseline.
large S-BERT [50], which is a sentence encoder. However, crucially, a combination of all three
For ER we use the class definition as input to gives us our best results. We note that the
the S-BERT and use the 1024 sized vector output improvement from the ranking loss is much more
as the semantic embedding. In case of Stories, we prominent in the generalized zero-shot setting
use S-BERT for each sentence and average all the than in the zero-shot.
vectors to obtain a singe vector of size 1024.
Stories DBN Lrank ZSLHM DB ZSLU CF GZSLHM DB GZSLU CF
× × × 29.1 ± 3.8 37.5 ± 3.1 32.7 ± 3.4 44.4 ± 3.0
4.8 Network Architecture × × ✓ 29.7 ± 3.5 38.0 ± 3.1 35.3 ± 3.1 47.1 ± 3.2
× ✓ × 30.6 ± 2.2 38.6 ± 3.4 33.3 ± 3.0 44.9 ± 2.9
We use the Wasserstein GAN [55] which has ✓
×
×

×

44.6
31.9
±
±
2.9
3.2
60.4
40.9
±
±
3.8
2.9
44.9
35.7
±
±
3.6
2.9
51.0
47.9
±
±
2.9
4.1
been successful in both zero-shot image classifi- ✓ × ✓ 45.0 ± 2.5 60.9 ± 3.5 49.0 ± 3.2 54.4 ± 3.7
✓ ✓ × 45.9 ± 2.7 61.4 ± 2.8 47.5 ± 2.6 53.7 ± 3.5
cation [56] and zero-shot action recognition [39] ✓ ✓ ✓ 46.5 ± 5.3 61.9 ± 2.5 49.7 ± 2.9 54.9 ± 4.4
tasks. This also allows us to compare directly Table 2 Ablation study to explore the impact of each
to OD [39] and Wasserstein GAN [55] in the proposed component.
experimental analysis.
The feature generator G is a three-layer fully-
connected network that has an output layer
dimension equal to that of the video feature size. 4.11 Why Not Just Use VAE for
The hidden layers are of size 4096. The discrimina- Feature Generation?
tor D is also a three-layer fully-connected network
with hidden layers of size 4096. However, the out- Another possible question is the use of the cur-
put size equals 1. The projection network P is a rent feature generator model. There are multiple
fully-connected network that has an output layer options to use as feature generators including
size equal to the size of the semantic embeddings VAEs, and other versions of GANs (not just
(in our case 1024). WGAN [55] that we use).
We chose to adapt the WGAN for our feature
4.9 Training Details generator based on two reasons. First, we wanted
to compare directly to existing literature on zero-
All the modules are trained using the Adam opti- shot action recognition and to the best of our
mizer with a weight decay of 0.0005 and with an knowledge the most recent one has been the one
adaptive learning rate using a learning rate sched- used in OD [39].
uler. We set λ1 as 0.1, λ2 as 0.9 and λ3 as 0.1. However, for the sake of sanity we also ran
At test time, we follow OD [39] and train a single additional experiments on the HMDB51 dataset

9
incorporating f-VAEGAN [57], adapted FREE [9]: 5 Results
feature refinement of f-VAEGAN for zero-shot
action recognition and using a simple VAE. The 5.1 Zero-Shot Learning Results
results of this can be seen in Tab 3. We look at the use of Stories as semantic embed-
ding to a wide variety of models here, from older
Feature Generator Accuracy models all the way to the most recent ones in the
VAE 25.5 ± 2.9 ZS literature. We use I3D and CLIP-based fea-
Vanilla GAN 31.5 ± 2.4 tures [44] to compare their effect. We list these
f-VAEGAN 45.9 ± 3.2 results as SDR+I3D and SDR+CLIP respectively.
FREE 46.6 ± 3.5 We observe an improvement across all of them
SDR (Ours) 48.1 ± 3.6
and across all datasets, demonstrating that Sto-
Table 3 Comparing different
choices for feature generator. ries is clearly model agnostic. Results can be seen
Reported results are on 10 different in Table 4.
runs and all models use the same We also observe that the proposed changes to
split. Dataset is HMDB51.
the vanilla feature generation method which we
call SDR consistently outperforms all approaches
across all datasets, achieving a new state-of-the-
art. We experiment with using a single model for
4.12 Datasets and Evaluation all datasets, by training on Kinetics and not doing
Protocol any fine-tuning for the smaller datasets. This is the
last row of the table, which we call “SDR (Ours)
We use the Olympic Sports [45], HMDB-51 [32], + SM”. It is remarkable and quite promising that,
UCF-101 [53] and Kinetics [7] datasets, as they without the need to fine-tune, this single model
are the standard choice in ZS action recognition, achieves even better performance.
so that we can compare them with recent state- We also evaluate on the Kinetics-220 dataset
of-the-art models [4, 17, 25, 39, 48, 49]. These as proposed in ER [8]. There are fewer methods
datasets contain 783, 6766 and 13320 videos, and who report on this split, but it is interesting as
have 16, 51 and 101 classes, respectively. We fol- it is much larger. Results can be seen in Table 5.
low the commonly used 50/50 splits of Xu et al. We observe that the proposed SDR outperforms
[58], where 50% are the seen classes and 50 are all previous work. We see significant gains of up
the unseen classes. Similar to previous approaches to 5%.
[17, 31, 40, 49, 60], we report average accu- Finally, we evaluate on the stricter TruZe [26]
racy and standard deviation over 10 independent split that ensures no overlap between pre-trained
runs. We also report on the recently introduced model and test classes. Results are shown in
TruZe [26]. This split accounts for the fact that Table 6. We report the mean class accuracy in ZS
some classes present on the dataset used for pre- setting and harmonic mean of seen and unseen
training (Kinetics [7]) overlap with some of the class accuracies in GZS. The split refers to the
unseen classes in the datasets used in the zero-shot train/test split used.
setting, therefore breaking the premise that those
classes have not been seen. We also report on the 5.2 Generalized Zero-Shot Learning
Kinetics-220[7] split as proposed in ER [8]. Here,
the 220 classes from Kinetics-600 [6] are treated as
Results
unseen classes to a model trained on the Kinetics- Generalized ZS action recognition is less explored
400 dataset. All the datasets we use have action in comparison to the ZS setting. Nonetheless,
classes that are singular and not compositional. we choose models such as Bi-Dir GAN, GGM,
WGAN, OD and CLASTER as recent state-of-
the-art methods evaluating on this setting. We
train an out-of-distribution detector (OOD) fol-
lowing OD [39] and two separate classifiers for
the seen and unseen classes along with the OOD
network.

10
D E Bi-Dir GAN WGAN OD E2E CLASTER JigSawNet ActionCLIP ResT X-CLIP SDR+I3D SDR+CLIP
W 40.2 ± 10.6 47.1 ± 6.4 50.5 ± 6.9 61.4 ± 5.5 63.8 ± 5.7 66.4 ± 6.8 - - - 55.1 ± 4.8 62.3 ± 4.5
ER 54.1 ± 6.8 65.5 ± 7.2 67.6 ± 6.2 66.5 ± 4.5 68.4 ± 4.1 71.5 ± 6.1 - - - 69.8 ± 2.8 73.5 ± 2.9
O
Sto 55.5 ± 6.5 66.2 ± 7.1 69.1 ± 5.6 69.9 ± 5.8 73.1 ± 6.6 74.9 ± 5.2 - - - 72.5 ± 2.1 80.1 ± 2.3
SM - - - - - - - - - 74.8 ± 2.3 82.2 ± 1.6
W 21.3 ± 3.2 29.1 ± 3.8 30.2 ± 2.7 33.1 ± 3.4 36.6 ± 4.6 35.4 ± 3.2 40.8 ± 5.4 39.3 ± 3.5 43.7 ± 6.5 35.8 ± 4.7 38.9 ± 3.5
ER 25.9 ± 2.9 31.6 ± 3.1 36.1 ± 2.9 36.2 ± 1.9 43.2 ± 1.9 39.3 ± 3.9 44.2 ± 4.4 43.6 ± 2.9 46.6 ± 6.1 41.2 ± 4.3 46.2 ± 3.1
H
Sto 27.2 ± 2.7 35.5 ± 2.8 39.2 ± 2.8 38.1 ± 3.6 45.5 ± 2.6 42.5 ± 3.2 48.8 ± 3.2 47.1 ± 3.5 50.1 ± 6.1 46.8 ± 5.0 52.7 ± 3.4
SM - - - - - - - - - 48.9 ± 4.4 54.4 ± 4.1
W 21.8 ± 3.6 25.8 ± 3.2 26.9 ± 2.8 46.2 ± 3.8 46.7 ± 5.4 53.3 ± 3.1 58.3± 3.4 58.7 ± 3.3 70.1 ± 3.4 41.5 ± 2.5 44.8 ± 4.2
ER 28.0 ± 3.4 37.9 ± 2.5 42.4 ± 3.4 52.4 ± 3.3 53.9 ± 2.5 56.8 ± 2.8 64.3 ± 3.8 62.6 ± 4.1 72.2 ± 2.3 50.3 ± 1.1 54.2 ± 3.5
U
Sto 29.5 ± 3.2 40.1 ± 3.7 50.3 ± 3.0 55.1 ± 3.3 59.6 ± 2.8 61.6 ± 3.5 71.8 ± 2.7 65.3 ± 2.5 73.8 ± 2.9 62.9 ± 1.6 73.4 ± 2.7
SM - - - - - - - - - 64.9 ± 2.1 75.5 ± 3.2

Table 4 Results on ZSL. SE: semantic embedding, W: word2vec embedding, ER: Elaborate Rehearsals, Sto: Stories. SM
corresponds to the single model training. We use the datasets Olympics (O), HMDB51 (H) and UCF101 (U).

Method Top-1 Acc Top-5 Acc The reported results are on the same set of
DEVISE [14] 23.8 ± 0.3 51.0 ± 0.6 10 random splits for fair compairson. There are
SJE [1] 22.3 ± 0.6 48.2 ± 0.4
ER [8] 42.1 ± 1.4 73.1 ± 0.3
no manual attributes for the HMDB dataset. We
JigSawNet [48] 45.9 ± 1.6 78.8 ± 1.0 see that the proposed SDR approach obtains best
SDR+I3D 50.8 ± 1.9 82.9 ± 1.3 results on all three categories. Another observa-
SDR+CLIP 55.1 ± 2.2 86.1 ± 3.1 tion we can see is that the performance of all
Table 5 Results of ZS in Kinetics-220. models using Stories is better than even the older
Method UCF101 HMDB51
manual attributes.
Split ZSL GZSL Split ZSL GZSL
WGAN 67/34 22.5 36.3 29/22 21.1 31.8 6 Why Does Using a Single
OD 67/34 22.9 42.4 29/22 21.7 35.5
CLASTER 67/34 45.8 47.3 29/22 33.2 44.5 Model Work?
SDR+I3D 67/34 49.7 51.3 29/22 34.9 45.5
SDR+CLIP 67/34 53.9 56.2 29/22 38.7 49.5 One curious question to ponder would be why
VCAP [12] 0/34 49.1 - 0/22 20.4 - the single model trained on a large dataset like
SDR+I3D SM 0/34 51.5 52.2 0/22 36.1 46.6
SDR+CLIP SM 0/34 55.1 57.7 0/22 40.8 51.1 Kinetics-400 [7] results in better performance than
the models fine-tuned on the smaller datasets. Our
Table 6 Results of ZS and GZS on TruZe.
hypothesis is that the feature generator trained
on a larger dataset has a better distribution of
data to learn from as the data-driven noise that
Table 7 shows the results, with the harmonic
we use is more representative of the real visual
mean of the seen and unseen class accuracies.
world. This in turn generates more realistically
Similar to the zero-shot case, we use both I3D
distributed features, which in turn results in the
and CLIP-based [44] backbones and list these
improved performance.
results as SDR+I3D and SDR+CLIP respectively.
We follow the earlier approach of using differ-
ent semantic embeddings to show the performance 7 Conclusion
gain that using SDR gives us.
The introduction of the novel Stories dataset
provides rich textual narratives that establish
5.3 Generalized Zero-Shot Action connections between diverse action classes. Lever-
Recognition Results in Detail aging this contextual data enables modeling of
In order to better analyze performance of the nuanced relationships between actions, overcom-
model on GZSL, we report the average seen ing previous limitations in ZS action recognition.
and unseen accuracies along with their harmonic Stories enables significant improvement for multi-
mean. The results using different embeddings and ple SOTA models. Building on Stories, our pro-
on the UCF101, HMDB51 and Olympics datasets posed feature generating approach harnessing Sto-
are reported in Table 8. ries achieves new state-of-the-art performance on
multiple benchmarks without any target dataset

11
Dataset E Bi-Dir GAN GGM WGAN OD CLASTER SDR+I3D SDR+CLIP
W 44.2 ± 11.2 52.4 ± 12.2 59.9 ± 5.3 66.2 ± 6.3 69.1 ± 5.4 67.1 ± 4.2 69.7 ± 4.1
ER 53.6 ± 6.2 59.1 ± 12.1 63.7 ± 6.6 69.7 ± 6.5 72.5 ± 3.5 70.4 ± 2.3 75.5 ± 3.3
Olympics
Sto 55.9 ± 4.2 59.9 ± 11.6 67.1 ± 5.1 72.4 ± 4.9 74.9 ± 6.1 74.5 ± 3.9 79.5 ± 3.5
SM - - - - - 76.6 ± 3.6 81.1 ± 3.0
W 17.5 ± 2.4 20.1 ± 2.1 32.7 ± 3.4 36.1 ± 2.2 48.0 ± 2.4 36.3 ± 5.1 39.5 ± 3.1
ER 26.1 ± 2.3 28.2 ± 3.3 35.2 ± 3.5 38.1 ± 2.4 50.8 ± 2.8 42.1 ± 4.5 45.4 ± 3.6
HMDB51
Sto 27.7 ± 2.1 29.1 ± 2.9 38.5 ± 3.3 40.9 ± 3.8 53.5 ± 2.4 49.7 ± 2.9 53.5 ± 3.3
SM - - - - - 50.9 ± 2.6 56.1 ± 3.2
W 22.7 ± 2.5 23.7 ± 1.2 44.4 ± 3.0 49.4 ± 2.4 51.3 ± 3.5 41.9 ± 2.7 44.3 ± 4.6
ER 26.2 ± 4.2 26.5 ± 2.5 46.1 ± 3.5 53.2 ± 3.1 52.8 ± 2.1 45.5 ± 1.1 49.1 ± 2.9
UCF101
Sto 29.1 ± 3.4 27.7 ± 3.1 48.3 ± 3.2 55.5 ± 3.3 54.1 ± 3.3 54.9 ± 4.4 57.8 ± 4.1
SM - - - - - 57.2 ± 3.5 59.7 ± 3.1
Table 7 Results on GZSL. SE: semantic embedding, W: word2vec embedding, ER: Elaborate Rehearsals, Sto: Stories.
SM corresponds to the single model training.

Model SE Olympics HMDB51 UCF-101


u s H u s H u s H
[2] A. Arnab, M. Dehghani, G. Heigold, C. Sun,
WGAN [55] M 50.8 71.4 59.4 - - - 30.4 83.6 44.6 M. Lučić, and C. Schmid. Vivit: A
OD [39] M 61.8 71.1 66.1 - - - 36.2 76.1 49.1
CLASTER [25] M 66.2 71.7 68.8 - - - 40.2 69.4 50.9
video vision transformer. In Proceedings of
SDR M 71.6 76.9 74.2 - - - 43.1 77.5 54.6 the IEEE/CVF International Conference on
WGAN [55] W 35.4 65.6 46.0 23.1 55.1 32.5 20.6 73.9 32.2
OD [39] W 41.3 72.5 52.6 25.9 55.8 35.4 25.3 74.1 37.7 Computer Vision, pages 6836–6846, 2021.
CLASTER W 49.2 71.1 58.1 35.5 52.8 42.4 30.4 68.9 42.1
WGAN [55] S 36.1 66.2 46.7 28.6 57.8 38.2 27.5 74.7 40.2
OD [39] S 42.9 73.5 54.1 33.4 57.8 42.3 32.7 75.9 45.7 [3] G. Bertasius, H. Wang, and L. Torresani. Is
CLASTER S 49.9 71.3 58.7 42.7 53.2 47.4 36.9 69.8 48.3 space-time attention all you need for video
CLASTER C 66.8 71.6 69.1 43.7 53.3 48.0 40.8 69.3 51.3
WGAN [55] Sto 52.5 73.4 61.2 35.2 65.1 45.7 33.8 84.2 48.2 understanding? In ICML, volume 2, page 4,
OD [39] Sto 63.3 75.1 68.7 37.2 67.5 47.9 40.1 81.7 53.8 2021.
CLASTER Sto 69.1 74.1 71.5 44.3 57.2 49.9 42.1 71.5 53.0
SDR+I3D Sto 73.5 79.9 76.6 46.9 55.8 50.9 44.4 80.7 57.2
SDR+CLIP Sto 78.9 83.5 81.1 52.5 60.4 56.1 47.3 81.2 59.7 [4] B. Brattoli, J. Tighe, F. Zhdanov, P. Per-
Table 8 Seen and unseen accuracies for CLASTER on
ona, and K. Chalupka. Rethinking zero-shot
different datasets using different embeddings. ’SE’
corresponds to the type of embedding used, wherein ’M’, video classification: End-to-end training for
’W’, ’S’, ’C’ and ’Sto’ refers to manual annotations, realistic applications. In Proceedings of the
word2vec, sen2vec, combination of the embeddings and IEEE/CVF Conference on Computer Vision
Stories respectively. ’u’, ’s’ and ’H’ corresponds to average
unseen accuracy, average seen accuracy and the harmonic
and Pattern Recognition, pages 4613–4623,
mean of the two. All the reported results are on the same 2020.
splits. SDR+I3D corresponds to the backbone network
being I3D and similarly for SDR+CLIP. [5] M. Bucher, S. Herbin, and F. Jurie. Generat-
ing visual representations for zero-shot classi-
fication. In Proceedings of the IEEE Interna-
fine-tuning. This demonstrates the value of Stories tional Conference on Computer Vision Work-
as a resource to enable ZS transfer and significant shops, pages 2666–2673, 2017.
progress in video understanding without reliance
on large labeled datasets. [6] J. Carreira, E. Noland, A. Banki-Horvath,
C. Hillier, and A. Zisserman. A short
References note about kinetics-600. arXiv preprint
arXiv:1808.01340, 2018.
[1] Z. Akata, S. Reed, D. Walter, H. Lee, and
B. Schiele. Evaluation of output embed- [7] J. Carreira and A. Zisserman. Quo vadis,
dings for fine-grained image classification. In action recognition? a new model and the
Proceedings of the IEEE conference on com- kinetics dataset. In IEEE Conf. Comput. Vis.
puter vision and pattern recognition, pages Pattern Recog., 2017.
2927–2936, 2015.

12
[8] S. Chen and D. Huang. Elaborative rehearsal [16] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and
for zero-shot action recognition. In Proceed- A. G. Hauptmann. Exploring semantic inter-
ings of the IEEE/CVF International Con- class relationships (sir) for zero-shot action
ference on Computer Vision, pages 13638– recognition. In Proceedings of the National
13647, 2021. Conference on Artificial Intelligence, 2015.

[9] S. Chen, W. Wang, B. Xia, Q. Peng, X. You, [17] C. Gan, T. Yang, and B. Gong. Learn-
F. Zheng, and L. Shao. Free: Feature refine- ing attributes equals multi-source domain
ment for generalized zero-shot learning. In generalization. In Proceedings of the IEEE
Proceedings of the IEEE/CVF international conference on computer vision and pattern
conference on computer vision, pages 122– recognition, pages 87–97, 2016.
131, 2021.
[18] J. Gao, T. Zhang, and C. Xu. I know the rela-
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, tionships: Zero-shot action recognition via
and L. Fei-Fei. Imagenet: A large-scale hier- two-stream graph convolutional networks and
archical image database. In 2009 IEEE knowledge graphs. In Proceedings of the
conference on computer vision and pattern AAAI Conference on Artificial Intelligence,
recognition, pages 248–255. Ieee, 2009. volume 33, pages 8303–8311, 2019.

[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza,


D. Weissenborn, X. Zhai, T. Unterthiner, B. Xu, D. Warde-Farley, S. Ozair,
M. Dehghani, M. Minderer, G. Heigold, A. Courville, and Y. Bengio. Generative
S. Gelly, et al. An image is worth 16x16 adversarial networks. Communications of
words: Transformers for image recognition at the ACM, 63(11):139–144, 2020.
scale. arXiv preprint arXiv:2010.11929, 2020.
[20] S. N. Gowda. Human activity recognition
[12] V. Estevam, R. Laroca, D. Menotti, and using combinatorial deep belief networks. In
H. Pedrini. Tell me what you see: A zero-shot Proceedings of the IEEE conference on com-
action recognition method based on natu- puter vision and pattern recognition work-
ral language descriptions. arXiv preprint shops, pages 1–6, 2017.
arXiv:2112.09976, 2021.
[21] S. N. Gowda. Synthetic sample selection for
[13] H. Fan, B. Xiong, K. Mangalam, Y. Li, generalized zero-shot learning. In Proceedings
Z. Yan, J. Malik, and C. Feichtenhofer. Mul- of the IEEE/CVF Conference on Computer
tiscale vision transformers. In Proceedings of Vision and Pattern Recognition, pages 58–67,
the IEEE/CVF International Conference on 2023.
Computer Vision, pages 6824–6835, 2021.
[22] S. N. Gowda, A. Arnab, and J. Huang.
[14] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, Optimizing vivit training: Time and mem-
J. Dean, M. Ranzato, and T. Mikolov. Devise: ory reduction for action recognition. arXiv
A deep visual-semantic embedding model. preprint arXiv:2306.04822, 2023.
Advances in neural information processing
systems, 26, 2013. [23] S. N. Gowda, M. Rohrbach, F. Keller, and
L. Sevilla-Lara. Learn2augment: Learning
[15] C. Gan, M. Lin, Y. Yang, G. De Melo, and to composite videos for data augmentation
A. G. Hauptmann. Concepts not alone: in action recognition. In European Confer-
Exploring pairwise relationships for zero-shot ence on Computer Vision, pages 242–259.
video activity recognition. In Thirtieth AAAI Springer, 2022.
conference on artificial intelligence, 2016.
[24] S. N. Gowda, M. Rohrbach, and L. Sevilla-
Lara. Smart frame selection for action

13
recognition. In Proceedings of the AAAI Con- zero-shot learning. Neurocomputing, 430:150–
ference on Artificial Intelligence, volume 35, 158, 2021.
pages 1451–1459, 2021.
[30] K. He, X. Zhang, S. Ren, and J. Sun. Deep
[25] S. N. Gowda, L. Sevilla-Lara, F. Keller, and residual learning for image recognition. In
M. Rohrbach. Claster: clustering with rein- Proceedings of the IEEE conference on com-
forcement learning for zero-shot action recog- puter vision and pattern recognition, pages
nition. In European Conference on Computer 770–778, 2016.
Vision, pages 187–203. Springer, 2022.
[31] E. Kodirov, T. Xiang, Z. Fu, and S. Gong.
[26] S. N. Gowda, L. Sevilla-Lara, K. Kim, Unsupervised domain adaptation for zero-
F. Keller, and M. Rohrbach. A new split for shot learning. In Proceedings of the IEEE
evaluating true zero-shot action recognition. international conference on computer vision,
arXiv preprint arXiv:2107.13029, 2021. pages 2452–2460, 2015.

[27] S. N. Gowda and C. Yuan. Colornet: Inves- [32] H. Kuehne, H. Jhuang, E. Garrote, T. Pog-
tigating the importance of color spaces for gio, and T. Serre. Hmdb: a large video
image classification. In Computer Vision– database for human motion recognition. In
ACCV 2018: 14th Asian Conference on Com- 2011 International Conference on Computer
puter Vision, Perth, Australia, December 2– Vision, pages 2556–2563. IEEE, 2011.
6, 2018, Revised Selected Papers, Part IV 14,
pages 581–596. Springer, 2019. [33] K. Li, Y. Wang, P. Gao, G. Song, Y. Liu,
H. Li, and Y. Qiao. Uniformer: Uni-
[28] K. Grauman, A. Westbury, E. Byrne, fied transformer for efficient spatiotempo-
Z. Chavis, A. Furnari, R. Girdhar, J. Ham- ral representation learning. arXiv preprint
burger, H. Jiang, M. Liu, X. Liu, M. Martin, arXiv:2201.04676, 2022.
T. Nagarajan, I. Radosavovic, S. K. Ramakr-
ishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, [34] C.-C. Lin, K. Lin, L. Wang, Z. Liu, and L. Li.
E. Z. Xu, C. Zhao, S. Bansal, D. Batra, Cross-modal representation learning for zero-
V. Cartillier, S. Crane, T. Do, M. Doulaty, shot action recognition. In Proceedings of the
A. Erapalli, C. Feichtenhofer, A. Fragomeni, IEEE/CVF Conference on Computer Vision
Q. Fu, C. Fuegen, A. Gebreselasie, C. Gon- and Pattern Recognition, pages 19978–19988,
zalez, J. Hillis, X. Huang, Y. Huang, W. Jia, 2022.
W. Khoo, J. Kolar, S. Kottur, A. Kumar,
F. Landini, C. Li, Y. Li, Z. Li, K. Man- [35] J. Lin, C. Gan, and S. Han. Tsm: Tem-
galam, R. Modhugu, J. Munro, T. Mur- poral shift module for efficient video under-
rell, T. Nishiyasu, W. Price, P. R. Puentes, standing. In Proceedings of the IEEE/CVF
M. Ramazanova, L. Sari, K. Somasundaram, international conference on computer vision,
A. Southerland, Y. Sugano, R. Tao, M. Vo, pages 7083–7093, 2019.
Y. Wang, X. Wu, T. Yagi, Y. Zhu, P. Arbe-
[36] J. Liu, H. Bai, H. Zhang, and L. Liu. Beyond
laez, D. Crandall, D. Damen, G. M. Farinella,
normal distribution: More factual feature
B. Ghanem, V. K. Ithapu, C. V. Jawahar,
generation network for generalized zero-shot
H. Joo, K. Kitani, H. Li, R. Newcombe,
learning. IEEE MultiMedia, 2022.
A. Oliva, H. S. Park, J. M. Rehg, Y. Sato,
J. Shi, M. Z. Shou, A. Torralba, L. Torresani, [37] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang,
M. Yan, and J. Malik. Ego4d: Around the S. Lin, and H. Hu. Video swin transformer.
World in 3,000 Hours of Egocentric Video. In In Proceedings of the IEEE/CVF conference
CVPR, 2022. on computer vision and pattern recognition,
pages 3202–3211, 2022.
[29] Z. Han, Z. Fu, G. Li, and J. Yang. Infer-
ence guided feature generation for generalized

14
[38] L. v. d. Maaten and G. Hinton. Visualizing [47] T. Perrett, A. Masullo, T. Burghardt,
data using t-sne. Journal of machine learning M. Mirmehdi, and D. Damen. Temporal-
research, 9(Nov):2579–2605, 2008. relational crosstransformers for few-shot
action recognition. arXiv preprint
[39] D. Mandal, S. Narayan, S. K. Dwivedi, arXiv:2101.06184, 2021.
V. Gupta, S. Ahmed, F. S. Khan, and
L. Shao. Out-of-distribution detection for [48] Y. Qian, L. Yu, W. Liu, and A. G. Haupt-
generalized zero-shot action recognition. In mann. Rethinking zero-shot action recogni-
Proceedings of the IEEE Conference on Com- tion: Learning from latent atomic actions. In
puter Vision and Pattern Recognition, pages European Conference on Computer Vision,
9985–9993, 2019. pages 104–120. Springer, 2022.

[40] P. Mettes and C. G. Snoek. Spatial-aware [49] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni,
object embeddings for zero-shot localization J. Chen, and Y. Wang. Zero-shot action
and classification of actions. In Proceed- recognition with error-correcting output
ings of the IEEE International Conference on codes. In Proceedings of the IEEE Conference
Computer Vision, pages 4443–4452, 2017. on Computer Vision and Pattern Recogni-
tion, pages 2833–2842, 2017.
[41] T. Mikolov, I. Sutskever, K. Chen, G. S. Cor-
rado, and J. Dean. Distributed representa- [50] N. Reimers and I. Gurevych. Sentence-
tions of words and phrases and their compo- bert: Sentence embeddings using siamese
sitionality. In Advances in neural information bert-networks. In Proceedings of the 2019
processing systems, pages 3111–3119, 2013. Conference on Empirical Methods in Natural
Language Processing. Association for Compu-
[42] M. Mirza and S. Osindero. Conditional tational Linguistics, 11 2019.
generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014. [51] M. Rohrbach, M. Regneri, M. Andriluka,
S. Amin, M. Pinkal, and B. Schiele. Script
[43] A. Mishra, V. K. Verma, M. S. K. Reddy, data for attribute-based recognition of com-
S. Arulkumar, P. Rai, and A. Mittal. A gen- posite activities. In Eur. Conf. Comput. Vis.,
erative approach to zero-shot and few-shot 2012.
action recognition. In 2018 IEEE Win-
ter Conference on Applications of Computer [52] K. Simonyan and A. Zisserman. Two-stream
Vision (WACV), pages 372–380. IEEE, 2018. convolutional networks for action recognition
in videos. In Advances in neural information
[44] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, processing systems, pages 568–576, 2014.
J. Fu, S. Xiang, and H. Ling. Expanding
language-image pretrained models for general [53] K. Soomro, A. R. Zamir, and M. Shah.
video recognition. In European Conference on Ucf101: A dataset of 101 human actions
Computer Vision, pages 1–18. Springer, 2022. classes from videos in the wild. arXiv preprint
arXiv:1212.0402, 2012.
[45] J. C. Niebles, C.-W. Chen, and L. Fei-Fei.
Modeling temporal structure of decompos- [54] V. K. Verma, G. Arora, A. Mishra, and
able motion segments for activity classifica- P. Rai. Generalized zero-shot learning via
tion. In European conference on computer synthesized examples. In Proceedings of the
vision, pages 392–405. Springer, 2010. IEEE conference on computer vision and pat-
tern recognition, pages 4281–4289, 2018.
[46] M. Pagliardini, P. Gupta, and M. Jaggi.
Unsupervised learning of sentence embed- [55] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata.
dings using compositional n-gram features. In Feature generating networks for zero-shot
Proceedings of NAACL-HLT, pages 528–540,
2018.

15
learning. In Proceedings of the IEEE confer-
ence on computer vision and pattern recogni-
tion, pages 5542–5551, 2018.

[56] Y. Xian, B. Schiele, and Z. Akata. Zero-shot


learning-the good, the bad and the ugly. In
Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages
4582–4591, 2017.

[57] Y. Xian, S. Sharma, B. Schiele, and Z. Akata.


f-vaegan-d2: A feature generating framework
for any-shot learning. In Proceedings of the
IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10275–10284,
2019.

[58] X. Xu, T. Hospedales, and S. Gong. Trans-


ductive zero-shot action recognition by word-
vector embedding. International Journal of
Computer Vision, 123(3):309–333, 2017.

[59] X. Xu, T. M. Hospedales, and S. Gong.


Multi-task zero-shot action recognition with
prioritised data augmentation. In European
Conference on Computer Vision, pages 343–
359. Springer, 2016.

[60] Y. Zhu, Y. Long, Y. Guan, S. Newsam, and


L. Shao. Towards universal representation
for unseen action recognition. In Proceedings
of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 9436–9445,
2018.

16

You might also like