Professional Documents
Culture Documents
Recognition
Shreyank N Gowda* and Laura Sevilla-Lara
* Universityof Oxford, United Kingdom.
University of Edinburgh, United Kingdom.
Abstract
Video understanding has long suffered from reliance on large labeled datasets, motivating research into
zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot
video analysis, but constructing an effective semantic space relating action classes remains challenging.
We address this by introducing a novel dataset, Stories, which contains rich textual descriptions
for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence
narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This
contextual data enables modeling of nuanced relationships between actions, paving the way for zero-
shot transfer. We also propose an approach that harnesses Stories to improve feature generation for
training zero-shot classification. Without any target dataset fine-tuning, our method achieves new
state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories
provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual
narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled
data that has long impeded advancements in this exciting domain. The data can be found here:
https://github.com/kini5gowda/Stories.
1
In this work, we address the problem of build-
ing a meaningful space of action labels by leverag-
ing the story around each action. In particular, we
use the descriptions of the steps needed to achieve
each action, which are the Stories around this
action and encode them using a language embed-
ding. These steps typically contain the objects,
Fig. 1 Comparison of accuracy across state-of-the-art tools, scenes, verbs, adjectives, etc., associated
ZS approaches using different semantic embeddings: the with the action label. One could think of all these
proposed Stories, word2vec [41] (W2V) and elaborative additional pieces of information around an action
definitions [8] (ER), on UCF101 [53]. Using the pro-
as the “common sense” of associations humans
posed Stories to create the semantic space of class labels
improves the performance by a large margin across all would typically consider. For example, in the case
methods, showing that it is model-agnostic. of the penalty shot, the steps would describe to
first place the ball, run the hand through the
output the corresponding class label. Approaches grass, fluff the ball, take some steps back, kick,
typically learn a mapping from the visual space to etc. When playing soccer, the steps include kick-
the class labels using the seen classes and lever- ing the ball with the inside of your shoe for short
age that mapping in different ways to approximate passes across the grass, tapping the ball from foot
the mapping in the unseen classes. Some exam- to foot, etc. When we compare the stories of steps
ples of solutions include learning to map visual around these two classes, the overlap of terms
information and class labels to the same space becomes much more obvious. It is more likely that
or learning to generate visual features using a these related classes are closer in semantic space,
generative adversarial network (GAN) [55]. facilitating the transfer of knowledge between seen
One of the underlying assumptions of these and unseen classes. We show that this relatively
approaches is that the distances between the data simple approach to creating a semantic space is
points are meaningful both in the visual space and extremely effective across datasets and methods,
the semantic space. In other words, data points improving performance by up to 20% compared to
that are close together should be similar in con- the standard word2vec [41]. Figure 1 shows that
tent across both seen and unseen classes. In visual all state-of-the-art methods improve significantly
space, this tends to happen naturally as related and therefore the proposed semantic embedding is
classes will share objects, scenes, etc. For exam- general and model-agnostic.
ple, if we compare classes such as “penalty shot” Finally, we leverage these Stories to go beyond
and “playing soccer”, they will share the ball, simply learning semantic representations, to actu-
the soccer field, etc. However, in the space of ally generating additional features that improve
action labels, which we also refer to as seman- the semantic space further. We follow a feature-
tic space, this property is not straightforward to generating approach [39, 55] using a GAN [19] to
achieve. While some similar classes will contain synthesize visual data points from these seman-
overlapping terms (such as “horse-back riding” tic embeddings. These synthetic visual data points
and “horse racing”), some others might be sim- are then used to learn a mapping from visual to
ilar but not contain overlapping words (such as semantic space in the unseen classes. We show
the previous example of “penalty shot” and “play- that this method improves state-of-the-art by an
ing soccer”). This makes the step of transfer- additional 6.1%. We observe a strong trend of fea-
ring knowledge between seen and unseen classes ture generating networks benefiting particularly
harder. Previous efforts to improve the seman- from using Stories. The largest jumps using Sto-
tic space of class labels have included the use ries are in feature generating methods (SDR-I3D,
of manually annotated attributes or embedding SDR-CLIP and OD), as shown in Figure 1.
functions trained on language corpora, such as
word2vec [41], sentence2vec [46] and using defini-
tions of actions [8].
2
2 Related Work ReST do. Unlike JigSawNet, that works on the
visual features, we create enriched textual descrip-
Fully Supervised Action Recognition. tions by decomposing actions into a series of steps
In this setting, there is a large amount of train- and hence obtain richer semantic features. Other
ing samples with their associated labels and the solutions to deal with limited labeled data include
label spaces are the same at train and test time. augmentation [23], relational modules [47].
Much of the advances in this area of research is
often used on ZS. Early work in deep learning for Semantic Embeddings.
action recognition used many tools to represent To obtain semantic embeddings of action class
the spatio-temporal nature of videos, including labels, earlier works use word2vec [41] directly on
3D CNNs [7], 2D CNNs with temporal mod- the class labels. However, word2vec averages the
ules [35], 2D CNNS with relational modules [24] embedding for class labels with multiple words,
and two-stream networks [20, 52]. More recently, giving equal weight to each word. This causes
the transformer architecture [11] is particularly class names to lose context. For example, the
well suited to represent sequential information class “pommel horse” is a gymnastic class which
and has been successfully adapted to the video does not involve the animal horse. Unfortunately,
domain [2, 3, 13, 22, 33, 37]. While these are using word2vec makes the embedding of that class
extremely powerful tools, they are difficult to train close to “horse riding” or “horse racing” in the
with limited data. Instead, we use the standard word2vec space, even though they are not gym-
I3D [7] as our backbone feature generator to com- nastics sports. A recent solution is to use elaborate
pare directly with recent state-of-the-art papers descriptions [8] based on the principle of Elabo-
[25, 39]. rative Rehearsals (ER), which replace each class
name with a class definition. An object description
Zero-Shot Action Recognition. was also used to describe the objects in the par-
Early work [51] in this setting used script data ticular action. This resulted in a significant boost
in cooking activities to transfer to unseen classes. in performance. Still, ER uses descriptions of each
Considering each action class as a domain, Gan word in a class label independently to create the
et al. [17] address the identification of semantic semantic space, which potentially leads to errors.
representations as a multisource domain gener-
alization problem. To obtain semantic embed- Feature Generating Networks for ZS.
dings of class names, label embeddings such as Bucher et al. [5] proposed to bridge the gap
word2vec [41] has proven popular as only class between seen and unseen classes by synthesizing
names are needed. Some approaches use a common features for the unseen classes using four differ-
embedding space between video features and class ent generator models. Xian et al. [55] trained
labels [58, 59], error-correcting codes [49], pairwise their generators by using a conditional genera-
relationships between classes [15], interclass rela- tive adversarial network (GAN) [42]. In contrast,
tionships [16], out-of-distribution detectors [39], Verma et al. [54] trained a variational autoen-
and graph neural networks [18]. Recently, it was coder. Mishra et al. [43] introduced the generative
seen that clustering of joint visual-semantic fea- approach to the domain of ZS action recognition.
tures helped obtain better representations [25] Their approach models each action class as a prob-
for ZS action recognition. Similar to CLASTER, ability distribution in the visual space. Mandal
ReST [34] jointly encodes video data and text et al. [39] modified the work by Xian et al. [55]
labels for ZS action recognition. In ResT, trans- to work directly on video features. They also
formers are used to perform modality-specific introduced an out-of-distribution detector to help
attention. JigSawNet [48] also models visual and with the generalized zero-shot learning setting.
textual features jointly but decomposes videos Here we propose a variant of the work on out-
into atomic actions in an unsupervised manner of-distribution detectors, which suffers less from
and bridges group-to-group relationships between long converging times yet it improves its accuracy.
visual and semantic representations instead of More recently Gowda et al. [21], proposed to have
the one-to-one relationships that CLASTER and a selector that selects generated features based on
3
their importance on training the classifier rather cosine similarity to find the 25 that are the most
than on realness of the features. similar to the class definition [8].
Next, we manually check the sentences for each
3 The Stories Dataset class. If we find a mismatch between the article
and the action class, as was the case with the
Research in semantic representation of action “draw sword” example, we do a manual search to
labels has shown over the years that more sophis- pick the most relevant article from other sources
ticated representations help to build a meaningful such as Wikipedia. However, these alternative arti-
semantic space for ZS action recognition. Here we cles do not tend to contain the sequence of steps
go beyond previous work by representing not only and hence need more manual intervention to order
the class label but the story around it. This is, the sentences into a sequence of steps.
we build a representation that captures all the We finally clean each story by re-arranging the
steps needed to perform the action, which include sentences in sequential order and removing irrel-
objects, verbs, etc, typically associated with that evant sentences. In total, we had 6 people who
action. We call these representations Stories , and manually cleaned the descriptions after the ini-
we now describe how we build them. tial stage of noisy collection and a further 10 who
verified the descriptions. This was done using the
3.1 Building the Stories Dataset Prolific2 platform. The time taken for the clean-
ing Stories for UCF101, HMDB51 and Kinetics
We leverage textual descriptions of actions from was 7.2 hours, 3.3 hours and 25.3 hours on average
WikiHow 1 , a website that gives instructions on respectively. We followed this process to create a
how to perform actions. These instructions con- dataset of these textual representations for classes
sist of long paragraphs that describe each step in in UCF101, HMDB51, Olympics, and Kinetics-
completing the action. For example, for the action 400, as they are the most commonly used datasets
classes in the HMDB51 dataset, the WikiHow arti- for ZS action recognition. Detailed descriptions of
cles contain an average of 9.8 steps, ranging from each class can be found in the https://github.com/
4 to 20 steps. The most closely related work to kini5gowda/Stories.
us, ER [8] uses a single or at max two sentences
per class. Instead, for the proposed Stories , the 3.2 Learning from Stories
average number of lines is 14.4 for the classes in
the Olympics dataset, 9.6 for HMDB51, 13.2 for In order to test the impact of using richer seman-
UCF101, and 13.5 Kinetics. These rich descrip- tic representations, we used Stories as input to
tions inherently contain information of the objects the state-of-the-art methods in ZS. To provide fair
needed to perform an action. For example, the points of comparison, we also supply the stan-
class ‘biking’ has a paragraph that explains where dard word2vec [41] embeddings as well as the more
to bike, what equipment you need, and the steps recently introduced ER [8] embeddings to these
to do it. In comparison, ED [8] has only a single same models. By visually depicting the perfor-
line definition of the class. mance gains achieved by the models when using
Not all classes have articles on WikiHow. For our proposed embeddings instead of the word2vec
example, if we search for “draw sword” (a class in or ER embeddings in Fig. 1, we are able to
UCF101), we will get instructions on how to paint clearly demonstrate the substantial improvements
or sketch swords instead of the steps needed to obtained through our approach. It increases up
essentially remove a sword from its sheath. Hence, to 21% compared to the widely used word2vec
collecting clean, meaningful articles requires a embeddings and 11.8% over the ER embeddings.
more complex process than a simple search. After This comprehensive experimental analysis high-
scraping the articles from WikiHow corresponding lights how our semantically richer embeddings can
to all classes, we use sentence-BERT encoders [50] notably boost performance across a diverse range
to represent the sentences in the article and use of state-of-the-art models and is hence model
agnostic.
1
https://www.wikihow.com/ 2
https://www.prolific.co/
4
Fig. 2 Comparing nearest neighbors using Stories. We see an example where ER fails and Stories provides more context
and helps in obtaining better neighbors. This is one example of where ER fails, there are multiple such examples. Dataset
is UCF101.
We visualize the effect of using ER and Sto- event where the athlete throws a spherical object.
ries to generate features, using the t-SNE [38] for Similar problems happen for example with classes
10 classes in UCF101, all related to gymnastics such as “Sword Exercise”, or “Swing Baseball”.
and therefore easier to confuse. We see that using Overall, this shows the need for the more sophis-
Stories helps keep a more meaningful neighbor- ticated joint description that Stories provides,
hood for visual instances, and keep classes apart. instead of the per-word definition of ER.
The visualization also shows why ER fails as there
is not enough information to clearly distinguish Size of dataset.
classes. We also compare Stories to ER from a statisti-
cal perspective. We have briefly mentioned before
3.3 Why Are Stories Necessary? that Stories contains much more detailed descrip-
The proposed Stories dataset produces semantic tions of the classes. Here we look at the numerical
embeddings for the class labels that are much difference between ER and Stories , shown in
more meaningful than previous work, and that Table 1. The number of sentences is one order of
reflects in the experimental improvement shown in magnitude larger, going from over 1 sentence on
Fig. 1. Here we delve deeper into what properties average per class, to over 10 sentences on over-
make Stories superior. age, depending on the dataset. This ratio is also
consistent in the number of nouns verbs, etc.
Capturing meaning of words jointly.
Diversity of dataset.
Previous work [8, 41] represents a class label by
computing a representation for each word in the Another aspect that increases the specificity of the
class label and then computing the average. 3 class descriptions in addition to the size of the
While this in general is a sensible choice, it leads dataset is the diversity of the vocabulary. We look
often to errors caused by words that have mul- at the number of unique words in Table 1 and
tiple meanings depending on the context. Fig. 2 observe that Stories contains more unique words
illustrates this issue with an example. We use the than ER in all datasets and that the difference
class “Hammer Throw” from the UCF101 [53] is particularly remarkable in smaller datasets. We
dataset, which is a sporting event in which the argue that this diversity contributes to represent-
athlete throws a spherical object. If we retrieve ing each class label in a more unique way, leading
the nearest neighbor with ER [8], we obtain the to a more sparse yet meaningful space.
class “Hammering”, which is not actually related
in meaning but both contain the description of the Cleaning data manually.
tool hammer. However, if we use Stories , the near- Generally, training with more data tends to pro-
est neighbor is “Shot Put”, which is also a sporting duce better results. There often is a tension
between using a smaller amount of clean data
or a larger amount of noisy data. Here we have
3
In the case of ER, this is done for some of the classes.
5
Method D N V A Ad UW S
Stories K 68.5 37.9 7.1 0.04 35.1 13.5
ER K 9.2 3.8 1.0 0.01 30.2 1.8
Stories U 69.6 37.2 7.7 0.1 32.1 13.2
ER U 9.7 4.3 1.4 0.06 29.5 1.8
Stories H 59.2 33.4 6.5 0.3 34.5 9.6
ER H 5.8 2.1 0.9 0.05 17.7 1.2
Stories O 68.9 38.6 11.8 2.0 57.2 14.4
Fig. 3 Using the cleaned version of Stories to create the ER O 9.1 4.2 1.2 0.3 31.4 1.7
semantic space of class labels improves the performance by Table 1 Statistical comparison of Stories to ER [8]. We
a large margin. The dataset is UCF101. observe a larger number of Nouns (N), Verbs (V),
Adverbs (A), Adjectives (Ad), Unique Words (UW), and
Sentences (S) are averages across all the classes in a
particular dataset (D). ’K’ refers to Kinetics-600, ’U’ to
UCF101, ’H’ to HMDB51, and ’O’ to Olympics.
One possible limitation of our approach is that the 4.1 Hyperparameter Selection
Stories may focus on one specific way of perform-
ing each action, while other valid methods may Choosing the number of nearest neighbors for
exist to do the same action. both the data-based noise and the ranking loss is
For example, the story for ”shuffling cards” done empirically. We use the generalized zero-shot
details the riffle shuffling technique, but other action recognition performance to decide these
shuffling techniques scuh as the overhand shuffling hyperparameters.
could occur in videos of this class. We choose UCF101 as our dataset for the
hyperparameter tuning, but also plot the results
on HMDB51 as it also ended up following the same
6
maps visual features (xi ) to a semantic embed-
ding approximation (âi ). D separates real from
synthesized features. They’re jointly trained via
a mini-max game using an optimization function
(see Eq 1).
7
Fig. 6 Using Stories for feature generation. The elements depicted in yellow are the standard vanilla approach to feature
generation for ZS. Depicted in green are the elements that we introduce.
p(xS ×aS ) is the joint distribution of visual and 4.6 Ranking Loss
semantic descriptors for seen classes, pa is the One of the risks of learning to generate semantic
empirical distribution of their semantic embed- embeddings (through the “Projection Network” in
dings, pz is noise, and α is a penalty coefficient. is that synthetic semantic embeddings can be too
Additional losses to enhance generated features similar to each other. To avoid this, we introduce
are the classification regularized loss (LCLS ) and a ranking loss [14] that pushes apart the generated
the mutual information loss (LM I ). These losses semantic representation (âi ) from those of their
form the objective function minimized to train the neighboring classes:
vanilla pipeline (see Eq 2).
Lrank = E[max(0, δ − aT âi + (a′ )T âi )], (3)
min min max LD + λ1 LCLS (G) + λ2 LM I (G).
G P D
(2) where a is the ground truth semantic embed-
Once these networks are trained on the seen ding, a′ is the semantic embedding of a class
classes, the generator is used to synthesize visual randomly sampled from the 5 classes (empirical
features for the unseen classes. The final step is results in Sec. 4.1 closest to the ground truth and
to train a simple classifier using these synthetic δ is a hyperparameter. Including this loss in the
visual features as input and the class labels. The overall objective function, we obtain:
loss is the standard cross-entropy loss, and the
classifier is a simple multilayer perceptron (MLP). min min max LD + λ1 LCLS (G)
G P D
(4)
+λ2 Lrank (P ) + λ3 LM I (G).
8
4.7 Features classifier for ZSAR and two classifiers for GZSAR
along with an out-of-distribution (OOD) detector.
For our visual features we consider two scenarios.
The classifiers are single-layer fully-connected
The first case, the appearance and flow features
networks with an input size equal to the video
are extracted from the Mixed 5c layer of the
feature size and output sizes equal to the num-
RGB and flow I3D networks, respectively. Both
ber of classes (seen or unseen). The OOD is a
I3D models are pre-trained on the Kinetics-400
three-layer fully connected network with output
dataset [7].
and hidden layer sizes equal to the number of seen
Given an input video, appearance and flow fea-
classes and 512, respectively. We use 8 RTX 2080
tures extracted are averaged across the temporal
Ti NVIDIA GPUs having 16 GB RAM each for
dimension and pooled by 4 in the spatial dimen-
our experiments.
sion and then flattened to obtain a vector of size
4096 each. These vectors are then concatenated to
obtain video features of size 8192. 4.10 Ablation Study
In the second case, we first train the X-CLIP- We propose modifications to the vanilla pipeline
B/16 [44] on 16 frames of the non-overlapping of feature generation and based on this, we show
classes of Kinetics [4] dubbed Kinetics-664 [4] the importance of each component here. We see
using the proposed ‘Stories’ as the semantic that every proposed contribution benefits over the
embedding. For the text embeddings we use the baseline.
large S-BERT [50], which is a sentence encoder. However, crucially, a combination of all three
For ER we use the class definition as input to gives us our best results. We note that the
the S-BERT and use the 1024 sized vector output improvement from the ranking loss is much more
as the semantic embedding. In case of Stories, we prominent in the generalized zero-shot setting
use S-BERT for each sentence and average all the than in the zero-shot.
vectors to obtain a singe vector of size 1024.
Stories DBN Lrank ZSLHM DB ZSLU CF GZSLHM DB GZSLU CF
× × × 29.1 ± 3.8 37.5 ± 3.1 32.7 ± 3.4 44.4 ± 3.0
4.8 Network Architecture × × ✓ 29.7 ± 3.5 38.0 ± 3.1 35.3 ± 3.1 47.1 ± 3.2
× ✓ × 30.6 ± 2.2 38.6 ± 3.4 33.3 ± 3.0 44.9 ± 2.9
We use the Wasserstein GAN [55] which has ✓
×
×
✓
×
✓
44.6
31.9
±
±
2.9
3.2
60.4
40.9
±
±
3.8
2.9
44.9
35.7
±
±
3.6
2.9
51.0
47.9
±
±
2.9
4.1
been successful in both zero-shot image classifi- ✓ × ✓ 45.0 ± 2.5 60.9 ± 3.5 49.0 ± 3.2 54.4 ± 3.7
✓ ✓ × 45.9 ± 2.7 61.4 ± 2.8 47.5 ± 2.6 53.7 ± 3.5
cation [56] and zero-shot action recognition [39] ✓ ✓ ✓ 46.5 ± 5.3 61.9 ± 2.5 49.7 ± 2.9 54.9 ± 4.4
tasks. This also allows us to compare directly Table 2 Ablation study to explore the impact of each
to OD [39] and Wasserstein GAN [55] in the proposed component.
experimental analysis.
The feature generator G is a three-layer fully-
connected network that has an output layer
dimension equal to that of the video feature size. 4.11 Why Not Just Use VAE for
The hidden layers are of size 4096. The discrimina- Feature Generation?
tor D is also a three-layer fully-connected network
with hidden layers of size 4096. However, the out- Another possible question is the use of the cur-
put size equals 1. The projection network P is a rent feature generator model. There are multiple
fully-connected network that has an output layer options to use as feature generators including
size equal to the size of the semantic embeddings VAEs, and other versions of GANs (not just
(in our case 1024). WGAN [55] that we use).
We chose to adapt the WGAN for our feature
4.9 Training Details generator based on two reasons. First, we wanted
to compare directly to existing literature on zero-
All the modules are trained using the Adam opti- shot action recognition and to the best of our
mizer with a weight decay of 0.0005 and with an knowledge the most recent one has been the one
adaptive learning rate using a learning rate sched- used in OD [39].
uler. We set λ1 as 0.1, λ2 as 0.9 and λ3 as 0.1. However, for the sake of sanity we also ran
At test time, we follow OD [39] and train a single additional experiments on the HMDB51 dataset
9
incorporating f-VAEGAN [57], adapted FREE [9]: 5 Results
feature refinement of f-VAEGAN for zero-shot
action recognition and using a simple VAE. The 5.1 Zero-Shot Learning Results
results of this can be seen in Tab 3. We look at the use of Stories as semantic embed-
ding to a wide variety of models here, from older
Feature Generator Accuracy models all the way to the most recent ones in the
VAE 25.5 ± 2.9 ZS literature. We use I3D and CLIP-based fea-
Vanilla GAN 31.5 ± 2.4 tures [44] to compare their effect. We list these
f-VAEGAN 45.9 ± 3.2 results as SDR+I3D and SDR+CLIP respectively.
FREE 46.6 ± 3.5 We observe an improvement across all of them
SDR (Ours) 48.1 ± 3.6
and across all datasets, demonstrating that Sto-
Table 3 Comparing different
choices for feature generator. ries is clearly model agnostic. Results can be seen
Reported results are on 10 different in Table 4.
runs and all models use the same We also observe that the proposed changes to
split. Dataset is HMDB51.
the vanilla feature generation method which we
call SDR consistently outperforms all approaches
across all datasets, achieving a new state-of-the-
art. We experiment with using a single model for
4.12 Datasets and Evaluation all datasets, by training on Kinetics and not doing
Protocol any fine-tuning for the smaller datasets. This is the
last row of the table, which we call “SDR (Ours)
We use the Olympic Sports [45], HMDB-51 [32], + SM”. It is remarkable and quite promising that,
UCF-101 [53] and Kinetics [7] datasets, as they without the need to fine-tune, this single model
are the standard choice in ZS action recognition, achieves even better performance.
so that we can compare them with recent state- We also evaluate on the Kinetics-220 dataset
of-the-art models [4, 17, 25, 39, 48, 49]. These as proposed in ER [8]. There are fewer methods
datasets contain 783, 6766 and 13320 videos, and who report on this split, but it is interesting as
have 16, 51 and 101 classes, respectively. We fol- it is much larger. Results can be seen in Table 5.
low the commonly used 50/50 splits of Xu et al. We observe that the proposed SDR outperforms
[58], where 50% are the seen classes and 50 are all previous work. We see significant gains of up
the unseen classes. Similar to previous approaches to 5%.
[17, 31, 40, 49, 60], we report average accu- Finally, we evaluate on the stricter TruZe [26]
racy and standard deviation over 10 independent split that ensures no overlap between pre-trained
runs. We also report on the recently introduced model and test classes. Results are shown in
TruZe [26]. This split accounts for the fact that Table 6. We report the mean class accuracy in ZS
some classes present on the dataset used for pre- setting and harmonic mean of seen and unseen
training (Kinetics [7]) overlap with some of the class accuracies in GZS. The split refers to the
unseen classes in the datasets used in the zero-shot train/test split used.
setting, therefore breaking the premise that those
classes have not been seen. We also report on the 5.2 Generalized Zero-Shot Learning
Kinetics-220[7] split as proposed in ER [8]. Here,
the 220 classes from Kinetics-600 [6] are treated as
Results
unseen classes to a model trained on the Kinetics- Generalized ZS action recognition is less explored
400 dataset. All the datasets we use have action in comparison to the ZS setting. Nonetheless,
classes that are singular and not compositional. we choose models such as Bi-Dir GAN, GGM,
WGAN, OD and CLASTER as recent state-of-
the-art methods evaluating on this setting. We
train an out-of-distribution detector (OOD) fol-
lowing OD [39] and two separate classifiers for
the seen and unseen classes along with the OOD
network.
10
D E Bi-Dir GAN WGAN OD E2E CLASTER JigSawNet ActionCLIP ResT X-CLIP SDR+I3D SDR+CLIP
W 40.2 ± 10.6 47.1 ± 6.4 50.5 ± 6.9 61.4 ± 5.5 63.8 ± 5.7 66.4 ± 6.8 - - - 55.1 ± 4.8 62.3 ± 4.5
ER 54.1 ± 6.8 65.5 ± 7.2 67.6 ± 6.2 66.5 ± 4.5 68.4 ± 4.1 71.5 ± 6.1 - - - 69.8 ± 2.8 73.5 ± 2.9
O
Sto 55.5 ± 6.5 66.2 ± 7.1 69.1 ± 5.6 69.9 ± 5.8 73.1 ± 6.6 74.9 ± 5.2 - - - 72.5 ± 2.1 80.1 ± 2.3
SM - - - - - - - - - 74.8 ± 2.3 82.2 ± 1.6
W 21.3 ± 3.2 29.1 ± 3.8 30.2 ± 2.7 33.1 ± 3.4 36.6 ± 4.6 35.4 ± 3.2 40.8 ± 5.4 39.3 ± 3.5 43.7 ± 6.5 35.8 ± 4.7 38.9 ± 3.5
ER 25.9 ± 2.9 31.6 ± 3.1 36.1 ± 2.9 36.2 ± 1.9 43.2 ± 1.9 39.3 ± 3.9 44.2 ± 4.4 43.6 ± 2.9 46.6 ± 6.1 41.2 ± 4.3 46.2 ± 3.1
H
Sto 27.2 ± 2.7 35.5 ± 2.8 39.2 ± 2.8 38.1 ± 3.6 45.5 ± 2.6 42.5 ± 3.2 48.8 ± 3.2 47.1 ± 3.5 50.1 ± 6.1 46.8 ± 5.0 52.7 ± 3.4
SM - - - - - - - - - 48.9 ± 4.4 54.4 ± 4.1
W 21.8 ± 3.6 25.8 ± 3.2 26.9 ± 2.8 46.2 ± 3.8 46.7 ± 5.4 53.3 ± 3.1 58.3± 3.4 58.7 ± 3.3 70.1 ± 3.4 41.5 ± 2.5 44.8 ± 4.2
ER 28.0 ± 3.4 37.9 ± 2.5 42.4 ± 3.4 52.4 ± 3.3 53.9 ± 2.5 56.8 ± 2.8 64.3 ± 3.8 62.6 ± 4.1 72.2 ± 2.3 50.3 ± 1.1 54.2 ± 3.5
U
Sto 29.5 ± 3.2 40.1 ± 3.7 50.3 ± 3.0 55.1 ± 3.3 59.6 ± 2.8 61.6 ± 3.5 71.8 ± 2.7 65.3 ± 2.5 73.8 ± 2.9 62.9 ± 1.6 73.4 ± 2.7
SM - - - - - - - - - 64.9 ± 2.1 75.5 ± 3.2
Table 4 Results on ZSL. SE: semantic embedding, W: word2vec embedding, ER: Elaborate Rehearsals, Sto: Stories. SM
corresponds to the single model training. We use the datasets Olympics (O), HMDB51 (H) and UCF101 (U).
Method Top-1 Acc Top-5 Acc The reported results are on the same set of
DEVISE [14] 23.8 ± 0.3 51.0 ± 0.6 10 random splits for fair compairson. There are
SJE [1] 22.3 ± 0.6 48.2 ± 0.4
ER [8] 42.1 ± 1.4 73.1 ± 0.3
no manual attributes for the HMDB dataset. We
JigSawNet [48] 45.9 ± 1.6 78.8 ± 1.0 see that the proposed SDR approach obtains best
SDR+I3D 50.8 ± 1.9 82.9 ± 1.3 results on all three categories. Another observa-
SDR+CLIP 55.1 ± 2.2 86.1 ± 3.1 tion we can see is that the performance of all
Table 5 Results of ZS in Kinetics-220. models using Stories is better than even the older
Method UCF101 HMDB51
manual attributes.
Split ZSL GZSL Split ZSL GZSL
WGAN 67/34 22.5 36.3 29/22 21.1 31.8 6 Why Does Using a Single
OD 67/34 22.9 42.4 29/22 21.7 35.5
CLASTER 67/34 45.8 47.3 29/22 33.2 44.5 Model Work?
SDR+I3D 67/34 49.7 51.3 29/22 34.9 45.5
SDR+CLIP 67/34 53.9 56.2 29/22 38.7 49.5 One curious question to ponder would be why
VCAP [12] 0/34 49.1 - 0/22 20.4 - the single model trained on a large dataset like
SDR+I3D SM 0/34 51.5 52.2 0/22 36.1 46.6
SDR+CLIP SM 0/34 55.1 57.7 0/22 40.8 51.1 Kinetics-400 [7] results in better performance than
the models fine-tuned on the smaller datasets. Our
Table 6 Results of ZS and GZS on TruZe.
hypothesis is that the feature generator trained
on a larger dataset has a better distribution of
data to learn from as the data-driven noise that
Table 7 shows the results, with the harmonic
we use is more representative of the real visual
mean of the seen and unseen class accuracies.
world. This in turn generates more realistically
Similar to the zero-shot case, we use both I3D
distributed features, which in turn results in the
and CLIP-based [44] backbones and list these
improved performance.
results as SDR+I3D and SDR+CLIP respectively.
We follow the earlier approach of using differ-
ent semantic embeddings to show the performance 7 Conclusion
gain that using SDR gives us.
The introduction of the novel Stories dataset
provides rich textual narratives that establish
5.3 Generalized Zero-Shot Action connections between diverse action classes. Lever-
Recognition Results in Detail aging this contextual data enables modeling of
In order to better analyze performance of the nuanced relationships between actions, overcom-
model on GZSL, we report the average seen ing previous limitations in ZS action recognition.
and unseen accuracies along with their harmonic Stories enables significant improvement for multi-
mean. The results using different embeddings and ple SOTA models. Building on Stories, our pro-
on the UCF101, HMDB51 and Olympics datasets posed feature generating approach harnessing Sto-
are reported in Table 8. ries achieves new state-of-the-art performance on
multiple benchmarks without any target dataset
11
Dataset E Bi-Dir GAN GGM WGAN OD CLASTER SDR+I3D SDR+CLIP
W 44.2 ± 11.2 52.4 ± 12.2 59.9 ± 5.3 66.2 ± 6.3 69.1 ± 5.4 67.1 ± 4.2 69.7 ± 4.1
ER 53.6 ± 6.2 59.1 ± 12.1 63.7 ± 6.6 69.7 ± 6.5 72.5 ± 3.5 70.4 ± 2.3 75.5 ± 3.3
Olympics
Sto 55.9 ± 4.2 59.9 ± 11.6 67.1 ± 5.1 72.4 ± 4.9 74.9 ± 6.1 74.5 ± 3.9 79.5 ± 3.5
SM - - - - - 76.6 ± 3.6 81.1 ± 3.0
W 17.5 ± 2.4 20.1 ± 2.1 32.7 ± 3.4 36.1 ± 2.2 48.0 ± 2.4 36.3 ± 5.1 39.5 ± 3.1
ER 26.1 ± 2.3 28.2 ± 3.3 35.2 ± 3.5 38.1 ± 2.4 50.8 ± 2.8 42.1 ± 4.5 45.4 ± 3.6
HMDB51
Sto 27.7 ± 2.1 29.1 ± 2.9 38.5 ± 3.3 40.9 ± 3.8 53.5 ± 2.4 49.7 ± 2.9 53.5 ± 3.3
SM - - - - - 50.9 ± 2.6 56.1 ± 3.2
W 22.7 ± 2.5 23.7 ± 1.2 44.4 ± 3.0 49.4 ± 2.4 51.3 ± 3.5 41.9 ± 2.7 44.3 ± 4.6
ER 26.2 ± 4.2 26.5 ± 2.5 46.1 ± 3.5 53.2 ± 3.1 52.8 ± 2.1 45.5 ± 1.1 49.1 ± 2.9
UCF101
Sto 29.1 ± 3.4 27.7 ± 3.1 48.3 ± 3.2 55.5 ± 3.3 54.1 ± 3.3 54.9 ± 4.4 57.8 ± 4.1
SM - - - - - 57.2 ± 3.5 59.7 ± 3.1
Table 7 Results on GZSL. SE: semantic embedding, W: word2vec embedding, ER: Elaborate Rehearsals, Sto: Stories.
SM corresponds to the single model training.
12
[8] S. Chen and D. Huang. Elaborative rehearsal [16] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and
for zero-shot action recognition. In Proceed- A. G. Hauptmann. Exploring semantic inter-
ings of the IEEE/CVF International Con- class relationships (sir) for zero-shot action
ference on Computer Vision, pages 13638– recognition. In Proceedings of the National
13647, 2021. Conference on Artificial Intelligence, 2015.
[9] S. Chen, W. Wang, B. Xia, Q. Peng, X. You, [17] C. Gan, T. Yang, and B. Gong. Learn-
F. Zheng, and L. Shao. Free: Feature refine- ing attributes equals multi-source domain
ment for generalized zero-shot learning. In generalization. In Proceedings of the IEEE
Proceedings of the IEEE/CVF international conference on computer vision and pattern
conference on computer vision, pages 122– recognition, pages 87–97, 2016.
131, 2021.
[18] J. Gao, T. Zhang, and C. Xu. I know the rela-
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, tionships: Zero-shot action recognition via
and L. Fei-Fei. Imagenet: A large-scale hier- two-stream graph convolutional networks and
archical image database. In 2009 IEEE knowledge graphs. In Proceedings of the
conference on computer vision and pattern AAAI Conference on Artificial Intelligence,
recognition, pages 248–255. Ieee, 2009. volume 33, pages 8303–8311, 2019.
13
recognition. In Proceedings of the AAAI Con- zero-shot learning. Neurocomputing, 430:150–
ference on Artificial Intelligence, volume 35, 158, 2021.
pages 1451–1459, 2021.
[30] K. He, X. Zhang, S. Ren, and J. Sun. Deep
[25] S. N. Gowda, L. Sevilla-Lara, F. Keller, and residual learning for image recognition. In
M. Rohrbach. Claster: clustering with rein- Proceedings of the IEEE conference on com-
forcement learning for zero-shot action recog- puter vision and pattern recognition, pages
nition. In European Conference on Computer 770–778, 2016.
Vision, pages 187–203. Springer, 2022.
[31] E. Kodirov, T. Xiang, Z. Fu, and S. Gong.
[26] S. N. Gowda, L. Sevilla-Lara, K. Kim, Unsupervised domain adaptation for zero-
F. Keller, and M. Rohrbach. A new split for shot learning. In Proceedings of the IEEE
evaluating true zero-shot action recognition. international conference on computer vision,
arXiv preprint arXiv:2107.13029, 2021. pages 2452–2460, 2015.
[27] S. N. Gowda and C. Yuan. Colornet: Inves- [32] H. Kuehne, H. Jhuang, E. Garrote, T. Pog-
tigating the importance of color spaces for gio, and T. Serre. Hmdb: a large video
image classification. In Computer Vision– database for human motion recognition. In
ACCV 2018: 14th Asian Conference on Com- 2011 International Conference on Computer
puter Vision, Perth, Australia, December 2– Vision, pages 2556–2563. IEEE, 2011.
6, 2018, Revised Selected Papers, Part IV 14,
pages 581–596. Springer, 2019. [33] K. Li, Y. Wang, P. Gao, G. Song, Y. Liu,
H. Li, and Y. Qiao. Uniformer: Uni-
[28] K. Grauman, A. Westbury, E. Byrne, fied transformer for efficient spatiotempo-
Z. Chavis, A. Furnari, R. Girdhar, J. Ham- ral representation learning. arXiv preprint
burger, H. Jiang, M. Liu, X. Liu, M. Martin, arXiv:2201.04676, 2022.
T. Nagarajan, I. Radosavovic, S. K. Ramakr-
ishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, [34] C.-C. Lin, K. Lin, L. Wang, Z. Liu, and L. Li.
E. Z. Xu, C. Zhao, S. Bansal, D. Batra, Cross-modal representation learning for zero-
V. Cartillier, S. Crane, T. Do, M. Doulaty, shot action recognition. In Proceedings of the
A. Erapalli, C. Feichtenhofer, A. Fragomeni, IEEE/CVF Conference on Computer Vision
Q. Fu, C. Fuegen, A. Gebreselasie, C. Gon- and Pattern Recognition, pages 19978–19988,
zalez, J. Hillis, X. Huang, Y. Huang, W. Jia, 2022.
W. Khoo, J. Kolar, S. Kottur, A. Kumar,
F. Landini, C. Li, Y. Li, Z. Li, K. Man- [35] J. Lin, C. Gan, and S. Han. Tsm: Tem-
galam, R. Modhugu, J. Munro, T. Mur- poral shift module for efficient video under-
rell, T. Nishiyasu, W. Price, P. R. Puentes, standing. In Proceedings of the IEEE/CVF
M. Ramazanova, L. Sari, K. Somasundaram, international conference on computer vision,
A. Southerland, Y. Sugano, R. Tao, M. Vo, pages 7083–7093, 2019.
Y. Wang, X. Wu, T. Yagi, Y. Zhu, P. Arbe-
[36] J. Liu, H. Bai, H. Zhang, and L. Liu. Beyond
laez, D. Crandall, D. Damen, G. M. Farinella,
normal distribution: More factual feature
B. Ghanem, V. K. Ithapu, C. V. Jawahar,
generation network for generalized zero-shot
H. Joo, K. Kitani, H. Li, R. Newcombe,
learning. IEEE MultiMedia, 2022.
A. Oliva, H. S. Park, J. M. Rehg, Y. Sato,
J. Shi, M. Z. Shou, A. Torralba, L. Torresani, [37] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang,
M. Yan, and J. Malik. Ego4d: Around the S. Lin, and H. Hu. Video swin transformer.
World in 3,000 Hours of Egocentric Video. In In Proceedings of the IEEE/CVF conference
CVPR, 2022. on computer vision and pattern recognition,
pages 3202–3211, 2022.
[29] Z. Han, Z. Fu, G. Li, and J. Yang. Infer-
ence guided feature generation for generalized
14
[38] L. v. d. Maaten and G. Hinton. Visualizing [47] T. Perrett, A. Masullo, T. Burghardt,
data using t-sne. Journal of machine learning M. Mirmehdi, and D. Damen. Temporal-
research, 9(Nov):2579–2605, 2008. relational crosstransformers for few-shot
action recognition. arXiv preprint
[39] D. Mandal, S. Narayan, S. K. Dwivedi, arXiv:2101.06184, 2021.
V. Gupta, S. Ahmed, F. S. Khan, and
L. Shao. Out-of-distribution detection for [48] Y. Qian, L. Yu, W. Liu, and A. G. Haupt-
generalized zero-shot action recognition. In mann. Rethinking zero-shot action recogni-
Proceedings of the IEEE Conference on Com- tion: Learning from latent atomic actions. In
puter Vision and Pattern Recognition, pages European Conference on Computer Vision,
9985–9993, 2019. pages 104–120. Springer, 2022.
[40] P. Mettes and C. G. Snoek. Spatial-aware [49] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni,
object embeddings for zero-shot localization J. Chen, and Y. Wang. Zero-shot action
and classification of actions. In Proceed- recognition with error-correcting output
ings of the IEEE International Conference on codes. In Proceedings of the IEEE Conference
Computer Vision, pages 4443–4452, 2017. on Computer Vision and Pattern Recogni-
tion, pages 2833–2842, 2017.
[41] T. Mikolov, I. Sutskever, K. Chen, G. S. Cor-
rado, and J. Dean. Distributed representa- [50] N. Reimers and I. Gurevych. Sentence-
tions of words and phrases and their compo- bert: Sentence embeddings using siamese
sitionality. In Advances in neural information bert-networks. In Proceedings of the 2019
processing systems, pages 3111–3119, 2013. Conference on Empirical Methods in Natural
Language Processing. Association for Compu-
[42] M. Mirza and S. Osindero. Conditional tational Linguistics, 11 2019.
generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014. [51] M. Rohrbach, M. Regneri, M. Andriluka,
S. Amin, M. Pinkal, and B. Schiele. Script
[43] A. Mishra, V. K. Verma, M. S. K. Reddy, data for attribute-based recognition of com-
S. Arulkumar, P. Rai, and A. Mittal. A gen- posite activities. In Eur. Conf. Comput. Vis.,
erative approach to zero-shot and few-shot 2012.
action recognition. In 2018 IEEE Win-
ter Conference on Applications of Computer [52] K. Simonyan and A. Zisserman. Two-stream
Vision (WACV), pages 372–380. IEEE, 2018. convolutional networks for action recognition
in videos. In Advances in neural information
[44] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, processing systems, pages 568–576, 2014.
J. Fu, S. Xiang, and H. Ling. Expanding
language-image pretrained models for general [53] K. Soomro, A. R. Zamir, and M. Shah.
video recognition. In European Conference on Ucf101: A dataset of 101 human actions
Computer Vision, pages 1–18. Springer, 2022. classes from videos in the wild. arXiv preprint
arXiv:1212.0402, 2012.
[45] J. C. Niebles, C.-W. Chen, and L. Fei-Fei.
Modeling temporal structure of decompos- [54] V. K. Verma, G. Arora, A. Mishra, and
able motion segments for activity classifica- P. Rai. Generalized zero-shot learning via
tion. In European conference on computer synthesized examples. In Proceedings of the
vision, pages 392–405. Springer, 2010. IEEE conference on computer vision and pat-
tern recognition, pages 4281–4289, 2018.
[46] M. Pagliardini, P. Gupta, and M. Jaggi.
Unsupervised learning of sentence embed- [55] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata.
dings using compositional n-gram features. In Feature generating networks for zero-shot
Proceedings of NAACL-HLT, pages 528–540,
2018.
15
learning. In Proceedings of the IEEE confer-
ence on computer vision and pattern recogni-
tion, pages 5542–5551, 2018.
16