You are on page 1of 10

ICCV ICCV

#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 054
001 055
002 056
003 Adversarial Memory Networks for Action Prediction 057
004 058
005 059
006 060
Anonymous ICCV submission
007 061
008 062
009 Paper ID 9282 063
010 064
011 065
012 066
Abstract for solving action prediction lies in: how to enhance the
013 067
discriminative information for a partially observed video?
014 068
Action prediction aims to infer the forthcoming human In recent years, many research efforts centered on the
015 069
action with partially observed video, which is a challeng- above question have been made for the action prediction
016 070
ing task due to the limited information underlying the early task. Pioneering works [32, 2, 21] mainly handle partial
017 071
observation. Existing methods mainly adopt a reconstruc- videos by relying on hand-crafted features, dictionary learn-
018 072
tion strategy to handle this task, expecting to learn a single ing, and designing temporally-structured classifiers. More
019 073
mapping function from partial observations to full videos to recently, deep convolutional neural networks (CNNs), es-
020 074
facilitate the prediction process. In this study, we propose pecially those pre-trained on large-scale video benchmarks
021 075
a novel two-stream adversarial memory networks (AMem- (e.g., Sports-1M [18] and Kinetics [19]), have been widely
022 076
Net) model to generate the “full video” feature condition- adopted to predict actions. The pre-trained CNNs, to the
023 077
ing on a partial video query with two new insights. Firstly, a extent, compensate for the incomplete temporal context and
024 078
key-value structured memory generator is designed to mem- empower reconstructing full video representations from the
025 079
orize different partial videos as key memories and dynami- partial ones. Along this line, existing works [22, 34, 43, 23]
026 080
cally write full videos into the value memories with a gating focus on model design for continuously improving the
027 081
mechanism and querying attention. Secondly, we develop reconstruction performance, yet without considering the
028 082
a class-aware discriminator to guide the memory genera- “malnutrition” nature (i.e., the limited temporal cues) of in-
029 083
tor to deliver not only realistic but also discriminative full complete videos. Particularly, it should be more straight-
030 084
video features upon adversarial training. The final predic- forward to learn what “nutrients” (e.g., the missing tempo-
031 085
tion result of AMemNet is given by late fusion over RGB and ral cues or reconstruction bases) this partial video needs,
032 086
optical flow stream. Extensive experimental results on two rather than map it to the full video through a single func-
033 087
benchmark video datasets, i.e., UCF-101 and HMDB51, tion. Moreover, it is also challenging to handle various par-
034 088
are provided to demonstrate the effectiveness of the pro- tial videos by solely resorting to a single model.
035 089
posed AMemNet model over state-of-the-art methods.
036 In this study, we propose a novel adversarial memory 090
037 networks (AMemNet) model to address the above chal- 091
038 lenges. The proposed AMemNet leverages augmented 092
1. Introduction memory networks to explicitly learn and store full video
039 093
040 Action prediction is a highly practical research topic that features to enrich incomplete videos. Specifically, we treat 094
041 could be used in many real-world applications such as video partial video as a query and the corresponding full video 095
042 surveillance, autonomous navigation, human-computer in- as its memory. The “full video” is generated with relevant 096
043 teraction, etc. Different from action recognition, which rec- memory slots fetched by the query of partial video. We 097
044 ognizes the human action category upon a complete video, summarize the contribution of this work in two aspects. 098
045 action prediction aims to understand the human activity at Firstly, a memory-augmented generator model is de- 099
046 an early stage, like a partially-observed video before the signed for generating full-video features conditioning on 100
047 entire action execution. Typically, action prediction meth- partial-video queries. We adopt a key-value memory net- 101
048 ods [32, 2, 22, 4, 43, 47] assume that a portion of consecu- work architecture [28, 46] for action prediction, where the 102
049 tive frames from the beginning is given, considered as a par- key memory slots are used for capturing similar partial 103
050 tial video. The challenges are mainly aroused from the lim- videos, and the value memory slots are extracted from the 104
051 ited information in the early progress of a full video, leading full training videos. The memory writing process is im- 105
052 to the incomplete temporal context and a lack of discrimina- plemented by a gating mechanism and attention weights. 106
053 tive cues for recognizing the action. Thus, the key problem The input/forget gates enable AMemNet to dynamically up- 107

1
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

108 162
date video memories attended by different queries and thus ory architecture with [28, 46] and employs the memory
109 163
memorize the variation between different video progress. writing methods provided in [12], which, however, is de-
110 164
Secondly, a class-aware discriminator model is devel- signed with different purposes compared with these meth-
111 165
oped to guide the memory generator with adversarial train- ods: the memory module is tailored as a generator model
112 166
ing, which not only employs an adversarial loss to encour- for action prediction task.
113 167
age generating realistic full video features, but also imposes Action Prediction has attracted lots of research ef-
114 168
a classification loss on training the generator network. By forts [22, 34, 43, 23, 4, 47] in recent years, which tries
115 169
this means, the discriminator network could further push the to predict the action label in early progress video and thus
116 170
generator to deliver discriminative full-video features. falls into a special case of video-based action recognition.
117 171
The proposed AMemNet obtains prediction results by Previous works [32, 2, 26, 21] solve this task via hand-
118 172
employing a late fusion strategy over two streams (i.e., RGB crafted features, and recent works [22, 34, 43, 23, 5, 20,
119 173
and optical flow) following [35, 41]. Extensive experiments 13, 4, 47] mainly rely on pre-trained deep CNN models
120 174
on two benchmark datasets, UCF101 and HMDB51, are for encoding videos, such as the 3D convolutional networks
121 175
conducted to show the effectiveness of AMemNet com- in [22, 23, 43], deep CNNs in [34, 4], and two-stream CNNs
122 176
pared with state-of-the-art methods, where our approach in [5, 20, 13, 47]. Among these works, the most common
123 177
surprisingly achieve over 90% accuracy with only observ- way for predicting actions is to design deep neural networks
124 178
ing 10% of the beginning video frames on the UCF101 model for reconstructing full videos from the partial ones,
125 179
dataset. A detailed ablation study compared with several such as deep sequential context networks [22], the RBF
126 180
competitive baselines is also presented. Kernelized RNN [34], progressive teacher-student learning
127 181
networks [43], adversarial action prediction networks [23],
128 182
2. Related Work etc. Moreover, some other interesting methods include the
129 183
LSTM based ones [33, 20, 44], part-activated deep rein-
130 Action Recognition targets at recognizing the label of 184
forcement learning [4], residual learning [13, 47], motion
131 human action in a given video, which is one of the core tasks 185
prediction [1], asynchronous activity anticipation [48], etc.
132 for video understanding. Previous works have extensively 186
The memory augmented LSTM (Mem-LSTM) [20]
133 studied this research problem from several aspects, in- 187
model and adversarial action prediction networks (AAP-
134 cluding hand-crafted features (e.g., spatio-temporal interest 188
Net) [23] share some similar ideas to our model. How-
135 points [8, 31], poselet key-frames [30, 27]), and dense tra- 189
ever, several essential differences between the proposed
136 jectory [40], 3D convolutional neural networks [17, 38, 14], 190
AMemNet and Mem-LSTM/AAPnet could be summarized.
137 recurrent neural networks (RNN) based methods [29, 9], 191
First, the memory networks play distinct roles in Mem-
138 and many recent deep CNN based methods such as tem- 192
LSTM [20] and AMemNet. Mem-LSTM formulates the
139 poral linear encoding networks [7], non-local neural net- 193
action labels as video memories and adopts memory net-
140 works [42], etc. Among existing methods, the two-stream 194
works as a nearest-neighbor classifier, whereas AMemNet
141 architecture [35, 10, 41] forms a landmark [3], which 195
develops the key-value memory as a generator model and
142 mainly employ deep CNNs on the RGB and optical flow 196
learns the value memory slots from full videos as recon-
143 streams for exploiting the spatial-temporal information in- 197
struction bases for a generation purpose. Second, the gener-
144 side videos. In this work, we also adopt the two-stream 198
ator models used in AAPNet [23] and AMemNet are differ-
145 structure as it naturally provides the complimentary infor- 199
ent, where AAPNet [23] employs a variational-GAN model,
146 mation for the action prediction task, where the RGB stream 200
and AMemNet develops a memory-augmented generator
147 contributes more on the early observation and the optical 201
to explicitly provide auxiliary information to generate full-
148 flow leads the following progress. 202
video features for testing videos.
149 Memory Networks, i.e., Memory-Augmented Neural 203
150 Networks, generally consists of two components [45] in- 204
151 cluding 1) a memory matrix and 2) a neural network con- 3. Methodology 205
152 troller, where the memory matrix is used for saving the in- 206
3.1. Problem Setup
153 formation as memory slots, and the neural network con- 207
154 troller is generally designed for addressing, reading and Given a set of training videos {(x, y)}, where x ∈ X 208
155 writing memories. Several representative memory network denotes one video sample and y ∈ Y refers to its action 209
156 architectures include end-to-end memory networks [37], category label, action prediction aims to infer y by only ob- 210
157 Key-Value memory networks [28, 46], neural tuning ma- serving the beginning sequential frames of x instead of us- 211
158 chines [12], etc. The memory networks work well in prac- ing the entire video. Let τ ∈ (0, 1] be the observation ratio 212
159 tice for its flexibility in saving the auxiliary knowledge and and L be length (i.e., the total number of frames) of x, a 213
160 its ability in memorizing the long-term temporal informa- partial video is defined as x1:bτ Lc that is a subsequence of 214
161 tion. The proposed AMemNet model shares the same mem- the full video x containing from the first frame to the bτ Lc- 215

2
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

216 270
217 Query Encoder Memory Generator 271
218 272
219 FC 273
220 CNN q 274
x v'
221 Partial Video $ 275
M% 𝛼 M!"#
222 276
223 277
Memory Encoder Sigmoid Discriminator
224 278

… …
225 e! 279
226 W&/( 280
227 CNN 281
v
228 Full Video Tanh a! M!$ 282
229 283
230 Figure 1. Illustration of training the proposed AMemNet on the RGB stream, where the attention weights α is first given by the query 284
231 embedding q with key memory matrix Mk in memory addressing, then α is used for updating value memory Mvt−1 → Mvt with real full 285
232
video feature v. The generated full video feature v̂ is obtained from memory reading with α and Mvt . 286
233 287
234 th frame. We employ a set of observation ratios {τp }P 288
1 to representations, the memory generator is learned to gener-
235 mimic the partial observations at different progress levels ate the full video representation conditioning on the partial 289
236 and define xp = x1:bτp Lc as the p-th progress level obser- video query, and the discriminator is trained to distinguish 290
237 vation of x, p ∈ {1, . . . , P }. By this means, the training set between fake and real full video representations and also 291
238 is augmented as P times of the original one, i.e., {(xp , y)}. deliver the prediction scores over all the categories. During 292
239 Following the existing work [22, 4, 47], we set P = 10 and the training process, the memories are continuously updated 293
240 increase τp from 0.1 to 1.0 with a fixed step of 0.1. by the given full videos with a gate mechanism. We show 294
241 the details of each component in the following. 295
We adapt a question answering schema to address the
242 296
action prediction task with memory networks. We formu-
243 297
late a partial video xp as a query to “retrieve” its lost in- 3.2.1 Query/Memory Encoder
244 298
formation from the memory of all the full training videos.
245 Given a partial video x and its corresponding full video v, 299
To learn the full video memories, we build the training set
246 we employ deep convolution neural networks (CNN) as en- 300
as {(xp , v, y)}, where v indicates the full video of xp , i.e.,
247 coding model to obtain feature representations as follows: 301
v := xP . Different from the previous work [22], which re-
248 x = fcnn (x; θcnn ), v = fcnn (v; θcnn ), where x ∈ Rd 302
quires the progress level p during the training process, the
249 and v ∈ Rd are the encoded representations for x and v, 303
proposed AMemNet is “progress-free”. Hence, for conve-
250 respectively, d is the feature dimension, and θcnn parame- 304
nience, we omit the subscript p for the partial video xp when
251 terizes the CNN model. Following [5, 47], we instantiate 305
no confusion occurs, and always denote (x, v, y) as a triplet
252 fcnn (·; θcnn ) with the pre-trained TSN model [41] for its 306
of partial observation, full observation and action category
253 robust and impressive performance on action recognition. 307
of the same video sample throughout the paper.
254 308
Following [35, 41], we train the proposed AMemNet The proposed AMemNet model utilizes the partial video
255 309
model on the RGB frames and optical flows, respectively, feature x as a query to fetch the relevant memories, which
256 310
and employ a late fusion mechanism to exploit the spatial- are learned from full training videos, to generate its full
257 311
temporal information in a two-stream framework. Each video feature v. Hence, it is straightforward to directly uti-
258 312
stream employs the same network architecture with its lize fcnn (·; θcnn ) as the memory encoder for learning mem-
259 313
own trainable weights. We refer to (xrgb , v rgb , y) and ory representations of full videos . On the other hand, to fa-
260 314
(xf low , v f low , y) as two modalities, and omit the subscripts cilitate the querying process, we further encode the partial
261 315
(rgb/f low) when it is unnecessary. video representation x in a lower-dimensional embedding
262 316
space by
263 317
3.2. Adversarial Memory Networks (AMemNet) q = fq (x; θq ), (1)
264 318
265 Fig. 1 shows the network architecture of the proposed where q ∈ Rh denotes the query vector, h < d refers to the 319
266 AMemNet model. Overall, our model consists of three dimension of query embedding, and fq (·; θq ) is given by 320
267 components: 1) query/memory encoder; 2) memory gen- fully-connected networks. By using Eq. (1), the query en- 321
268 erator, and 3) discriminator, where the encoder network coder is formulated by concatenating fq on the top of fcnn . 322
269 is used to vectorize the given partial/full video as feature In this work, the memory and query encoder share the same 323

3
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

324 378
CNN weights, and freeze θcnn with the pre-trained TSN where et ∈ Rd and at ∈ Rd represent the erase vector
325 379
model to avoid overfitting. Thus, the encoding component and add vector, respectively, denotes the element-wise
326 380
of AMemNet is mainly parameterized by θenc = {θq }. multiplication, and αt is computed by Eq (3) with (q, v)
327 381
arriving at the t-th training step. In Eqs. (4) and (5), the
328 382
3.2.2 Memory Generator erase vector et and add vector at work as forget and input
329 383
gates as in the LSTM model [15], which are learned with
330 We adopt the Key-Value memory networks architecture [28, 384
two linear projection matrices1 We ∈ Rd×d and Wa ∈
331 46] and tailor it as a generator model defined by 385
Rd×d , respectively. The et decides the forgetting degree of
332 386
the memory slots in Mvt−1 , while at computes the updates
333 v̂ = Gmem (q; θmem ), (2) 387
for the new value memory Mvt . Note that, by using the
334 388
where Gmem (·; θmem ) denotes the memory generator and attention weights αt , Eqs. (6) and (7) will mainly update
335 389
v̂ ∈ Rd represents the generated full video representation. the most attended (αt [i] → 1) memory slots and leave the
336 390
Particularly, Gmem (·; θmem ) includes two memory blocks, ones (αt [i] → 0) that are irrelevant to the query q nearly
337 391
termed as key memory matrix Mk ∈ RN ×h and value unchanged.
338 392
memory matrix Mv ∈ RN ×d , where N is the number of 3) Memory Reading. After updating the value memory
339 393
memory slots in each memory block. The memory slot is in matrix Mv , Gmem generates the full video representation
340 394
essence one row in Mk /Mv and is learned with query q and v̂ by reading the memory slots from Mv in the following
341 395
full video memory v during the training process. The ben- way:
342 X 396
343 efits of using such a Key-Value structure lies at separating v̂ = x + α[i]Mv [i], (8) 397
344 the learning process for different purposes. Mk could focus i
398
345 on memorizing different queries of partial videos, and Mv which adds a skip-connection between the partial video fea- 399
346 is trained to distill useful information from full videos for ture x and the memory output. By this means, Eq. (8) allows 400
347 generation. To generate v̂, our memory generator Gmem Gmem to memorize the residual between partial video and 401
348 conducts the following three steps in a sequence. the entire one. 402
349 1) Memory Addressing. The key memory matrix Mk In summary, the memory generator Gmem (·; θmem ) 403
of Gmem provides sufficient flexibility to store the simi-
350 defined in Eq. (2) is implemented through Eq. (3) to 404
lar queries (partial videos) for addressing the relevant value
351
memory slots in Mk with attention mechanism. The ad- Eq. (8), where θmem includes the key/value memory ma- 405
352
dressing process is computed by trix and the learnable gate parameters, i.e., θmem = 406
353 {Mk , Mv , We , Wa }. 407
354 exp(φ(q, Mk [i])) 408
α[i] = softmax(φ(q, Mk [i])) = PN , (3)
355 k
j=1 exp(φ(q, M [i]))
409
3.2.3 Discriminator
356 410
N
357 where α ∈ R denotes the soft attention weights over all The discriminator network is designed with two pur- 411
358 the memory slots, Mk [i] refers to its i-th row, and φ(·, ·) poses: 1) predicting the true action category label 412
359 is a similarity score function, which could be given by the given the real/generated (v/v̂) full video representation; 413
360 cosine similarity φ(a, b) = aT b or `2 norm φ(a, b) = 2) distinguishing the real full video representation v 414
361 −ka − bk. Note that, by using Eq. (3), it enables the end- and the fake one v̂. Inspired by [6, 23], we build 415
362 to-end differentiable property [37, 28] of our memory net- the discriminator in a composition way: D(·; θD ) := 416
363 works, and learns key slots with backpropagation gradients. {Dcls (·; θcls ), Dadv (·; θadv )}, where Dcls : Rd → R|Y| 417
364 2) Memory Writing. The value memory matrix Mv of works as a classifier to predict the probability score over 418
365 Gmem memorizes full videos for the generation purpose, |Y| action classes, and Dadv : Rd → {0, 1} follows the 419
366 where the memory slots attended by a partial video query same definition in the GAN model [11] to infer the prob- 420
367 q are written with its full video representation v. Specifi- ability of the given sample being real. The discriminator 421
368 cally, Gmem updates the value memory matrix with a gating D in our model is formulated as fully-connected networks 422
369 mechanism and attentions following [12, 28]. Let t be the parameterized by θD = {θcls , θadv }. 423
370 current training step and Mvt−1 be the value memory matrix 424
371 in the last step, Mvt ← Mvt−1 is obtained by 3.3. Objective Function 425
372 426
et = sigmoid(We v), (4) The main goal of this work is to deliver realistic full-
373 427
video representations for partial videos to predict the correct
374 at = tanh(Wa v), (5) 428
action classes. To this end, three loss functions are jointly
375 M̃vt [i] = Mvt−1 [i] (1 − αt [i]et ), (6) 429
employed for training the proposed AMemNet model.
376 430
377 Mvt [i] = M̃vt [i] + αt [i]at , (7) 1 We omit all the bias vectors in dense layers to simplify the notations. 431

4
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

432 The Stream of RGB 486


Adversarial Loss. Given a partial video feature x and its

… …
433 real full video representation v, we compute the adversarial FC 487
CNN
q!"#
434 loss Ladv as x !"# v# !"# Discriminator y# !"#
488

… …
Query Encoder Memory Generator

435 The Stream of Optical Flow 489


Ladv = Ev [log Dadv (v)]+Eq [log(1−Dadv (Gmem (q)))]. (9)

… …
436 FC Fusion 490
CNN
q $%&'
437 The discriminator Dadv tries to differentiate v̂ = Query Encoder x $%&' Memory Generator v# $%&' Discriminator y# $%&' 491
438 Gmem (q)) from the real one v by maximizing Ladv , while, Figure 2. Illustration of two-stream AMemNets. 492
439 on the contrary, the memory generator Gmem aims to fool 493
440 the Dadv by minimizing Ladv . By using Eq. (9), we could 494
parametrizes the discriminator, λcls and λrec are the trade-
441 employ Dadv to push Gmem towards generating realistic 495
off parameters for balancing different loss functions. To
442 full video features. 496
proceed with the training procedure, we optimize θD and
443 Reconstruction Loss. The adversarial loss Ladv encour- 497
θG by alternatively solving Eq. (13) and Eq. (14) while fix-
444 ages our model to generate video features approaching the 498
ing the other.
445 real feature distribution of full videos, yet without consid- 499
446 ering the reconstruction error at instance level, which might 3.4. Two-Stream Fusion for Action Prediction 500
447 miss some useful information for recovering v from x. In 501
After training the proposed AMemNet model via
448 light of this, we employ the reconstruction loss defined as 502
Eq. (13) and Eq. (14), we freeze the model weights θ =
449 503
Lrec = E(x,v) kGmem (fq (x)) − vk22 , (10) {θG , θD } and suppress the memory writing operations in
450 504
Eq. (4)-(7) for testing AMemNet. Particularly, given a par-
451 505
which calculates the squared Euclidean distance between tial video feature x, we predict its action label by
452 506
453
the generated feature v̂ = Gmem (fq (x))) and its corre- 507
sponding full video feature v. Eq. (10) further guides the ŷ = Dcls (Gmem (fenc (x))), (15)
454 508
455
memory generator by bridging the gap between x and v. 509
Classification Loss. It is important for Gmem to gener- where ŷ ∈ R|Y| denotes the probability distribution over
456 |Y| action classes for x. 510
457
ate the discriminative representations v̂ for different action 511
classes. Thus, it is natural to impose the classification loss As shown in Fig 2, we adopt a two-stream frame-
458 work [35, 41] to exploit the spatial and temporal informa- 512
459
Lcls on training the memory generator as follows: 513
tion of given videos, where we first test AMemNet on each
460 Lvcls = E(v,y) H(y, Dcls (v)), (11) stream (i.e., RGB frames and optical flow) individually, and 514
461 then fuse the prediction scores to obtain the final result. 515
Lxcls = E(x,y) H(y, Dcls (Gmem (x))), (12)
462 Given xrgb and xf low , we obtain the prediction results ŷrgb 516
463 and ŷf low by testing Eq. (15) with θrgb and θrgb , respec- 517
where y ∈ R|Y| indicates the one-hot vector of action label
464 tively. The final prediction result is given by 518
y over |Y| classes and H(·, ·) computes the cross-entropy
465 519
between two probability distributions. Let ŷ ∈ R|Y| be the
466 P|Y|
output of Dcls , we have H(y, ŷ) = − i=1 y[i] log ŷ[i]. ŷf usion = ŷrgb + βŷf low , (16) 520
467 521
Different from [23], which only employs the Lvcls for
468 where β is the fusion weight for integrating the scores given 522
training the discriminator model, we employ Eq. (11) and
469 by the stream of spatial RGB frames and the stream of tem- 523
Eq. (12) to train the discriminator Dcls and the memory
470 poral optical flow images. 524
generator Gmem alternatively, where the benefit lies at: a
471 525
high-quality classifier Dcls is first obtained by minimiz-
472 4. Experiments 526
ing Lvcls with real full videos and then Dcls is leveraged
473 527
to “teach” Gmem for generating representations v̂ to lower 4.1. Experimental Setting
474 528
Lxcls . By this means, Gmem could learn the discriminative
475 Datasets. Two benchmark video datasets, UCF101 [36] 529
information from Dcls .
476 and HMDB51 [24], are used in the experiment. The 530
Final Objective. By summarizing Eq. (9) to Eq. (12),
477 UCF101 dataset consists of 13, 320 videos from 101 human 531
the final objective function of the proposed AMemNet
478 actions covering a wide range of human activities, and the 532
model is given by
479 HMDB51 dataset collects 6, 766 video clips from movies 533
480 max Ladv + λcls Lvcls , (13) and web videos over 51 action categories. We follow the 534
θD
481 standard training/testing splits on these two datasets follow- 535
482 min Ladv + λcls Lxcls + λrec Lrec , (14) ing [41, 23]. We test the proposed AMemNet model over 536
θG
483 three splits and report the average prediction result for each 537
484 where θG = {θenc , θmem } includes all the trainable pa- dataset. We employ the preprocessed RGB frames and op- 538
485 rameters for generating v̂ from x, θD = {θcls , θadv } tical flow images provided by [10]. 539

5
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540 594
Method 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
541 595
542 IBoW [32] 36.29 65.69 71.69 74.25 74.39 75.23 75.36 75.57 75.79 75.79 596
543 MSSC [2] 34.05 53.31 58.55 57.94 61.79 60.86 63.17 63.64 61.63 61.63 597
544 MTSSVM [21] 40.05 72.83 80.02 82.18 82.39 83.21 83.37 83.51 83.69 82.82 598
Single-stream
545 DeepSCN [22] 45.02 77.64 82.95 85.36 85.75 86.70 87.10 87.42 87.50 87.63 599
546 PA-DRL [4] 81.36 82.63 82.90 83.51 84.01 84.38 85.09 85.41 85.81 86.15 600
547 PTSL [43] 83.32 87.13 88.92 89.82 90.85 91.04 91.28 91.23 91.31 91.47 601
548 Mem-LSTM [20] 51.02 80.97 85.73 87.76 88.37 88.58 89.09 89.38 89.67 90.49 602
549 TSL [5] 82.20 86.70 88.50 89.50 90.10 91.00 91.50 91.90 92.40 92.50 603
550 Two-stream 604
RGN-KF [47] 83.12 85.16 88.44 90.78 91.42 92.03 92.00 93.19 93.13 93.14
551 AAPNet [23] 90.25 93.10 94.46 95.41 95.89 96.09 96.27 96.35 96.47 96.36 605
552 606
TSN [41] 86.76 89.29 90.64 91.81 91.73 92.47 92.97 93.15 93.31 93.42
553 607
Baselines TSN+finetune 88.88 91.52 93.01 94.05 94.66 95.34 95.64 95.92 95.90 96.00
554 608
TSN+KNN 85.69 88.71 90.34 91.29 91.78 92.33 92.41 92.75 92.96 93.11
555 609
556 AMemNet-RGB 85.95 87.47 88.21 88.57 89.26 89.51 89.81 89.99 90.06 90.17 610
557 Our model AMemNet-Flow 83.64 88.32 90.74 92.18 93.20 93.78 94.40 94.75 94.88 94.96 611
558 AMemNet 92.45 94.60 95.55 96.00 96.45 96.67 96.97 96.95 97.07 97.03 612
559 613
560 Table 1. Action prediction accuracy (%) under 10 observation ratios on the UCF101 dataset. 614
561 615
562 Implementation Details. The proposed AMemNet is three different kinds of methods as follows. 1) Single- 616
563 built on the top of temporal segment networks (TSN) [41], stream methods: Integral BoW (IBoW) [32], mixture seg- 617
564 where we adopt the BN-Inception network [16] as its back- ments sparse coding (MSSC) [2], multiple temporal scales 618
565 bone and employ the pre-trained model on the Kinetics SVM (MTSSVM) [21], deep sequential context networks 619
566 dataset [19]. The same data augmentation strategy (e.g., (DeepSCN) [22], part-activated deep reinforcement learn- 620
567 cropping and jittering) as provided in [41] is employed for ing (PA-DRL) [4], and progressive teacher-student learning 621
568 encoding all the partial and full videos as d = 1024 feature (PTSL) [43]. 2) Two-stream methods: memory augmented 622
569 representations. We formulate fq as fully-connected net- LSTM (Mem-LSTM) [20], temporal sequence learning 623
570 works of two layers, where the middle layer has 512 hidden (TSL) [5], residual generator network with Kalman filter 624
571 states and the final query embedding size is set as h = 256. (RGN-KF) [47], and adversarial action prediction networks 625
572 The batch normalization and LealyRelu are both used in (AAPNet) [23]. We implemented the AAPNet with the 626
573 fq . We employ N = 512 memory slots for the key and same pre-trained TSN features as our approach and posted 627
574 value memory matrices, hence we have Mk ∈ R512×256 the authors’ reported results for the other single/two-stream 628
575 and Mv ∈ R512×1024 . All the memory matrices and gat- methods. 3) Baselines: We also compare AMemNet with 629
576 ing parameters in θmem are randomly initialized. We im- temporal segment networks (TSN) [41]. Specifically, we 630
577 plement the discriminator network by one fully-connected test the TSN model pre-trained on the UCF101/HMDB51 631
578 layer, where the softmax and sigmoid activation function dataset as baseline results, and finetune the TSN model pre- 632
579 are used for Dcls and Dadv , respectively. trained on the Kinetics dataset for UCF101 and HMDB51, 633
580 For each training step, we first employ the Adam op- respectively. Moreover, we train a k-nearest neighbors 634
581 timizer with a learning rate of 0.0001 to update θD with (KNN) classifier with the Kinetics pre-trained TSN fea- 635
582 Eq. (13) twice, and then optimize θG once by solving tures, termed as TSN+KNN, and report its best performance 636
583 Eq. (13) with the SGD optimizer of 0.0001 learning rate by selecting k from {5, 10, 20, 30, 50, 100, 500}. For a fair 637
584 and 0.9 momentum rate. We set the batch size as 64. For comparison, we follow the same testing setting in previous 638
585 all the datasets, we set λcls = 1 to strengthen the impact works [22, 4, 47] by evenly dividing all the videos into 10 639
586 of Dcls on the memory generator Gmem for encouraging progresses, i.e., P = 10 as described in Section 3.1. How- 640
587 discriminative representations, and set λres = 0.1 to avoid ever, it is worth noting that, the proposed AMemNet does 641
588 overemphasizing the reconstruction of each video sample not require any progress label in both training and testing. 642
589 to lead the overfitting issue. The fusion weight β is fixed as 643
4.2. Prediction Performance
590 1.5 for all the datasets following [41]. All the codes in this 644
591 work were implemented by Pytorch toolbox and ran with UCF101 Dataset. Table 1 summarizes the prediction ac- 645
592 Titan X GPUs. curacy of the proposed AMemNet and 13 compared meth- 646
593 Compared Methods. We compare our approach with ods on the UCF101 dataset. Overall, AMemNet consis- 647

6
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

648 702
Method 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
649 703
650 Global-Local [25] 38.80 43.80 49.10 50.40 52.60 54.70 56.30 56.90 57.30 57.30 704
651 Two-stream TSL [5] 38.80 51.60 57.60 60.50 62.90 64.60 65.60 66.20 66.30 66.30 705
652 AAPNet [23] 56.03 60.11 64.87 67.99 70.76 72.55 74.00 74.81 75.59 75.56 706
653 TSN [41] 47.12 52.81 59.35 62.55 64.77 67.52 68.95 69.87 70.07 70.13 707
654 Baselines TSN+finetune 55.13 59.82 63.88 67.02 69.74 71.72 72.98 73.43 74.08 73.55 708
655 TSN+KNN 48.77 53.59 57.83 60.33 62.84 65.18 66.53 67.00 67.53 66.65 709
656 710
657
AMemNet-RGB 52.55 55.52 58.27 60.55 62.53 63.87 64.41 64.61 64.99 64.86 711
658
Our model AMemNet-Flow 47.41 54.43 60.26 64.51 68.03 70.53 72.10 73.05 73.39 73.52 712
AMemNet 57.74 62.10 66.28 70.17 72.66 74.55 75.22 75.78 76.08 76.14
659 713
660 714
Table 2. Action prediction accuracy (%) under 10 observation ratios on the HMDB51 dataset.
661 715
662 716
663 90 94 717
96
89 92
Predition Accuracy (%)

Predition Accuracy (%)

Predition Accuracy (%)


664 718
665 88 90 719
94
88
666 87 720
86
667 86 92 721
TSN + finetune 84 TSN + finetune TSN + finetune
668 85 AMemNet w/o Mem 82 AMemNet w/o Mem AMemNet w/o Mem 722
AMemNet w/o GAN AMemNet w/o GAN 90 AMemNet w/o GAN
669
84 AMemNet 80 AMemNet AMemNet
723
670 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 724
671 Observation Ratio Observation Ratio Observation Ratio 725
(a) RGB on UCF101 (b) Flow on UCF101 (c) Fusion on UCF101
672 726
Figure 3. Ablation study for the proposed AMemNet on the UCF101 dataset in terms of RGB, Flow and Fusion, respectively
673 727
674 728
675 tently outperforms all the competitors over different ob- the AAPNet [23] implemented with the same backbone as 729
676 servation ratios with a significant improvement. Impres- our model, the consistent improvement of AMemNet over 730
677 sively, the proposed AMemNet achieves around 92% ac- AAPNet shows the effectiveness of using the memory gen- 731
678 curacy when only 10% video being observed, which fully erator to deliver “full” video features in early progress. 732
679 validates the effectiveness of applying AMemNet for early 733
680
In Table 1, we refer to AMemNet-RGB and AMemNet- 734
action prediction. This is mainly benefited from the rich Flow as the single-stream result by using AMemNet on
681 key-value structured memories learned from the full-video 735
682
RGB frames and flow images, respectively. Two interest- 736
features guided by adversarial training. ing observations could be drawn: 1) The RGB contributes
683 737
684 The single-stream methods mainly explore the tempo- more than the flow at the beginning, as the still images en- 738
685 ral information by using hand-crafted features (e.g., spatio- capsulating scenes and objects could provide key cues for 739
686 temporal interest points (STIP) [8], dense trajectory) like recognizing the actions with few frames. 2) The late fusion 740
687 in IBoW [32], MSSC [2], and MTSSVM [21], or by uti- naturally fits action prediction by integrating the compli- 741
688 lizing 3D convolutional networks (e.g., C3D [38]) like in mentary information between two streams over time. 742
689 DeepSCN [22] and PTSL [43]. Differently, the two-stream HMDB51 Dataset. Table 2 reports the prediction re- 743
690 methods deploy the convolutional neural networks on two sults of our approach and TSN [41] on the HMDB51 744
691 pathways to capture the spatial information of RGB im- dataset, which, compared with UCF101, is a more chal- 745
692 ages and the temporal characteristics of optic flows, re- lenging dataset for predicting actions due to the large mo- 746
693 spectively. On the one hand, the two-stream methods tion variations rather than static cues across different cate- 747
694 could better exploit the spatial-temporal information inside gories [24]. As can be seen, the flow result of AMemNet 748
695 videos than using one single stream. The proposed AMem- exceeds AMemNet-RGB around 8% accuracy after more 749
696 Net inherits the merits from this two-stream architecture, progress being observed (e.g., τp ≥ 0.5). However, even 750
697 and thus performs better than the single-stream methods under this case, the proposed AMemNet still consistently 751
698 even employing a more powerful CNN encoder, e.g., the improves the AMemNet-Flow by incorporating the RGB re- 752
699 3D ResNeXt-101 [14] used in PTSL [43] is much deeper sults along with different progresses. Moreover, the clear 753
700 than BN-Inception in the proposed AMemNet. On the improvements of AMemNet over TSN indicate that the full 754
701 other hand, compared with two-stream methods, especially video memories learned by our memory generator could 755

7
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

756 810
Early predicatable Late predicatable ThrowDiscus ThrowDiscus
757 811
1.00 PoleVault PoleVault
Ours TSN

758 812
RGB

0.95 HeadMassage HeadMassage


759 0.90
CliffDiving CliffDiving
813
760 0.85 814
Volleyball Volleyball
Ours TSN

0.80
Flow

761 PommelHorse PommelHorse 815


0.75
762 0.70 PlayingPiano PlayingPiano 816
763 0.65 817
Ours TSN
Fusion

RockClimbing RockClimbing
0.60
764 IceDancing IceDancing 818
s ll t s
ard cing bing iano orse ba ving age aul scu
765 BillIiceDanckClimayingP melH VolleyCliffDi dMass PoleVrowDi Billiards Billiards 819
R o P l P o m H e a T h
766 (a) Predictions accuracy (b) TSN (c) AMemNet 820
767 821
Figure 4. Visual analysis of the proposed AMemNet for action categories of different properties, i.e., early predictable and late predictable.
768 (a) Prediction results of TSN and the proposed AMemNet over 10 categories with 10% video progress on UCF101. (b) and (c) t-SNE 822
769 embedding results given by TSN and AMemNet. 823
770 824
771 825
well enhance the discriminative of the video representations dings of TSN features and the generated full video fea-
772 826
especially for the early progress. tures given by AMemNet in 4(b) and 4(c), respectively. In-
773 827
spired by [22], we select 10 action categories from UCF101
774 4.3. Model Discussion and divide them into two groups as 1) the early predictable 828
775 829
Ablation Study. Fig 3 shows the ablation study for group including Billiards, IceDancing, RockClimbingIn-
776 830
the proposed AMemNet model on the UCF101 dataset2 in door, PlayingPiano, PommelHorse, and 2) the late pre-
777 831
terms of RGB, Flow and fusion, respectively, where we dictable group including VolleyballSpiking, CliffDiving,
778 832
test all the methods by different observation ratios on each HeadMassage, PoleVault, ThrowDiscus, where the early
779 833
dataset. We adopt TSN + finetune as a sanity check for our group usually could be predicted by given 10% progress and
780 834
approach and implement two strong ablated models to dis- the late group is selected as the non-early ones.
781 835
cuss the impact of two main components of AMemNet as As expected, the proposed AMemNet mainly improves
782 836
follows: 1) AMemNet w/o Mem refers to our model by dis- the TSN baseline over the late predictable actions in
783 837
carding the memory generator, i.e., θ\θmem . Instead, we Fig. 4(a), which again demonstrates the realistic of the
784 838
use the same generator network following AAPNet [23] for full video features given by our memory generator. More-
785 839
generating the full video features by AMemNet w/o Mem. over, as shown in Fig. 4(b) and 4(c), although TSN ex-
786 840
2) AMemNet w/o GAN is developed without the adversar- hibits a good structured feature embeddings for the early
787 841
ial training and is trained by only using a classification loss. predictable classes, e.g., IceDancing and PommelHorse, its
788 842
As show in Fig 3, AMemNet improves all the above embeddings mixed up for the late predictable ones like Pol-
789 843
methods with a clear margin on different cases, which fully eVault and CliffDiving. In contrast, AMemNet generates
790 844
supports the motivations of this work. It is worth not- the full video features encouraging a good cluster structure
791 845
ing that, for the early progress (i.e., the observation ratio in the embedding space.
792 846
793
τp ≤ 0.3), AMemNet w/o GAN clearly boots the perfor- 847
mance over AMemNet w/o Mem. This demonstrates the 5. Conclusion
794 848
795
effectiveness of using memory networks to compensate for In this paper, we presented a novel two-stream adver- 849
796
the limited information in incomplete videos. As observing sarial memory networks (AMemNet) model for the action 850
797
more progress, the GAN model will lead the generating pro- prediction task, where a key-value structured memory gen- 851
798
cess since it has sufficient information given by the partial erator was proposed to generate the full video feature con- 852
799
videos, where AMemNet w/o Mem improves over AMem- ditioning on the partial video query, and a class-aware dis- 853
800
Net w/o GAN after the τp > 0.7 on the UCF101 dataset. criminator was developed to supervise the generator for 854
801
Early Predicable vs Late Predicable. In Fig. 4, we dis- delivering realistic and discriminative representations to- 855
802
cuss the performance of AMemNet for action categories of wards full videos through adversarial training. The pro- 856
803
different properties, e.g., predictability (the progress level posed AMemNet adopts input and forget gates for updating 857
804
is required for recognizing the action), on the UCF101 the full video memories attended by different queries, which 858
805
dataset, where we compare AMemNet and TSN [41] on captures the long-term temporal variation across different 859
806
the 10% progress level video of 10 different categories video progresses. Experimental results on two benchmark 860
807
in 4(a) and show the corresponding t-SNE [39] embed- datasets were provided to show that AMemNet achieves a 861
808 2 More ablation study results on the HMDB51 dataset are provided in new sate-of-the-art for the action prediction problem. 862
809 the supplementary material. 863

8
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

864 918
References [17] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu-
865 919
tional neural networks for human action recognition. TPAMI,
866 [1] Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei 2013. 2 920
867 Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui 921
[18] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
868 Shen, Ding Liu, Jing Liu, and Nadia Magnenat Thalmann. 922
Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
869 Learning progressive joint propagation for human motion 923
classification with convolutional neural networks. In CVPR,
prediction. In Computer Vision – ECCV 2020, pages 226–
870 2014. 1 924
242, 2020. 2
871 [19] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, 925
[2] Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth
872 Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, 926
Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei
873 Lin, Sven Dickinson, Jeffrey Siskind, and Song Wang. Rec- Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, 927
874 ognizing human activities from partially observed videos. In and Andrew Zisserman. The kinetics human action video 928
875 CVPR, 2013. 1, 2, 6, 7 dataset. CoRR, abs/1705.06950, 2017. 1, 6 929
876 [3] Joan Carreira and Andrew Zisserman. Quo vadis, action [20] Yu Kong, Shangqian Gao, Bin Sun, and Yun Fu. Action pre- 930
877 recognition? a new model and kinetics dataset. In CVPR, diction from videos via memorizing hard-to-predict samples. 931
878 2017. 2 In AAAI, 2018. 2, 6 932
879 [4] Lei Chen, Jiwen Lu, Zhanjie Song, and Jie Zhou. Part- [21] Yu Kong, Dmitry Kit, and Yun Fu. A discriminative model 933
880 activated deep reinforcement learning for action prediction. with multiple temporal scales for action prediction. In 934
In ECCV, September 2018. 1, 2, 3, 6 ECCV, 2014. 1, 2, 6, 7
881 935
[5] Sangwoo Cho and Hassan Foroosh. A temporal sequence [22] Yu Kong, Zhiqiang Tao, and Yun Fu. Deep sequential context
882 936
learning for action recognition and prediction. In WACV, networks for action prediction. In CVPR, 2017. 1, 2, 3, 6, 7,
883 937
pages 352–361, 2018. 2, 3, 6, 7 8
884 938
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, [23] Yu Kong, Zhiqiang Tao, and Yun Fu. Adversarial action pre-
885 939
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- diction networks. IEEE Transactions on Pattern Analysis
886 tive adversarial networks for multi-domain image-to-image 940
and Machine Intelligence, 42(3):539–553, 2020. 1, 2, 4, 5,
887 translation. In CVPR, 2018. 4 6, 7, 8 941
888 [7] Ali Diba, Vivek Sharma, and Luc Van Gool. Deep temporal [24] H. Kuhne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 942
889 linear encoding networks. In CVPR, July 2017. 2 Hmdb: A large video database for human motion recogni- 943
890 [8] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, and Serge tion. In ICCV, 2011. 5, 7 944
891 Belongie. Behavior recognition via sparse spatio-temporal [25] S. Lai, W. Zheng, J. Hu, and J. Zhang. Global-local tempo- 945
892 features. In VS-PETS, 2005. 2, 7 ral saliency action prediction. IEEE Transactions on Image 946
893 [9] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Processing, 27(5):2272–2285, 2018. 7 947
894 Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, [26] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierar- 948
895 and Trevor Darrell. Long-term recurrent convolutional net- chical representation for future action prediction. In ECCV, 949
896
works for visual recognition and description. In CVPR, June 2014. 2 950
2015. 2
897 [27] I. Laptev. On space-time interest points. IJCV, 64(2):107– 951
[10] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 123, 2005. 2
898 952
Convolutional two-stream network fusion for video action
899 [28] Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir- 953
recognition. In CVPR, June 2016. 2, 5
900 Hossein Karimi, Antoine Bordes, and Jason Weston. Key- 954
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
901 value memory networks for directly reading documents. In 955
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
902 EMNLP, pages 1400–1409, 2016. 1, 2, 4 956
Yoshua Bengio. Generative adversarial nets. In NIPS, pages
903 2672–2680. 2014. 4 [29] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi- 957
904 [12] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing jayanarasimhan, Oriol Vinyals, Rajat Monga, and George 958
machines. CoRR, abs/1410.5401, 2014. 2, 4 Toderici. Beyond short snippets: Deep networks for video
905 959
classification. In CVPR, 2015. 2
906 [13] Shuangshuang Guo, Laiyun Qing, Jun Miao, and Lijuan 960
Duan. Deep residual feature learning for action prediction. [30] Michalis Raptis and Leonid Sigal. Poselet key-framing: A
907 961
In IEEE International Conference on Multimedia Big Data, model for human activity recognition. In CVPR, 2013. 2
908 962
pages 1–6, 2018. 2 [31] M.S. Ryoo and J.K. Aggarwal. Spatio-temporal relationship
909 963
[14] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can match: Video structure comparison for recognition of com-
910 964
spatiotemporal 3d cnns retrace the history of 2d cnns and plex human activities. In ICCV, pages 1593–1600, 2009. 2
911 965
imagenet? In CVPR, June 2018. 2, 7 [32] M. S. Ryoo. Human activity prediction: Early recognition of
912 ongoing activities from streaming videos. In ICCV, 2011. 1, 966
[15] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
913 memory. Neural computation, 9(8):1735–1780, 1997. 4 2, 6, 7 967
914 [16] Sergey Ioffe and Christian Szegedy. Batch normalization: [33] Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, 968
915 Accelerating deep network training by reducing internal co- Mathieu Salzmann, Basura Fernando, Lars Petersson, and 969
916 variate shift. In ICML, volume 37, pages 448–456, 2015. Lars Andersson. Encouraging lstms to anticipate actions 970
917 6 very early. In ICCV, Oct 2017. 2 971

9
ICCV ICCV
#9282 #9282
ICCV 2021 Submission #9282. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

972 1026
[34] Yuge Shi, Basura Fernando, and Richard Hartley. Action an-
973 ticipation with rbf kernelized feature mapping rnn. In ECCV, 1027
974 September 2018. 1, 2 1028
975 [35] Karen Simonyan and Andrew Zisserman. Two-stream con- 1029
976 volutional networks for action recognition in videos. In 1030
977 NIPS, 2014. 2, 3, 5 1031
978 [36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 1032
979 UCF101: A Dataset of 101 Human Action Classes From 1033
980 Videos in The Wild. Technical report, CRCV-TR-12-01, 1034
981 2012. 5 1035
982 [37] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob 1036
983 Fergus. End-to-end memory networks. In NeurIPS, pages 1037
984
2440–2448. 2015. 2, 4 1038
985
[38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 1039
Learning spatiotemporal features with 3d convolutional net-
986 1040
works. In ICCV, 2015. 2, 7
987 1041
[39] Laurens van der Maaten and Geoffrey Hinton. Visualizing
988 1042
data using t-SNE. Journal of Machine Learning Research,
989 9:2579–2605, 2008. 8 1043
990 [40] Heng Wang, Alexander Kläser, Cordelia Schmid, and 1044
991 Cheng-Lin Liu. Dense trajectories and motion boundary de- 1045
992 scriptors for action recognition. IJCV, 103(1):60–79, 2013. 1046
993 2 1047
994 [41] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua 1048
995 Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment 1049
996 networks: Towards good practices for deep action recogni- 1050
997 tion. In ECCV, 2016. 2, 3, 5, 6, 7, 8 1051
998 [42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- 1052
999 ing He. Non-local neural networks. In CVPR, June 2018. 1053
1000
2 1054
1001
[43] Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo 1055
Zhang, and Wei-Shi Zheng. Progressive teacher-student
1002 1056
learning for early action prediction. In CVPR, June 2019.
1003 1057
1, 2, 6, 7
1004 1058
[44] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Ming-
1005 sheng Long, and Li. Fei-Fei. Eidetic 3d lstm: A model for 1059
1006 video prediction and beyond. In ICLR, 2019. 2 1060
1007 [45] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory 1061
1008 networks. CoRR, abs/1410.3916, 2014. 2 1062
1009 [46] Jiani Zhang, Xingjian Shi, Irwin King, and Dit-Yan Yeung. 1063
1010 Dynamic key-value memory networks for knowledge trac- 1064
1011 ing. In WWW, pages 765–774, 2017. 1, 2, 4 1065
1012 [47] He Zhao and Richard P. Wildes. Spatiotemporal feature 1066
1013 residual propagation for action prediction. In ICCV, Octo- 1067
1014 ber 2019. 1, 2, 3, 6 1068
1015 [48] He Zhao and Richard P. Wildes. On diverse asynchronous ac- 1069
1016 tivity anticipation. In Computer Vision – ECCV 2020, pages 1070
1017
781–799, 2020. 2 1071
1018 1072
1019 1073
1020 1074
1021 1075
1022 1076
1023 1077
1024 1078
1025 1079

10

You might also like