You are on page 1of 10

ICCV ICCV

#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 054
001 055
002 056
003 Learning Temporal Dynamics from Cycles in Narrated Video 057
004 058
005 059
006 060
Anonymous ICCV submission
007 061
008 062
009 Paper ID 3128 063
010 064
011 065
012 066
Abstract
013 067
014 068
Learning to model how the world changes as time
015 069
elapses has proven a challenging problem for the computer
016 070
vision community. We introduce a self-supervised approach
017 071
to this problem that solves a multi-modal temporal cycle
018 072
consistency objective, MMCC1 , jointly in vision and lan-
019 073
guage. This objective requires a model to learn modality-
020 Figure 1. Predicting the future is challenging. Given a frame (a) 074
agnostic functions to predict the future and past that undo
021 at time t, previous work focuses on predicting frames at a fixed off- 075
each other when composed. We hypothesize that a model set, such as (b). However, these frames are often either redundant
022 076
trained on this objective will discover long-term temporal or stochastic, motivating the prediction of non-immediate futures.
023 077
dynamics in video. We verify this hypothesis by using the Predicting such a frame is highly non-trivial, as many are irrel-
024 078
resultant visual representations and predictive models as- evant, such as (c). Aided by the textual information in narrated
025 079
is to solve a variety of downstream tasks. Our method video, we can learn long-term temporal dynamics in video, and
026 predict (d). We learn these dynamics by solving a multi-modal 080
outperforms state-of-the-art self-supervised video predic-
027 temporal cycle consistency problem. 081
tion methods on future action anticipation, temporal im-
028 082
age ordering, and arrow-of-time classification tasks, with-
029 083
out training on target datasets or their labels. able to learn from large unlabeled datasets of in-the-wild ac-
030 084
tion and discover transitions autonomously, to enable prac-
031 085
tical applications. Second, modeling the complex long-term
032 086
033
1. Introduction transitions in the real world requires learning high-level
087
concepts, more naturally found in abstract latent representa-
034 Prediction is a central problem in computer vision which 088
tions than raw pixels. Finally, the duration elapsed by tem-
035 researchers have been grappling with since the early days of 089
poral transitions can vary significantly depending on con-
036 the field [10, 12, 21, 28, 33, 38, 53]. Previous deep learn- 090
text, and models must be able to make predictions at varied
037 ing methods have largely focused on predicting fixed, small 091
offsets into the future. To satisfy these desiderata, we intro-
038 offsets into the future. To understand why this formulation 092
duce a new self-supervised training objective, Multi-Modal
039 is flawed, consider Figure 1. This figure shows a frame (a) 093
Temporal Cycle Consistency (MMCC), and a model that
040 from a video at time t and three frames at times > t. Which 094
learns a representation to solve it.
041 of the three should be the output of a model that predicts the 095
042 We show the MMCC objective in Figure 2. Starting from 096
future? Option (d) is closest to the future that humans are
043 a sampled frame in a narrated video, our model learns to at- 097
likely to imagine. By predicting frames such as option (b),
044 tend among all narration text to retrieve a relevant utterance. 098
which occur in the immediate future [14, 15, 46, 52], we
045 Combining both modalities, the model learns a function to 099
limit the scope of temporal transitions that can be learned
046 predict a latent future, attending over the entire video to re- 100
by models and hurt downstream performance.
047 trieve a future frame. This frame’s corresponding utterance 101
Motivated by this example, we identify three central
048 is estimated, and a function to predict a past frame is learned 102
challenges in training a model to predict the future. First,
049 in a similar way. The cycle constraint requires that the final 103
manually annotating videos with temporal relationships be-
050 model prediction be equal to the starting frame. 104
tween frames is prohibitively expensive, and ground truth
051 may be difficult to define. Therefore, models should be MMCC addresses all three challenges discussed above. 105
052 In Figure 1, only (d) is a viable solution to our cycle formu- 106
053 1 Multi-Modal Temporal Cycle Consistency lation. Selecting (c) as a future would not allow the model 107

1
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

108 162
to return to (a), since the two frames have no clear relation-
109 163
ship. On the other hand, because the model does not know
110 164
which modality its input comes from—and therefore must
111 165
operate equally on vision and language—it is discouraged
112 166
from selecting lower-level future frames such as (b), which
113 167
likely do not accompany a predictable change in text.
114 168
We show that our model, trained end-to-end from scratch
115 169
to solve the MMCC objective on the HowTo100M dataset
116 170
[36], captures long-term dynamics in its predictive model
117 171
of the future, and can be used without further training to an-
118 172
ticipate future actions, order image collections, and identify
119 173
salient temporal relationships in long videos. It also learns
120 174
representations of video and text that contain information
121 175
relevant to modeling temporal dynamics, which we demon-
122 176
strate to be crucial to the quality of prediction.
123 177
Our main contributions are:
124 178
125 Figure 2. Learning temporal dynamics with cycles. Given an 179
• MMCC, a self-supervised multi-modal temporal cycle
126 image (here, second from left) as a start node, our model finds 180
consistency objective that requires learning visual rep-
127 corresponding text to build a start state . From this, our model 181
resentations attuned to temporal dynamics, as well as
128 predicts a future image and again builds a multi-modal state. Fi- 182
long-term predictive models of the future and past.
129 nally, our model predicts a past image from the future state. The 183
• An attention-based model to solve this objective, discrepancy between this prediction of the past and the start image
130 184
gives our cycle-consistency loss. To solve this problem, we learn
131 which uses cross-modal and temporal cues to discover 185
the temporal and cross-modal edges using soft attention.
132 relationships through time in video. 186
133 of work is concerned with predicting time t + 1 given time 187
• Since no previous self-supervised benchmarks exist in t. This formulation is highly constraining. Our model can
134 188
this area, a suite of qualitative and quantitative tasks to predict arbitrarily far into the future and learns long-term
135 189
evaluate learned representations and predictive mod- dynamics from unlabeled, narrated video.
136 190
els. Our model outperforms the self-supervised SOTA Learning from unlabeled narrated video. Self-
137 191
in video prediction on all tasks. supervised learning has a long history, even dating back to
138 192
139 the early 1990s, where De Sa [6] considered audiovisual 193
140 data to “derive label[s] from a co-occurring input to an- 194
2. Related Work
141 other modality”. We join an increasingly popular line of 195
142 Modeling the future. Building predictive models of the work and leverage automatic textual transcripts extracted 196
143 future is a long-studied task in the computer vision com- from narrated videos uploaded online. Combining video 197
144 munity. Early work considers generating or warping pix- and text has been widely explored in the deep learning 198
145 els or optical flow to synthesize immediate futures [3, 11, era, with datasets largely focusing on manual textual an- 199
146 31, 40, 41, 45, 53, 54, 55, 56, 62]. More recent work at- notation of video [2, 5, 59, 63] or on movies which have 200
147 tempts to model uncertainty in pixel-space, often by learn- provided scripts [42, 43]. Other work instead learns from 201
148 ing a distribution of futures that can be sampled [8, 17, automatic transcripts of narrations in instructional videos 202
149 20, 26, 28, 47, 50, 51, 60]. These approaches tend to fo- [1, 32, 61]. A main benefit of learning from unlabeled video 203
150 cus on synthetic or very short-term data, since synthesis is that it unlocks unprecedented scales of data; Miech et al. 204
151 is challenging in real video. Rather than predicting pix- [36] introduces a dataset of over 100 million video clips 205
152 els, another line of work uses supervision to predict fu- and their narration transcripts, which is later used to learn 206
153 ture action labels [18, 23, 27, 35, 44]. Sun et al. [48] also strong models of cross-modal correspondence [34]. We are 207
154 uses narrated video, but quantizes input video using Kinet- inspired by their success in training vision-and-language 208
155 ics supervision, then learns a transformer-based model of models on large collections of narrated video, and build on 209
156 vision-and-language sequences. Instead of using supervi- their data and approach to learn temporal dynamics. 210
157 sion, Vondrick et al. [52] predicts representations which are Learning with self-supervised cycles. Cycle consistency 211
158 trained to capture abstract concepts but are automatically was recently proposed [64] as a natural cue for learning 212
159 obtained on large collections of data. Recent work extends from unlabeled data or when ground truth is unavailable. 213
160 this, using contrastive learning to predict future representa- In Zhu et al. [65], cycles are used for unpaired image- 214
161 tions [14, 15, 46]. With very few exceptions [20], this line to-image translation; Recycle-GAN [4] builds on this in 215

2
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

216 270
follow-up work that incorporates simple temporal predic-
217 271
tion (one timestep into the future) into these cycles. Kulka-
218 272
rni et al. [25] uses cycles to learn mappings between canon-
219 273
ical 3D surfaces and 2D images. Dwibedi et al. [9] uses
220 274
cycles to enforce that moments from two different videos
221 275
should be mutual nearest neighbors, aligning action se-
222 276
quences and learning features useful for downstream tasks.
223 277
Another line of work uses cycles to track objects through
224 278
time [19, 57], tracking a pixel forward and then backward
225 279
in time and requiring that the final pixel be the same as
226 280
the start pixel. We are inspired by all these applications
227 281
and introduce a new type of temporal cycle, one which not
228 282
only incorporates multi-modal information into its learning,
229 283
but also predicts dynamically into the future, instead of at a
230 Figure 3. Structure of a cycle edge: We learn to embed all visual 284
fixed offset. In particular, we draw inspiration from Jabri et
231 and textual nodes with ΦV and ΦT . We then compute the cross- 285
al. [19], which casts temporal edges as contrastive compar-
232 modal node corresponding to the start node with attention across 286
isons (i.e., attention) among candidate nodes.
233 the other modality. Both node representations are passed into an 287
234 MLP gfwd ◦ f , which predicts the future using attention across 288
235
3. Learning to Cycle through Narrated Video the start modality. The process is then repeated to go backward 289
in time, replacing gfwd with gback . Our loss trains the model’s final
236 Our model learns long-term temporal dynamics by cy- 290
output to close the cycle by predicting the start node.
237 cling through narrated video. We formulate the cycle con- 291
238 sistency problem as follows: Given a moment Mi in a start 292
239 modality M (either video V or text T ), retrieve a corre- We then compute the cycle edges, where each edge is 293
240 sponding moment in the other modality M 0 , then use both an instance of soft attention as described above. The atten- 294
241 modalities to select a future moment in M . From this fu- tion operation Attn(Q, K, V ) accepts sets of query, key, 295
242 ture moment, find a correspondence in M 0 , then select a and value vectors Q ∈ RNQ ×d , K, V ∈ RNK ×d and re- 296
243 past moment in M . For the cycle to be complete, this fi- turns a set of new values Z ∈ RNQ ×d , computed (with τ - 297
244 nal moment must be the same as the initial moment Mi . temperature softmax along the second dimension) as 298
245 We illustrate the cycle in Figure 2. Solving this problem 299
QK T
 
246 requires learning forward- and backward-in-time predictive 300
247
Z = Attn (Q, K, V ) = Softmax V. (1) 301
functions that invert each other, as well as image and sen- τ
248 tence embeddings that capture inter-modal correspondences 302
249 and temporally relevant information. To cycle through narrated video, we first select a modal- 303
250 ity M ∈ {T, V } and a start node Mα ∈ Mt0 :t1 (we de- 304
251 3.1. Cycles as repeated soft attention scribe the process for selecting Mα in Section 3.3). We find 305
252 the representation of the corresponding node in the other 306
253 Let Vt0 :t1 and Tt0 :t1 be sequences of video and text, modality M 0 with a cross-modal attention edge: 307
254 respectively, drawn from some temporal interval [t0 , t1 ]. 308
These sequences can be discretized into frames {Vi }N V
 
i=1 and
NM 0 NM 0
255
NT
zMα0 = Attn πMα , {πMi0 }i=1 , {zMi0 }i=1 . (2a) 309
256 utterances {Ti }i=1 , where NV , NT are the number of in- 310
257 stances the sequence is split into. We refer to each instance We learn to project representations z into a shared semantic 311
258 as a node, which allows viewing the training goal as learn- space, using a modality-specific projector ΠM : Rd → Rd 312
259 ing a cyclic path through a graph, as depicted in Figure 2. that outputs vectors π. For notational convenience, we de- 313
260 In order to differentiate through the cycle generation pro- note by Attnn0 :n1 the same attention operation which con- 314
261 cess, edges uv in the graph shown in Figure 2 are imple- siders keys (and values) at indices {n0 , . . . , n1 }. In this 315
262 mented as soft retrievals of v given u, as shown in Figure 3. notation we can rewrite the above as: 316
263 This soft retrieval operation can be viewed as an application 317
of the well-known attention mechanism [49].

264 zMα0 = Attn1:NM 0 πMα , {πMi0 }, {zMi0 } . (2b) 318
265 We start by running all visual and textual nodes through 319
266 embedding networks ΦT and ΦV initialized with random The representations from both modalities are concate- 320
267 weights. We use the architecture from [34] for embedding nated and run through a multi-layer perceptron f : R2d → 321
268 text nodes and a ResNet-18 [16] for visual nodes. This op- Rd . This operation yields zα = f (zMα , zMα0 ), embedding 322
269 eration yields series of embeddings {zVi }N V NT
i=1 and {zTi }i=1 . the joint information back into the shared semantic space. 323

3
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

324 378
This choice also allows us to train our temporal edges with- vision-language pairs are used as negatives. To handle the
325 379
out cross-modal information and accept input from only one common misalignment intrinsic to real-world video, [34] al-
326 380
modality with some probability punimodal , since ΦV and ΦT lows for representations within k nodes of the ground truth
327 381
also map to z-space, in Rd . node Mt0 to be considered as positives. We adopt this ap-
328 382
Our model must now go from this multi-modal state rep- proach to learn cross-modal correspondence, training it for
329 383
resentation zα to a future state representation zβ . First, we finer-grained discrimination among a set of candidate mo-
330 384
predict an estimated representation of the future in projec- ments drawn from the same video as opposed to randomly
331 385
tion (π) space, with an MLP gfwd . We then retrieve the across the entire dataset. We denote the loss used to train
332 386
node in modality M corresponding to this future state with cross-modal correspondence Lcross-modal . For the full cross-
333 387
a forward-in-time attention edge: modal formulation, please see Supplementary Material.
334 388
335 zMβ = Attn1:NM (gfwd (zα ), {πMi }, {zMi }) . (3) 389
336
3.3. Starting the cycle 390
337 It is important to note that attention is order-invariant in K 391
Our model will be unable to learn semantic transitions
338 and V , i.e. shuffling rows of K and V yields the same out- 392
between states if the initial input node depicts noisy or un-
339 put Z, since the individual matrix-row multiplications are 393
clear data. This is especially probable when training on un-
340 agnostic to row index. This means that, importantly, the 394
constrained, real-world video datasets. Therefore, instead
341 model is not given temporal information about input nodes, 395
of randomly sampling start nodes, we sample from a distri-
342 which could be used as a shortcut in learning2 . We then 396
bution defined by a “concreteness” score si . We calculate
343 retrieve the corresponding node in M 0 , zMβ0 , with another 397
this score for each node Mi ∈ M as the highest cross-
344 cross-modal edge, projecting queries and keys into π-space: 398
modal similarity between Mi and some node Mj0 in the
345 
other modality. Intuitively, this score captures concreteness 399
zMβ0 = Attn1:NM 0 πMβ , {πMi0 }, {zMi0 } . (4)
346 since frames and utterances that align strongly tend to con- 400
347 As before, these vectors are combined to yield a future state tain objects or actions which are salient in both modalities: 401
348 representation zβ = f (zMβ , zMβ0 ). 402
349 This process is repeated to predict backward in time. We   403
350 si = max πMi · πMj0 (7) 404
compute gback (zβ ), where gback shares its first few layers j
351 with gfwd to allow learning features useful for dynamics in 405
352 either direction (see Section 3.5 for more details). To close 406
We run the above scores through a softmax with τ = 0.1,
353 the cycle, we compute the normalized similarity scores be- 407
yielding a distribution from which we sample Mα .
354 tween gback (zβ ) and the π-space nodes in M : 408
355 409
356 p = Attn1:NM (gback (zβ ), {πMi }, {1}) . (5) 3.4. Avoiding collapse 410
357 411
We train our system with the negative log likelihood loss on Training on the above formulation of Lcycle in practice
358 412
the score vector cycling back to the location of Mα , which may lead to fast collapse to a simple “looping in place” so-
359 413
we denote iα : lution, where temporal edges always point to the current
360 414
361
Lcycle = − log p(iα ) . (6) node. We propose two strategies to prevent this collapse:
415
362
Constraining candidate nodes. We can limit the range of 416
3.2. Cross-modal correspondence temporal edges during training by removing nodes from K
363 417
364 A key component of our cycle model is the ability to find and V in Eqs. 3 and 5. We rewrite Eq. 3 with Attniα +1:NM , 418
365 correspondences between vision and language. Eqs. 2b and i.e., since we know the index the cycle starts from, we can 419
366 4 crucially rely on this ability in order to incorporate multi- consider only those nodes after the start point in the forward 420
367 modal information into temporal edges. Recent work has edge. We similarly rewrite Eq. 5 with Attn1:index(zMβ )−1 , 421
368 demonstrated remarkable progress in training models on where index(zMβ ) = arg maxi (gfwd (zα ) · πMi ), i.e., the 422
369 massive datasets for this cross-modal retrieval task: given index of the node with highest similarity to the latent pre- 423
370 a moment at time t in one modality of a video - Mt - find dicted future. This constrains the backward edge to only 424
371 the matching moment in the other modality M 0 . consider nodes that precede the estimated current index. 425
372 We build on the approach presented in [34] which uses This can also be seen as resolving the sign ambiguity in- 426
373 a contrastive loss to train representations of vision and lan- herent to the unconstrained formulation which allows the 427
374 guage, where temporally co-occurring information is con- model to go back-then-forward or vice versa. Importantly, 428
375 sidered ground truth (Mt should retrieve Mt0 ), and other we run the model without this constraint at test time. 429
376 2 E.g., the model could cycle by selecting the node it knows is at t + 1 Penalizing visual similarity. Alternatively, we can encour- 430
377 or t − 1. age our model to select visually diverse nodes in its tempo- 431

4
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

432 486
ral edges: of the dataset. We build a train-validation-test split such that
433 487
of 338,033 total recipe videos, 80% are in train, 15% in val-
434

Lsim = max cos(zVα , zVβ ) − m, 0 (8) 488
idation, and 5% in test. Recipe videos are rich in complex
435 489
objects, actions, and state transitions, and the subset allows

+ max cos(zVβ , zVα,back ) − m, 0 ,
436 490
us to train models faster.
437 491
where zVα,back is the visual representation given by Eq. 5, For more controlled testing, we use the CrossTask
438 492
replacing the values with {zMi }, and m = 0.5 is a margin. dataset [66], which contains similar videos along with task-
439 493
In practice, we combine both the above strategies for the specific annotations. Videos are associated with tasks (e.g.,
440 494
strongest results. “making pancakes”), where each task has a predefined se-
441 495
quence of high-level subtasks with rich long-term temporal
442 3.5. Implementation inter-dependencies (e.g., [“pour flour into bowl”, “crack egg 496
443 497
We combine Lcycle , Lcross-modal , and Lsim in our final loss: into bowl”, ..., “drizzle maple syrup”]). Video segments that
444 498
depict one of these subtasks are annotated as such.
445 499
L = λ1 Lcycle + λ2 Lcross-modal + λ3 Lsim . (9)
446 4.2. Previous work and baselines 500
447 We embed images into Rd (d = 512) using a ResNet- 501
Baselines: We evaluate purely cross-modal features (Sec-
448 18 [16], and embed text using a word embedding matrix 502
tion 3.2), given by frozen embedding nets ΨV , ΨT , and
449 followed an MLP and global pooling, as in [34]. 503
also use these features as prediction targets for RA and TAP
450 We implement all modules (ΠM , f , gfwd , gback ) as MLPs, 504
below. We also study ImageNet supervised features [7].
451 where each layer is followed by ReLU and a LayerNorm 505
Representation Anticipation (RA): As a representative of
452 except for the final layer, which is followed by l2 normal- 506
the self-supervised line of work in predicting a fixed off-
453 ization if its output is in π-space. ΠM and f are one-layer 507
set into the future, we implement RA [52] on our data and
454 MLPs, and gfwd , gback are four-layer MLPs, with weights of 508
architecture, training a model to predict frozen representa-
455 the first two layers shared. We randomly sample batches of 509
tions of a network trained for cross-modal correspondence.
456 video segments of maximum duration t1 − t0 = 64sec. The 510
In vision, we train the network to anticipate one second into
457 sparsity at which data is sampled affects the time elapsed 511
the future, while in text, we anticipate the subsequent utter-
458 by input videos in a batch as well as the granularity of vi- 512
ance (on average, ∼2 seconds into the future). We train:
459 sual information provided to the model. Denser data is less 513
 
460 likely to miss key moments, but more likely to contain re- 514
arg min − cos ft+1 (ΦM (Mi )), ΨM (Mi+1 ) , (10)
461 dundant information. We therefore train models on various ΦV ,ΦT ,ft+1 515
462 image sampling frame rates r ∈ {0.25fps, 0.5fps, 1fps}. 516
463
Time-Agnostic Prediction (TAP): Noting the restrictive 517
Because good cross-modal correspondence is necessary nature of the fixed offset formulation, TAP [20] introduces
464 to learn strong, semantic cycles, we initialize λ2 = 1 and 518
465
the minimum-across-time formulation to allow the predic- 519
exponentially increase λ1 from some small value  up to 1, tion of “bottleneck” predictable moments. We implement
466 across 30 epochs. We peg λ3 = 3λ1 when using the simi- 520
467
their loss, taking the minimum across all future moments in 521
larity loss. For further details on training and architecture, the sampled video segment:
468 please see Supplementary Material. 522
469 arg min min − cos (ft+∆ (ΦM (Mi )), ΨM (Mi0 )) . 523
0
470 4. Experiments ΦV ,ΦT ,ft+1 i<i ≤NM 524
471 (11) 525
472 This sections examines the design choices and learned While the above two models do not consider the exact same 526
473 temporal dynamics of our model. Since most previous setting as us, we re-implement their approaches as faithfully 527
474 benchmarks focus on supervised action anticipation with as possible, training them to predict SOTA features trained 528
475 fixed categories and time offsets [5, 24], we design a suite for cross-modal correspondence. 529
476 of qualitative and quantitative experiments to evaluate dif- MemDPC: In order to efficiently model multiple future hy- 530
477 ferent approaches. potheses, MemDPC [15] casts future prediction as estima- 531
478 tion of convex combinations of memories stored in a code- 532
4.1. Data book, and achieves SOTA performance on tasks of interest.
479 533
480 We train our model on unconstrained real-world video We evaluate their trained visual future prediction model, 534
481 data. Specifically, we use a subset of the HowTo100M which does not take textual information as input. 535
482 dataset [36], which contains around 1.23 million videos and 536
4.3. Evaluating cycle consistency
483 their automatically extracted audio transcripts. Videos in 537
484 this dataset are roughly categorized by subject area, and we Central to our formulation is the model’s ability to learn 538
485 use only the videos categorized “Recipe”, around a quarter dynamic predictions of the future and past that undo each 539

5
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540 594
Percentile rank
541 Choice Variant 595
Cycle Cross-modal
542 Temporal constraint None* - - 596
543 Similarity loss 93.1 74.4 597
544 Max-index 92.6 74.3 598
Max-index + sim. loss 93.6 75.7
545 599
Multi-modal info. punimodal = 1 89.8 74.3
546 600
punimodal = 0.5 93.6 75.7
547 punimodal = 0 96.5 75.9 601
548 Start point selection Cross-modal similarity 93.6 75.7 602
549 Random 88.7 74.5 603
550 Input embedding Fine-tuned 93.6 75.7 604
551 Frozen cross-modal [34] 67.5 76.8 605
552 Cycle path Within modalities 93.6 75.7 606
553 Across modalities 85.0 73.2 607
554 Chance 50.0 50.0 608
555 609
556 Table 1. Cycle-back and cross-model accuracy: We evaluate 610
557 models on the percentile rank assigned to ground truth in the cycle 611
558
and cross-modal tasks on the Recipes test set (100 = ground truth 612
ranked first, 0 = ranked last). Used options are shown in bold.
559 613
*Without any temporal constraint, training collapses.
560 614
561 other, as well as finding strong cross-modal correspon- 615
562 dences. Thus, we begin by evaluating how well different 616
563 model variants are able to solve our self-supervised ob- 617
Figure 4. Emergent long-term temporal dynamics: We show
564 jective on the Recipes test set. We ablate various design examples of learned model cycles in the Recipes test set. Given 618
565 choices, including multi-modal information usage, cycle a start node (top: text, bottom: image) sampled as described in 619
566 edge order, and temporal constraints on edges. Section 3.3, we show the retrieved cross-modal node, the predicted 620
567 future node and its cross-modal retrieval, and the model’s final 621
Multi-modal information: As an alternative to defining
568 backward prediction. In the bottom row, we show a failure case, 622
the state as a learned combination of visual and textual rep-
569 where the forward edge skips too far ahead and breaks the cycle. 623
resentations zα = f (zM,α , zM 0 ,α ), we can use only one
570 model appears to traverse video according to long-term dy- 624
modality at a time, giving zα = zM,α . The frequency at
571 namics, as hypothesized. Note that these transitions occur 625
which only unimodal information is used can be controlled
572 up to one minute apart, highlighting the importance of al- 626
by a hyperparameter punimodal .
573 lowing dynamic prediction offsets. 627
Cycle path: The above formulation navigates between mo-
574 628
ments in the start modality M , optionally using information 4.4. Zero-shot prediction
575 629
from M 0 to augment representations. We denote this vari-
576 Ranking transitions by likelihood: To directly evaluate 630
ant Within modalities. The order of these edges can also
577 the learned representations π and functions gfwd and gback , 631
be permuted, such that cycles start in M , retrieve a moment
578 we can visualize the pairs of frames (u, v) for which the 632
in M 0 , find a future moment in M 0 , then cycle back through
579 probability of v being the future of u is highest. We model 633
M . This variant is denoted Across modalities.
580 this probability as the product of the likelihood of states u 634
581 Evaluating variants: To compare between different vari- 635
and v and the forward and backward likelihood of u → v:
582 ants, we measure the average percentile rank (e.g. 100 = 636
583 ground truth is ranked first among all candidates, 50 = eπfwd ·πv 637
Pfwd (v|u) = P πfwd ·πm
, (12)
584 ranked in the middle, 0 = ranked last) assigned by our model m∈M e 638
585 to ground truth cross-modal and cycle nodes. We show this eπback ·πu 639
ablation study in Table 1, observing significant gains us- Pback (u|v) = P πback ·πm
,
m∈M e
586 640
587 ing our cycle configuration. We hypothesize that across- 641
P (u → v) = Pfwd (v|u) · Pback (u|v) · P (u) · P (v),
588 modality cycles perform worse since switching modalities 642
589 acts as a bottleneck, forcing the model to discard informa- where πfwd , πback are the result of running u and v (option- 643
590 tion that would be useful for subsequent edges. ally with cross-modal information) through gfwd , gback and 644
591 Visualizing cycles: We show examples of cycles discov- P (x) is the concreteness score defined in Equation 7. 645
592 ered by the trained model in Figure 4. Our model correctly We compute this probability efficiently for all n2 pairs 646
593 cycles back around 66% of the time (chance is 4%). The in long clips of continuously sampled video (n ≈ 600). We 647

6
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

648 702
Recall Percentile rank
649 Model 703
@1 @ 5 @ 10 Worst Mean Best
650 704
MemDPC* [15] 3.6 12.1 22.8 25.0 45.8 68.9
651 Cross-modal [34] 2.9 14.2 24.3 28.2 47.9 68.2 705
652 Repr. Ant. [52] 3.0 13.3 26.0 25.7 47.7 71.4 706
653 TAP [20] 4.5 17.1 27.9 28.3 50.1 71.6 707
654 MMCC (ours) 5.4 19.9 33.8 33.0 55.0 76.9 708
655 709
Table 2. Predicting future actions: We evaluate models’ ability
656 710
to anticipate action at a high level, potentially minutes into the
657 711
future, without any fine-tuning. On the CrossTask dataset [66],
658 712
our model outperforms the previous self-supervised state of the art
659 in inferring possible future actions. *We evaluate MemDPC by 713
660 clip retrieval since it does not have a textual representation. 714
661 715
662 pute their average embedding z̄v . The likelihood score uses 716
663 our learned representations and predictive model to define a 717
664 score sv→j = gfwd (z̄v ) · πTj . We compute likelihood scores 718
665 for all subtask descriptions in the CrossTask validation set, 719
666 and consider the model’s prediction correct if any of the 720
667 future actions in Tfuture are predicted, since not all future 721
668 Figure 5. Discovering transitions in video: Once trained, the subtasks are necessarily related to the given visual state. 722
669 learned model of temporal dynamics can be applied to long (5- Table 2 shows recall and percentile rank statistics for this 723
10 min.) video sequences to discover the most salient pairwise
670 task. We compare our model to [15, 20, 52], replacing gfwd 724
transitions u → v. We compute the probability of a transition
671 with each method’s predictive model. Since [15] is vision- 725
as defined in Eq. 12 and show the highest-score transitions in 10
672
different Recipes test set videos. only, we set πTj to the average visual representation of all 726
673 video segments with a given subtask label Tj . We also de- 727
674 fine a cross-modal similarity score sv→j = π̄v · πTj as a 728
675
then look at the top temporal transitions discovered by the 729
strong baseline, taking advantage of contextual similarities
676
model in each video. We show results on the Recipes test 730
in video and text. Our model outperforms all baselines and
677
set in Figure 5. The top transitions show clear state transi- 731
self-supervised state of the art on detecting the temporal re-
678
tions such as adding chocolate chips to dough, segmenting 732
lationships between visual states and future actions.
679
an orange, and baking a loaf. These predictions could not 733
680
be made by a model trained to predict a fixed future, since 4.5. Further analysis 734
681
they occur at varied temporal offsets. 735
Unshuffling bags of frames: The ability to order a shuf-
682 Predicting future actions: Existing benchmarks e.g. [5, 736
fled set of states is used to evaluate human language and
683 24] focus on predicting action from a few frames in the im- 737
common-sense reasoning skills, and has been explored as a
684 mediate past or present. Instead, given a few frames, we 738
learning signal in NLP [13, 29, 30]. This same ability can
685 wish to predict long-term temporal dynamics, which may 739
also be used to discover temporal structure and summaries
686 unfold arbitrarily far into the future. While the former task 740
of events from large image datasets, as in [22]. We solve
687 is more well-defined, the latter is more interesting and rel- 741
this problem by finding the optimal explanation of shuffled
688 evant. However, ground truth for this task – i.e., per-frame 742
video given by iterative application of our temporal dynam-
689 annotation of related future action – is not widely available. 743
690 We propose using CrossTask task steps as a proxy, since Q orderings {i1 , i2 , . . . , in },
ics model. Out of all n! possible
744
we select the one for which j P (xij → xij +1 ) is highest.
691 they capture long-term temporal relationships in video. 745
Given P (u → v) scores computed by Eq. 12, we in-
692 For a video V belonging to task τ (with Nτ predefined duce a fully-connected directed graph with sampled frames 746
693 subtasks), let v ∈ V be a clip with subtask label Tτ,i (ith in as nodes and edge weights given by weight(uv) ~ ≡ 747
694 the predefined sequence). We would like to predict future − log P (u → v). Adding a special null node connected 748
695 actions Tfuture = {Tτ,j }Nj=i+1 from v. For example, given
τ
to all other nodes with edge weight 0 allows running this 749
696 a short clip of eggs being added to a bowl with flour, the graph through an off-the-shelf traveling salesperson prob- 750
697 model should assign high likelihoods to subtasks such as lem (TSP) solver3 . The optimal TSP solution then repre- 751
698 “mix batter” and low likelihoods to “crack eggs” or “sea- sents the lowest-cost (ordered) path through all video clips, 752
699 son steak”. Formally, we define a future likelihood score effectively unshuffling the input. 753
700 given a video segment v and candidate future subtask Tj . 754
701 We first sample frames from the video segment and com- 3 https://pypi.org/project/elkai/ 755

7
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

756 810
Sampling strategy
757 Rand Cos sim TAP RA Model Avg 811
758 812
Random 50.4 50.7 51.3 51.4 51.2 51.0
759 813
ImageNet 51.5 52.1 50.8 50.9 53.4 51.7

Features
760 Cross-modal 52.6 53.3 50.9 50.6 55.8 52.6 814
761 Repr. Ant. [52] 50.7 51.4 51.2 51.2 51.7 51.2 815
762 TAP [20] 50.8 51.4 51.4 51.3 51.8 51.3 816
763 MMCC (ours) 52.3 53.4 50.5 50.7 69.2 55.2 817
764 Average 51.4 52.0 51.0 51.0 55.5 52.2 818
765 Chance 50.0 50.0 50.0 50.0 50.0 50.0 819
766 From scratch 51.1 52.1 51.8 51.4 62.5 53.8 820
767 821
768 Table 4. Learning the arrow of time: We train a linear layer on 822
769 frozen features as well as a full model from scratch (last row) to 823
770 detect which of two input frames comes first. This task is near- 824
771 impossible when sampling random frames. Using our temporal 825
772 model to sample frames leads to a significant improvement in per- 826
773
formance, indicating that our model can identify salient transitions 827
in video. Our visual embedding also outperforms other models,
774 828
highlighting the importance of temporally-attuned representations.
775 829
776 Figure 6. Unshuffling image collections: We show example video 830
sequences in the ordering given by treating the induced graph of sitions between states, we explore the arrow of time classifi-
777 831
log-likelihoods as an instance of the traveling salesperson prob- cation task, introduced in [37, 39, 58]. In [58], a network is
778 832
lem. We list the ground truth index in the sequence under each trained on short videos (on the order of seconds) to predict
779 833
clip. Even accepting only sparse video frames as input, our model whether input is being played forward or backward.
780 makes reasonable predictions on this challenging task. 834
We consider the more challenging task of predicting the
781 835
temporal relationship between two far-apart input frames
782 Model Kendall’s τ (↑) Spearman’s ρ (↑) Edit dist. (↓) 836
– which one comes first? For frames which depict unre-
783 Chance 0.0000 0.0000 6.5822 837
lated moments, this task is perhaps near-impossible, even
784 838
Repr. Ant. [53] 0.3383 0.4132 5.4596 for humans. But if frames show semantically related states,
785 MemDPC [15] 0.3492 0.4206 5.3398 839
TAP [20] 0.3344 0.4107 5.4178
the direction of likely transition provides a useful signal for
786 840
MMCC (ours) 0.3632 0.4420 5.3343 solving the arrow-of-time task.
787 MMCC (vision only) 0.3530 0.4328 5.3370 841
We train linear classifiers on top of frozen features as
788 842
Table 3. Unshuffling image collections: As defined in Eq. 12, well as a full network from scratch to solve the arrow of
789 843
log P (u → v) gives a log-likelihood score of v being the future time task on randomly shuffled pairs of frames. We sam-
790 844
state of u. These scores induce a graph which is optimally tra- ple pairs of frames using our learned predictive model by
791 versed using a TSP solver. Each model above defines a different 845
selecting the highest-probability futures of start frames se-
792 P and is applied to shuffled videos from the Recipes test set. Our 846
lected with the concreteness prior (Eq. 7). We demonstrate
793 model outperforms previous state of the art on all metrics. 847
in Table 4 that the temporal ordering of frames mined by our
794 848
model is much more classifiable than that of frames sam-
795 We run this experiment on CrossTask, where videos are pled using predictive models in previous work. Further, our
849
796 annotated with ordered steps and their associated temporal learned features are much more able to classify a given pair
850
797 segments. We treat each segment as a node by computing its of frames, since they must capture temporal information in
851
798 average visual representation, as before. We then use these training. This confirms that a strong understanding of dy-
852
799 representations to find P (u → v) scores between labeled namics that emerges from the cycle consistency task.
853
800 segments and solve an optimal path. We run this experiment 854
801 both in vision only (by passing projected visual representa- 855
5. Conclusion
802 tions directly into g in Eq. 12) as well as with ground truth 856
803 vision-text pairings, and show results in Table 3. We show We introduce a self-supervised method to learn temporal 857
804 example predicted orderings in Figure 6. Again, we can dynamics by cycling through narrated video. Despite the 858
805 replace gfwd with the future prediction model in other meth- simplicity of our architecture, our model is able to discover 859
806 ods and run the same algorithm. Our model outperforms long-term state transitions in vision and language. We show 860
807 previous work on all evaluation metrics. that this model can be applied without further training to 861
808 Discovering the arrow of time: To further examine challenging downstream tasks such as anticipating far-away 862
809 whether our model has learned to discover meaningful tran- action and ordering collections of image data. 863

8
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

864 918
References [19] Allan Jabri, Andrew Owens, and Alexei A Efros. Space-time
865 919
correspondence as a contrastive random walk. In NeurIPS,
866 [1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, 2020. 3 920
867 Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsu- [20] Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and 921
868 pervised learning from narrated instruction videos. In CVPR, Sergey Levine. Time-agnostic prediction: Predicting pre- 922
869 2016. 2 dictable video frames. In ICLR, 2019. 2, 5, 7, 8 923
870 [2] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef [21] Dinesh Jayaraman and Kristen Grauman. Learning image 924
871 Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- representations tied to ego-motion. In ICCV, 2015. 1 925
872
ments in video with natural language. In ICCV, 2017. 2 [22] Gunhee Kim, Leonid Sigal, and Eric P Xing. Joint summa- 926
873 [3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, rization of large-scale collections of web images and videos 927
Roy H Campbell, and Sergey Levine. Stochastic variational for storyline reconstruction. In CVPR, 2014. 7
874 928
video prediction. arXiv:1710.11252, 2017. 2 [23] Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and
875 929
[4] Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Martial Hebert. Activity forecasting. In ECCV, 2012. 2
876 930
Sheikh. Recycle-GAN: Unsupervised video retargeting. In [24] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language
877 931
ECCV, 2018. 2 of actions: Recovering the syntax and semantics of goal-
878 932
[5] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, directed human activities. In CVPR, 2014. 5, 7
879 933
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide [25] Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani.
880 Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 934
Canonical surface mapping via geometric cycle consistency.
881 Scaling egocentric vision: The EPIC-Kitchens dataset. In 935
In ICCV, 2019. 3
882 ECCV, 2018. 2, 5, 7 936
[26] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan,
883 [6] Virginia R de Sa. Learning classification with unlabeled data. Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk 937
884 In NeurIPS, 1994. 2 Kingma. Videoflow: A conditional flow-based model for 938
885 [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, stochastic video generation. arXiv:1903.01434, 2019. 2 939
886 and Li Fei-Fei. ImageNet: A large-scale hierarchical image [27] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierar- 940
887 database. In CVPR, 2009. 5 chical representation for future action prediction. In ECCV, 941
888 [8] Emily Denton and Rob Fergus. Stochastic video generation 2014. 2 942
889 with a learned prior. arXiv:1802.07687, 2018. 2 [28] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, 943
890 [9] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Chelsea Finn, and Sergey Levine. Stochastic adversarial 944
891 Sermanet, and Andrew Zisserman. Temporal cycle- video prediction. arXiv:1804.01523, 2018. 1, 2 945
892 consistency learning. In CVPR, 2019. 3 [29] Haejun Lee, Drew A Hudson, Kangwook Lee, and Christo- 946
893 [10] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey pher D Manning. SLM: Learning a discourse language rep- 947
Levine. Self-supervised visual planning with temporal skip resentation with sentence unshuffling. arXiv:2010.16249,
894 948
connections. arXiv:1710.05268, 2017. 1 2020. 7
895 949
[11] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsuper- [30] Lajanugen Logeswaran, Honglak Lee, and Dragomir Radev.
896 950
vised learning for physical interaction through video predic- Sentence ordering and coherence modeling using recurrent
897 951
tion. arXiv:1605.07157, 2016. 2 neural networks. arXiv:1611.02654, 2016. 7
898 952
[31] William Lotter, Gabriel Kreiman, and David Cox. Deep pre-
899 [12] Chelsea Finn and Sergey Levine. Deep visual foresight for 953
dictive coding networks for video prediction and unsuper-
900 planning robot motion. In ICRA, 2017. 1 954
vised learning. arXiv:1605.08104, 2016. 2
901 [13] Jingjing Gong, Xinchi Chen, Xipeng Qiu, and Xuanjing 955
[32] Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick
902
Huang. End-to-end neural sentence ordering using pointer 956
Johnston, Andrew Rabinovich, and Kevin Murphy. What’s
network. arXiv:1611.04953, 2016. 7
903 cookin’? interpreting cooking videos using text, speech and 957
904
[14] Tengda Han, Weidi Xie, and Andrew Zisserman. Video rep- vision. arXiv:1503.01558, 2015. 2 958
resentation learning by dense predictive coding. In ICCV [33] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep
905 959
Workshops, 2019. 1, 2 multi-scale video prediction beyond mean square error.
906 960
[15] Tengda Han, Weidi Xie, and Andrew Zisserman. Memory- arXiv:1511.05440, 2015. 1
907 961
augmented dense predictive coding for video representation [34] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan
908 962
learning. arXiv:2008.01065, 2020. 1, 2, 5, 7, 8 Laptev, Josef Sivic, and Andrew Zisserman. End-to-end
909 963
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. learning of visual representations from uncurated instruc-
910 964
Deep residual learning for image recognition. In CVPR, tional videos. In CVPR, 2020. 2, 3, 4, 5, 6, 7
911 2016. 3, 5 965
[35] Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang,
912 [17] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and 966
Lorenzo Torresani, and Du Tran. Leveraging the present to
913 Juan Carlos Niebles. Learning to decompose and disentangle anticipate the future in videos. In CVPR Workshops, 2019. 2 967
914 representations for video prediction. In NeurIPS, 2018. 2 [36] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, 968
915 [18] De-An Huang and Kris M Kitani. Action-reaction: Fore- Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 969
916 casting the dynamics of human interaction. In ECCV, 2014. Howto100m: Learning a text-video embedding by watching 970
917 2 hundred million narrated video clips. In ICCV, 2019. 2, 5 971

9
ICCV ICCV
#3128 #3128
ICCV 2021 Submission #3128. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

972 1026
[37] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf- [54] Carl Vondrick and Antonio Torralba. Generating the future
973 fle and learn: unsupervised learning using temporal order with adversarial transformers. In CVPR, 2017. 2 1027
974 verification. In ECCV, 2016. 8 [55] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to 1028
975 [38] Nemanja Petrovic, Aleksandar Ivanovic, and Nebojsa Jo- the future: Unsupervised visual prediction. In CVPR, 2014. 1029
976 jic. Recursive estimation of generative models of video. In 2 1030
977 2006 IEEE Computer Society Conference on Computer Vi- [56] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense 1031
978 sion and Pattern Recognition (CVPR’06), volume 1, pages optical flow prediction from a static image. In ICCV, 2015. 1032
979 79–86. IEEE, 2006. 1 2 1033
980 [39] Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, [57] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learn- 1034
981 Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, ing correspondence from the cycle-consistency of time. In 1035
and William T Freeman. Seeing the arrow of time. In CVPR, CVPR, 2019. 3
982 1036
2014. 8 [58] Donglai Wei, Joseph J Lim, Andrew Zisserman, and
983 1037
[40] Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeul- William T Freeman. Learning and using the arrow of time.
984 1038
ders. Déja vu. In ECCV, 2014. 2 In CVPR, 2018. 8
985 [41] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael 1039
986
[59] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A 1040
Mathieu, Ronan Collobert, and Sumit Chopra. Video (lan- large video description dataset for bridging video and lan-
987 guage) modeling: a baseline for generative models of natural 1041
guage. In CVPR, 2016. 2
988 videos. arXiv:1412.6604, 2014. 2 1042
[60] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Free-
989 [42] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt 1043
man. Visual dynamics: Probabilistic future frame synthesis
990 Schiele. A dataset for movie description. In CVPR, 2015. 2 1044
via cross convolutional networks. In NeurIPS, 2016. 2
991 [43] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket 1045
[61] Shoou-I Yu, Lu Jiang, and Alexander Hauptmann. Instruc-
992 Tandon, Chris Pal, Hugo Larochelle, Aaron Courville, and 1046
tional videos for unsupervised harvesting and learning of ac-
993 Bernt Schiele. Movie description. IJCV, 2017. 2 1047
tion examples. In ACM MM, 2014. 2
994
[44] Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, 1048
[62] Jenny Yuen and Antonio Torralba. A data-driven approach
Mathieu Salzmann, Basura Fernando, Lars Petersson, and
995 for event prediction. In ECCV, 2010. 2 1049
Lars Andersson. Encouraging LSTMs to anticipate actions
996 [63] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards 1050
very early. In ICCV, 2017. 2
997 [45] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- automatic learning of procedures from web instructional 1051
998 nov. Unsupervised learning of video representations using videos. In AAAI, 2018. 2 1052
999 LSTMs. In ICML, 2015. 2 [64] Tinghui Zhou, Philipp Krähenbühl, Mathieu Aubry, Qixing 1053
1000 [46] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Huang, and Alexei A. Efros. Learning dense correspondence 1054
1001 Schmid. Learning video representations using contrastive via 3D-guided cycle consistency. In CVPR, 2016. 2 1055
1002 bidirectional transformer. arXiv:1906.05743, 2019. 1, 2 [65] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A 1056
1003 [47] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, Efros. Unpaired image-to-image translation using cycle- 1057
and Kevin Murphy. Stochastic prediction of multi-agent consistent adversarial networks. In ICCV, 2017. 2
1004 1058
interactions from partial observations. arXiv:1902.09641, [66] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk
1005 1059
2019. 2 Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-
1006 1060
[48] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, task weakly supervised learning from instructional videos.
1007 In CVPR, 2019. 5, 7 1061
and Cordelia Schmid. Videobert: A joint model for video
1008 1062
and language representation learning. In Proceedings of the
1009 IEEE/CVF International Conference on Computer Vision, 1063
1010 pages 7464–7473, 2019. 2 1064
1011 [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- 1065
1012 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia 1066
1013 Polosukhin. Attention is all you need. In NeurIPS, 2017. 3 1067
1014 [50] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, 1068
1015 and Honglak Lee. Decomposing motion and content for nat- 1069
1016 ural video sequence prediction. arXiv:1706.08033, 2017. 2 1070
1017 [51] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, 1071
1018
Xunyu Lin, and Honglak Lee. Learning to generate long- 1072
term future via hierarchical prediction. arXiv:1704.05831,
1019 1073
2017. 2
1020 1074
[52] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. An-
1021 ticipating visual representations from unlabeled video. In 1075
1022 CVPR, 2016. 1, 2, 5, 7, 8 1076
1023 [53] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 1077
1024 Generating videos with scene dynamics. In NeurIPS, 2016. 1078
1025 1, 2, 8 1079

10

You might also like