You are on page 1of 6

Narrative Dataset: Towards Goal-Driven Narrative Generation

Karen Stephen∗ Rishabh Sheoran∗ Satoshi Yamazaki∗


NEC Corporation National University of Singapore NEC Corporation
Japan Singapore Japan
stephenkaren@nec.com rishabh.sheoran@u.nus.edu s-yamazaki31@nec.com

ABSTRACT are seen in novels, biography, newsreels and chronicles. In our


In this paper, we propose a new dataset called the Narrative dataset, daily life, a lot of events are visually seen and happen concurrently.
which is a work in progress, towards generating video and text However, some events are causally related to each other and form
narratives of complex daily events from long videos, captured from narratives, while others are not linked with any narratives, or are
multiple cameras. As most of the existing datasets are collected not interesting. Due to the flood of uninteresting events, people
from publicly available videos such as YouTube videos, there are easily miss narratives of their interest. What if we can easily find
no datasets targeted towards the task of narrative summarization narratives of our interest in video footages of our daily life?
of complex videos which contains multiple narratives. Hence, we As vision and language research area including video storytelling
create story plots and conduct video shooting with hired actors to [7], scene graph generation [1, 3, 9, 12], and video summarization
create complex video sets where 3 to 4 narratives happen in each [6, 13] has been emerging, we envision a realization of a new tech-
video. In the story plot, a narrative composes of multiple events nology, goal-driven narrative generation proposed in [16]. The aim
corresponding to video clips of key human activities. On top of the of narrative generation is to generate a coherent and succinct nar-
shot video sets and the story plot, the narrative dataset contains rative summary of interest, separating it from uninteresting events,
dense annotation of actors, objects, and their relationships for each from long videos. The proposed goal-driven narrative generation
frame as the facts of narratives. Therefore, narrative dataset richly takes into account the user’s interest as an input, finds the events
contains holistic and hierarchical structure of facts, events, and of interest, and generates summarized narrative sentences or video
narratives. Moreover, Narrative Graph, a collection of scene graphs clips. The excellent point is that goal-driven narrative generation
of narrative events with their causal relationships, is introduced for can summarize, in other word, compress video contents removing
bridging the gap between the collection of facts and generation of the uninteresting and irrelevant events. Especially, video content
the summary sentences of a narrative. Beyond related subtasks such compression is efficient when long video lengths or videos shot
as scene graph generation, narrative dataset potentially provide simultaneously from multiple locations are involved. As a result,
challenges of subtasks for bridging human event clips to narratives. the generated summary allows the user to focus on and act on only
the necessary parts of daily life videos without omission. Such a
CCS CONCEPTS narrative summary based on the long term causalities of human
events is useful for Productivity Monitoring, such as smart manu-
• Computing methodologies → Artificial intelligence; Ma-
facturing or smart retails. According to Industry 4.0, there are high
chine learning; • General and reference → Surveys and overviews.
expectations that advanced AI models generate informative report
KEYWORDS from big data. In factory domain, narrative generation can be used
to report why the production quality has dropped. Besides, retail
Narrative generation, datasets, video summarization markets have a demand of customer demographic analysis report
ACM Reference Format: to understand how product advertisement and promotion affect the
Karen Stephen, Rishabh Sheoran, and Satoshi Yamazaki. 2022. Narrative customer engagement and sales.
Dataset: Towards Goal-Driven Narrative Generation. In Proceedings of the 1st Despite the demand of narrative generation use cases, narrative
Workshop on User-centric Narrative Summarization of Long Videos (NarSUM
generation is still in its early-stages of research, as it has some
’22), October 10, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 6 pages.
unique challenges. One challenge is how to retrieve the series of
https://doi.org/10.1145/3552463.3557021
events, that are of user’s interest, from the massive number of
1 INTRODUCTION events that are present in long videos. Yet another challenge is, how
to identify the causal relationships between the retrieved events.
Our daily life is full of narratives, series of events connected in a To benchmark this problem, tasks based on footage are required
causal manner. Factually well-structured and described narratives in which multiple stories happen simultaneously. However, there
∗ Authors contributed equally to the paper are no existing datasets based on complex videos which contains
multiple narratives. Most of existing dataset are collected from
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed publicly available videos such as YouTube videos. The contents in
for profit or commercial advantage and that copies bear this notice and the full citation source videos of datasets are essential to determine the capability
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
of what video analytics tasks can be performed.
to post on servers or to redistribute to lists, requires prior specific permission and/or a In order to solve the challenges related to Narrative generation
fee. Request permissions from permissions@acm.org. and as a first step to accelerate research in this area, we created a
NarSUM ’22, October 10, 2022, Lisboa, Portugal
new dataset called the Narrative dataset. In this paper, we make
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9493-2/22/10. . . $15.00 the following contributions:
https://doi.org/10.1145/3552463.3557021

7
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki

• We conducted video shooting to create sets of video footages outputs of the video storytelling dataset seem similar to our dataset,
where 3-4 narratives are acted out based on pre-written it is limited to contain only one class. Moreover, our dataset consists
story plots. Narrative dataset is constructed on top of the of long untrimmed videos where each video contains complex
shot videos. events captured from multiple cameras.
• Rich and dense annotation involving appeared actors, ob- Datasets with scene graph annotations: Recent datasets [1,
jects and their relationships had been conducted on Narrative 3, 9, 12] used for the task of action recognition provide annotated
dataset. Introducing new intermediate data structure, Nar- atomic actions and scene graphs for each action category. The
rative Graph, Narrative dataset richly contains holistic and Action genome dataset [3] is built on the Charades dataset [14]
hierarchical structure of facts, events, and narratives. which contains videos where one person performs various actions,
• We discuss the potential research tasks for which the Nar- additionally annotated with spatio-temporal scene graph captur-
rative dataset can be utilized, by utilizing the multi-level ing the relationship between the person and the objects. Home
dense annotation it provides. This includes a novel task of action genome [12] dataset was introduced for the task of action
narrative element retrieval, which involves generating inter- recognition by leveraging multiple modalities and multiple views,
mediate metadata that fills a gap between human event clips including egocentric views. It includes trimmed videos of daily ac-
to factual narratives. tivities performed by a single person in a home setting. The ground
The rest of the paper is organized as follows: Section 2 provides a truth annotations include video level activity labels, temporally
literature review of related datasets in vision and language research. localized atomic action labels and spatio-temporal scene graphs
Section 3 covers the details of our Narrative dataset, including our with human object interaction annotations). But these datasets have
proposed intermediate data structure, Narrative Graph, how we con- relatively simpler activities performed by a single person and spans
ducted the data curation and the annotation process. In Section 4, only a few seconds. The Multi-object Multi-Actor dataset (MOMA)
we discuss the potential research tasks for which the Narrative [9] goes further by having videos with multiple actors and multiple
dataset can be utilized. Note that Narrative dataset is a work in objects associated with each activity. It also provides multiple levels
progress and we discuss the remaining works to complete Narrative of annotations including video -level activity, sub-activities, atomic
dataset construction in Section 5. actions and action hyper-graph (representing the relationship be-
tween entities). The videos in these datasets are trimmed in nature,
2 RELATED WORKS where only a single activity spans the entire duration of the video.
Video summarization datasets: One line of work in video
In this section, we go through the related works and review existing
summarization deals with query focused videos summarization,
datasets.
where the aim is to generate summaries that includes parts of
the video that are important to the video as a whole, as well as
2.1 Summarization & Captioning
related to an input query. The dataset [13] used for this task is
The task of video summarization focus on generating a shorter ver- created from the UT-egocentric dataset [6] by providing additional
sion of the video by aggregating segments of the video that capture annotations of concepts, queries and video summaries for each
its essence [2, 11, 15]. Their aim is to select the most varied and query. The dataset contains 48 concepts, which includes the objects
representative frames from the video. However, a visual summary present in the videos. They try to capture the semantics of the
created in this way is often not enough for understanding the events videos by providing dense tags corresponding to each shot of the
and narratives in a long and complex video that may involve mul- video, but it is limited to only objects labels, and does not include
tiple people, interactions and long-range dependencies. Another any information related to actions or interactions.
line of work focuses on visual paragraph generation, which aims to Our dataset, on the other hand, is quite different from the exist-
provide detailed description of images/videos. This includes fine- ing datasets. We provide untrimmed long videos, captured from
grained dense description of images [4, 5] as well as story-telling of multiple cameras, containing complex events like shoplifting, theft,
photo streams [8, 10]. However, these are limited to images or short loitering, etc. In the videos, these complex events are happening
videos. Goal driven narrative generation as introduced by authors while other activities are simultaneously happening in the same
in [16] focus on long videos from single or multiple cameras with scene (like purchasing a store item, reading a book, talking, etc)
more complex event dynamics, while not aiming to describe every making it a more challenging dataset. The dataset is richly anno-
detail presented in the videos. Instead, the focus is on extracting the tated with multiple levels of information. On the lowest level, we
important events and composing a coherent and succinct narrative provide annotations for object and human bounding boxes, labels,
summary. atomic actions of people, spatio-temporal scene graph containing
human object interactions and human-human interactions. On the
mid-level, we provide the annotations corresponding to narrative
2.2 Datasets elements associated with each complex event. Finally, at the highest
Video storytelling dataset: The Video storytelling dataset [7] level, annotations corresponding to temporal segments and nar-
consists of long YouTube videos (average length 12 minutes 35 rative story (including summary frames) corresponding to each
seconds), belonging to four classes (birthday, camping, wedding complex events in the video is provided.
and Christmas). Each video is labelled with one of the four classes
and is annotated with captions in the form of multiple sentences
describing the events happening in the video. While the inputs and

8
Narrative Dataset: Towards Goal-Driven Narrative Generation NarSUM ’22, October 10, 2022, Lisboa, Portugal

Figure 1: Sample video frames of Narrative dataset.

3 NARRATIVE DATASET the video that represent the factual information about the events
As a first step towards realizing the goal of goal-driven narrative that describe a narrative corresponding to a story class.
summary generation, we introduce a new dataset called “Narra- Narrative Element Dependency: The narrative elements have
tive Dataset” for benchmarking purposes. The Narrative dataset temporal and causal dependencies between them that are covered
consists of long videos with multiple complex events happening at a more semantic level using Narrative Element Dependency. Nar-
in it, involving multiple people interacting with each other and rative Element Dependency of NE2 on NE1 means that NE1 must
with various objects. It contains videos shot both indoors and out- occur before NE2. It is important to include these dependencies as
doors. Each scenario, called a story class, is captured using multiple it describes the order of the events and how one event is depen-
cameras placed at different locations, with some scenarios having dent on another. For example, as shown in Fig. 2, consider a case of
overlapping camera views. Some sample video frames are shown bicycle theft from the parking of a departmental store where the
in Figure 1. victim parks the bicycle outside the store and a thief steals it. The
Through the Narrative dataset, we propose a new concept called the video will have segments that cover the footage of the victim park-
Narrative Graph, which can be considered as a new intermediate ing his bicycle and the thief stealing it. When narrative elements
data structure that can aid in the process of narrative generation. are pulled out for an event class of bicycle theft, some narrative
The concept of narrative graph is explained below. elements will cover the part where the bicycle is parked by the
victim and after that there would be some narrative elements that
3.1 Narrative Graph would cover the segment where the thief steals the bicycle and
rides away. The narrative elements representing the latter segment
A Narrative Graph is the representation of a network that provides will have a dependency on the narrative elements representing the
factual information about different events of a story in a long video former segment, and these dependencies must be captured to write
or multiple videos for a given story class. The Narrative Graph can a coherent story.
be used to generate a narrative for the given story class and video.
The narrative graph corresponding to a story class is made up
of two components: the narrative elements, which form the nodes
of the Narrative graph, and the Narrative elements dependency, 3.2 Data Curation
which are the edges between the nodes. Narrative generation is able to retrieve video clips of an interested
The definition of narrative elements and narrative element de- narrative from long videos. For the evaluation dataset, the set of
pendency is detailed below. long videos is preferable to contain multiple narratives involving
Narrative Elements: Narrative Element (NE) comprises of causally connected events. Therefore we decided to conduct our
objects, actions, interactions between objects, and attributes from own video shooting for creating a sematically rich and dense dataset.

9
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki

Figure 2: An example of Narrative Graph that has been generated using the annotator’s narrative story, ’bicycle theft’ story
class, and the video segment of the story class. Each sentence of the narrative story describes the events in a clip from the video
and has a corresponding Narrative Element. The Narrative Element Dependency is also shown between different narrative
elements to maintain the coherence of the sentences in the story.

Firstly we listed story plots where each story plot involves mul- annotation, the shot videos are split into images extracted at 5
tiple events performed on 2-5 minutes duration. Then actors and FPS.Our annotators were asked to perform 2 levels of annotations
shooting locations were arranged for the plots. to provide rich and dense annotations for the Narrative dataset.
Details of each of these annotations are explained below.
3.2.1 Story Plot. Table 4 lists the story classes used in our Narrative
dataset, along with its description and actors’ roles. Each story
involves 1-2 actors assigned a role, such as criminal or victim. Based 3.3.1 Semantic Video Graph Annotation. Semantics video graph
on the scenario description in Table 4, we created story plots with includes object bounding boxes, object Identifiers (IDs), atomic
actual actions. The actions were slightly modified according to the actions, and relationships between objects. Toward constructing
actors’ attributes and shooting locations. a comprehensive video graph, we conduct a 2-step annotation. In
the first step, our annotators created bounding boxes of actors and
3.2.2 Video Shooting. We ran video shooting with hired actors to objects with object classes for an image, and assigned atomic action
curate target videos where 3-4 stories in Table 4 happens. Figure 1 labels to actors. The covered object classes and atomic actions are
shows sample video frames. There are 4 shooting locations includ- listed in Table 1 and 2, respectively. All the appeared objects are
ing store, bar, library, and outdoor. For each video set, we deployed assigned unique object IDs across cameras in a video set. Note
3 cameras at one of the shooting locations with about 10 to 15 that most of small objects, such as store items, are ignored for this
actors involved in it. The actors performed actions corresponding annotation. For example, in the store location case, there were a lot
to assigned roles. Some actors were assigned story-related roles of small store items on the shelves. However, most of the small items
such as criminal or victim of theft, and others were assigned a are not related to our story plots. Hence, the bounding boxes of
background role that does not involve any actions related to story store items were annotated only when interacted with any persons
plots. The story scenarios were shot as natural as possible. The (e.g. a person picking a store item). The relationships between non-
main story events were enacted by a few actors, while actions such person objects are less important in our Narrative dataset, since
as walking, talking in a group of 2-3 people were being performed our target narratives are person centric events.
in the background by the other actors. After the first step, our annotators proceeded to assign relation-
ship labels to pairs of objects referring Table 3. Following MOMA
3.3 Annotation [9], our relationship labels cover 3 types of relationships: spatial,
We hired our own annotators as professionals to create Narrative attention, and interaction. At most one label from each relationship
dataset based on the shot videos mentioned above. During the type is given between specific objects in each frame (e.g. assign

10
Narrative Dataset: Towards Goal-Driven Narrative Generation NarSUM ’22, October 10, 2022, Lisboa, Portugal

"picking" and "at side of" to a pair of a person and a bicycle). In a that explain the events happening in the clip. These subgraphs of
similar manner as atomic actions, the interaction labels are assigned the video graph are used to create the narrative element for the clip.
only to person class. If the actions and interactions in the narrator’s story are similar
to the actions and interactions in the video graph, the annotator is
Table 1: A list of object classes. encouraged to use the elements provided in the video graph. Since
Object class there could also be a case where the video graph is missing some
person bicycle basket elements such as objects, actions, interaction, or attributes that are
baby car bicycle important to the clip, the annotator can add elements based on their
wheel chair suitcase narrative to make the Narrative Element richer. Finally, the anno-
backpack side bag tator has to label if a pair of narrative elements have a Narrative
hand bag book Element Dependency. If NE2 is dependent on NE1, tuple (NE2, NE1)
cell phone laptop will be included in the set of Narrative Element Dependency. An
umbrella wine glass example of the Narrative and clips used to generate the Narrative
bottle white cane Graph is shown in Fig. 2.
POS register shopping basket
Table bench
chair garbage bin 4 NARRATIVE DATASET TASKS
delivery item store item/goods The narrative dataset with its rich and multiple levels of annotation,
can be used for many potential tasks. Some of these tasks and the
methods of evaluation are explained below.
Table 2: A list of atomic actions.
Action label
standing walking
4.1 Semantics Video Graph Generation task
running looking around The task of video semantics generation is to capture the complex
lying sitting event dynamics and overall semantic information from the videos,
crouching sleeping including detection of the objects in the scene (bounding boxes and
labels), atomic actions of persons, and spatio-temporal scene graph
that captures the human-object interactions and human-human
interactions in the scene. As the narrative dataset is annotated with
Table 3: A list of relationships.
this information for long videos having complex events captured
Relationship label from multiple cameras, this dataset can be used for the task of
interact spatial attention semantic video graph generation.
touching at side of speaking to For the purpose of evaluation, the standard metrics for each
opening in front of looking to task can be utilized. For example, IoU (Intersection over Union)
closing behind can be used for evaluating the object detection subtask, while top-
pushing inside 1 accuracy and mAP (mean Average Precision) can be used for
pulling on atomic action recognition and temporal localization, respectively.
holding/carrying The spatio-temporal scene graph generation (<subject-predicate-
picking object> relationships) can be evaluated using Top-K recall as the
putting evaluation metric.
using
reading
chasing 4.2 Narrative Elements Retrieval task
throwing
A long video can have multiple people, various objects and a lot of
interactions between them. Given a particular event class, all the
3.3.2 Narrative Summary & Narrative Graph Annotation. The an- information captured from the video might not be relevant. The
notator is provided with the story class, part of the video corre- goal of narrative elements retrieval is, given an event class, discover
sponding to the story class, and the video graph for the whole long specific parts of the video that are important for that event class,
video or multiple videos. First, the annotator watches the video seg- ie., retrieve the elements that are important for creating the narra-
ment and writes a narrative summary/story of the video. The given tive for that event class. Since the narrative dataset is annotated
video is temporally segmented as per the annotator’s narrative. with the ground truth narrative elements for each complex event
Each sentence of the narrative corresponds to a clip or segment class, the retrieved narrative elements can be compared against the
in the given video and each clip represents a Narrative Element. ground truth narrative elements to evaluate the goodness of the
Two segments can have an overlap in the video. Next, the video narrative elements retrieval task. Precision and Recall can be used
graph of the corresponding clip is checked by the annotator and as the metrics to evaluate this task.
the annotator extracts the relevant subgraphs from the video graph

11
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki

Table 4: Details of story classes


Story class Actors’ role Scenario Description
Purchasing cashier, shopper Purchase by picking store items, paying to a cashier, and leaving the store.
Shoplifting criminal Theft by concealing store items under clothes, or in a bag, and leaving store without paying.
Bicycle theft criminal, victim Theft by picking and riding a bicycle owned by others, and leaving.
Baggage theft criminal, victim Theft by picking and carrying a baggage owned by others, and leaving.
Loitering criminal Loitering by staying more than twice as long as the average duration in a location.
Stalking criminal, victim Stalking by keeping a constant distance from the victim for more than 1 minute in total.
Left baggage criminal By putting a baggage and leaving.

4.3 Narrative Summary Generation task Initiative. Any opinions, findings and conclusions or recommenda-
Narrative dataset can be utilized for the task of narrative summary tions expressed in this material are those of the author(s) and do
generation. Given long videos captured from multiple cameras, the not reflect the views of National Research Foundation, Singapore.
goal is to create a coherent and succinct narrative that describes the
event. The outputs are (a) a set of captions that narrate the story REFERENCES
[1] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing
(narrative summary) and (b) the frames corresponding to the events Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk-
being described in the narrative (visual summary). The generated thankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic
narrative summary and visual summary can be judged based on visual actions. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 6047–6056.
human evaluation. [2] Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by
learning submodular mixtures of objectives. In Proceedings of the IEEE conference
4.4 Complex event segmentation task on computer vision and pattern recognition. 3090–3098.
[3] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action
The narrative dataset consists of long videos with multiple complex genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings
events in each of them. This makes it suitable to utilize it for the of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236–
10247.
task of complex event segmentation where, given a long video and [4] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierar-
an event, the task is to identify where the event happens in the chical approach for generating descriptive image paragraphs. In Proceedings of
the IEEE conference on computer vision and pattern recognition. 317–325.
video, and temporally segment it. Since each event in the dataset is [5] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles.
temporally annotated with its start and end time, for each camera, 2017. Dense-captioning events in videos. In Proceedings of the IEEE international
the segmentation result can be evaluated based on this ground truth conference on computer vision. 706–715.
[6] Yong Jae Lee and Kristen Grauman. 2015. Predicting important objects for
information. Mean average precision can be used as the evaluation egocentric video summarization. International Journal of Computer Vision 114, 1
metric for temporal segmentation of events. (2015), 38–55.
[7] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2020. Video
Storytelling: Textual Summaries for Events. IEEE Trans. Multim. 22, 2 (2020),
5 CONCLUSION 554–565.
NarSUM dataset provides dense and rich annotation containing [8] Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos
talk: Generating narrative paragraph for photo stream via bidirectional attention
holistic and hierarchical structure of facts, events, and narratives recurrent neural networks. In Proceedings of the AAAI Conference on Artificial
on top of the long videos with complex events. This is the first Intelligence, Vol. 31.
[9] Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Car-
step towards realizing goal-driven narrative generation from long los Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. MOMA: Multi-Object Multi-Actor
videos. We introduce the concept of narrative graph for enabling Activity Parsing. Advances in Neural Information Processing Systems 34 (2021),
machine learning models to generate goal-driven narrative sum- 17939–17955.
[10] Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence
maries from videos. Narrative dataset is a work in progress, and we of natural sentences. Advances in neural information processing systems 28 (2015).
need the following considerations to complete the narrative dataset [11] Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014.
construction. Category-specific video summarization. In European conference on computer
vision. Springer, 540–555.
• Quality control of annotated narrative summaries by as- [12] Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka,
signing multiple annotators to make unbiased ground truth Ehsan Adeli, and Juan Carlos Niebles. 2021. Home action genome: Cooperative
compositional action understanding. In Proceedings of the IEEE/CVF Conference
summaries. on Computer Vision and Pattern Recognition. 11184–11193.
• Training and test data construction utilizing shot videos and [13] Aidean Sharghi, Jacob S Laurel, and Boqing Gong. 2017. Query-focused video
summarization: Dataset, evaluation, and a memory network based approach.
their annotation where the same actor appears in the same In Proceedings of the IEEE conference on computer vision and pattern recognition.
location but acts as a role of different story class. 4788–4797.
[14] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and
In future, we will extend the dataset to include more objects, rela- Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for
tionships, and story classes. We expect that our narrative dataset activity understanding. In European Conference on Computer Vision. Springer,
becomes comprehensive for training and benchmarking machine 510–526.
[15] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum:
learning models of goal-driven narrative generation. Summarizing web videos using titles. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 5179–5187.
ACKNOWLEDGMENTS [16] Yongkang Wong, Shaojing Fan, Yangyang Guo, Ziwei Xu, Karen Stephen, Rishabh
Sheoran, Anusha Bhamidipati, Vivek Barsopia, Jianquan Liu, and Mohan Kankan-
This research is supported by the National Research Foundation, halli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In ACM
Singapore under its Strategic Capability Research Centres Funding Multimedia (to be published).

12

You might also like