Professional Documents
Culture Documents
7
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki
• We conducted video shooting to create sets of video footages outputs of the video storytelling dataset seem similar to our dataset,
where 3-4 narratives are acted out based on pre-written it is limited to contain only one class. Moreover, our dataset consists
story plots. Narrative dataset is constructed on top of the of long untrimmed videos where each video contains complex
shot videos. events captured from multiple cameras.
• Rich and dense annotation involving appeared actors, ob- Datasets with scene graph annotations: Recent datasets [1,
jects and their relationships had been conducted on Narrative 3, 9, 12] used for the task of action recognition provide annotated
dataset. Introducing new intermediate data structure, Nar- atomic actions and scene graphs for each action category. The
rative Graph, Narrative dataset richly contains holistic and Action genome dataset [3] is built on the Charades dataset [14]
hierarchical structure of facts, events, and narratives. which contains videos where one person performs various actions,
• We discuss the potential research tasks for which the Nar- additionally annotated with spatio-temporal scene graph captur-
rative dataset can be utilized, by utilizing the multi-level ing the relationship between the person and the objects. Home
dense annotation it provides. This includes a novel task of action genome [12] dataset was introduced for the task of action
narrative element retrieval, which involves generating inter- recognition by leveraging multiple modalities and multiple views,
mediate metadata that fills a gap between human event clips including egocentric views. It includes trimmed videos of daily ac-
to factual narratives. tivities performed by a single person in a home setting. The ground
The rest of the paper is organized as follows: Section 2 provides a truth annotations include video level activity labels, temporally
literature review of related datasets in vision and language research. localized atomic action labels and spatio-temporal scene graphs
Section 3 covers the details of our Narrative dataset, including our with human object interaction annotations). But these datasets have
proposed intermediate data structure, Narrative Graph, how we con- relatively simpler activities performed by a single person and spans
ducted the data curation and the annotation process. In Section 4, only a few seconds. The Multi-object Multi-Actor dataset (MOMA)
we discuss the potential research tasks for which the Narrative [9] goes further by having videos with multiple actors and multiple
dataset can be utilized. Note that Narrative dataset is a work in objects associated with each activity. It also provides multiple levels
progress and we discuss the remaining works to complete Narrative of annotations including video -level activity, sub-activities, atomic
dataset construction in Section 5. actions and action hyper-graph (representing the relationship be-
tween entities). The videos in these datasets are trimmed in nature,
2 RELATED WORKS where only a single activity spans the entire duration of the video.
Video summarization datasets: One line of work in video
In this section, we go through the related works and review existing
summarization deals with query focused videos summarization,
datasets.
where the aim is to generate summaries that includes parts of
the video that are important to the video as a whole, as well as
2.1 Summarization & Captioning
related to an input query. The dataset [13] used for this task is
The task of video summarization focus on generating a shorter ver- created from the UT-egocentric dataset [6] by providing additional
sion of the video by aggregating segments of the video that capture annotations of concepts, queries and video summaries for each
its essence [2, 11, 15]. Their aim is to select the most varied and query. The dataset contains 48 concepts, which includes the objects
representative frames from the video. However, a visual summary present in the videos. They try to capture the semantics of the
created in this way is often not enough for understanding the events videos by providing dense tags corresponding to each shot of the
and narratives in a long and complex video that may involve mul- video, but it is limited to only objects labels, and does not include
tiple people, interactions and long-range dependencies. Another any information related to actions or interactions.
line of work focuses on visual paragraph generation, which aims to Our dataset, on the other hand, is quite different from the exist-
provide detailed description of images/videos. This includes fine- ing datasets. We provide untrimmed long videos, captured from
grained dense description of images [4, 5] as well as story-telling of multiple cameras, containing complex events like shoplifting, theft,
photo streams [8, 10]. However, these are limited to images or short loitering, etc. In the videos, these complex events are happening
videos. Goal driven narrative generation as introduced by authors while other activities are simultaneously happening in the same
in [16] focus on long videos from single or multiple cameras with scene (like purchasing a store item, reading a book, talking, etc)
more complex event dynamics, while not aiming to describe every making it a more challenging dataset. The dataset is richly anno-
detail presented in the videos. Instead, the focus is on extracting the tated with multiple levels of information. On the lowest level, we
important events and composing a coherent and succinct narrative provide annotations for object and human bounding boxes, labels,
summary. atomic actions of people, spatio-temporal scene graph containing
human object interactions and human-human interactions. On the
mid-level, we provide the annotations corresponding to narrative
2.2 Datasets elements associated with each complex event. Finally, at the highest
Video storytelling dataset: The Video storytelling dataset [7] level, annotations corresponding to temporal segments and nar-
consists of long YouTube videos (average length 12 minutes 35 rative story (including summary frames) corresponding to each
seconds), belonging to four classes (birthday, camping, wedding complex events in the video is provided.
and Christmas). Each video is labelled with one of the four classes
and is annotated with captions in the form of multiple sentences
describing the events happening in the video. While the inputs and
8
Narrative Dataset: Towards Goal-Driven Narrative Generation NarSUM ’22, October 10, 2022, Lisboa, Portugal
3 NARRATIVE DATASET the video that represent the factual information about the events
As a first step towards realizing the goal of goal-driven narrative that describe a narrative corresponding to a story class.
summary generation, we introduce a new dataset called “Narra- Narrative Element Dependency: The narrative elements have
tive Dataset” for benchmarking purposes. The Narrative dataset temporal and causal dependencies between them that are covered
consists of long videos with multiple complex events happening at a more semantic level using Narrative Element Dependency. Nar-
in it, involving multiple people interacting with each other and rative Element Dependency of NE2 on NE1 means that NE1 must
with various objects. It contains videos shot both indoors and out- occur before NE2. It is important to include these dependencies as
doors. Each scenario, called a story class, is captured using multiple it describes the order of the events and how one event is depen-
cameras placed at different locations, with some scenarios having dent on another. For example, as shown in Fig. 2, consider a case of
overlapping camera views. Some sample video frames are shown bicycle theft from the parking of a departmental store where the
in Figure 1. victim parks the bicycle outside the store and a thief steals it. The
Through the Narrative dataset, we propose a new concept called the video will have segments that cover the footage of the victim park-
Narrative Graph, which can be considered as a new intermediate ing his bicycle and the thief stealing it. When narrative elements
data structure that can aid in the process of narrative generation. are pulled out for an event class of bicycle theft, some narrative
The concept of narrative graph is explained below. elements will cover the part where the bicycle is parked by the
victim and after that there would be some narrative elements that
3.1 Narrative Graph would cover the segment where the thief steals the bicycle and
rides away. The narrative elements representing the latter segment
A Narrative Graph is the representation of a network that provides will have a dependency on the narrative elements representing the
factual information about different events of a story in a long video former segment, and these dependencies must be captured to write
or multiple videos for a given story class. The Narrative Graph can a coherent story.
be used to generate a narrative for the given story class and video.
The narrative graph corresponding to a story class is made up
of two components: the narrative elements, which form the nodes
of the Narrative graph, and the Narrative elements dependency, 3.2 Data Curation
which are the edges between the nodes. Narrative generation is able to retrieve video clips of an interested
The definition of narrative elements and narrative element de- narrative from long videos. For the evaluation dataset, the set of
pendency is detailed below. long videos is preferable to contain multiple narratives involving
Narrative Elements: Narrative Element (NE) comprises of causally connected events. Therefore we decided to conduct our
objects, actions, interactions between objects, and attributes from own video shooting for creating a sematically rich and dense dataset.
9
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki
Figure 2: An example of Narrative Graph that has been generated using the annotator’s narrative story, ’bicycle theft’ story
class, and the video segment of the story class. Each sentence of the narrative story describes the events in a clip from the video
and has a corresponding Narrative Element. The Narrative Element Dependency is also shown between different narrative
elements to maintain the coherence of the sentences in the story.
Firstly we listed story plots where each story plot involves mul- annotation, the shot videos are split into images extracted at 5
tiple events performed on 2-5 minutes duration. Then actors and FPS.Our annotators were asked to perform 2 levels of annotations
shooting locations were arranged for the plots. to provide rich and dense annotations for the Narrative dataset.
Details of each of these annotations are explained below.
3.2.1 Story Plot. Table 4 lists the story classes used in our Narrative
dataset, along with its description and actors’ roles. Each story
involves 1-2 actors assigned a role, such as criminal or victim. Based 3.3.1 Semantic Video Graph Annotation. Semantics video graph
on the scenario description in Table 4, we created story plots with includes object bounding boxes, object Identifiers (IDs), atomic
actual actions. The actions were slightly modified according to the actions, and relationships between objects. Toward constructing
actors’ attributes and shooting locations. a comprehensive video graph, we conduct a 2-step annotation. In
the first step, our annotators created bounding boxes of actors and
3.2.2 Video Shooting. We ran video shooting with hired actors to objects with object classes for an image, and assigned atomic action
curate target videos where 3-4 stories in Table 4 happens. Figure 1 labels to actors. The covered object classes and atomic actions are
shows sample video frames. There are 4 shooting locations includ- listed in Table 1 and 2, respectively. All the appeared objects are
ing store, bar, library, and outdoor. For each video set, we deployed assigned unique object IDs across cameras in a video set. Note
3 cameras at one of the shooting locations with about 10 to 15 that most of small objects, such as store items, are ignored for this
actors involved in it. The actors performed actions corresponding annotation. For example, in the store location case, there were a lot
to assigned roles. Some actors were assigned story-related roles of small store items on the shelves. However, most of the small items
such as criminal or victim of theft, and others were assigned a are not related to our story plots. Hence, the bounding boxes of
background role that does not involve any actions related to story store items were annotated only when interacted with any persons
plots. The story scenarios were shot as natural as possible. The (e.g. a person picking a store item). The relationships between non-
main story events were enacted by a few actors, while actions such person objects are less important in our Narrative dataset, since
as walking, talking in a group of 2-3 people were being performed our target narratives are person centric events.
in the background by the other actors. After the first step, our annotators proceeded to assign relation-
ship labels to pairs of objects referring Table 3. Following MOMA
3.3 Annotation [9], our relationship labels cover 3 types of relationships: spatial,
We hired our own annotators as professionals to create Narrative attention, and interaction. At most one label from each relationship
dataset based on the shot videos mentioned above. During the type is given between specific objects in each frame (e.g. assign
10
Narrative Dataset: Towards Goal-Driven Narrative Generation NarSUM ’22, October 10, 2022, Lisboa, Portugal
"picking" and "at side of" to a pair of a person and a bicycle). In a that explain the events happening in the clip. These subgraphs of
similar manner as atomic actions, the interaction labels are assigned the video graph are used to create the narrative element for the clip.
only to person class. If the actions and interactions in the narrator’s story are similar
to the actions and interactions in the video graph, the annotator is
Table 1: A list of object classes. encouraged to use the elements provided in the video graph. Since
Object class there could also be a case where the video graph is missing some
person bicycle basket elements such as objects, actions, interaction, or attributes that are
baby car bicycle important to the clip, the annotator can add elements based on their
wheel chair suitcase narrative to make the Narrative Element richer. Finally, the anno-
backpack side bag tator has to label if a pair of narrative elements have a Narrative
hand bag book Element Dependency. If NE2 is dependent on NE1, tuple (NE2, NE1)
cell phone laptop will be included in the set of Narrative Element Dependency. An
umbrella wine glass example of the Narrative and clips used to generate the Narrative
bottle white cane Graph is shown in Fig. 2.
POS register shopping basket
Table bench
chair garbage bin 4 NARRATIVE DATASET TASKS
delivery item store item/goods The narrative dataset with its rich and multiple levels of annotation,
can be used for many potential tasks. Some of these tasks and the
methods of evaluation are explained below.
Table 2: A list of atomic actions.
Action label
standing walking
4.1 Semantics Video Graph Generation task
running looking around The task of video semantics generation is to capture the complex
lying sitting event dynamics and overall semantic information from the videos,
crouching sleeping including detection of the objects in the scene (bounding boxes and
labels), atomic actions of persons, and spatio-temporal scene graph
that captures the human-object interactions and human-human
interactions in the scene. As the narrative dataset is annotated with
Table 3: A list of relationships.
this information for long videos having complex events captured
Relationship label from multiple cameras, this dataset can be used for the task of
interact spatial attention semantic video graph generation.
touching at side of speaking to For the purpose of evaluation, the standard metrics for each
opening in front of looking to task can be utilized. For example, IoU (Intersection over Union)
closing behind can be used for evaluating the object detection subtask, while top-
pushing inside 1 accuracy and mAP (mean Average Precision) can be used for
pulling on atomic action recognition and temporal localization, respectively.
holding/carrying The spatio-temporal scene graph generation (<subject-predicate-
picking object> relationships) can be evaluated using Top-K recall as the
putting evaluation metric.
using
reading
chasing 4.2 Narrative Elements Retrieval task
throwing
A long video can have multiple people, various objects and a lot of
interactions between them. Given a particular event class, all the
3.3.2 Narrative Summary & Narrative Graph Annotation. The an- information captured from the video might not be relevant. The
notator is provided with the story class, part of the video corre- goal of narrative elements retrieval is, given an event class, discover
sponding to the story class, and the video graph for the whole long specific parts of the video that are important for that event class,
video or multiple videos. First, the annotator watches the video seg- ie., retrieve the elements that are important for creating the narra-
ment and writes a narrative summary/story of the video. The given tive for that event class. Since the narrative dataset is annotated
video is temporally segmented as per the annotator’s narrative. with the ground truth narrative elements for each complex event
Each sentence of the narrative corresponds to a clip or segment class, the retrieved narrative elements can be compared against the
in the given video and each clip represents a Narrative Element. ground truth narrative elements to evaluate the goodness of the
Two segments can have an overlap in the video. Next, the video narrative elements retrieval task. Precision and Recall can be used
graph of the corresponding clip is checked by the annotator and as the metrics to evaluate this task.
the annotator extracts the relevant subgraphs from the video graph
11
NarSUM ’22, October 10, 2022, Lisboa, Portugal Karen Stephen, Rishabh Sheoran, & Satoshi Yamazaki
4.3 Narrative Summary Generation task Initiative. Any opinions, findings and conclusions or recommenda-
Narrative dataset can be utilized for the task of narrative summary tions expressed in this material are those of the author(s) and do
generation. Given long videos captured from multiple cameras, the not reflect the views of National Research Foundation, Singapore.
goal is to create a coherent and succinct narrative that describes the
event. The outputs are (a) a set of captions that narrate the story REFERENCES
[1] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing
(narrative summary) and (b) the frames corresponding to the events Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk-
being described in the narrative (visual summary). The generated thankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic
narrative summary and visual summary can be judged based on visual actions. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 6047–6056.
human evaluation. [2] Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by
learning submodular mixtures of objectives. In Proceedings of the IEEE conference
4.4 Complex event segmentation task on computer vision and pattern recognition. 3090–3098.
[3] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action
The narrative dataset consists of long videos with multiple complex genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings
events in each of them. This makes it suitable to utilize it for the of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236–
10247.
task of complex event segmentation where, given a long video and [4] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierar-
an event, the task is to identify where the event happens in the chical approach for generating descriptive image paragraphs. In Proceedings of
the IEEE conference on computer vision and pattern recognition. 317–325.
video, and temporally segment it. Since each event in the dataset is [5] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles.
temporally annotated with its start and end time, for each camera, 2017. Dense-captioning events in videos. In Proceedings of the IEEE international
the segmentation result can be evaluated based on this ground truth conference on computer vision. 706–715.
[6] Yong Jae Lee and Kristen Grauman. 2015. Predicting important objects for
information. Mean average precision can be used as the evaluation egocentric video summarization. International Journal of Computer Vision 114, 1
metric for temporal segmentation of events. (2015), 38–55.
[7] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2020. Video
Storytelling: Textual Summaries for Events. IEEE Trans. Multim. 22, 2 (2020),
5 CONCLUSION 554–565.
NarSUM dataset provides dense and rich annotation containing [8] Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos
talk: Generating narrative paragraph for photo stream via bidirectional attention
holistic and hierarchical structure of facts, events, and narratives recurrent neural networks. In Proceedings of the AAAI Conference on Artificial
on top of the long videos with complex events. This is the first Intelligence, Vol. 31.
[9] Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Car-
step towards realizing goal-driven narrative generation from long los Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. MOMA: Multi-Object Multi-Actor
videos. We introduce the concept of narrative graph for enabling Activity Parsing. Advances in Neural Information Processing Systems 34 (2021),
machine learning models to generate goal-driven narrative sum- 17939–17955.
[10] Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence
maries from videos. Narrative dataset is a work in progress, and we of natural sentences. Advances in neural information processing systems 28 (2015).
need the following considerations to complete the narrative dataset [11] Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014.
construction. Category-specific video summarization. In European conference on computer
vision. Springer, 540–555.
• Quality control of annotated narrative summaries by as- [12] Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka,
signing multiple annotators to make unbiased ground truth Ehsan Adeli, and Juan Carlos Niebles. 2021. Home action genome: Cooperative
compositional action understanding. In Proceedings of the IEEE/CVF Conference
summaries. on Computer Vision and Pattern Recognition. 11184–11193.
• Training and test data construction utilizing shot videos and [13] Aidean Sharghi, Jacob S Laurel, and Boqing Gong. 2017. Query-focused video
summarization: Dataset, evaluation, and a memory network based approach.
their annotation where the same actor appears in the same In Proceedings of the IEEE conference on computer vision and pattern recognition.
location but acts as a role of different story class. 4788–4797.
[14] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and
In future, we will extend the dataset to include more objects, rela- Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for
tionships, and story classes. We expect that our narrative dataset activity understanding. In European Conference on Computer Vision. Springer,
becomes comprehensive for training and benchmarking machine 510–526.
[15] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum:
learning models of goal-driven narrative generation. Summarizing web videos using titles. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 5179–5187.
ACKNOWLEDGMENTS [16] Yongkang Wong, Shaojing Fan, Yangyang Guo, Ziwei Xu, Karen Stephen, Rishabh
Sheoran, Anusha Bhamidipati, Vivek Barsopia, Jianquan Liu, and Mohan Kankan-
This research is supported by the National Research Foundation, halli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In ACM
Singapore under its Strategic Capability Research Centres Funding Multimedia (to be published).
12