Professional Documents
Culture Documents
Hierarchical Attention
for the text summarization task by letting the model 2.2 Modalities
pay more attention to words that are spoken with We use the following three inputs corresponding to
a certain tone or level of emphasis. Through our the three different modalities used:
experiments, we were able to prove that not all • Audio: We use the concatenation of 40-
modalities contribute equally to the output. We dimensional Kaldi (Povey et al., 2011) filter
found a higher contribution of text, followed by bank features from 16kHz raw audio using a
video and then by audio. This formed the moti- time window of 25ms with 10ms frame shift
vation for our MAST model, which places higher and the 3-dimensional pitch features extracted
importance on text input while generating the out- from the dataset to obtain the final sequence
put summary. MAST is able to produce a more of 43-dimensional audio features.
illustrative summary of the original text (see Table
• Text: We use the transcripts corresponding
1) and achieves state of the art results.
to each video. All texts are normalized and
In summary, our primary contributions are:
lower-cased.
• Introduction of audio modality for abstractive
multimodal text summarization. • Video: We use a 2048-dimensional feature
vector per group of 16 frames, which is ex-
• Examining the challenges of utilizing audio in- tracted from the videos using a ResNeXt-101
formation and understanding its contribution 3D CNN trained to recognize 400 different
in the generated summary. actions (Hara et al., 2018). This results in a
• Proposition of a novel state of the art model, sequence of feature vectors per video.
MAST, for the task of multimodal abstractive
text summarization. 2.3 Multimodal Abstractive Summarization
with Trimodal Hierarchical Attention
2 Methodology Figure 1 shows the architecture of our Multimodal
In this section we describe (1) the dataset used, (2) Abstractive Summarization with Trimodal Hierar-
1
the modalities, and (3) our MAST model’s architec- https://github.com/amankhullar/mast
dently as in Bahdanau et al. (2014):
(k) (k) (k)
eij = va(k)T tanh(Wa(k) si + Ua(k) hj + batt ) (1)
(k) (k)
αij = softmax(eij ) (2)
Nk
(k) (k) (k)
X
ci = αij hj (3)
j=1,
k∈{audio, text, video}
Figure 4: Visualization of attention weights in the Trimodal Hierarchical Attention layer for a sample video in the
test set. Figures 4a to 4c show the varying attention distribution on different combinations of modalities across
the decoder timesteps. Figures 4d to 4f show the attention distribution on the encoder timesteps for each modality
across the decoder timesteps. This shows the usefulness of each modality for the generation of the summary.
Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,
Chengqing Zong, et al. 2017. Multi-modal summa- Desmond Elliott, Loı̈c Barrault, Lucia Specia, and
rization for asynchronous collection of text, image, Florian Metze. 2018. How2: a large-scale dataset
audio and video. for multimodal language understanding. arXiv
preprint arXiv:1811.00347.
Jindřich Libovickỳ and Jindřich Helcl. 2017. Attention
strategies for multi-source sequence-to-sequence Abigail See, Peter J Liu, and Christopher D Man-
learning. In Proceedings of the 55th Annual Meet- ning. 2017. Get to the point: Summarization
ing of the Association for Computational Linguistics with pointer-generator networks. arXiv preprint
(Volume 2: Short Papers), pages 196–202. arXiv:1704.04368.
Kristian Woodsend and Mirella Lapata. 2012. Multiple
aspect summarization using integer linear program-
ming. In Proceedings of the 2012 Joint Conference
on Empirical Methods in Natural Language Process-
ing and Computational Natural Language Learning,
pages 233–243. Association for Computational Lin-
guistics.
Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Ji-
ajun Zhang, Chengqing Zong, et al. 2018. Msmo:
multimodal summarization with multimodal output.