You are on page 1of 18

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-022-12812-4

Tech-Talk-Sum: fine-tuning extractive summarization


and enhancing BERT text contextualization
for technological talk videos

Chalothon Chootong1 · Timothy K. Shih2

Received: 21 June 2020 / Revised: 23 February 2022 / Accepted: 9 March 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Automatic summarization is a task to condense the data to a shorter version while preserving
key informational components and the meaning of content. In this paper, we introduce Tech-
Talk-Sum, which is the combination of BERT (Bidirectional Encoder Representations from
Transformers) and the attention mechanism to summarize the technological talk videos. We
first introduce the technology talk datasets that were constructed from YouTube including
short- and long-talk videos. Second, we explored various sentence representations from
BERT’s output. Using the top hidden layer to represent sentences is the best choice for our
datasets. The outputs from BERT were fed forward to the Bi-LSTM network to build local
context vectors. Besides, we built the document encoder layer that leverages BERT and
the self-attention mechanism to express the semantics of a video caption and to form the
global context vector. Third, the undirected LSTM was added to bridge the local and global
sentence’s contexts to predict the sentence’s salience score. Finally, the video summaries
were generated based on the scores. We trained a single unified model on long-talk video
datasets. ROUGE was utilized to evaluate our proposed methods. The experimental results
demonstrate that our model has generalization ability, and achieves the baselines and state-
of-the-art results for both long and short videos.

Keywords Video summary · Spoken summarization · BERT · Technological talk ·


Attention mechanism

 Chalothon Chootong
chootong.c@ku.th

Timothy K. Shih
timothykshih@gmail.com

1 Faculty of Science at Sriracha, Kasetsart University, Chonburi, Thailand


2 Department of Computer Science and Information Engineering National Central University,
Taoyuan, Taiwan
Multimedia Tools and Applications

1 Introduction

In general, automatic text summarization is categorized into two main approaches: abstrac-
tion and extraction [25]. Summaries generated from abstractive summarization are closer to
the human-produced summarizations. Meanwhile, the extractive summarization extracts the
key parts of the document that are deemed interesting and necessary, and joins them to build
a summary. The summaries are produced by selecting the highlighted sentences from the
original text. Recently, deep learning methods have gradually become a hot topic in natural
language processing (NLP), including automated text summarization. Deep learning meth-
ods are powerful in constructing the summarization model without manual intervention or
labor [3, 23, 26, 27]. However, most previous works focused on text documents, and only a
few studied speech summarization [9, 14]. Thus, it is still a challenging task.
Video is a popular information sharing platform that people can use for entertainment,
education, and learning new knowledge. At present, there are various platforms such as
YouTube, Facebook, and Instagram, which allow people to share videos. YouTube is a bril-
liant online video-sharing platform that provides options for the user to upload, view and
share videos, and manage playlists. Besides, it offers a short and long duration wide vari-
ety of user-created and corporate media videos. Such videos include video clips, music
videos, movies, live streams, and educational videos. There are several categories of videos
available on YouTube, including the science and technology genre, which consists of the
most beneficial videos. Technological talk videos aim to showcase science and technologi-
cal innovations, and the content involves both software and hardware. They usually discuss
technologies, devices, gadgets, machines, or computer algorithms. Apart from that, various
organizations publish and freely distribute online talks, which help people to improve and
gain new knowledge. Video metadata consist of video ID, title, caption, description, tags,
and thumbnails. Mostly, video documents such as a title and descriptions have been manu-
ally created. We found that numerous videos still lack a description; the description not only
summarizes the video content but also plays a vital role in improving the video retrieval
system’s performance. Without description might make time-consuming for the user to get
the relevant videos. We believe that an automatic summarization can be a potential tool to
distil the key points of video content and generate the video’s description to overcome the
issue mentioned above.
In this paper, we construct the technological talk videos dataset including short- and
long- length videos. To summarize video content accurately and efficiently, the summary
models need to comprehend subtitles and extract relevant information from the given video
caption. Consequently, we propose an extractive summarization method called Tech-Talk-
Sum to summarize technological talk videos. For video content representation, we apply
contextualized word embedding from BERT (Bidirectional Encoder Representations from
Transformers) [11] to overcome the limitations of handling the small amount of data.
Outputs from BERT are learned by the Bi-LSTM network to generate the local context
representation of sentences. Besides, we create the document encoder layer based on the
output of Bi-LSTM and the self-attention mechanism for global context representation of
sentences. The extractive model is trained as a single unified model, which combines three
technology talk channels from YouTube, including TED, MicrosoftResearch, and Tensor-
Flow channels. To demonstrate the model performance, we conducted the experiments
based on both long- and short- length videos to examine the significance performance
of the Tech-Talk-Sum. The experimental results proved that our model outperforms base-
lines and state-of-the-art methods on the average ROUGE scores. Moreover, the significant
Multimedia Tools and Applications

difference of Tech-Talk-Sum is that it focuses on sentence scoring to generate summaries,


not like previous works which were produced based on sentence classification.
The rest of our paper is organized as follows: Section 2 describes the background and
related works. The proposed method is presented in Section 3, and Section 4 explains the
experimental setup, baseline models and evaluation methods. The experimental results anal-
ysis of the datasets is discussed in Section 5, while Section 6 concludes the study and
presents future directions to improve the current work.

2 Background and related works

2.1 Automatic text summarization methods

Extractive summarization produces a summary by distillation of the most crucial sen-


tence in the source text. It might be considered as a sentence classification problem in
the neural model. Ramesh Nallapati et al. [17] presented a model named SummaRuN-
Ner that applied a Recurrent Neural Network (RNN)-based sequence model to summarize
documents. In [18], they classified sentences by utilizing RNN to get the vector repre-
sentation of sentences and articles. Their model combined the maximum likelihood of
cross-entropy loss with rewards from policy gradient reinforcement learning to optimize the
evaluation metric for summarization tasks. Xingxing Zhang et al. [20] proposed an extrac-
tive model, in which the sentences were viewed as latent variables, and sentences with
activated variables were used to infer gold summaries. In terms of abstractive summariza-
tion, the summaries are generated from new words, which do not exist in the source text.
Neural approaches consider abstractive summarization as a sequence-to-sequence problem.
An encoder will map a sequence of a token in the source text to representations of the
sequence, and a decoder will generate the target summary [4, 15]. However, this method
is still complicated and poses many challenges. The generated summary is not satisfactory
or readable.
In the past few years, the attention mechanism has been applied to enhance the perfor-
mance of automatic summarization. Aishwarya et al. [1] presented SWAP-NET (Sentences
and Words from Alternating Pointer Networks) for extractive summarization. Their model
used an encoder-decoder with an attention mechanism to select the critical sentences. Kamal
et al. [13] studied a hierarchical structure of self-attention-based to identify the sentence-
summary membership. WPABS [10] was proposed by Fu Zhao et al., which is document
summarization using word and part-of-speech embedding based on an attention mechanism.
The previous studies evidenced that incorporating deep learning methods and attention
mechanisms has been extended to various text summarization tasks.

2.2 Speech document summarization

Speech document summarization attempts to select representative sentences while main-


taining the core context of the source data. Chun-I Tsai et al. [9] produced an extractive
speech summarization model that integrates two convolutional neural networks (CNNs).
Their model is composed of a multilayer perceptron (MLP) for sentence selection tasks.
Documents and sentences are encoded and paired into CNNs, and then the output from
the CNNs and document-sentences’ similarity scores are utilized to induce a rank-
ing score for each sentence. Kuan-Yu Chen [14] introduced a novel recurrent neural
Multimedia Tools and Applications

network language modelling (RNNLM) framework to extract broadcast news summa-


rization. In [28], authors worked on films and documentary summarization based on
subtitles and scripts. They employed language models to produce the summary, and their
experimental results exhibited that the LexRank method outperformed other language
models.

2.3 Bidirectional encoder representations from transformers (BERT) model

In the last few years, new deep learning models have emerged in the field of Natural
Language Processing (NLP) to improve the ability of a machine to understand languages.
The most well-known of these models is BERT [8]. BERT was built based on the trans-
former architecture and unsupervised pre-training that used multiple layers of attention
incorporated with multiple attention “heads” in every layer. BERT was first trained on two
unsupervised tasks: masked language modeling (predicting a missing word in a sentence)
and next sentence prediction (predicting if one sentence naturally follows another). The core
component of BERT is attention, which is a way for a model to assign a weight to input
features based on their importance.
Many researchers realize the benefits of BERT, and try to enhance its performance for
NLP tasks. For instance, in [29], authors proposed ALBERT for language representation,
which could decrease the training cost with low memory consumption and increase BERT’s
training speed. The Facebook AI team proposed RoBERTa [24], which included a careful
evaluation of the effects of hyper-parameter tuning and training size. Furthermore, BERT
has been applied to various tasks in NLP such as sentiment analysis, text classification,
question answering, and language translation. Yang Lin et al. [22] studied how to usefully
apply BERT in text summarization. They proposed a general framework for both extractive
and abstractive text summarization. They found that BERT is able to encode a document
and obtain representations for its sentences by employing pre-trained language models.
Yang Lin [21] also designed different variants of using BERT for extractive summarization
tasks where BERT was combined with a simple classifier method, recurrent neural net-
work, and transformer technique. Qicai Wang et al. [16] applied BERT and reinforcement
learning for text abstraction summary by taking advantage of the rich semantic features of
BERT word embedding. Their model can select the critical sentences from source input,
and then these sentences are rewritten to a shorter version which still contains the main
meaning.

3 Methods

In this paper, we introduce Tech-Talk-Sum that is a novel extractive summarization for the
technological talk videos, as presented in Fig. 1. The Tech-Talk-Sum model is composed of
three main components: Sentence Encoder, Document Encoder, and Sentence Score Predic-
tion. First, BERT is utilized for encoding word tokens to produce sentence representation
(T). Next, Bi-LSTM is applied to the outputs from BERT. The latent features form Bi-LSTM
are utilized for the attention sub-layer to compute the attention weigh for each sentence
and construct the document vector. In general, Bi-LSTM can summarize the information of
documents from both directions. The local context feature vector (h) and the global context
feature vector (D) are bridged and fed into unidirectional LSTM. Finally, a salient score is
computed and assigned to each sentence to indicate how it is crucial to represent the video
content.
Multimedia Tools and Applications

Fig. 1 The architecture of the Tech-Talk-Sum model. The model comprises three components: sentence
encoder, document encoder, and sentence score prediction. ReLU is utilized at the last layer to predict the
sentence salient score

3.1 Sentences encoder

To create the sentence representation, we leverage contextualizing word embedding from


BERT, which supports transfer learning. In general, the embedding represents a vector of
words in a low dimensional space. Besides, word embedding can capture not only seman-
tic meanings but also contextual meanings of words. Applying BERT is very successful for
sentence representation on text classification [6, 7] and summarization models [16, 21, 22].
When adapting BERT for extractive summarization, the expected output should be the rep-
resentation of each sentence. However, BERT is trained based on a masked-language model.
Thus, the output vectors are tokens instead of sentences. To solve this issue, we revise the
input sequence and embedding of BERT to give it potential for extracting summaries as
introduced in [21].
Video captions represent a sequence of sentences to describe the video content. To rep-
resent individual sentences, we insert the special token—[CLS] at the beginning of each
sentence, and each [CLS] symbol collects features of the sentence. We utilize the WordPiece
tokenizer from BERT for the input sequence, and then convert each token into token embed-
dings, and add its position embeddings. To find the position embeddings, we use different
frequencies of sine and cosine values introduced in [5], by using (1) and (2).
P E(pos,2i) = sin(pos/1000(2i/dmodel ) ) (1)

P E(pos,2i+1) = cos(pos/1000(2i/dmodel ) ) (2)


Where pos is the word’s numerical position index in the sentence and i is the dimension
of the position index. During this work, we use BERT-based uncase with 768 hidden layers.
The BERT model has a maximum length of 512, thus, pos ∈ [0, 511] and i ∈ [0, 767]. The
position embedding can be learned like other deep learning parameters as well.
Multimedia Tools and Applications

The word embeddings xpos is added with P Epos , which is input to the BERT model. The
relative position’s attention is used to generate the hidden feature for each encoder layer in
BERT, as shown in (3)-(5):
 n
zi = αij (xi W v + aijv ), (3)
j =1
exp(eij )
αij = , (4)

n
exp(eik )
k=1
xi W Q (xj W k + aijk )T
eij = √ , (5)
dz
where zi is the hidden feature of word embeddings xi , aijv and aijk are learnable parameters
between positions i and j, αij is attention value, dz is the number of dimensions, and W v ,
W Q , W k are parameter matrices to be learned.
The BERT-based model stacks 12 transformer encoder layers that each output per token
from each layer can be utilized as a word embedding. In our experiments, we found that the
output from the top layer of BERT performs the best with our datasets. Therefore, vector Ti ,
which is the vector of the i th [CLS] token from the top layer, is used as the representation
of senti .

3.2 Document encoder

Once we obtain the sentence representation from the BERT encoder, we summarize infor-
mation of the document from both directions by applying Bi-LSTM to the sentence vector
T, as formulated in (6)-(7).

→ −−−−→ −

ht = LST M(Tt , h t−1 ) (6)
←− ←−−−− ←−
ht = LST M(Tt , h t−1 ), (7)
th
where Tt is the sentence vector of the t sentence at time stamp t.

→ ←−
Then the forward state ( ht ) and the backward state ( ht ) are concatenated to get the vec-
tor ht , which summarizes the information of the t th ssentence and its context. N is denoted
as the number of sentences in the caption document so that the whole Bi-LSTM hidden
states can be represented as in (8), which HD ∈ R N∗2dn , dn is the number of hidden nodes
of Bi-LSTM.
HD = (h1 , h2 , ..., hN ). (8)
Next, we assign a weight to each sentence in the caption document according to its con-
tribution. The representation of the whole captions is modeled as a weighted sum of the
concatenated hidden states of the bi-LSTM by a self-attention mechanism [5]. The concate-
nated hidden states HD are taken as input to produce a vector of weight aD , aD ∈ R 1∗N ,
calculated as in (9). The document vector is obtained as a weighted sum of Bi-LSTM hidden
states weighted by aD , as shown in (10),
aD = sof tmax(W2 tanh(W1 , HDT + b)), (9)
D = aD HD , (10)
where sof tmax(.) is the function used to normalize the attention weights, W1 , W2 , and b
are learnable parameters, W1 ∈ R k∗2dn , W1 ∈ R k , dn is number of hidden node of LSTM,
k is a hyperparameter that can be arbitrarily, and b is bias value. tanh(.) is the activation
function, and D represents the document vector, D ∈ R 1∗2dn .
Multimedia Tools and Applications

3.3 Sentence score prediction

Again, each sentence is considered as a sequence to assign a score. Both sentence and
document vectors are bridged and inputted to the undirected LSTM, which build with 64
hidden units. Then, the outputs from LSTM are fed into fully connected layers and squashed
with the ReLU function to predict the score, as shown in (11)-(12),
Li = LST M((Wc hi + hTi Ws D), hi−1 ), (11)
sˆi = ReLU (Wl Li + b), (12)
where Li is the sentence latent feature of the i th sentence, hi iis the hidden state of Bi-
LSTM at the i t h time step, D is the document vector, sˆi is a predicted score indicating
the saliency of the i th sentence, and Wc , Ws , Wl , their dimension is depend on number of
hidden layer unit, which are learnable parameters during the training process. and b b is bias
value for LSTM training network. Besides, Wc hi indicates the information content of the
i th sentence, and hTi Ws D denotes the salience of the sentence regarding the video content.
hi ∈ R 1∗2dn , hTi ∈ R 2dn ∗1 , and D ∈ R 1∗2dn .
The sentences are ranked according to their scores, and the most salient sentences were
selected to generate summaries. To avoid sentence redundancy, we selected sentences that
were not similar to the existing set. Usually, two sentences are said to be similar if they
share 70% of the words excluding the stop words. If the similarity between the i th sentence
in the candidate set and the list of the sentences in the summary set were less than 70%, the
i th sentence is added to the summary set. This process is repeated until the length of the
summary equals M. In our experiments, M is set to 120.

4 Experimental setup

In this section, we first explain the dataset preparation. Afterwards, we provide the details
of five baseline methods and the evaluation metrics followed by the details of the training
process.

4.1 Technological talk video dataset

With the YouTube Data API, we can retrieve video information by sending the request
token to the Google Authentication Server (AS) by the OAuth 2.01 protocol. After obtain-
ing OAuth 2.0 credentials, the access token is provided to reach the YouTube Data API. The
Client ID and Client secrete are generated for our application. Begin to extract YouTube
information; Channel ID is utilized as a parameter to get the video playlist, in which each
playlist obtains a video list. Then, we can get video IDs and pass them to the downloading
method provided by the YouTube API to retrieve the corresponding captions and descrip-
tions. The caption is an essential resource to represent video content, and descriptions can
be utilized as ground truth in order to train the model. For training the model, we consider
only the educational video that provides the video caption and useful description. These
pieces of information were provided by the author who published the video. All the steps of
extracting video captions are illustrated in Fig. 2. To convert a SubRip Subtitle (SRT) file,
we created a function to remove the timecode, remove the tag, delete useless characters, and

1 https://tools.ietf.org/html/rfc6749
Multimedia Tools and Applications

transform abbreviated words. Finally, all captions and their corresponding description are
saved as a text file. The full data set is available at https://github.com/chalothon/Tech-Talk
Sum dataset.
During our experiments, we examine six YouTube channels related to technology as
listed below, and then group them into long- and short-length videos based on their duration.
– TED channel: videos on this channel comprise the best talks and performances from the
TED Conferences. Their videos are organized into various categories, which include
technology and science.
– Microsoft Research channel: this channel shares the talk videos from researchers, sci-
entists, and engineers. They present both business and engineering topics involving
machine learning, artificial intelligence, and cloud computing.
– TensorFlow channel: this channel publishes the lecture videos related to deep learning
knowledge and machine-learning methods implementation with TensorFlow.
– DeepLearningTV channel: this channel features topics such as how to review software
libraries and applications, showcasing the intuition behind Deep Learning methods.
– TwoMinute Papers channel: this provides research paper summarization in various
fields, publishing two new science videos every week.
– Tech Insider channel: this channel produces updated technology videos. The videos
present the details of new technologies, gadgets, and machines, etc.

4.2 Baselines

To further illustrate the performance of the Tech-Talk-Sum across six datasets, we compared
it with well-known baseline methods. The first baseline is the leading sentenced method
(LEAD-3): the summaries were produced by selecting the first three sentences of the doc-
ument. The second baseline is the TF-IDF score: this method utilized TF-IDF scores to
indicate the importance of the sentence. All sentences are ranked by score, and the top N
sentences are used to represent the summary. The third baseline is TextRank, where GloVe
word embedding [12] was used to represent a sentence, and the similarity matrix was cal-
culated between sentences. Then, the matrix was converted into a graph, and next to the
PageRank algorithm [2] was applied to the graph to arrive at sentence rankings. Finally,
the summary was produced based on the top N sentences. The last baseline is DistilBert
[19]. We applied DistilBERT, which is a distilled version of BERT in order to produce the
summary. DistilBERT does not have the token ids and does not consider the input position.

4.3 Experimental metrics

To evaluate the proposed model, we utilized the Pyrouge package with evaluation scripts.
ROUGE (Recall-Oriented Understudy for Evaluation) [8] is a standard evaluation metric
used for the automatic summarization and machine translation systems. It works by com-
paring an automatically produced summary against a set of reference summaries. The first
metric is ROUGE-N, which overlaps by N-grams between the system and reference sum-
maries. Currently, ROUGE-1 (unigram) and ROUGE-2 (bi-gram) are widely used measures.
Another is ROUGE-L, which statistically evaluates the generated summary based on the
longest common subsequence. This metric naturally considers the sentence-level structure
similarity, and automatically identifies the longest co-occurrence of n-grams in sequence.
Besides, Precision and Recall in the context of ROUGE are other evaluation measures that
Multimedia Tools and Applications

Fig. 2 The overview of the dataset preparation

we used. The Precision score measures how much of the system-created summary is rele-
vant or needed, and Recall in the context of ROUGE means how much the n-grams in the
reference summary are recovered or captured in the generated summary. Moreover, to report
our model performance, we computed the FN score by using (13) based on ROUGE-1,
ROUGE-2, and ROUGE-L for balancing the precision score as calculated by (14) and the
recall score as calculated by (15).

ROU GE NP recision ∗ ROU GE NRecall


FN score = 2 ∗ (13)
ROU GE NP recision + ROU GE NRecall
Multimedia Tools and Applications

number of overlapping N words


ROU GE NP recision = (14)
total words in predicted summary
number of overlapping N words
ROU GE NRecall = (15)
total words in ref erence summary

4.4 Training details

To build Tech-Talk-Sum, we used the pre-trained BERT to substitute sentences of the cap-
tion documents. There are two types of BERT: case-sensitive and case-insensitive. In our
experiments, we have chosen the case-insensitive BERT with 12 transformer encoder lay-
ers, 12 attention heads, and 768 hidden layers. We examined three approaches to use the
BERT output for sentence representation, which cover the summing of all layers, summing
the last four layers, and using the top layer encoder. However, we noticed that when using
the top layer of the BERT, our datasets achieved a significant performance.
For the training step, we combined three technological talk datasets: TED, Microsoft-
Research, and TensorFlow, and trained a single unified model. The complete input training
data consist of 326 videos, 76,130 sentences, were gathered from TED channel 172 videos
(and 20,655 sentences), Microsoft Research channel 100 videos (and 41,094 sentences),
and TensorFlow channel 54 videos (and 14,381 sentences). As for the training parameters,
we set the validation ratio to 0.2, the batch size to 32, the number of epochs to 100, and
the maximum length of the input text sequence as 128. For Bi-LSTM, we set the number of
hidden layers to 128, so that the concatenated hidden layer size was 256, and utilized tanh
function to determine the feature output of a neural network. To build the Bi-LSTM net-
work, all training parameters were set in the same value for both M2 (BERT BiLSTM) and
M3 (Tech-Talk Sum).
To train the proposed model, it is necessary to have the ground truth score. However,
our training datasets contain abstract gold summaries, which are not readily suitable for
extractive summarization models. Therefore, we applied the ROUGE metric to calculate the
score between each sentence in the caption and its corresponding reference description, as
in (16):
s = α.R1 + (1 − α).R2 , (16)
where R1 is ROUGE-1, R2 is ROUGE-2, and α is the coefficient, which is set to 0.5 to keep
the balance between the two scores.
We utilized Mean Absolute Error loss (MAE) and set the objective function as minimiz-
ing the cross-entropy (CE) between the sentence’s salient score (s) and the predicted score
(ŝ).
CE = −sloge (ŝ) − (1 − s)loge (1 − ŝ). (17)
−3
Besides, we utilized the Adadelta optimizer and set the initial learning rate to 10 . More-
over, we employed the tensorflow callback function to reduce the learning rate when a
metric had stopped improving.

5 Results and analysis

To assess the model performance, we employ the Tech-Talk-Sum for producing a summary
of both long- and short-length videos. The statistical information of the evaluation dataset,
including the number of videos, the number of sentences, duration, and the average sentence
per video, are listed in Table 1. We compare the produced summaries between our model
Multimedia Tools and Applications

Table 1 The statistical information of the technological videos for model evaluation

YouTube Channel Number of Videos Number of Duration (minutes) Average of sen-


Sentences tence/video

TED 100 11,829 15-20 ≈ 120


Microsoft Research 47 19,258 40-60 ≈ 400
TensorFlow 30 9,511 30-40 ≈ 300
DeepLearningTV 29 1,093 5-10 ≈ 40
Two Minute Papers 27 1,744 3-5 ≈ 60
TechInsider 50 2,730 3-5 ≈ 50

and four well-known baseline approaches. Besides, we explore the performance of BERT
by combining it with several summarization layers. First, we constructed M1, in which the
sentence vectors from BERT were inputted to the CNNs network with a dynamic filter size
and max pooling layers. We added the linear layers with the sigmoid function to predict the
sentence score. Second, M2 was built by applying the Bi-LSTM network to BERT output,
and we then fed forward the hidden feature to the linear layer to predict the score. M3
represents our proposed model. These models were trained on the same dataset. The average
ROUGE scores have been estimated between the generated summaries and corresponding
reference summaries by using unigram and bigram overlap (ROUGE-1 and ROUGE-2) and
the longest common subsequence (ROUGE-L).

5.1 Experiment results

To evaluate the effectiveness of the model, we deployed the proposed model on both long-
and short-length talk videos to generate the video summaries. The comparison results of the
long-length talk videos include the TED, Microsoft Research, and TensorFlow channels, as
shown in Table 2. We found that all of the BERT-based models outperformed the baseline
models on the average of R-1, R-2, and R-L. The proposed model could produce the sum-
maries that R-1 = 26.20, R-2 = 4.54, and R-L = 22.46 on the TED channel, 27.57, 3.89, and
22.33 on the Microsoft Research channel, and 33.33, 8.16, and 25.44 on the TensorFlow
channel, respectively.

Table 2 Performance comparison of Tech-Talk-Sum on long length talk videos dataset using the average
F score based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)

Models TED Microsoft Research TensorFlow

R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

LEAD-3 10.09 1.93 9.71 6.94 1.94 6.52 20.25 6.95 19.09
TF-IDF score 16.70 2.82 15.34 10.93 1.68 9.87 18.83 4.75 17.18
TextRank[2] 14.59 1.97 13.26 11.01 0.89 9.94 17.62 4.16 16.11
DistillBert[19] 13.75 2.42 13.01 11.03 1.47 10.65 19.09 6.23 18.26
M1 (BERT CNN) 25.76 3.39 21.75 13.87 2.35 13.01 29.12 5.57 23.02
M2 (BERT BiLSTM) 25.00 3.38 20.38 26.72 2.86 21.64 28.71 4.55 24.03
M3 (Tech-Talk-Sum) 26.20 4.54 22.46 27.57 3.89 22.33 30.33 8.16 25.44
Multimedia Tools and Applications

Table 3 Performance comparison of Tech-Talk-Sum on short length talk videos dataset using the average
F score based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)

Models DeepLearningTV TechInsider TwoMinutePaper

R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

LEAD-3 27.87 9.71 23.95 27.57 4.41 25.83 30.20 11.13 29.07
TF-IDF score 26.38 8.30 24.09 21.66 5.23 19.39 29.03 12.98 28.31
TextRank[2] 21.62 5.47 19.42 19.88 5.91 18.71 22.92 9.67 21.54
DistillBert[19] 23.45 7.38 21.13 20.81 5.79 19.40 21.66 9.86 19.98
M1 (BERT CNN) 13.87 2.35 13.01 27.26 5.59 24.30 32.35 11.44 30.00
M2 (BERT BiLSTM) 24.69 4.37 21.47 26.57 5.33 22.46 33.51 12.11 29.38
M3 (Tech-Talk-Sum) 27.05 8.76 24.94 28.50 6.02 24.64 34.41 13.53 30.81

Table 3 demonstates the comparison results of the short-talk videos. For DeepLearn-
ingTV, our model performed better on the baselines for R-L but got slightly worse scores
on R-1 and R-2 when compared to LEAD-3, which obtained R-1 = 27.05, R-2 = 8.76, and
R-L = 24.94. The testing results on the TechInsider dataset show that our model achieves
R-1= 28.50, and R-2 = 6.02, which outperforms the baseline models. Except for R-L, our
model obtained a slightly worse score, 24.64, when compared to LEAD-3. Experimenting on
the TwoMinutePaper dataset, our model reached R-1=34.41, R-2 = 13.53, and R-L=30.81,
which surpassed all the baselines. Usually, for short-length talk videos, a speaker will dis-
cuss the core content of the video at the beginning. Therefore, this might be the reason why
LEAD-3 can provide an understandable summary.
Besides, we experiment on a well-known text document dataset named CNN/Daily Mail
to evaluate the performance of our model. We compared with the state-of-the-art models
include SummerRunner [17], REFRESH[18], and SWAP NET[1]. Based on the experimen-
tal result, we found that our model can produce a satisfying summary although we use a
small training dataset. Using the small dataset will consume the training time less than using
the large dataset. The comparison results are listed in Table 4.
Furthermore, we also demonstrated whether integrating BERT and the attention mech-
anism could distill meaningful sentences to form the representable summaries for the
technology talk video. Figure 3a and b illustrate the comparison graph of the ROUGE pre-
cision scores of Tech-Talk-Sum and the baseline approaches. The precision scores could
tell us how much of the generated summary was relevant or demanded. Our extractive sum-
marization system is able to produce satisfying summaries on both the long- and short-talk
videos.

Table 4 The comparison of Tech-Talk-Sum on CNN/Daily Mail dataset videos dataset using the average
F1 Measure based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)

Models No. of Articles R-1 R-2 R-L

SummerRunner[17] 90,266 35.40 13.30 32.60


REFRESH[18] 90,165 39.60 16.20 35.30
SWAP NET[1] 83,568 43.60 17.70 35.30
Tech-Talk Sum 72,579 40.45 17.38 37.25
Multimedia Tools and Applications

Fig. 3 The comparison graph of the ROUGE precision score between Tech-Talk Sum and the baseline
approaches

5.2 Experiment analysis

5.2.1 BERT representation studies

To study the effectiveness of BERT contextual representation, we examined three different


BERT representations. First, we used a summation of all encoder layers (pooling-layer = -1
Multimedia Tools and Applications

Table 5 The performance comparison of three different choices to use BERT as a sentence representation on
the long talk videos dataset

Models TED Microsoft Research TensorFlow

R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

BERT All 25.33 3.56 21.23 24.06 2.54 19.27 29.90 5.64 23.18
BERT 4Sum 25.71 4.22 21.06 26.69 3.04 21.67 29.33 5.38 24.54
BERT Top 26.20 4.54 22.46 27.57 3.89 22.33 30.33 8.16 25.44

to -12) of BERT to represent the caption sentences, which is BERT All. Second, the summa-
tion of the last four layers (pooling layer = -1 to -4) is used to produce the caption sentence
representation. We named it as BERT 4Sum. The last one is BERT Top that utilizes the
top layer from BERT (pooling layer=-1) to represent the caption sentences. Tables 5 and 6
present the ROUGE score comparison of three choices to use BERT’s output as a sentence
representation on our summary task, in which Table 5 shows a comparison of the long-talk
videos. Meanwhile, Table 6 presents a comparison of the short-talk videos. We found that
the best performing choice for our dataset is to use the last hidden layer from BERT to
represent sentences.

5.2.2 Ablation studies

Moreover, we also assessed the performance of the effect of the document encoder layer.
Ablation studies were conducted with two different components of Tech-Talk-Sum. The
first one is with the document encoder layer denoted by “+DocEn” and the second one
is without the encoder layer denoted by “-DocEn”. The results are shown in Table 7. The
experiment results demonstrate that leveraging the attention mechanism is able to enhance
the sentence extraction. Using the document encoder layer can enhance the performance of
the base model, where R-1 increases by 1.22%, R-2 increases by 1.93%, and R-L increases
by 1.39% for the long-length talk videos. For the short-length talk videos, R-1 increases by
1.73%, R-2 increases by 2.16%, and R-L increases by 2.36%, on average.

5.2.3 Case study

Table 8 presents an example of the generated summaries from the three different methods.
The first row presents the reference summary of the video titled “Build a deep neural net-

Table 6 The performance comparison of three different choices to use BERT as a sentence representation on
the short talk videos dataset

Models DeepLearningTV TechInsider TwoMinutePaper

R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

BERT All 25.31 4.99 22.20 25.62 5.18 22.82 33.96 11.97 28.94
BERT 4Sum 25.64 4.81 22.33 26.27 4.72 22.08 33.44 11.56 29.23
BERT Top 27.05 8.76 24.94 28.50 6.02 24.64 34.41 13.53 30.81
Multimedia Tools and Applications

Table 7 Results of ablation studies of Tech-Talk Sum with the documents encoder layer (+DocEn) and
without the document encoder layer (-DocEn) based on the average F score of unigram (R-1), bigram (R-2),
and longest common subsequence (R-L)

Datasets R-1 R-2 R-L


+DocEn -DocEn increase +DocEn -DocEn increase +DocEn -DocEn increase

TED 26.20 25.00 +1.20 4.54 3.38 +1.16 22.46 20.38 +2.08
MicrosoftResearch 27.57 26.72 +0.85 3.89 2.86 1.03 22.33 21.64 +0.69
TensorFlow 30.33 28.71 +1.62 8.16 4.55 +3.61 25.44 24.03 +1.41
DeepLearningTV 27.05 24.69 +2.36 8.76 4.37 +4.39 24.94 21.47 +3.47
TechInsider 28.50 26.57 +1.93 6.02 5.33 +0.69 24.64 22.46 +2.18
TwoMinutePaper 34.41 33.51 +0.90 13.53 12.11 +1.42 30.81 29.38 +1.43

work in 4 mins with TensorFlow in Colab” from the TensorFlow YouTube channel. The
second row shows the generated summary from DistilBert [19], and the third and fourth
rows present the produced summaries from BERT-BiLSTM and Tech-Talk-Sum, respec-
tively. We observe that the produced summary from the proposed model is more concise,
not only maintaining semantic relevance but also taking into account the core content repre-
sentation. Moreover, the produced summary from Tech-Talk-Sum gets the highest salience
score when compared with the reference summary.

Table 8 Examples of generated summaries on TensorFlow channel, have been produced from DistilBer [19],
M2 (BERT Bi-LSTM), and Tech-Talk-Sum

Reference Summary (Video’s description)


“google colaboratory is a free jupyter notebook environment that requires no setup and runs entirely in the
cloud”, “in this episode of coding tensorflow laurence shows us how to code test and train neural networks
right in your browser without having to worry about installing any kind of runtime”, “watch to quickly see
an example of how you can use tensorflow to build a neural network for breast cancer classification all this
happens within colab”
DistillBert [19], Salience score = 0.324, ROUGE-1 score = 0.500
“hi, and welcome to part three of this series on using google colab to code, train, and test neural networks in
the browser without needing to install any kind of a runtime”, “let’s now take a look at the code for training
this neural network using this data so you can use that network to then perform breast cancer classification
yourself”, “and we’ll then have a layer of 16, then 8, then 6, and then, finally, 1, “now we’re classifying
two features, so that’s perfect”
M2 (BERT-BiLSTM), Salience score = 0.381, ROUGE-1 score = 0.525
“in this video i will show you then how you can use tensorflow to build a neural network for breast cancer
classification”, “now we are classifying two features so that is perfect”, “we can now test that network with
data that the neural network has not yet seen”, “music playing laurence moroney hi and welcome to part
three of this series on using google colab to code train and test neural networks in the browser without
needing to install any kind of a runtime”, “as you will see once it finishes training the loss is 0 0595 showing
that it is about 94 accurate”
Tech-Talk Sum, Salience score = 0.400 , ROUGE-1 score =0.532
“hi and welcome to part three of this series on using google colab to code train and test neural networks in
the browser without needing to install any kind of a runtime”, “in this video i will show you then how you
can use tensorflow to build a neural network for breast cancer classification”, “i have pre processed the data
into several csv files so we can just focus on the neural network itself”, “and in the next video in this series
my colleague paige will show you about how to use different runtimes and processors and how to use your
code to take advantage of gpus and tpus right in your browser”
Multimedia Tools and Applications

6 Conclusions and future works

The automatic text summarization has undoubtedly grown in popularity in recent years. The
stream of prior studies focuses on text documents, with only a few works having studied
spoken document summarization. In this paper, we introduce a novel approach to the extrac-
tive summarization for technological talk videos called Tech-Talk-Sum, which focuses
on sentence scoring, different from previous works that considered sentence classifica-
tion. We constructed the technological talk datasets including long- and short-talk videos.
For long-length technological videos, we retrieve videos from TED, Microsoft Research,
and TensorFlow channels. The average duration of videos of these three channels is 40-
60 minutes. Video contents mainly focus on applying the technology on various works,
including engineering, business, and agriculture, that involve machine learning and artificial
intelligence. For short-length technological videos, we consider three YouTube channels,
including DeepLearningTV, TwoMinute Papers, and Tech Insider channels. Their contents
present new technical knowledge, new technologies, which the average of videos length on
these three channels is 5-10 minutes. The full version of dataset is available at https://github.
com/chalothon/Tech-Talk Sum dataset. To build video content representation, we leverage
BERT (Bidirectional Encoder Representations from Transformers), which is pre-trained for
large-scale texts and can capture semantic features. Next, we applied Bi-LSTM at the sen-
tence level to learn the local context. Besides, the self-attention mechanism was adopted on
the Bi-LSTM output to produce the document encoder layer that could find out the global
context of the sentence. The output from the Bi-LSTM network and the output from the doc-
ument layer were fed into the unidirectional LSTM to compute the salience score of each
sentence. Finally, the summaries are formed based on these scores.
In the training procedure, we trained a single unified model, which combined three tech-
nology talk channels, including the TED, Microsoft Research, and TensorFlow channels.
We built various experiments to study the effectiveness of the proposed model. Moreover,
we analyzed the impact of the attention mechanism on six datasets. Our model has been
validated for the summary generation on both short- and long-length videos. To report the
model performance, we input the testing data to our model and four baselines for produc-
ing the summary. We then measured the overlapping content between the generated and the
reference summaries by using the standard ROUGE metrics. We analyzed the comparison
scores based on the average F score by using unigram and bigram overlap (R-1 and R-
2) and the longest common subsequence (R-L). The experimental results demonstrate that
Tech-Talk-Sum outperforms the baseline models on both long- and short-talk videos. In our
future work, we will consider focusing more on the encoder-decoder method to generate
abstractive summaries that are more similar to the summaries created by humans.

Acknowledgements The authors would like to thank the Ministry of Science and Technology of Taiwan for
supporting our research under grant No. MOST105-2923-S-008-001-MY3.

References

1. Aishwarya J, Vaibhav R (2018) Extractive summarization with SWAP-NET: Sentences and words from
alternating pointer networks. In: Proc of the 56th annual meeting ofthe association for computational
linguistics, melbourne, Australia, pp 142–151. https://doi.org/10.18653/v1/P18-1014
2. Albi D, Silvester H (2017) Pagerank Algorithm. IOSR Journal of Computer Engineering 19(1):01–07.
https://doi.org/10.9790/0661-1901030107
Multimedia Tools and Applications

3. Alessandro F (2014) Innovative document summarization techniques: revolutionizing knowledge under-


standing. IGI Global. https://doi.org/10.4018/978-1-4666-5019-0
4. Alexander MR, Sumit C, Jason W (2015) A neural attention model for abstractive sentence summariza-
tion. In: Proc. of the 2015 conference on empirical methods in natural language processing, pp 379–389.
https://doi.org/10.18653/v1/D15-1044
5. Ashish V, Noam S, Niki P, Jakob U, Llion J, Aidan NG, Łukasz K (2017) Attention Is All You Need.
In: Proc. of 2017 NIPS Long Beach, CA, USA, pp 6000–6010
6. Ashutosh A, Achyudh R, Raphael T, Jimmy L (2019) DocBERT: BERT for document classification.
arXiv:1904.08398
7. Chi S, Luyao H, Xipeng Q (2019) utilizing BERT for Aspect-Based sentiment analysis via constructing
auxiliary sentence. In: Proc. of NAACL-HLT 2019 minneapolis, minnesota, pp 380–385
8. Chin YL (2004) ROUGE: A package for automatic evaluation of summaries. In: Association for
computational linguistics, Barcelona, Spain, pp 74–81
9. Chun IT, Hsiao TH, Kuan YC, Berlin C (2016) Extractive speech summarization leveraging convo-
lutional neural network techniques. In: Proc 2016 IEEE spoken language technology workshop (SLT
2016), pp 158–164
10. Fu Z, Bing Q, Jing Y, Jing C, Yubo Z, Xu W (2019) Document summarization using word
and part-of-speech based on attention mechanism. Journal of Physics: Conference Series, 1168(3).
https://doi.org/10.1088/1742-6596/1168/3/032008
11. Jacob D, Ming WC, Kenton L, Kristina T (2016) BERT: Pre-training of deep bidirectional transformers
for language understanding. Computer Science. https://doi.org/10.18653/v1/n19-1423
12. Jeffrey P, Richard S, Christopher M (2014) Glove: Global vectors for word representation. In: Proc. of
the 2014 conference on empirical methods in natural language processing (EMNLP), Association for
Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
13. Kamal AS, Zhang Z, Mohammed N (2018) A hierarchical structured self-attentive
model for extractive document summarization (HSSAS). IEEE Access 6:24205–24212.
https://doi.org/10.1109/ACCESS.2018.2829199
14. Kuan YC, Shih HL, Berlin C, Hsin MW, Ea EJ (2015) Extractive broadcast news summarization leverag-
ing recurrent neural network language modeling techniques. IEEE/ACM Transactions on Audio Speech,
and Language Processing 23(8):1322–1334. https://doi.org/10.1109/TASLP.2015.2432578
15. Lijun W, Fei T, Li Z, Jianhuang L, Tie YL (2018) Word attention for sequence to sequence text under-
standing. In: Proc. of the thirty-second AAAI conference on artificial intelligence, Louisiana, USA,
pp 5578–5585
16. Qicai W, Peiyu L, Zhenfang Z, Hongxia Y, Qiuyue Z, Lindong Z (2019) A text abstraction
summary model based on BERT word embedding and reinforcement learning. Appl Sci, 9(21).
https://doi.org/10.3390/app9214701
17. Ramesh N, Feifei Z, Bowen Z (2016) SummaruNNer: A recurrent neural network based sequence model
for extractive summarization of documents. In: Proc. of the thirty-first AAAI conference on artificial
intelligence (AAAI-17), San Francisco, California USA, pp 3075–3081
18. Shashi N, Shay BC, Mirella L (2018) Ranking sentences for extractive summarization with reinforce-
ment learning. In: Proc. of NAACL-HLT 2018 New Orleans, Louisiana, pp 1747–1759
19. Victor S, Lysandre D, Julien C, Thomas W (2020) DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter. arXiv:1910.01108
20. Xingxing Z, Mirella L, Furu W, Ming Z (2018) Neural latent extractive document summarization. In:
Proc. of the 2018 conference on empirical methods in natural language processing, brussels, Belgium.
November, pp 779–784. https://doi.org/10.18653/v1/D18-1088
21. Yang L (2019) Fine-tune BERT for Extractive Summarization. arXiv:1903.10318
22. Yang L, Mirella L (2019) Text summarization with pretrained encoders. In: Proc 2019 the conference on
empirical methods in natural language processing, Hong Kong, China, pp 3730–3740
23. Yau SW, Hung YL (2018) Learning to encode text as human-readable summaries using gen-
erative adversarial networks. In: Proc. of the 2018 conference on empirical methods in natural
language processing, association for computational linguistics, Brussels, Belgium, pp 4187–4195.
https://doi.org/10.18653/v1/D18-1451
24. Yinhan L, Myle O, Naman G, Jingfei D, Mandar J, Danqi C, Omer L, Mike L, Luke Z, Veselin S (2019)
RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692
25. Yogan JK, Ong SG, Halizah B, Ngo HC, Puspalata CS (2016) A review on automatic text summarization
approaches. J Comput Sci 12(4):178–190. https://doi.org/10.3844/jcssp.2016.178.190
26. Yong Z, Meng JE, Mahardhika P (2016) Extractive document summarization based on convolutional
neural networks. In: IECON 2016 - 42nd Annu. Conf. IEEE Ind. Electron. Soc., Florence, Italy, pp 918–
922
Multimedia Tools and Applications

27. Yuxiang W, Baotian H (2018) Learning to extract coherent summary via deep reinforcement learning. In:
association for the advancement of artificial intelligence (AAAI 2018), Louisiana, USA, pp 5602–5609
28. Zhenzhong L, Mingda C, Sebastian G, Kevin G, Piyush S, Radu S (2020) ALBERT: A lite bert
for Self-Supervised learning of language representations. In: Proc. ICLR 2020, Ababa, Ethiopia.
arXiv:1909.11942
29. Zhenzhong L, Mingda C, Sebastian G, Kevin G, Piyush S, Radu S (2020) ALBERT: A lite bert
for Self-Supervised learning of language representations. In: Proc. ICLR 2020, Ababa, Ethiopia.
arXiv:1909.11942

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

You might also like