Professional Documents
Culture Documents
https://doi.org/10.1007/s11042-022-12812-4
Abstract
Automatic summarization is a task to condense the data to a shorter version while preserving
key informational components and the meaning of content. In this paper, we introduce Tech-
Talk-Sum, which is the combination of BERT (Bidirectional Encoder Representations from
Transformers) and the attention mechanism to summarize the technological talk videos. We
first introduce the technology talk datasets that were constructed from YouTube including
short- and long-talk videos. Second, we explored various sentence representations from
BERT’s output. Using the top hidden layer to represent sentences is the best choice for our
datasets. The outputs from BERT were fed forward to the Bi-LSTM network to build local
context vectors. Besides, we built the document encoder layer that leverages BERT and
the self-attention mechanism to express the semantics of a video caption and to form the
global context vector. Third, the undirected LSTM was added to bridge the local and global
sentence’s contexts to predict the sentence’s salience score. Finally, the video summaries
were generated based on the scores. We trained a single unified model on long-talk video
datasets. ROUGE was utilized to evaluate our proposed methods. The experimental results
demonstrate that our model has generalization ability, and achieves the baselines and state-
of-the-art results for both long and short videos.
Chalothon Chootong
chootong.c@ku.th
Timothy K. Shih
timothykshih@gmail.com
1 Introduction
In general, automatic text summarization is categorized into two main approaches: abstrac-
tion and extraction [25]. Summaries generated from abstractive summarization are closer to
the human-produced summarizations. Meanwhile, the extractive summarization extracts the
key parts of the document that are deemed interesting and necessary, and joins them to build
a summary. The summaries are produced by selecting the highlighted sentences from the
original text. Recently, deep learning methods have gradually become a hot topic in natural
language processing (NLP), including automated text summarization. Deep learning meth-
ods are powerful in constructing the summarization model without manual intervention or
labor [3, 23, 26, 27]. However, most previous works focused on text documents, and only a
few studied speech summarization [9, 14]. Thus, it is still a challenging task.
Video is a popular information sharing platform that people can use for entertainment,
education, and learning new knowledge. At present, there are various platforms such as
YouTube, Facebook, and Instagram, which allow people to share videos. YouTube is a bril-
liant online video-sharing platform that provides options for the user to upload, view and
share videos, and manage playlists. Besides, it offers a short and long duration wide vari-
ety of user-created and corporate media videos. Such videos include video clips, music
videos, movies, live streams, and educational videos. There are several categories of videos
available on YouTube, including the science and technology genre, which consists of the
most beneficial videos. Technological talk videos aim to showcase science and technologi-
cal innovations, and the content involves both software and hardware. They usually discuss
technologies, devices, gadgets, machines, or computer algorithms. Apart from that, various
organizations publish and freely distribute online talks, which help people to improve and
gain new knowledge. Video metadata consist of video ID, title, caption, description, tags,
and thumbnails. Mostly, video documents such as a title and descriptions have been manu-
ally created. We found that numerous videos still lack a description; the description not only
summarizes the video content but also plays a vital role in improving the video retrieval
system’s performance. Without description might make time-consuming for the user to get
the relevant videos. We believe that an automatic summarization can be a potential tool to
distil the key points of video content and generate the video’s description to overcome the
issue mentioned above.
In this paper, we construct the technological talk videos dataset including short- and
long- length videos. To summarize video content accurately and efficiently, the summary
models need to comprehend subtitles and extract relevant information from the given video
caption. Consequently, we propose an extractive summarization method called Tech-Talk-
Sum to summarize technological talk videos. For video content representation, we apply
contextualized word embedding from BERT (Bidirectional Encoder Representations from
Transformers) [11] to overcome the limitations of handling the small amount of data.
Outputs from BERT are learned by the Bi-LSTM network to generate the local context
representation of sentences. Besides, we create the document encoder layer based on the
output of Bi-LSTM and the self-attention mechanism for global context representation of
sentences. The extractive model is trained as a single unified model, which combines three
technology talk channels from YouTube, including TED, MicrosoftResearch, and Tensor-
Flow channels. To demonstrate the model performance, we conducted the experiments
based on both long- and short- length videos to examine the significance performance
of the Tech-Talk-Sum. The experimental results proved that our model outperforms base-
lines and state-of-the-art methods on the average ROUGE scores. Moreover, the significant
Multimedia Tools and Applications
In the last few years, new deep learning models have emerged in the field of Natural
Language Processing (NLP) to improve the ability of a machine to understand languages.
The most well-known of these models is BERT [8]. BERT was built based on the trans-
former architecture and unsupervised pre-training that used multiple layers of attention
incorporated with multiple attention “heads” in every layer. BERT was first trained on two
unsupervised tasks: masked language modeling (predicting a missing word in a sentence)
and next sentence prediction (predicting if one sentence naturally follows another). The core
component of BERT is attention, which is a way for a model to assign a weight to input
features based on their importance.
Many researchers realize the benefits of BERT, and try to enhance its performance for
NLP tasks. For instance, in [29], authors proposed ALBERT for language representation,
which could decrease the training cost with low memory consumption and increase BERT’s
training speed. The Facebook AI team proposed RoBERTa [24], which included a careful
evaluation of the effects of hyper-parameter tuning and training size. Furthermore, BERT
has been applied to various tasks in NLP such as sentiment analysis, text classification,
question answering, and language translation. Yang Lin et al. [22] studied how to usefully
apply BERT in text summarization. They proposed a general framework for both extractive
and abstractive text summarization. They found that BERT is able to encode a document
and obtain representations for its sentences by employing pre-trained language models.
Yang Lin [21] also designed different variants of using BERT for extractive summarization
tasks where BERT was combined with a simple classifier method, recurrent neural net-
work, and transformer technique. Qicai Wang et al. [16] applied BERT and reinforcement
learning for text abstraction summary by taking advantage of the rich semantic features of
BERT word embedding. Their model can select the critical sentences from source input,
and then these sentences are rewritten to a shorter version which still contains the main
meaning.
3 Methods
In this paper, we introduce Tech-Talk-Sum that is a novel extractive summarization for the
technological talk videos, as presented in Fig. 1. The Tech-Talk-Sum model is composed of
three main components: Sentence Encoder, Document Encoder, and Sentence Score Predic-
tion. First, BERT is utilized for encoding word tokens to produce sentence representation
(T). Next, Bi-LSTM is applied to the outputs from BERT. The latent features form Bi-LSTM
are utilized for the attention sub-layer to compute the attention weigh for each sentence
and construct the document vector. In general, Bi-LSTM can summarize the information of
documents from both directions. The local context feature vector (h) and the global context
feature vector (D) are bridged and fed into unidirectional LSTM. Finally, a salient score is
computed and assigned to each sentence to indicate how it is crucial to represent the video
content.
Multimedia Tools and Applications
Fig. 1 The architecture of the Tech-Talk-Sum model. The model comprises three components: sentence
encoder, document encoder, and sentence score prediction. ReLU is utilized at the last layer to predict the
sentence salient score
The word embeddings xpos is added with P Epos , which is input to the BERT model. The
relative position’s attention is used to generate the hidden feature for each encoder layer in
BERT, as shown in (3)-(5):
n
zi = αij (xi W v + aijv ), (3)
j =1
exp(eij )
αij = , (4)
n
exp(eik )
k=1
xi W Q (xj W k + aijk )T
eij = √ , (5)
dz
where zi is the hidden feature of word embeddings xi , aijv and aijk are learnable parameters
between positions i and j, αij is attention value, dz is the number of dimensions, and W v ,
W Q , W k are parameter matrices to be learned.
The BERT-based model stacks 12 transformer encoder layers that each output per token
from each layer can be utilized as a word embedding. In our experiments, we found that the
output from the top layer of BERT performs the best with our datasets. Therefore, vector Ti ,
which is the vector of the i th [CLS] token from the top layer, is used as the representation
of senti .
Once we obtain the sentence representation from the BERT encoder, we summarize infor-
mation of the document from both directions by applying Bi-LSTM to the sentence vector
T, as formulated in (6)-(7).
−
→ −−−−→ −
→
ht = LST M(Tt , h t−1 ) (6)
←− ←−−−− ←−
ht = LST M(Tt , h t−1 ), (7)
th
where Tt is the sentence vector of the t sentence at time stamp t.
−
→ ←−
Then the forward state ( ht ) and the backward state ( ht ) are concatenated to get the vec-
tor ht , which summarizes the information of the t th ssentence and its context. N is denoted
as the number of sentences in the caption document so that the whole Bi-LSTM hidden
states can be represented as in (8), which HD ∈ R N∗2dn , dn is the number of hidden nodes
of Bi-LSTM.
HD = (h1 , h2 , ..., hN ). (8)
Next, we assign a weight to each sentence in the caption document according to its con-
tribution. The representation of the whole captions is modeled as a weighted sum of the
concatenated hidden states of the bi-LSTM by a self-attention mechanism [5]. The concate-
nated hidden states HD are taken as input to produce a vector of weight aD , aD ∈ R 1∗N ,
calculated as in (9). The document vector is obtained as a weighted sum of Bi-LSTM hidden
states weighted by aD , as shown in (10),
aD = sof tmax(W2 tanh(W1 , HDT + b)), (9)
D = aD HD , (10)
where sof tmax(.) is the function used to normalize the attention weights, W1 , W2 , and b
are learnable parameters, W1 ∈ R k∗2dn , W1 ∈ R k , dn is number of hidden node of LSTM,
k is a hyperparameter that can be arbitrarily, and b is bias value. tanh(.) is the activation
function, and D represents the document vector, D ∈ R 1∗2dn .
Multimedia Tools and Applications
Again, each sentence is considered as a sequence to assign a score. Both sentence and
document vectors are bridged and inputted to the undirected LSTM, which build with 64
hidden units. Then, the outputs from LSTM are fed into fully connected layers and squashed
with the ReLU function to predict the score, as shown in (11)-(12),
Li = LST M((Wc hi + hTi Ws D), hi−1 ), (11)
sˆi = ReLU (Wl Li + b), (12)
where Li is the sentence latent feature of the i th sentence, hi iis the hidden state of Bi-
LSTM at the i t h time step, D is the document vector, sˆi is a predicted score indicating
the saliency of the i th sentence, and Wc , Ws , Wl , their dimension is depend on number of
hidden layer unit, which are learnable parameters during the training process. and b b is bias
value for LSTM training network. Besides, Wc hi indicates the information content of the
i th sentence, and hTi Ws D denotes the salience of the sentence regarding the video content.
hi ∈ R 1∗2dn , hTi ∈ R 2dn ∗1 , and D ∈ R 1∗2dn .
The sentences are ranked according to their scores, and the most salient sentences were
selected to generate summaries. To avoid sentence redundancy, we selected sentences that
were not similar to the existing set. Usually, two sentences are said to be similar if they
share 70% of the words excluding the stop words. If the similarity between the i th sentence
in the candidate set and the list of the sentences in the summary set were less than 70%, the
i th sentence is added to the summary set. This process is repeated until the length of the
summary equals M. In our experiments, M is set to 120.
4 Experimental setup
In this section, we first explain the dataset preparation. Afterwards, we provide the details
of five baseline methods and the evaluation metrics followed by the details of the training
process.
With the YouTube Data API, we can retrieve video information by sending the request
token to the Google Authentication Server (AS) by the OAuth 2.01 protocol. After obtain-
ing OAuth 2.0 credentials, the access token is provided to reach the YouTube Data API. The
Client ID and Client secrete are generated for our application. Begin to extract YouTube
information; Channel ID is utilized as a parameter to get the video playlist, in which each
playlist obtains a video list. Then, we can get video IDs and pass them to the downloading
method provided by the YouTube API to retrieve the corresponding captions and descrip-
tions. The caption is an essential resource to represent video content, and descriptions can
be utilized as ground truth in order to train the model. For training the model, we consider
only the educational video that provides the video caption and useful description. These
pieces of information were provided by the author who published the video. All the steps of
extracting video captions are illustrated in Fig. 2. To convert a SubRip Subtitle (SRT) file,
we created a function to remove the timecode, remove the tag, delete useless characters, and
1 https://tools.ietf.org/html/rfc6749
Multimedia Tools and Applications
transform abbreviated words. Finally, all captions and their corresponding description are
saved as a text file. The full data set is available at https://github.com/chalothon/Tech-Talk
Sum dataset.
During our experiments, we examine six YouTube channels related to technology as
listed below, and then group them into long- and short-length videos based on their duration.
– TED channel: videos on this channel comprise the best talks and performances from the
TED Conferences. Their videos are organized into various categories, which include
technology and science.
– Microsoft Research channel: this channel shares the talk videos from researchers, sci-
entists, and engineers. They present both business and engineering topics involving
machine learning, artificial intelligence, and cloud computing.
– TensorFlow channel: this channel publishes the lecture videos related to deep learning
knowledge and machine-learning methods implementation with TensorFlow.
– DeepLearningTV channel: this channel features topics such as how to review software
libraries and applications, showcasing the intuition behind Deep Learning methods.
– TwoMinute Papers channel: this provides research paper summarization in various
fields, publishing two new science videos every week.
– Tech Insider channel: this channel produces updated technology videos. The videos
present the details of new technologies, gadgets, and machines, etc.
4.2 Baselines
To further illustrate the performance of the Tech-Talk-Sum across six datasets, we compared
it with well-known baseline methods. The first baseline is the leading sentenced method
(LEAD-3): the summaries were produced by selecting the first three sentences of the doc-
ument. The second baseline is the TF-IDF score: this method utilized TF-IDF scores to
indicate the importance of the sentence. All sentences are ranked by score, and the top N
sentences are used to represent the summary. The third baseline is TextRank, where GloVe
word embedding [12] was used to represent a sentence, and the similarity matrix was cal-
culated between sentences. Then, the matrix was converted into a graph, and next to the
PageRank algorithm [2] was applied to the graph to arrive at sentence rankings. Finally,
the summary was produced based on the top N sentences. The last baseline is DistilBert
[19]. We applied DistilBERT, which is a distilled version of BERT in order to produce the
summary. DistilBERT does not have the token ids and does not consider the input position.
To evaluate the proposed model, we utilized the Pyrouge package with evaluation scripts.
ROUGE (Recall-Oriented Understudy for Evaluation) [8] is a standard evaluation metric
used for the automatic summarization and machine translation systems. It works by com-
paring an automatically produced summary against a set of reference summaries. The first
metric is ROUGE-N, which overlaps by N-grams between the system and reference sum-
maries. Currently, ROUGE-1 (unigram) and ROUGE-2 (bi-gram) are widely used measures.
Another is ROUGE-L, which statistically evaluates the generated summary based on the
longest common subsequence. This metric naturally considers the sentence-level structure
similarity, and automatically identifies the longest co-occurrence of n-grams in sequence.
Besides, Precision and Recall in the context of ROUGE are other evaluation measures that
Multimedia Tools and Applications
we used. The Precision score measures how much of the system-created summary is rele-
vant or needed, and Recall in the context of ROUGE means how much the n-grams in the
reference summary are recovered or captured in the generated summary. Moreover, to report
our model performance, we computed the FN score by using (13) based on ROUGE-1,
ROUGE-2, and ROUGE-L for balancing the precision score as calculated by (14) and the
recall score as calculated by (15).
To build Tech-Talk-Sum, we used the pre-trained BERT to substitute sentences of the cap-
tion documents. There are two types of BERT: case-sensitive and case-insensitive. In our
experiments, we have chosen the case-insensitive BERT with 12 transformer encoder lay-
ers, 12 attention heads, and 768 hidden layers. We examined three approaches to use the
BERT output for sentence representation, which cover the summing of all layers, summing
the last four layers, and using the top layer encoder. However, we noticed that when using
the top layer of the BERT, our datasets achieved a significant performance.
For the training step, we combined three technological talk datasets: TED, Microsoft-
Research, and TensorFlow, and trained a single unified model. The complete input training
data consist of 326 videos, 76,130 sentences, were gathered from TED channel 172 videos
(and 20,655 sentences), Microsoft Research channel 100 videos (and 41,094 sentences),
and TensorFlow channel 54 videos (and 14,381 sentences). As for the training parameters,
we set the validation ratio to 0.2, the batch size to 32, the number of epochs to 100, and
the maximum length of the input text sequence as 128. For Bi-LSTM, we set the number of
hidden layers to 128, so that the concatenated hidden layer size was 256, and utilized tanh
function to determine the feature output of a neural network. To build the Bi-LSTM net-
work, all training parameters were set in the same value for both M2 (BERT BiLSTM) and
M3 (Tech-Talk Sum).
To train the proposed model, it is necessary to have the ground truth score. However,
our training datasets contain abstract gold summaries, which are not readily suitable for
extractive summarization models. Therefore, we applied the ROUGE metric to calculate the
score between each sentence in the caption and its corresponding reference description, as
in (16):
s = α.R1 + (1 − α).R2 , (16)
where R1 is ROUGE-1, R2 is ROUGE-2, and α is the coefficient, which is set to 0.5 to keep
the balance between the two scores.
We utilized Mean Absolute Error loss (MAE) and set the objective function as minimiz-
ing the cross-entropy (CE) between the sentence’s salient score (s) and the predicted score
(ŝ).
CE = −sloge (ŝ) − (1 − s)loge (1 − ŝ). (17)
−3
Besides, we utilized the Adadelta optimizer and set the initial learning rate to 10 . More-
over, we employed the tensorflow callback function to reduce the learning rate when a
metric had stopped improving.
To assess the model performance, we employ the Tech-Talk-Sum for producing a summary
of both long- and short-length videos. The statistical information of the evaluation dataset,
including the number of videos, the number of sentences, duration, and the average sentence
per video, are listed in Table 1. We compare the produced summaries between our model
Multimedia Tools and Applications
Table 1 The statistical information of the technological videos for model evaluation
and four well-known baseline approaches. Besides, we explore the performance of BERT
by combining it with several summarization layers. First, we constructed M1, in which the
sentence vectors from BERT were inputted to the CNNs network with a dynamic filter size
and max pooling layers. We added the linear layers with the sigmoid function to predict the
sentence score. Second, M2 was built by applying the Bi-LSTM network to BERT output,
and we then fed forward the hidden feature to the linear layer to predict the score. M3
represents our proposed model. These models were trained on the same dataset. The average
ROUGE scores have been estimated between the generated summaries and corresponding
reference summaries by using unigram and bigram overlap (ROUGE-1 and ROUGE-2) and
the longest common subsequence (ROUGE-L).
To evaluate the effectiveness of the model, we deployed the proposed model on both long-
and short-length talk videos to generate the video summaries. The comparison results of the
long-length talk videos include the TED, Microsoft Research, and TensorFlow channels, as
shown in Table 2. We found that all of the BERT-based models outperformed the baseline
models on the average of R-1, R-2, and R-L. The proposed model could produce the sum-
maries that R-1 = 26.20, R-2 = 4.54, and R-L = 22.46 on the TED channel, 27.57, 3.89, and
22.33 on the Microsoft Research channel, and 33.33, 8.16, and 25.44 on the TensorFlow
channel, respectively.
Table 2 Performance comparison of Tech-Talk-Sum on long length talk videos dataset using the average
F score based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)
LEAD-3 10.09 1.93 9.71 6.94 1.94 6.52 20.25 6.95 19.09
TF-IDF score 16.70 2.82 15.34 10.93 1.68 9.87 18.83 4.75 17.18
TextRank[2] 14.59 1.97 13.26 11.01 0.89 9.94 17.62 4.16 16.11
DistillBert[19] 13.75 2.42 13.01 11.03 1.47 10.65 19.09 6.23 18.26
M1 (BERT CNN) 25.76 3.39 21.75 13.87 2.35 13.01 29.12 5.57 23.02
M2 (BERT BiLSTM) 25.00 3.38 20.38 26.72 2.86 21.64 28.71 4.55 24.03
M3 (Tech-Talk-Sum) 26.20 4.54 22.46 27.57 3.89 22.33 30.33 8.16 25.44
Multimedia Tools and Applications
Table 3 Performance comparison of Tech-Talk-Sum on short length talk videos dataset using the average
F score based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)
LEAD-3 27.87 9.71 23.95 27.57 4.41 25.83 30.20 11.13 29.07
TF-IDF score 26.38 8.30 24.09 21.66 5.23 19.39 29.03 12.98 28.31
TextRank[2] 21.62 5.47 19.42 19.88 5.91 18.71 22.92 9.67 21.54
DistillBert[19] 23.45 7.38 21.13 20.81 5.79 19.40 21.66 9.86 19.98
M1 (BERT CNN) 13.87 2.35 13.01 27.26 5.59 24.30 32.35 11.44 30.00
M2 (BERT BiLSTM) 24.69 4.37 21.47 26.57 5.33 22.46 33.51 12.11 29.38
M3 (Tech-Talk-Sum) 27.05 8.76 24.94 28.50 6.02 24.64 34.41 13.53 30.81
Table 3 demonstates the comparison results of the short-talk videos. For DeepLearn-
ingTV, our model performed better on the baselines for R-L but got slightly worse scores
on R-1 and R-2 when compared to LEAD-3, which obtained R-1 = 27.05, R-2 = 8.76, and
R-L = 24.94. The testing results on the TechInsider dataset show that our model achieves
R-1= 28.50, and R-2 = 6.02, which outperforms the baseline models. Except for R-L, our
model obtained a slightly worse score, 24.64, when compared to LEAD-3. Experimenting on
the TwoMinutePaper dataset, our model reached R-1=34.41, R-2 = 13.53, and R-L=30.81,
which surpassed all the baselines. Usually, for short-length talk videos, a speaker will dis-
cuss the core content of the video at the beginning. Therefore, this might be the reason why
LEAD-3 can provide an understandable summary.
Besides, we experiment on a well-known text document dataset named CNN/Daily Mail
to evaluate the performance of our model. We compared with the state-of-the-art models
include SummerRunner [17], REFRESH[18], and SWAP NET[1]. Based on the experimen-
tal result, we found that our model can produce a satisfying summary although we use a
small training dataset. Using the small dataset will consume the training time less than using
the large dataset. The comparison results are listed in Table 4.
Furthermore, we also demonstrated whether integrating BERT and the attention mech-
anism could distill meaningful sentences to form the representable summaries for the
technology talk video. Figure 3a and b illustrate the comparison graph of the ROUGE pre-
cision scores of Tech-Talk-Sum and the baseline approaches. The precision scores could
tell us how much of the generated summary was relevant or demanded. Our extractive sum-
marization system is able to produce satisfying summaries on both the long- and short-talk
videos.
Table 4 The comparison of Tech-Talk-Sum on CNN/Daily Mail dataset videos dataset using the average
F1 Measure based on unigram (R-1), bigram (R-2), and longest common subsequence (R-L)
Fig. 3 The comparison graph of the ROUGE precision score between Tech-Talk Sum and the baseline
approaches
Table 5 The performance comparison of three different choices to use BERT as a sentence representation on
the long talk videos dataset
BERT All 25.33 3.56 21.23 24.06 2.54 19.27 29.90 5.64 23.18
BERT 4Sum 25.71 4.22 21.06 26.69 3.04 21.67 29.33 5.38 24.54
BERT Top 26.20 4.54 22.46 27.57 3.89 22.33 30.33 8.16 25.44
to -12) of BERT to represent the caption sentences, which is BERT All. Second, the summa-
tion of the last four layers (pooling layer = -1 to -4) is used to produce the caption sentence
representation. We named it as BERT 4Sum. The last one is BERT Top that utilizes the
top layer from BERT (pooling layer=-1) to represent the caption sentences. Tables 5 and 6
present the ROUGE score comparison of three choices to use BERT’s output as a sentence
representation on our summary task, in which Table 5 shows a comparison of the long-talk
videos. Meanwhile, Table 6 presents a comparison of the short-talk videos. We found that
the best performing choice for our dataset is to use the last hidden layer from BERT to
represent sentences.
Moreover, we also assessed the performance of the effect of the document encoder layer.
Ablation studies were conducted with two different components of Tech-Talk-Sum. The
first one is with the document encoder layer denoted by “+DocEn” and the second one
is without the encoder layer denoted by “-DocEn”. The results are shown in Table 7. The
experiment results demonstrate that leveraging the attention mechanism is able to enhance
the sentence extraction. Using the document encoder layer can enhance the performance of
the base model, where R-1 increases by 1.22%, R-2 increases by 1.93%, and R-L increases
by 1.39% for the long-length talk videos. For the short-length talk videos, R-1 increases by
1.73%, R-2 increases by 2.16%, and R-L increases by 2.36%, on average.
Table 8 presents an example of the generated summaries from the three different methods.
The first row presents the reference summary of the video titled “Build a deep neural net-
Table 6 The performance comparison of three different choices to use BERT as a sentence representation on
the short talk videos dataset
BERT All 25.31 4.99 22.20 25.62 5.18 22.82 33.96 11.97 28.94
BERT 4Sum 25.64 4.81 22.33 26.27 4.72 22.08 33.44 11.56 29.23
BERT Top 27.05 8.76 24.94 28.50 6.02 24.64 34.41 13.53 30.81
Multimedia Tools and Applications
Table 7 Results of ablation studies of Tech-Talk Sum with the documents encoder layer (+DocEn) and
without the document encoder layer (-DocEn) based on the average F score of unigram (R-1), bigram (R-2),
and longest common subsequence (R-L)
TED 26.20 25.00 +1.20 4.54 3.38 +1.16 22.46 20.38 +2.08
MicrosoftResearch 27.57 26.72 +0.85 3.89 2.86 1.03 22.33 21.64 +0.69
TensorFlow 30.33 28.71 +1.62 8.16 4.55 +3.61 25.44 24.03 +1.41
DeepLearningTV 27.05 24.69 +2.36 8.76 4.37 +4.39 24.94 21.47 +3.47
TechInsider 28.50 26.57 +1.93 6.02 5.33 +0.69 24.64 22.46 +2.18
TwoMinutePaper 34.41 33.51 +0.90 13.53 12.11 +1.42 30.81 29.38 +1.43
work in 4 mins with TensorFlow in Colab” from the TensorFlow YouTube channel. The
second row shows the generated summary from DistilBert [19], and the third and fourth
rows present the produced summaries from BERT-BiLSTM and Tech-Talk-Sum, respec-
tively. We observe that the produced summary from the proposed model is more concise,
not only maintaining semantic relevance but also taking into account the core content repre-
sentation. Moreover, the produced summary from Tech-Talk-Sum gets the highest salience
score when compared with the reference summary.
Table 8 Examples of generated summaries on TensorFlow channel, have been produced from DistilBer [19],
M2 (BERT Bi-LSTM), and Tech-Talk-Sum
The automatic text summarization has undoubtedly grown in popularity in recent years. The
stream of prior studies focuses on text documents, with only a few works having studied
spoken document summarization. In this paper, we introduce a novel approach to the extrac-
tive summarization for technological talk videos called Tech-Talk-Sum, which focuses
on sentence scoring, different from previous works that considered sentence classifica-
tion. We constructed the technological talk datasets including long- and short-talk videos.
For long-length technological videos, we retrieve videos from TED, Microsoft Research,
and TensorFlow channels. The average duration of videos of these three channels is 40-
60 minutes. Video contents mainly focus on applying the technology on various works,
including engineering, business, and agriculture, that involve machine learning and artificial
intelligence. For short-length technological videos, we consider three YouTube channels,
including DeepLearningTV, TwoMinute Papers, and Tech Insider channels. Their contents
present new technical knowledge, new technologies, which the average of videos length on
these three channels is 5-10 minutes. The full version of dataset is available at https://github.
com/chalothon/Tech-Talk Sum dataset. To build video content representation, we leverage
BERT (Bidirectional Encoder Representations from Transformers), which is pre-trained for
large-scale texts and can capture semantic features. Next, we applied Bi-LSTM at the sen-
tence level to learn the local context. Besides, the self-attention mechanism was adopted on
the Bi-LSTM output to produce the document encoder layer that could find out the global
context of the sentence. The output from the Bi-LSTM network and the output from the doc-
ument layer were fed into the unidirectional LSTM to compute the salience score of each
sentence. Finally, the summaries are formed based on these scores.
In the training procedure, we trained a single unified model, which combined three tech-
nology talk channels, including the TED, Microsoft Research, and TensorFlow channels.
We built various experiments to study the effectiveness of the proposed model. Moreover,
we analyzed the impact of the attention mechanism on six datasets. Our model has been
validated for the summary generation on both short- and long-length videos. To report the
model performance, we input the testing data to our model and four baselines for produc-
ing the summary. We then measured the overlapping content between the generated and the
reference summaries by using the standard ROUGE metrics. We analyzed the comparison
scores based on the average F score by using unigram and bigram overlap (R-1 and R-
2) and the longest common subsequence (R-L). The experimental results demonstrate that
Tech-Talk-Sum outperforms the baseline models on both long- and short-talk videos. In our
future work, we will consider focusing more on the encoder-decoder method to generate
abstractive summaries that are more similar to the summaries created by humans.
Acknowledgements The authors would like to thank the Ministry of Science and Technology of Taiwan for
supporting our research under grant No. MOST105-2923-S-008-001-MY3.
References
1. Aishwarya J, Vaibhav R (2018) Extractive summarization with SWAP-NET: Sentences and words from
alternating pointer networks. In: Proc of the 56th annual meeting ofthe association for computational
linguistics, melbourne, Australia, pp 142–151. https://doi.org/10.18653/v1/P18-1014
2. Albi D, Silvester H (2017) Pagerank Algorithm. IOSR Journal of Computer Engineering 19(1):01–07.
https://doi.org/10.9790/0661-1901030107
Multimedia Tools and Applications
27. Yuxiang W, Baotian H (2018) Learning to extract coherent summary via deep reinforcement learning. In:
association for the advancement of artificial intelligence (AAAI 2018), Louisiana, USA, pp 5602–5609
28. Zhenzhong L, Mingda C, Sebastian G, Kevin G, Piyush S, Radu S (2020) ALBERT: A lite bert
for Self-Supervised learning of language representations. In: Proc. ICLR 2020, Ababa, Ethiopia.
arXiv:1909.11942
29. Zhenzhong L, Mingda C, Sebastian G, Kevin G, Piyush S, Radu S (2020) ALBERT: A lite bert
for Self-Supervised learning of language representations. In: Proc. ICLR 2020, Ababa, Ethiopia.
arXiv:1909.11942
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.