You are on page 1of 5

2020 IEEE 20th International Conference on Communication Technology

A Smooth Video Summarization Method Based on Frame-Filling

Xiaoyu Teng1,2, Xiaolin Gui1,2*, Huijun Dai1,2, Tianjiao Du1,2, Zhenxing Wang1,2, Hui Li3
1
School of Computer Science and Technology, Xi’an Jiaotong University, Xi'an, 710049, P.R. China
2
Shaanxi Province Key Laboratory of Computer Network, Xi'an, 710049, P.R. China
3
Department of Neonatology, The First Affiliated Hospital of Medical College,
Xi’an Jiaotong University, Xi'an, 710049, P.R. China
e-mail: xlgui@mail.xjtu.edu.cn

Abstract—Given the existing video summarization algorithms mainly divided into three ways: based on key-frames [9], key
have too much redundancy in video semantics, and destroy the shots [10], and time interval [11]. Among them, the class of
maximum continuity between video frames, a smooth video based on key-frames algorithms also can be roughly divided
summarization generation method is designed in this paper.
This method is based on the video keyframe selection, which is
into the following three categories in terms of their
mainly divided into three parts. In the first part, this method processing methods:1) basis for characteristics of the video
completes the de-redundancy operation based on the similarity itself, includes shot boundary [12], event and motion [13],
calculated by the essential characteristics of the video frame. etc. 2) The basis for traditional machine learning algorithms,
The second part completes the shot segmentation processing including clustering [14] and "machine learning +" [15], etc.
using the capsule network. The third part can be divided into This type of algorithm combines machine learning with the
two sub-parts, key frames selection and filling. The former of basic features of video frame information entropy, histogram,
this part is selects key-frames based on video semantic, and the and so forth. While improving the accuracy of video
latter is smooths summarization based on frame-filling.
Experimental results show that the video summarization summarization algorithm, the influence of the video theme
algorithms designed in this paper can outperform several on the summarization is neglected. 3) The basis for deep
existing algorithms and achieve excellent results in terms of learning algorithms, such as [16], it uses deep learning
video content integrity, semantic accuracy and so on. algorithms to extract key-frames of video and generate
summaries. This type of algorithm able to automatically
Keywords-video summarization; de-redundancy; video extract video frame features, meanwhile ensuring the
semantic; key-frames filling accuracy of the algorithm and considering the video
semantics, scene, and other information. However, this type
I. INTRODUCTION of algorithm increases the time complexity of the algorithm,
Video data is an essential form in terms of information and put the implementation process of the algorithm is in the
transmission. Especially with rapid developments in storage "black box", ignoring the user's demand for video
devices, recording technology, and video editing tools [1], summarization.
video data is widely available used in movies, surveillance In summary, the existing generation algorithms can meet
cameras, news and so on [2]. However, there has been an the requirements for video summarization in general.
exponential increase in the volume of the video data [3]. As a However, there are still following shortcomings: 1) High-
result, retrieval, storage, acquiring adequate information performance video summarization algorithms always
from video data becomes very difficult. Video sacrifice the space-time complexity it, that is, lack of the
summarization can analyze the input videos [4], find the balance method set according to users' needs. 2) During the
summarization process, some algorithms have semantic
information video portions [5], and then generate
redundancy in order to ensure the video content integrity. 3)
summarization data which contains highly practical
Discontinuity between key-frames reduces the readability of
information. In essence, the video summarization aims to video summary. Therefore, based on the analysis and review
extract key-frames/shots from a long video [6]. In terms of of existing video summarization generation algorithms, this
the form of video summarization, it can be divided into paper designs a new user-oriented static video
dynamic and static summarization. The static summarization, summarization generation scheme. It contains an easy-to-
such as [7], in which the method is to select frame sets implement and straightforward algorithm for removing
obtained by clustering the video frames with high redundant frames in the pre-processing of video frames is
information content in the original video [8]. designed. This method is eliminating redundant frames
At present, there are many research achievements on before the algorithm processing. It can improve the quality
video summarization. According to counterpart algorithms, of video summarization [3], and reduce the complexity of
video summarization generation from the root. At the same
time, it also ensures the semantic integrity and contains a
This work was supported partially by National key Research Project video frame-filling strategy based on the smoothing function
under Grant 2018YFB1800304, Key Research and Development Project of
Shaanxi Province under Grant 2019GY-005 and 2017ZDXM-GY-011.
is designed, then a video summary is further generated.

978-1-7281-8141-7/20/$31.00 ©2020 IEEE 1418

Authorized licensed use limited to: California State University Fresno. Downloaded on June 23,2021 at 16:17:39 UTC from IEEE Xplore. Restrictions apply.
II. THE PROPOSED METHOD

A. Overall Scheme Model


The model of static video summarization method in this
paper, which proposed in this paper, is shown in Fig. 1. This Figure 2. Non-Information Frame Schematic
model is mainly divided into three parts: The first is video
frames pre-processing part. In this part, several operations The first part is judging the video frames to generate a
are included, such as key process de-redundancy. The second non-information frame candidate set. The implementation
part is the automatic shot segmentation which based on process is showing in Fig. 3 In this process, in order to
CapsNet and using it to extract the spatiotemporal features; increase the accuracy and fineness of information detection,
the third part is the keyframes selection and it filling, finally,
this paper divide each video frame fi into 32×32 pi blocks
the static video summarization is generated.

f i , j . The feature vectors H fi , j are obtained by extracting
频 … Reference [21]


+ the histogram of oriented gradient (HOG) of these blocks.
Set  f represent the threshold of perceptual hash, the
… FCN
… Time

Non-redundancy
Frame Sets …
Capsules Net perceptual hash fingerprint of this module is calculated as
Video Keyframes
follows:
Filing 1, H f   f (1)
 i, j
H _ a fi , j = 
0, H fi , j   f
Video Frames

Summarization
By counting the hash fingerprints H _ a fi , j of all blocks
Perceptual Hash Generation
Algorithm slice
Reconstruction Hash
Redundant
frame
f i , j in the video frame fi , and according to equation (2),
determination

then we can determine fi is the non-information frame or


Non-information Null non-
frame judgment information
frame set
Non-negative
De-non-information Frames
matrix
mutual
information
Static Video
Summaization
not.
 H _ af  H _ af
factorization
Sliding sequential
(2)
 1   2
frame similarity i, j i, j

Figure 1. Overall Scheme Model Diagram f i, j f i, j

where  1 and  2 are the threshold of de Decision Non-


B. Video Frame De-Redundancy Algorithm Information Frames.
This part completes the operation of removing redundant
frames in the video frame sequence. As shown in Fig. 1, the
de-redundancy algorithm comes after the video split. Then,
the mutual information judgment between frames and the
similarity judgment based on non-negative matrix
factorization (NMF) is simultaneously performed on the
resulting null non-information set N k and the non-
redundant frameset N r is selected by filtering the non-
redundant frames. As shown in Fig. 2, this algorithm
contents two parts: Decision Non-Information Frames
Algorithm, and de-redundancy algorithm. Figure 3. Generate a Non-Information Frame Candidate Set Schematic
1) Decision Non-Information Frames Algorithm
Reference [17] pointed out that in the video sequence, 2) De-redundancy algorithm
there are high similarities between consecutive frames, so the As shown in Fig. 4, the de-redundancy algorithm is
de-redundancy algorithm needs to remove these higher mainly divided into three parts: frame similarity
similarity frames. However, in the actual process of video determination based on NMF feature decomposition, frame
information processing, except for the highly similar frames similarity determination based on mutual information, and
are redundant, the native video has a large number of non- redundant frame determination.
information frames, which are video frames with less video Frame Similarity Determination
information content, as shown in Fig. 2. This paper divides Based on NMF Feature
the decision non-information frames process into two parts. Decomposition redundant
frame
Firstly, judging the video frame, and then combining the determination
information environment in which these frames location to Frame Similarity Determination
Based on Mutual Information
determine whether it is redundant.
Figure 4. Step of De-Redundancy Algorithm.

1419

Authorized licensed use limited to: California State University Fresno. Downloaded on June 23,2021 at 16:17:39 UTC from IEEE Xplore. Restrictions apply.
C. Automatic Shot Segmentation Based on Capsnet method based on Memorability-Entropy-based video
Compared with the convolutional neural network, the summarization, which is mainly based on the memorability
CapsNet [18] has a robust characteristic expression ability of images. However, because the filled frame is most not the
and improves the shortcomings of the convolutional neural original video out of the native frame, in the process of
network (CNN) in recognition of spatial relations and processing the need to ensure the integrity of video
rotation of objects. information, semantic information at the same time try to
avoid increasing the redundancy of video summary. This
paper uses structural similarity index (SSIM) [20] to measure
the similarity between the frame to be inserted and the
original frame to determine whether to add.
From what has been discussed above, the arithmetic of
smooth summarization based on frame filling Smoo-Summ
is shown blew:
Algorithm 1 Smoo-Summ
1

Input: The keyframe set: Vkey =  f key1 1 , , f keyk , f n
key1
n
, , f keyk 
Video frames after CapsNet
treatment: Vc =  f1 , f 2 , f n 
Figure 5. CapsNet Implementation Process
v : Fill threshold;  v' : Shots transition threshold;
D. Generation of Smooth Summarization Based on Frame  v : SSIM fill threshold
Filling
1) Key Frames selection 
output: VA = f A' 1 , f A' 2 , '
, f Am  : Video summarization set
This paper uses the keyframe selection method proposed 1: For 0  i  k // k is total number of keyframes
in [21] for reference and combines K-Means to design a '
cos( f keyi '
, f keyi +1 )
keyframe selection algorithm for accurate summarization. L = similarity ( f keyi , f keyi +1 ) =
' '
' '
f keyi f keyi
The algorithm process shows in Fig. 6. This selection 2: +1

algorithm has two parts: (i) K-Means algorithm is used to L


3: if v
cluster the output data of the capsule network coarsely. The
purpose of this step is twofold: first, to fix the upper and 4. DO  j  n
0
lower limits for the second step of keyframe selection, and '
cos( f keyi , fj)
L' = similarity ( f keyi
'
, fj) =
multiple types of common screening to improve the selection '
f keyi fj
time; Second, provide candidate frames for smooth filling. (ii) 5.
Keyframe selection uses the algorithm of keyframes L  v '
6. WHILE
selecting which proposed in the reference [21]. 7. // Find frames with high similarity and insert
8. VA = Vkey  f j end end
9. For 0  r  n
r r +1
r +1
cos( f keyk , f key 1)
L'' = similarity ( f keyk
r
, f key 1) = r r +1
f keyk f key 1
10.
L''  v'
11. if
'
r
f keyk
12. =bilateral filter ( f keyk
r
)
Figure 6. Key-frames Selection
13.
'
L''' = SSIM ( f keyk
r
, f keyk
r
)
2) Key Frames filling
14. if L'''   v
Human perception of vision and hearing has an inevitable
delay, such as human hearing not be aware of low-frequency 15. VA = Vkey  f keyk
r '

murmurs, which after higher-pitched music, the same is true 16. end
of vision. Therefore, in the process of video summarization 17. end
generation, the internal frames of the shots need to have
maximum continuity, and the transition of the shot needs to III. EXPERIMENT AND RESULT ANALYSIS
smooth while properly highlighting the elements natural to
be ignored by the vision in the video frame. Given the A. Experimental Datasets
requirements of video content fidelity, this paper designs an In order to evaluate our scheme on three video datasets
algorithm based on human perception and memory. There “VSUMM [21]”, “Xi'an JiaoTong University Faculty of
are already many ways to implement smooth filling of the Electronic and Information Engineering Advertising Video”
summarization, such as [19] proposed a video summarization [22], OVP [23], and “Gourmet China-taste Xi'an” [24].

1420

Authorized licensed use limited to: California State University Fresno. Downloaded on June 23,2021 at 16:17:39 UTC from IEEE Xplore. Restrictions apply.
B. Evaluation Metrics generated video summarization and algorithm efficiency. To
In order to measure the performance of video measure the effectiveness of the de-redundancy algorithm
summarization generated by this scheme,we first measure proposed in the scheme in this paper, this section evaluations
it from the perspective of commonness, and then measure it our algorithm on OVP [24], and compares with reference [3],
from the perspective of different types of summarization the outcome of the comparison is shown in Table I.
requirements. In terms of common measurement, this paper TABLE I. PERFORMANCE COMPARISON ON DE -REDUNDANCY ALGORITHM
uses F-score measurement, and its calculation equation is as Dataset Algorithms RR MR
CUS E
follows:
OVP[24] Reference [3] 97.8 8.5 0
2  Pr ecision  Re call (3)
F − score = Proposed 97.9 8.1 0
Pr ecision + Re call
N match (4) It can be seen from Table I that the performance of the
Pr ecision =
N msum algorithm proposed in this paper is similar to that in
N match (5) reference [3] on OVP. The MR is always equal to 0, and RR
Re call = can reach over 97%. But the algorithm in reference [3] is
Nusum
based on SIFT, this method has a relatively high space-time
where N match represents the total number of similar frames overhead, and its generality is limited to a certain extent. In
in the summarization, N msum the total number of keyframes this paper, the proposed algorithm adopts the basic data
processing method of digital signals, calculating the
in summarization, N usum the number of keyframes in users. similarity based on the essential characteristics of the video
That is, the Pr ecision is the ratio of the total number of frames and adding the step of removing the blank frame
similar frames to the total number of frames in the before de-redundant, which improves the efficiency of the
redundant algorithm in some range. To further test the
summarization, and Re call measures the ratio of total
generality and stability of the de-redundancy algorithm
number of similar frames to the total number of frames in the proposed in this paper, we apply the algorithm to [25], the
user summarization [7]. rate of RR change curve shown in Fig. 7.
In addition, in order to measure the de-redundancy effect
of the scheme proposed in this paper, we refer to the three
measurement elements in [3], where are the reduction rate
( RR ), the error factor CUS E , and miss rate ( MR ). Among
them, RR represents the percentage of redundant frames
deleted from the user metric, set N input is the total number of
Figure 7. RR Change Curve on [25]
input frames, N output the total number of video
summarization output frames, then the calculation method is In Fig. 7, RR values in video frame de-redundancy vary
as follows: depending on the video content and its type. It can be seen
from Fig. 7, in this paper, the highest RR value of the de-
N − Noutput (6)
RR = input 100% redundancy algorithm can reach 98.2, and the average is
Ninput 97.38 on the dataset [25]. That is to say, our algorithm has
Reference [3] shows that the calculation method of stability in different data segments.
CUS E follows: D. Verification and Analysis The Generation Algorithm of
Noutput − N match
'
(7) Summarization
CUS E = '
N match To Verification and reasonable analysis from the
' perspective of the performance of the algorithm, the process
where N match indicates the number of frames in the output is mainly through comparative analysis with the reference
frame corresponding to the standard. MR indicates the [17], [18], [22].
number of missing frames, set NUS represents the total TABLE II. SUMMARY EFFECT COMPARISON
number of user summarization, N miss is the total number of Datasets Algorithms F-score
Reference [17] 0.62
frames in the user summarization but not exist in our Reference [18] 0.78
algorithm. VSUMM[22] VSUMM1 [21] 0.54
N (8) VSUMM2 [21] 0.54
MR = miss 100% Proposed 0.72
NUS
Reference [17] 0.56
C. De-Redumdancy Verification and Analysis Reference [18] 0.64
Reference[23] VSUMM1 [21] 0.48
In this paper, Video frame de-redundancy is the first step VSUMM2 [21] 0.49
of static video summarization generation, and its Proposed 0.66
performance is directly related to the accuracy of the

1421

Authorized licensed use limited to: California State University Fresno. Downloaded on June 23,2021 at 16:17:39 UTC from IEEE Xplore. Restrictions apply.
Table II can be seen that the effect of the video [8] Fei M, Jiang W, Mao W. A novel compact yet rich key frame creation
summarization generation algorithm is different for videos method for compressed video summarization[J]. Multimedia Tools
and Applications, 2018, 77(10): 11957-11977.
with different themes. According to the previous paragraph,
[9] Mahasseni B, Lam M, Todorovic S. Unsupervised video
the algorithm proposed by [17] is more stable in various summarization with adversarial lstm networks[C]//Proceedings of the
topics. The proposed algorithm is not much different from its IEEE conference on Computer Vision and Pattern Recognition. 2017:
stability, but the performance is slightly better than the 202-211.
literature [17]. On the VSUMM dataset, the performance of [10] Zhang K, Chao W L, Sha F, et al. Video summarization with long
the proposed algorithm is better than that of VSUMM1, short-term memory[C]//European conference on computer vision.
VSUMM2, and [17], but slightly lower than that of [18]. But Springer, Cham, 2016: 766-782.
on the dataset [22], the performance of the proposed [11] Yuan Y, Li H, Wang Q. Spatiotemporal Modeling for Video
Summarization Using Convolutional Recurrent Neural Network[J].
algorithm is not much different from that of the reference [18]. IEEE Access, 2019, 7: 64676-64685.
Therefore, the accurate summarization algorithm has
[12] Li J, Yao T, Ling Q, et al. Detecting shot boundary with sparse
relatively good performance. coding for video summarization[J]. Neurocomputing, 2017, 266: 66-
78.
IV. CONCLUSION [13] Chen M, Han X, Zhang H, et al. Quality-guided key frames selection
This paper revolves around video static summarization, from video stream based on object detection[J]. Journal of Visual
based on it, a smooth video summarization generation Communication and Image Representation, 2019, 65: 102678.
method is designed. The goal of this method is to ensure that [14] Gharbi H, Bahroun S, Zagrouba E. Key frame extraction for video
summarization using local description and repeatability graph
video semantic integrity. This method consists of de-
clustering[J]. Signal, Image and Video Processing, 2019, 13(3): 507-
redundancy operation, shot segmentation, key-frames 515.
selection based on video semantic, and frames filling. [15] Aote S S, Potnurwar A. An automatic video annotation framework
However, the threshold selection is not fine enough. And the based on two level keyframe extraction mechanism[J]. Multimedia
summary algorithm still has much room for improvement, Tools and Applications, 2019, 78(11): 14465-14484.
which is the next step needs to improve. [16] Huang S, Li X, Zhang Z, et al. User-Ranking Video Summarization
With Multi-Stage Spatio–Temporal Representation[J]. IEEE
ACKNOWLEDGMENT Transactions on Image Processing, 2018, 28(6): 2654-2664.
[17] Bendraou Y, Essannouni F, Salam A, et al. From local to global key-
The authors would like to thank the editor and the frame extraction based on important scenes using SVD of centrist
anonymous reviewers for their constructive comments and features[J]. Multimedia Tools and Applications, 2019, 78(2): 1441-
suggestions, which improve the quality of this paper. 1456
[18] Huang C, Wang H. Novel key-frames selection framework for
REFERENCES comprehensive video summarization[J]. IEEE Transactions on
[1] Sasithradevi A, Roomi S M. Video classification and retrieval Circuits and Systems for Video Technology, 2019.
through spatio-temporal Radon features[J]. Pattern Recognition, 2020. [19] Sabour S, Frosst N, Hinton G E. Dynamic routing between
[2] Hannane R, Elboushaki A, Afdel K, et al. An efficient method for capsules[C]//Advances in neural information processing systems.
video shot boundary detection and keyframe extraction using SIFT- 2017: 3856-3866.
point distribution histogram[J]. International Journal of Multimedia [20] Jacob H, Pádua F L C, Lacerda A, et al. A video summarization
Information Retrieval, 2016, 5(2): 89-104. approach based on the emulation of bottom-up mechanisms of visual
[3] Mohan J, Nair M S. Domain independent redundancy elimination attention[J]. Journal of Intelligent Information Systems, 2017, 49(2):
based on flow vectors for static video summarization[J]. Heliyon, 193-211.
2019, 5(10). [21] Li Y, Shi J, Lin D. Low-latency video semantic
[4] Javed A, Irtaza A, Khaliq Y, et al. Replay and key-events detection segmentation[C]//Proceedings of the IEEE Conference on Computer
for sports video summarization using confined elliptical local ternary Vision and Pattern Recognition. 2018: 5997-6005.
patterns and extreme learning machine[J]. Applied Intelligence, 2019, [22] Sandra Eliza Fontes de Avila, Antonio da Luz Jr, Arnaldo de
49(8): 2899-2917. Albuquerque Araújo. VSUMM: An Approach for Automatic Video
[5] Zhang S, Zhu Y, Roy-Chowdhury A K. Context-aware surveillance Summarization and Quantitative Evaluation[C]// SIBGRAPI 2008,
video summarization[J]. IEEE Transactions on Image Processing, Proceedings of the XXI Brazilian Symposium on Computer Graphics
2016, 25(11): 5469-5478. and Image Processing, Campo Grande, Brazil, 12-15 October 2008.
IEEE Computer Society, 2008.
[6] Jiang Y, Cui K, Peng B, et al. Comprehensive Video Understanding:
Video summarization with content-based video recommender [23] http://eie.xjtu.edu.cn/info/1073/22115.htm
design[C]//Proceedings of the IEEE International Conference on [24] The Open Video Project. http://www.open-video.org.
Computer Vision Workshops. 2019: 0-0. [25] http://tv.cctv.com/2019/10/22/VIDEClUYGfsb4ia3LNBodXiw19102
[7] Wu J, Zhong S, Jiang J, et al. A novel clustering method for static 2.shtml
video summarization[J]. Multimedia Tools and Applications, 2017,
76(7): 9625-9641.

1422

Authorized licensed use limited to: California State University Fresno. Downloaded on June 23,2021 at 16:17:39 UTC from IEEE Xplore. Restrictions apply.

You might also like