You are on page 1of 9

Journal of Ambient Intelligence and Humanized Computing

https://doi.org/10.1007/s12652-020-01679-8

ORIGINAL RESEARCH

Fast key‑frame image retrieval of intelligent city security video based


on deep feature coding in high concurrent network environment
Chuhong Li1 · Bo Zhou1

Received: 27 June 2019 / Accepted: 3 January 2020


© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract
Aiming at the problems faced by the connection management module in single packet processing in intelligent city security
video retrieval, we firstly propose the traffic locality quantization index based on the traffic characteristics of the backbone
link to quantitatively analyze the traffic locality characteristics in the backbone link. And then, a key-frame abstraction and
retrieval of videos based on deep learning is proposed to improve the efficiency and accuracy of video retrieval, where an
adaptive key-frame selection algorithm is designed and the existing convolutional neural network framework is used to extract
the features of key-frames, and unsupervised, semi-supervised and supervised retraining models are designed to improve the
effectiveness of the feature extraction of the convolutional neural network and the accuracy of the video retrieval. Experi-
mental results based on the public video datasets show that our proposed key-frame image retrieval model realizes a good
precision for key-frame representation, and achieves high accuracy and efficiency for video retrieval.

Keywords  Intelligent city · Security video · Deep feature · Deep learning · Key-frame selection · Retraining model

1 Introduction Literature (Wu et al. 2018) puts forward three controlling


strategies of data concurrent access: the strategy of service
With the popularity of video information on network, a quality as high priority; the strategy of concurrency number as
large number of video information is generated every day high priority; and the adaptive strategy of balance. For the first
and the traditional information retrieval technology based strategy, the system allocates all the resources for the current
on key words cannot keep up this development (Jakub et al. task, which ensures the shortest response time for single task
2018). As a result, content-based video retrieval has become while concurrency is still low. For the second strategy, the sys-
a research “hotspot” in recent years. Currently, most search tem executes multi-task of retrieval simultaneously in the way
engines generally use the technology of cluster load bal- of threaded concurrency. Although the executive performance
ance, which makes the entrance of system easy to become of single task cannot be improved in this strategy, the whole
network’s bottleneck while high concurrent request of video operating efficiency can be increased in the condition of high
search happens on network (Song et al. 2018). For the goal concurrency. With the adaptive strategy of balance, system will
of improving network performance of video retrieval plat- switch between the first strategy and the second strategy above
form, some scholars focus on the strategy of data concurrent smoothly according to the number of concurrent accesses and
access and the strategy of data cache, which are the key the special needs of video retrieval, aiming at improve the adap-
modules affecting service performance seriously (Chen et al. tive capacity of system farthest (Okabe et al. 2018).
2017; Song et al. 2017). Although the pressure of video feature retrieval will
be alleviated and the whole system performance will be
improved in high concurrent network environment, it still
* Bo Zhou consumes a lot of resources and bandwidth (Feng et al.
zhoubo028@scu.edu.cn 2019). How to process and analyze video quickly and
Chuhong Li effectively has become an important problem to be solved
2016151476001@stu.scu.edu.cn urgently. Video key-frame extraction is an effective method
for fast browsing and retrieval of massive video data. It is
1
College of Architecture and Environment, also the basis of video summarization (Dong et al. 2018).
Sichuan University, Chengdu 610000, Sichuan, China

13
Vol.:(0123456789)
C. Li, B. Zhou

Commonly used video key-frame extraction methods more suitable for human vision system. However, the num-
include shot boundary method, image feature based method, ber of key-frames is not determined flexibly according to
clustering based method and motion information analysis the type and length of the video, which does not provide an
method (Li 2017; Wang et al. 2014). These four methods effective basis for the user to extract the appropriate length
do not make enough use of visual saliency, which results of key-frames. With the rapid development of information
in the low degree of representation and redundancy of the technology, varieties of video technologies are widely used
extracted key-frames, and low consistency with the key- in the Internet and wireless sensor networks. In order to
frames extracted by users. The purpose of some researches solve the problems of traditional video coding, such as high
is to extract key-frames which are more consistent with computational complexity and poor error resilience capabil-
human vision system and highly able to generalize the orig- ity, distributed compressive video sensing has become a hot
inal video content. A video key-frame extraction method research topic both at home and abroad in recent years and is
based on video saliency and two-step hierarchical clustering the most competitive video application solution (Chenchen
algorithm is proposed in literature (Cappallo et al. 2018). et al. 2018). The technical problems to be solved urgently are
The specific contents are as follows: The saliency detection that the existing video coding and reconstruction methods
algorithm is applied to the original video sequence. Com- still need to be further optimized, the reconstruction rate is
bined with the underlying features and motion information increased, and the complexity of the algorithm is reduced.
of the salient region, a fusion feature vector is constructed. When the video frame is reconstructed, a multi-hypothe-
Then the redundant information is removed by calculating sis bidirectional best match strategy is proposed in literature
the similarity between the feature vectors (Lin et al. 2017). (Haijun et al. 2018). The idea of motion alignment estima-
Secondly, a two hierarchical clustering algorithm based on tion is introduced, and the best matching patch is selected
mutual information is proposed. One time hierarchical clus- from the reference frame by using the patch structural simi-
tering adaptively determines the clustering threshold, which larity of the measured domain (Ying and Daiyin 2018). The
is real time. Two-time clustering artificial sets clustering optimal video generation matrix is extracted pixel by pixel
threshold to meet the needs of different users (Roy et al. in the search window centered on the best matching block,
2017). Finally, the frame with the maximum mutual infor- and the residual compensation is applied to reconstruct the
mation between the other frames in each cluster is selected non-key-frame. Simulation results show that at the same
as the key-frame. Compared with the current mainstream sampling rate, the proposed algorithm can improve the qual-
video key-frame extraction algorithm, the effectiveness and ity of reconstructed video frames, and the reconstruction
applicability of the proposed algorithm is verified, and the time does not increase significantly, and the subjective visual
selected key-frames are more suitable for human vision effect is better.
system. Nevertheless, the existing prediction reconstruction algo-
A key-frame extraction method based on PCA and tem- rithm is not clear enough to reconstruct the video in the
poral K-means clustering is proposed in literature (Yoo et al. vigorous scene of the sport, so the best matching interpola-
2017). The main research is how to use fast convergent clus- tion reconstruction method in distributed compressive video
tering algorithm to cluster similar features on the basis of sensing is proposed in literature (Liu et al. 2003). The best
ideal feature extraction. The specific content of the work is matching interpolation dictionary is formed by the inter-
as follows: Firstly, Principal component analysis (PCA) is polation method in the searching window centered on the
used to extract the principal component of the fusion feature best matching patch, and the interpolation method makes
vector. The number of principal components of a video in the residual signal sparser. Simulation results show that this
time coordinates is determined by cumulative contribution method can obtain higher reconstruction quality.
rate, and considers the number of principal components as In conclusion, both bidirectional best matching strategy
the number of key-frames (Deng and Yu 2019). Secondly, and the best matching interpolation method are applied in
the similarity between video frames is calculated, and the the distributed compressive video sensing. The key-frame
distance time distribution curve is generated, and the initial is reconstructed by the best matching interpolation method,
boundary of clustering is determined according to the peak and the nonkey-frame is reconstructed by the bidirectional
value of the curve. Thirdly, the initial cluster is optimized best matching method. Simulation results show that the
by the time series K-means clustering algorithm. Finally, the proposed algorithm significantly improves the quality of
nearest frame of the distance cluster center is selected from reconstruction, and the reconstruction time does not increase
each cluster as the key-frame. significantly.
The experimental results show that the key-frame gener- Aiming at the problems faced by the connection manage-
ated by literature (Rameshnath and Bora 2019) has a higher ment module in single packet processing, we firstly propose
matching degree to the user’s summary. And the content the traffic locality quantization index based on the traffic
of the original video is highly generalized, key-frames are characteristics of the backbone link to quantitatively analyze

13
Fast key-frame image retrieval of intelligent city security video based on deep feature coding…

the traffic locality characteristics in the backbone link. Sec- filter has only two states, and does not satisfy the connec-
ondly, the auxiliary space of constant overhead based on tion state maintenance requirement in the traffic processing
the traditional hash structure is added, and a fast flow table system. The literature (Stippick 2004) proposed a modified
lookup method (FFTL) is designed to reduce the flow- Fingerprint Compressed Filter(FCF). Each Hash slot of
table access overhead. And then, a key-frame abstraction FCF contains d fixed-size sub-tables and d hash functions
and retrieval of videos based on deep learning is proposed which are mutually independent. A connection will cal-
to improve the efficiency and accuracy of video retrieval, culate the d-times hash value, and finally insert into the
where an adaptive key-frame selection algorithm is designed shortest sub-table, but this makes it necessary to find d
and the existing convolutional neural network framework conflict chains for each search in the backbone with dense
is used to extract the features of key-frames, and unsuper- number of packets. The network will bring a large search
vised, semi-supervised and supervised retraining models are overhead, and it uses a single bit for timeout processing,
designed to improve the effectiveness of the feature extrac- which is a slot-level processing method. Although the FCF
tion of the convolutional neural network and the accuracy of will compress the FID where it can get the fast access of
the video retrieval. Experimental results based on the pub- the SRAM, but its scalability is also limited by the capac-
lic video datasets show that our proposed key-frame image ity of the SRAM.
retrieval model realizes a good precision for key-frame rep- In the aspect of optimizing the search operation by means
resentation, and achieves high accuracy and efficiency for of network locality, the literature (Mademlis et al. 2018)
video retrieval. indicates that the backbone network has high concurrency
characteristics, and the number of concurrent connections
reaches two million. The newly emerging connection per
2 Fast flow table lookup for high second is only about 20,000, accounting for 1% of the total
concurrency network concurrency. The backbone network is slow to update and
has a certain degree of locality. Therefore, it uses FPGA
The existing flow table lookup operations are classified into and SRAM to realize the high-speed Cache of the flow table
three types: hash table, Bloom filter and content addressable to speed up the access speed, which is limited by the cir-
memory (Burtnyk and Wein 1976, 1971). For flow tables cuit complexity of the FPGA and the capacity limitation
with hash table structure, one hash table lookup includes two of the SRAM. Mainly for the first few bytes of data packet
steps: hash value calculation and conflict chain comparison. processing, specifically for the protocol identification sys-
The worst case performance of using the linked list method tem, the model can not solve the full packet load processing
to deal with hash conflict is very poor. All N keywords are problem such as DPI system. In addition, TCAM (Ternary
inserted into the same position, where a linked list with the Content Addressable Memory) (Gong et al. 2013) is also
length of N is generated. Therefore, the worst lookup length used to accelerate the search operation. However, TCAM
is O(N). is too expensive and consumes a lot of power, and the scale
Many research works have been done to try to balance the of the flow table is also limited by the storage capacity. In
length of the collision chain on each slot as much as possible the high concurrent network environment, a large number
to ensure that the average lookup length is as close as pos- of active connections will be affected by traffic fluctuation
sible to the best case of O(1 + a), where a is the load factor. and burst traffic and be forced to replace, leading to system
A good hash algorithm is needed to achieve this function. missed detection.
The literature (Liu and Fan 2005) pointed out that although In order to improve flow table lookup speed, flow char-
the complex cryptographic hashing method, such as MD5, acteristic of the video surveillance backbone links was
SHA-1, can be used to achieve all conflict chain length dis- explored. It proves that the backbone traffic not only has
tribution equilibrium in the hash table, a good hash function high concurrency and high arrival rate, but also has good
usually consumes a large number of CPU resource. Even network locality characteristic (Huang et al. 2018) in an
the hardware implementation of the hash unit requires 64 appropriate cached window. Based on these characteristic
clock cycles. and the principle of locality, a fast flow table lookup method
The multi-hash effect is better than the single-hash was implemented by using a naive Hash table structure with
mentioned above. The Bloom filter is a multiple hash constant increase of auxiliary space. The theoretical analy-
that is often combined with high-speed hardware such sis and experiments on real-life data traces show that the
as TCAM, FPGA and so on, to provide efficient content proposed method can reduce the length and the time of flow
lookup. The literature (Radhakrishnan 2001) pointed out table lookup by 20.2% and 17.1% compared with the exist-
that the Bloom filter is commonly used to represent a set ing method, respectively. Therefore, it will be adopted to
to provide member query operations, and the query result achieve the fast key-frame image retrieval of security videos
has an error probability. And the query result of the Bloom in intelligent city.

13
C. Li, B. Zhou

3 Video retrieval model based 2. The model parameters required for the full connection
on convolutional neural network layer are much more than those for the convolution layer.
3. The full connection layer needs to predict the size of the
After extracting the key-frames of each shot, the relevant input image, while the convolution layer does not need
video sequences can be retrieved efficiently and accurately to predict this information.
through content-based retrieval of video frames (Aote and
Potnurwar 2018; Ponni et al. 2018). At present, the deep Therefore, the full connection layer is not responsible
learning architecture used in image and video retrieval sys- for feature extraction, but the convolution layer is respon-
tem adopts a lot of parameters, which makes the structure sible for feature extraction.
of deep learning complex, and difficult to implement. It The modified scheme in this paper uses the fifth con-
takes too long to calculate and train the network parameters volution layer, expressed as CONV5, or the fourth con-
(Naveed et al. 2018). volution layer, expressed as CONV4. The dimension of
In this paper, the external structure of deep neural net- CONV5 is 13 × 13 × 384. Therefore, the MAC (Maximum
work is improved. An improved model is designed to retrain Activations of Convolutions) layer generates a 256-dimen-
parameters so as to make full use of some prior information sional or 384-dimensional feature for each video frame.
to generate low-dimensional image and video representation. The flow chart of this retraining program is shown in Fig. 2.
They can reduce the computation time and memory require-
ments of retrieval.
In this paper, BVLC Reference CaffeNet model is used 3.1 Unsupervised retraining
as a deep neural network model, which is the result that
AlexNet model is trained in ImageNet Large Scale Visual Unsupervised retraining retrains the pre-trained convolu-
Recognition Challenge, ILSVRC. The model consists of tional neural network model based on a given data set to
eight trained neural network layers, the first five are convolu- maximize the cosine similarity between each video key-
tion layers, and the last three are full connection layers. And frame and its nearest “N-key-frame”.
{ }
the first, second and fifth convolution layers are maximum Assuming that I = x{i , i = 1, … , N represents
} the set
pooling layers. ReLU (Rectified Linear Unit) non-linear acti- of N key-frames, X = xi , i = 1, … , N represents the
vation function is used for five convolution layers and the corresponding features of N key-frames, 𝜇 represents the
first two full connection layers. The output of the third full nearest “N-key-frame” average vector of xi , and 𝜇i can be
connection layer is a distributed result with 1000 ImageNet defined as follows:
classes.
1∑ i
n
Softmax loss is used for model training. The overall archi- 𝜇i = x (1)
tecture of CaffeNet is as shown in Fig. 1. n l=1 l
In this paper, CaffeNet has been modified for the follow-
The new objective function of I can be determined by
ing reasons:
solving the following optimization problem:
1. Because the activation function is a spatially arranged ∑
N
xiT 𝜇i
structure, the convolution layer retains spatial informa- max J = max
‖xi ‖‖𝜇i ‖ (2)
tion, while the full connection layers connect with all
xi ∈X xi ∈X
i=1 ‖ ‖‖ ‖
input neurons, so the full connection layers do not con- Equation (2) is solved by using gradient descent method.
tain spatial information. And the first order gradient of objective function J is:

Videos

Video feature Convolution layer 1 Convolution layer 2 Convolution layer 3


227×227×3 55×55×96 27×27×256 13×13×384

Fully Connected Layer8 Fully Connected Layer 7 Fully Connected Layer 6 Convolution layer 5 Convolution layer 4
1000 4096 4096 13×13×256 13×13×384

Fig. 1  Eight-layer framework of CaffeNet deep neural network

13
Fast key-frame image retrieval of intelligent city security video based on deep feature coding…

Fig. 2  The flow diagram of the Error back propagation


Loss
retraining program based on
improved deep learning
Videos Convolution layer 1 MAC1 Vectors

Convolution layer 2 MAC2 Vectors

Convolution layer 3 MAC3 Vectors

Retrain
Vectors with 384 objection
Convolution layer 4 MAC4
dimensional function

Vectors with 256


Convolution layer 5 MAC5
dimensional

Non-supervision

Semi-supervision
Class label

Supervision
Sematic information

𝜕J 𝜇i xiT 𝜇i tagged descriptor, where xi is the key-frame descriptor and yi


= − x
‖xi ‖‖𝜇i ‖ ‖x ‖3 ‖𝜇i ‖ i (3)
𝜕xi ‖ ‖‖ ‖ ‖ i ‖ ‖ ‖ is the tag corresponding to the key-frame. This scheme regu-
lates the convolutional neural layer of CNN mode (Zheng et al.
The updating rule for the vth iteration of each image is
2018). Its goal is to maximize the cosine similarity between
as follows:
xi and the latest “m-correlation descriptor” while minimiz-
⎛ ing the cosine similarity between xi and the latest “l-correla-
𝜇i xi(v)T 𝜇i (v) ⎞
xi(v+1) = xi(v) + 𝜂 ⎜ − x ⎟ (4) tion descriptor”. key-frames belonging to the same class are
⎜��x (v) �� i �
� 𝜇
� (v) �� i � i ⎟
�x � 𝜇 defined as related frames, and key-frames belonging to differ-
⎝ � i �� � � i �� � ⎠
ent classes are defined as
{ uncorrelated frames.
}
In order to better control the learning rate of the neural net- Assuming that I = Ii , i = 1, … , N represents N image
work, the updating rule is normalized and defined as follows: sets with relevant information, x = FL (I) represents the L-level
output result
{ of the pre-trained
} CNN model for the input image
⎛ ⎞ set I, X = xi , i = 1, … , N represents the corresponding fea-
� (v) �� i �⎜ 𝜇i xi(v)T 𝜇i (v) ⎟
{ }
xi(v+1) xi(v) + 𝜂 �xi ��𝜇 �⎜ tures of the image, and Ri = rk , k = 1, … , K i represents
� �� � �x(v) ��𝜇i � � (v) �3 � i � i ⎟
= − x
⎜ � i �� � �x � �𝜇 � ⎟ the set of relevant descriptors for the i-th key-frame. The new
⎝� � � i � ⎠
descriptor of key-frame set I is determined by solving the fol-
(5)
lowing two optimization problems:
Using the above features as the targets in the inter-
est layer, a regression task is established for the neural ∑
N
xiT 𝜇+i
network. The weight of CafeNet is initialized, and the
+
max J = max
‖x ‖‖𝜇i ‖ (6)
xi ∈X xi ∈X
i=1 ‖ i ‖‖ + ‖
backward propagation method is used to train the neu-
ral network based on the target data set. Euclidean loss
(Euclidean distance loss function) is used in the training ∑
N
xiT 𝜇−i
min J − = min
‖xi ‖‖𝜇i ‖ (7)
process of the regression task. xi ∈X xi ∈X
i=1 ‖ ‖‖ − ‖
where 𝜇−i is the average vector of the nearest “ l -descriptor”
3.2 Semi‑supervised retraining of xi , and 𝜇+i is the average vector of the nearest “ m-descrip-
tor” of xi.
Semi-supervised retraining improves the performance of Con- The normalized update rule of the v- th iteration is rewrit-
volutional Neural Network (CNN) descriptors by using infor- ten as follows:
mation derived from class labels. Given ( xi , yi ) to represent a

13
C. Li, B. Zhou

⎛ ⎞ The normalized updating rule of the v-th iteration can


� �� � 𝜇+i xi(v)T 𝜇+i
xi(v+1) = xi(v) + 𝜍1 �xi(v) ��𝜇+i �⎜ − xi(v) ⎟ be defined as the following two equations:
� �� �⎜ �x(v) ��𝜇i � �x(v) ��𝜇i � ⎟
⎝� � � �
� i �� + � � i �� + � ⎠ ⎛ ⎞
(8) � (v) �� k �⎜ qk xi(v)T qk (v) ⎟
xi(v+1) xi(v) + a�xi ��q �⎜
� �� � �x(v) ��qk � � (v) �3 � k � i ⎟
= − x
⎛ 𝜇−i xi(v)T 𝜇−i ⎞ ⎜ � i �� � �x � �q � ⎟
� (v) �� i �⎜ ⎝� � � i � ⎠
xi(v+1) (v)
= xi −𝛽1 �xi ��𝜇− � − xi(v) ⎟
� �� �⎜ �x(v) ��𝜇i � �x(v) ��𝜇i � ⎟ (13)
⎝� � � �
� i �� − � � i �� − � ⎠
(9) ⎛ ⎞
� (v) �� k �⎜ xj(v)T qk
Fuse the normalized updating rules, that is, add Eq. (8) and
qk (v) ⎟
xj(v+1) xj(v) − a�xj ��q �⎜
� �� � �x(v) ��qk � � (v) �3 � k � j ⎟
= − x
Eq. (9) together: ⎜ � j �� � �x � �q � ⎟
⎝� � � j � ⎠
⎛ 𝜇+i xi(v)T 𝜇+i ⎞ (14)
� (v) �� i �⎜
xi(v+1) = xi(v) + 𝜍1 �xi ��𝜇+ � − xi(v) ⎟+ The above key-frame descriptor is used as the target of
� �� �⎜ �x(v) ��𝜇i � �x(v) ��𝜇i � ⎟
⎝� � � �
� i �� + � � i �� + � ⎠ the interest layer, and the deep neural network is retrained.
⎛ 𝜇−i xi(v)T 𝜇−i ⎞
� �� �
𝛽1 �xi(v) ��𝜇−i �⎜ − xi(v) ⎟
� �� �⎜ �x(v) ��𝜇i � �x(v) ��𝜇i � ⎟ 4 Experimental results and analysis
⎝� � � �
� i �� − � � i �� − � ⎠
(10)
4.1 Evaluation criteria
Using the above object descriptor, the neural network
can be retrained by using the key-frame related information In this paper, the performance of video retrieval algorithm
through backward propagation technology. is evaluated by precision, recall rate and average preci-
sion. The higher the recall rate and accuracy, the better the
3.3 Supervised retraining performance of the retrieval system (Naruse et al. 2014).
Recall refers to the ratio of the number of relevant video
The idea of supervised training comes from the theory of frames found by the user and the number of all clips in
relevant feedback. Supervised retraining considers feedback the video library related to the target clip during a query.
from different users, which consists of relevant key-frames for Precision is the ratio of the number of relevant key-frames
querying objective. The goal of this scheme is to modify the found by the user to the number of all frames found during
model parameters to maximize the cosine similarity between a query. Precision and recall rate are defined as follows:
the specified query and its related key-frames, and to mini-
mize the cosine similarity between the specified query and its Ncorrect
Recall = (15)
unrelated key-frames. { } Ncorrect + Nmiss
Assuming
{ that Q = Q} k , k = 1, … , K represents the query
set, I+k = Ii , i = 1, … , Z represents
{ the relevant
} key-frame Ncorrect
of the specified query, I−k = Ij , i = 1, … , O represents the Precision = (16)
Ncorrect + Nfalse
irrelevant key-frame of the specified query, x = FL (I) repre-
sents the L-level output of the pre-trained CNN model for the Where,Ncorrect , Nfalse and Nmiss are the number of key-
key-frame I, and q = FL (Q) represents the L-level output of frames for correct detection, error detection, and missed
the neural network for the query Q. detection, respectively. Precision is equal to the number of
By solving the following optimization problems, we can relevant retrieval images/the number of retrieval images,
obtain new object functions of relevant and irrelevant key- while Recall rate is equal to the number of relevant retrieval
frames for query, respectively: images/the number of relevant images (Zhu et al. 2013).
The average precision is defined as the average query pre-

Z
xiT qk
maxk J + = maxk (11) cision of all query experiments, and the precision QP of i-th
‖xi ‖‖qk ‖ query experiment is defined as follows:
xi ∈X+ xi ∈X+ i=1 ‖ ‖‖ ‖

1 ∑ Ri i
N n

0 xjT qk APi = t (17)
mink J − = mink Qi n=1 n n
‖ ‖‖ k ‖ (12)
xi ∈X− xi ∈X−
j=1 ‖xj ‖‖q ‖
‖ ‖ where Qi is the number of related images of i-th query; N
is the total number of search images; Rni is the number of n

13
Fast key-frame image retrieval of intelligent city security video based on deep feature coding…

optimal related retrieval images; tni is the indicator function. In addition, Scheme RRI and FU significantly improve the
If the n-th retrieval image is related to i-th query, tni = 1 or performance of the original CaffeNet framework. The per-
tni = 0. formance of RRI is better than retraining scheme FU. The
reason is that the model training process of scheme FU does
not utilize user information and relevant feedback informa-
4.2 Experimental environment and parameter tion. Therefore, retraining scheme “FRRI + CONV5” is
setting adopted as the feature representation scheme of video in the
following experiments.
We evaluate the proposed approach on five public data-
sets: the UKbench, Corel-5 K, Holidays, UQ_VIDEO and 4.4 Experimental results of video retrieval
San Francisco Landmarks (SFL) (Liu et al. 2016). In the
UKbench and Holidays, relevant images are near-duplicates In order to comprehensively evaluate the video retrieval
or the same objects/scenes to the query, while the Corel-5 K effect of our proposed model, the existing three retrieval
involves category-level relevant images without any near- models with better performance are selected and compared
duplicate ones. SFL is a realistic large-scale dataset with with our model. The comparison models are as follows: ITQ,
a variable number of relevant images for different queries. QBH and KMH, where ITQ and KMH are both HASH-
We employ the performance measures from the original based retrieval schemes, while QBH improves the HASH
papers of these datasets and demonstrate the query specific effectively and achieves obvious results. The UQ_VIDEO
fusion improves considerably for all these diverse datasets; video dataset is searched by 24 query statements in literature
UQ_VIDEO is used as video data set of benchmark, which (Atmojo et al. 2015), and they are shown in Fig. 3.
is a real video set collected from YouTube aiming at similar It can be seen from the graph that the algorithm achieves
shot detection experiments. The data set contains 169, 952 the optimal result. The recall rate of this algorithm is better
videos, and 3, 305, 525 key-frames are extracted from the than that of KMH. When the precision of this algorithm is
video set (Song et al. 2013). 0.1, the recall rate is greater than 0.9. When the precision
The algorithm is implemented based on Caffe Deep is less than 0.4, the recall rate performance of ITQ is poor.
Learning framework, and features are extracted from UQ_ When the precision is greater than 0.3, the recall rate of
VIDEO dataset by using CONV5 and CONV4. The retrain- QBH is higher than that of the other two schemes, but still
ing module is used to replace the ReLU layer of CaffeNet lower than that of this scheme.
framework. Figure 4 is the relationship between key-frame and preci-
sion for four retrieval algorithms. It can be seen from Fig. 4
that the average absolute accuracy first increases and then
4.3 The experimental result of deep neural network decreases as the value of the key-frame increases, while the
number of key-frames k keeps increasing, and the cover-
Firstly, the learning effect of the proposed deep neural net- age and accuracy of key-frames also increase. When the
work is evaluated. The improved retraining model in this -frame value is greater than a certain value, the key-frame
paper is tested on UQ_VIDEO dataset, as shown in Table 1. redundancy rate increases, and the average absolute accuracy
In Table  1, FU denotes unsupervised retraining, RRI
denotes semi-supervised retraining, and FRRI denotes fully
supervised retraining. It can be seen from the Table 1 that
the performance of CONV5 is better than that of CONV4.

Table 1  The representation results of three retraining algorithms


Feature representation Feature dimensions Average accuracy
of key-frames

CONV4 384 0.45891


FU + CONV4 384 0.61082
RRI + CONV4 384 0.81004
FRRI + CONV4 384 0.92074
CONV5 256 0.53612
FU + CONV5 256 0.68325
RRI + CONV5 256 0.88743
Fig. 3  The relationship between recall rate and precision for four
FRRI + CONV5 256 0.95278
retrieval algorithms

13
C. Li, B. Zhou

Table 2  The retrieving time for comparison retrieval algorithms


Video size QBH KMH ITQ Proposed

200 M 0.034 0.033 0.041 0.036


400 M 0.042 0.039 0.046 0.046
600 M 0.051 0.049 0.052 0.047
800 M 0.061 0.062 0.067 0.051
1000 M 0.079 0.078 0.081 0.061

this algorithm, the processing time of small-scale data sets is


close to that of QBH and KMH, but this algorithm has obvi-
ous advantages in the retrieval time of large-scale data sets.
Fig. 4  The relationship between key-frame and precision for four Based on these characteristic and the principle of locality, a
retrieval algorithms fast flow table lookup method was implemented by using a
naive Hash table structure with constant increase of auxiliary
decreases. When the average absolute accuracy takes the space. The theoretical analysis and experiments on real-life
maximum value, the accuracy of the key-frame is high, the data traces show that the proposed method can reduce the
redundancy rate is low, and the effect is good. In summary, length and the time of flow table lookup by 20.2% and 17.1%
for the content change is richer and longer, the suitable range compared with the existing method, respectively. Therefore,
of the key-frame value is [10, 20], and the content range is our proposed algorithm can achieve the fast key-frame image
not obvious, the length is short, and the suitable range of the retrieval of security videos in intelligent city.
key-frame value is [20,30].

5 Conclusion
4.5 Calculation time of our proposed algorithm
Aiming at the problems faced by the connection management
To improve the QoS of video retrieval, this paper designs a module in single packet processing for intelligent city security
kind of technology of data cache. In the method of retrieval video retrieval, we firstly propose the traffic locality quantiza-
result’s data cache, by the construction of rapid lookup table tion index based on the traffic characteristics of the backbone
of video content feature data, the pressure of video feature link to quantitatively analyze the traffic locality characteris-
retrieval will be alleviated and the whole system performance tics in the backbone link. And then, a key-frame abstraction
will be improved. While in the method of retrieval task’s data and retrieval of videos based on deep learning is proposed to
cache, by the construction of retrieval task table, the repetitive improve the efficiency and accuracy of video retrieval, where
process because of the same tasks will be avoided, and then an adaptive key-frame selection algorithm is designed and the
the pressure to the system from the retrieval of hotspot video existing convolutional neural network framework is used to
will be decreased largely. This paper establishes testing envi- extract the features of key-frames, and unsupervised, semi-
ronments and finishes the simulative experiments, which veri- supervised and supervised retraining models are designed to
fies the efficiency to improve the service performance of the improve the effect of the feature extraction of the convolu-
video retrieval platform resulting from the strategies above. tional neural network and the accuracy of the video retrieval.
The experimental environment is: Intel Core i7 7700; the Experimental results in the public video datasets show that
main frequency is 3.6 GHz and the memory is 16 GB. In this our proposed key-frame image retrieval model realizes a good
paper, the data sets of different scales (obtained by partition- precision for key-frame representation, and achieves an accu-
ing UQ_VIDEO data set) are experimented, and the retrieval racy and high efficiency for video retrieval.
time of those several retrieval algorithms is counted and the
result is shown in Table 2.
As can be seen from the Table 2, both QBH and KMH are References
retrieval algorithms based on HASH operation. The HASH
operation of these two algorithms requires more time, and the Aote SS, Potnurwar A (2018) An automatic video annotation frame-
work based on two level keyframe extraction mechanism [J]. Mul-
processing time increases with the increase of data set size. timed Tools Appl 24(12):78–92
This algorithm is based on deep learning. It takes a certain Atmojo UD et al (2015) System-level approach to the design of ambi-
amount of time in the training phase, but it can be processed ent intelligence systems based on wireless sensor and actuator
offline in the training phase. In the online retrieval phase of networks. J Ambient Intell Humaniz Comput 6(2):153–169

13
Fast key-frame image retrieval of intelligent city security video based on deep feature coding…

Burtnyk N, Wein M (1971) Computer-generated key-frame Naveed E, Wook BS, Hammad M et al (2018) Multi-scale contrast and
animation[J]. J SMPTE 80(3):149–153 relative motion-based key-frame extraction[J]. EURASIP J Image
Burtnyk N, Wein M (1976) Interactive skeleton techniques for enhanc- Video Process 2018(1):40–48
ing motion dynamics in key-frame animation[J]. Commun ACM Okabe M, Dobashi Y, Anjyo K (2018) Animating pictures of water
19(10):564–569 scenes using video retrieval[J]. Vis Comput 34(3):347–358
Cappallo S, Svetlichnaya S, Garrigues P et al (2018) The new modality: Ponni ASS, Ramakrishnan S (2018) Fibonacci Based key-frame Selec-
emoji challenges in prediction, anticipation, and retrieval[J]. IEEE tion and Scrambling for Video Watermarking in DWT–SVD
Trans Multimed 24(7):191–202 Domain[J]. Wirel Personal Commun 69(25):28–35
Chen W, Qing Z, Yeting Z et al (2017) A NoSQL-SQL hybrid organi- Radhakrishnan R (2001) Video summarization using descriptors of
zation and management approach for real-time geospatial data: a motion activity: a motion activity based approach to key-frame
case study of public security video surveillance[J]. ISPRS Int J extraction from video shots[J]. J Electron Imaging 10(4):90–99
Geo-Inform 6(1):21–43 Rameshnath S, Bora PK (2019) Perceptual video hashing based on
Chenchen J, Zhen D, Mingtao P et al (2018) Heterogeneous hash- temporal wavelet transform and random projections with applica-
ing network for face retrieval across image and video domains[J]. tion to indexing and retrieval of near-identical videos. Multimed
IEEE Trans Multimed 24(23):1–12 Tools Appl 78:18055–18075
Deng Y, Yu Y (2019) Self-feedback image retrieval algorithm based Roy PP, Bhunia AK, Pal U (2017) Date-field retrieval in scene image
on annular color moments[J]. EURASIP J Image Video Process and video frames using text enhancement and shape coding[J].
20(1):36–45 Neurocomputing. 274:S0925231217306689
Dong J, Li X, Snoek CGM (2018) Predicting visual features from text Song J, Yang Y, Huang Z et al (2013) Effective multiple feature hash-
for image and video caption retrieval[J]. IEEE Trans Multimed ing for large-scale near-duplicate video retrieval[J]. IEEE Trans
24(4):91–107 Multimed 15(8):1997–2008
Feng Y, Zhou P, Xu J et al (2019) Video big data retrieval over media Song J, Gao L, Liu L et al (2017) Quantization-based hashing: a gen-
cloud: a context-aware online learning approach[J]. IEEE Trans eral framework for scalable image and video retrieval[J]. Pattern
Multimed 24(7):12–31 Recognit 75:S0031320317301322
Gong Y, Lazebnik S, Gordo A et al (2013) Iterative quantization: Song J, Gao L, Liu L et al (2018) Quantization-based hashing: a gen-
a procrustean approach to learning binary codes for large- eral framework for scalable image and video retrieval[J]. Pattern
scale image retrieval[J]. IEEE Trans Pattern Anal Mach Intell Recogn 75:175–187
35(12):2916–2929 Stippick J (2004) Advanced tray support system using orthogonal
Haijun Z, Yuzhu J, Wang H et al (2018) Sitcom-star-based clothing grillage[J]. J Comput Aided Des Comput Graph 24(22):632–642
retrieval for video advertising: a deep learning framework[J]. Neu- Wang Q, Si L, Zhang D (2014) Learning to hash with partial tags:
ral Comput Appl 44(23):208–231 exploring correlation between tags and hashing bits for large scale
Huang S, Mao C, Tao J et al (2018) A novel chinese sign language image retrieval[C]//European Conference on Computer Vision.
recognition method based on keyframe-centered clips[J]. IEEE Springer, NY, pp 378–392
Signal Process Lett 25(3):442–446 Wu S, Song H, Cheng G et al (2018) Civil engineering supervision
Jakub L, Werner B, Klaus S et al (2018) On influential trends in inter- video retrieval method optimization based on spectral clustering
active video retrieval [J]. IEEE Trans Multimed 24(21):13–21 and R-tree[J]. Neural Comput Appl 21(2):12–23
Li X (2017) Tag relevance fusion for social image retrieval[J]. Mul- Ying Z, Daiyin Z (2018) Height retrieval in postprocessing-based Vide-
timed Syst 23(1):29–40 oSAR image sequence using shadow information[J]. IEEE Sens
Lin J, Duan LY, Wang S et al (2017) HNIP: compact deep invariant J 44(13):8–21
representations for video matching, localization, and retrieval[J]. Yoo G, Kim H, Hong S et al (2017) Implementation of convergence
IEEE Trans Multimed 19(9):1968–1983 P2P information retrieval system from captured video frames[J].
Liu L, Fan G (2005) Combined key-frame extraction and object-based Peer-to-Peer Netw Appl 11:1–11
video segmentation[J]. IEEE Trans Circuits Syst Video Technol Zheng F, Tang H, Liu YH (2018) Odometry-vision-based ground vehi-
15(7):869–884 cle motion estimation with SE(2)-constrained SE(3) poses[J].
Liu T, Zhang HJ, Qi F (2003) A novel video key-frame-extraction algo- IEEE Trans Cybern 127(99):1–12
rithm based on perceived motion energy model[J]. IEEE Trans Zhu X, Huang Z, Shen HT et al (2013) Linear cross-modal hashing for
Circuits Syst Video Technol 13(10):1006–1013 efficient multimedia search[J]. Comput Sci 2013:143–152
Liu W, Wen Y, Yu Z et al (2016) Large-margin softmax loss for convo-
lutional neural networks[J]. Comput Sci 2:507–516 Publisher’s Note Springer Nature remains neutral with regard to
Mademlis I, Tefas A, Pitas I (2018) A salient dictionary learning frame- jurisdictional claims in published maps and institutional affiliations.
work for activity video summarization via key-frame extraction[J].
Inf Sci 43(2):319–331
Naruse N, Tenmyo O, Tomita K et al (2014) 1-HKUST: object detec-
tion in ILS VRC 2014[J]. Comput Sci 42(6):837–845

13

You might also like