You are on page 1of 6

2020 Eighth International Conference on Advanced Cloud and Big Data (CBD)

Video Object Detection based on Non-local Prior of Spatiotemporal Context*


2020 Eighth International Conference on Advanced Cloud and Big Data (CBD) | 978-1-6654-2313-7/20/$31.00 ©2020 IEEE | DOI: 10.1109/CBD51900.2020.00040

Wei Lu1, Wei Xu2, Zebin Wu1, Yang Xu1, Zhihui Wei1
1
School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, China,
2
China Railway Shanghai Group Co., Ltd., Nanjing Power Supply Section, NangJing 210011, China

will be slow. Moreover, the accuracy of the region proposal


Abstract—The appearances of objects in video sequence
affected by complex background, motion blur and partial
algorithm has a great impact on the detection accuracy. If the
occlusion, which make the object detection in video sequence a region proposal algorithm has a poor performance, the
hard work. Due to these problems, traditional image object detection accuracy will also be worse.
detection methods cannot perform well in video sequence The one-stage detection algorithms adopt anchor idea for
image. A fast and effective video object detection method is region proposals, and then directly returns the target's
necessary to improve the detection efficiency. In this paper, we category probability and position offset to get the final result.
propose a non-local prior based spatiotemporal attention The parameters of the anchor are determined by the label
model based for video object detection. Unlike existing clustering of the training set. Since the position of the
attention models, the proposed model can make full use of the anchors are determined in advance, the speed is greatly
spatiotemporal contextual information extracted from video accelerated compared to the two-stage detection algorithms,
sequence images. We apply our models in common object and the accuracy is also ensured.
detection framework and evaluate it on Overhead Contact Video object detection is an important component of
System (OCS) driving recorder dataset and OTB50 dataset. image object detection. With the rapid growth of video data,
The proposed model achieves a greater increase in mAP value video object detection has attracted more attention.
which proves our model can gains good performance in Compared with still images, video frames provide richer
various complex video sequences. temporal information, which can be employed to improve the
accuracy of object detection, but also have much more
Keywords—Non-local prior; video object detection; Yolov2;
difficulties. Due to the problems of complex background,
convolutional neural network
blurry images, partial occlusion, etc., video object detection
I. INTRODUCTION is still a challenging work. How to keep the temporal and
spatial consistency of the target during the detection process
Object detection in images has significant achieved and ensure the target detected in the intermediate frame are
success in recent years due to the emergence of deep the main difficulties of video object detection.
convolutional neural networks (CNNs). The deep neural The human visual system can find the object in current
network mainly use convolution kernel to extract image frame with the features of previous frame. For example, if an
features and preserve the input’s neighborhood relations in object is detected in previous frame but not in current frame,
their higher-level feature representations. The convolution we can recover the missing object by previous frame’s
layer has multiple kernels to extract multiple features of the features. Therefore, we can use pervious frame’s features to
image. The kernels of a layer use the same parameters, so enhance current frame’s feature map. Since video sequence
the convolutional neural network has less calculation but images have more temporal information than still images, the
more effective. Compared with traditional methods, CNNs key to improve the accuracy of video object detection is
can extract features from massive data more effectively. The making full use of the temporal information.
application of deep convolutional neural networks in object An intuitive idea to effectively utilize the temporal
detection mainly has two categories, one is two-stage information is to combine multiple frames on the temporal
detection algorithms such as [8][9][10], and the other is one- channel and extract features for detection by 3D
stage detection frameworks like [11][12][13]. convolutions [1] in spatiotemporal dimension. Compared
The two-stage detection algorithms divide the target with 2D CNN, 3D CNN can extract much more temporal
detection into several steps. First, the interest of regions(ROI) information in video frames. The 3D filters can be extended
are generated from image to find the areas that are most from pre-trained 2D filters [2] [3]. Another approach is to
likely to contain objects. If the region proposal algorithm is combine the object tracking algorithm into the object
slow or the areas proposed are too much, the detection speed detection framework to improve the accuracy, such as the T-
*This work was supported in part by the National Natural Science CNN [4] [5]. T-CNN can improve temporal consistency by
Foundation of China under Grant 61772274, Grant 61701238, Grant propagate detection results to neighbor frames to reduce
61471199, Grant 11431015, and Grant 61671243, in part by the Jiangsu
Provincial Natural Science Foundation of China under Grant BK20180018 sudden changes of detection results. In addition, optical flow
and Grant BK20170858, in part by the Fundamental Research Funds for information, such as the target motion information is also
the Central Universities under Grant 30919011103, Grant 30917015104, used to extract the temporal relationship between frames [6].
and Grant 30919011234, and in part by the China Postdoctoral Science
Foundation under Grant 2017M611814. (Corresponding author: Zebin Wu.
Optical flow can show target motion information clearly to
help CNN extract temporal information, but also needs huge

978-1-6654-2313-7/20/$31.00 ©2020 IEEE 177


DOI 10.1109/CBD51900.2020.00040

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
of calculation. In order to improve the speed of video target sample layers and convolutional layers to expand the
detection, Google has proposed a method combining fast and receptive field of the convolution kernel, but multi-layer
slow network for video detection [7]. However, some of convolution will cause a lot of waste of computing power.
these approaches cannot make effective use of temporal In the traditional image processing algorithm, Buades et
information, some of these approaches have huge calculation, al. [14] proposed a non-local mean filtering algorithm, which
leading to slow speed and low precision. uses the global image information to improve the denoising
When we try to find object in an image, we will focus on effect. This algorithm takes advantage of the characteristics
regions which have the object and pay more attention on it. of multiple image blocks with similar structure in natural
Meanwhile, we will reduce attention on the other regions. images to find the center points of the most similar image
Attention mechanism is inspired by this. Attention blocks for weighted averaging. Compared with the mean
mechanism can distinguish the importance of each region value filtering algorithm that uses the local information of
and enhance their features. In general, attention mechanism the image, it can remove the image noise more effectively
is mainly divided into two categories: soft attention and hard and retain the edge features of the image better.
attention. The hard attention will only focus on important Inspired by the non-local mean filtering algorithm, Wang
areas and dislodge unimportant parts of this image. This way et al. proposed the concept of non-local neural networks. The
can find main feature rapidly, but will loss some features. core of this algorithm lies in non-local blocks. The non-local
The soft attention will use weights which learned by blocks directly calculate the similarity between any two
network to weight the image and strengthen significant points of the image to get the long range dependency, and
regions without loss any features. So soft attention obtains a mask to enhance the feature map. Compared with
mechanism is more suitable for deep learning. multi-layer convolution, the global image information can be
Self-attention is a soft attention mechanism relating effectively used without losing image feature information,
different positions of a single sequence in order to compute a and the calculation amount is smaller. Meanwhile, these
representation of the sequence. In [17], self-attention is points can come from different frames, so the non-local
mainly used for machine translation, but it can also be blocks can also extract the temporal contextual information
applied in image and video problems in computer vision with of the video sequence images to enhance the image features
some improvements. and improve the video object detection accuracy.
Inspired by the non-local mean algorithm and the self-
attention mechanism, Wang et al. [15] proposed the concept 1
of non-local neural networks, using non-local blocks to find  yi =
C( x)
¦ f (x , x )g(x ) 
i j j
 
long-range, non-local dependency in frames. The non-local j

blocks obtain global information by directly calculating the


similarity between any two locations. In the video, these two Equation (1) describes the calculation process of the non-
locations may be spatial or temporal, so the relationship can local model. Where represents the value at position of
be spatial or temporal. By adding non-local blocks, the the feature map and represents the value of all positions of
convolutional neural network can effectively use the the feature map. Function is mapping function learned by
spatiotemporal contextual information of the video, greatly network. By equation (1) we can calculate all positions’ non-
improving the detection accuracy. local dependencies value to obtain the non-local mask of
In this paper, we propose a video object detection method the feature map.
based on non-local prior. After trying several basic detection As is summarized in equation (2), each position in the
networks, we adopt the YOLO model as the basic detection image has been calculated by similarity metric function to
model. Thus we use the same loss function and prediction find the long range non-local dependency, while in the
method with YOLOv2. We design Spatiotemporal non-local traditional convolution operation, it can only cover the
models to capture the long-range dependencies in time and adjacent points of the position. The fully connected layer can
space. Section Ċ introduces the non-local models. Section also find the long range non-local dependency by calculating
ċ shows the details of the proposed method. In section Č, all positions, but compared with the fully connected layer,
we will introduce our datasets and compare our framework there are the following differences:
with classical or state-of-art models on them. Experiments in (1) In the non-local blocks, the similarity is obtained by
high-speed railway Overhead Contact System (OCS) driving the designed formula and the trained weights. But the fully
recorder dataset and OTB50 prove our framework achieves connected layer only uses learned weights. Compared with
good performance in various complex scenarios. the fully connected layer, the non-local blocks can measure
the similarity between image positions better.
II. NON-LOCAL MODULE (2) The non-local blocks support variable sizes of the
Due to the limitations of the convolution kernel, the input and maintains the same size of the output, so it can be
receptive field is only a small rectangular region calculated used in any position of the network. While the fully
by kernel. This architecture forced the network to pay connected layer can only accept input of fixed size and
attention to the local image information but ignored the generate output of fixed size, so it mainly be used as the last
global information. For this problem, the traditional layer of the network. Because the input is stretched into one
convolutional neural network uses a combination of down- column, the location information of feature maps will lost in
the fully connected layer. Compared with the fully connected

178

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
layer, non-local blocks are more flexible to use and can frame only is only to provide temporal contextual
maintain the position information of feature maps. information for current frame prediction. The purpose of this
(3) Non-local blocks can be used anywhere in the design is to make full use of the spatial contextual
network and are very flexible to use, while fully connected information of the video sequence image, while effectively
layers are usually used at the last layer of the network. use the temporal contextual information of the image.
Compared with the fully connected layer, non-local modules
can build a network with a richer hierarchical structure.
Due these advantages, we use non-local blocks to
enhance the features of the video frames. The non-local
blocks mainly contains the following calculation.

T ( xiT )I ( x j )
 f ( xi , x j ) e   

Equation (2) describes the similarity metric function ,


referring to the non-local mean algorithm and the bilateral
filtering algorithm, using the Gaussian function as the metric
function, and further adding two convolutional layers to
extract feature map information more effectively.

 C ( x) ¦ f (x , x ) 
i j
 
j

Equation (3) describes the normalization function ( ). Fig 1. the structure of video object detection method based on non-local
priors
 Zi h( yi )  xi   
In the spatial non-local block, the input is the feature map
of the previous frame at a specific layer, and the output is the
Equation (4) describes the method to get the final result feature map enhanced by the spatial non-local mask, which
of the non-local blocks. is used as the input of the next layer of the previous frame
The non-local mask obtained by equation (1) will be network. In the temporal non-local block, the input is the
calculated by the mapping function ℎ and then added to the respective feature map of the previous frame and the next
original feature map using the residual structure method to frame at a specific layer, and the output is the feature map
obtain the final result of the non-local blocks. It will enhance enhanced by the non-local mask of time, which is used as the
the feature map of current stage to improve the accuracy of input of the next layer of the next frame network. Fig. 2 and
detection. Finally, we get as the result of the non-local Fig. 3 illustrates the calculation process of spatial non-local
blocks. block and temporal non-local block.
As shown in Fig. 2 and Fig.3, the non-local model of our
III. VIDEO OBJECT DETECTION BASED ON NON-LOCAL PRIOR method includes spatial non-local block and temporal non-
We use YOLOv2 as the basic network to extract video local module, contains four convolution layers and an active
sequence image features. Due to the problems of complex layer. To make the non-local blocks more flexible, the input
background, blurry images, and partial occlusion in video feature map and output feature map are the same size. The
sequence images, we propose non-local blocks for extracting non-local model will be used repeatedly in our framework.
spatiotemporal contextual information to improve the feature Fig. 2 illustrates the calculation process of spatial non-
extraction capability of the method. Fig. 1 illustrates the local block to obtain the result feature map. Where
structure of video object detection method based on non- represents the feature map of the current frame and
local priors. represents the feature map of the current frame which is
In video sequence images, due to temporal continuity enhanced by the spatial non-local module. After introducing
between frames, the front and back frames of the video are the non-local modules in the spatial dimension, the feature
similar in space, so how to use the features of the previous map through the three convolutional layers, , and ,which
frame to enhance the features of the next frame is the key have 1×1×1 convolution kernel learned by network. The
point of video object detection. In this method, we use 2 channels of resulted feature map are half of , and will be
continuous frames of the video sequence image for detection. the same as by the last convolutional layer. This follows
The previous frame only uses spatial non-local blocks to the bottleneck design of and reduces the computation of a
provide temporal, and the current frame uses both the spatial block by about a half. The resulted feature map of layer T
and temporal non-local blocks. The result of previous frame and layer will be calculated by matrix multiplication and
and current frame in spatial non-local blocks will be used as normalized by SoftMax function, then calculated with the
the input of the temporal non-local blocks. The previous resulted feature map of layer by matrix multiplication.

179

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
After that the resulted feature map is fed through a Fig. 3 illustrates the calculation process of temporal non-
convolutional layer which has 1×1×1 convolution kernel local block. Where represents the feature map of the
learned by network and the spatial non-local mask of current current frame, represents the feature map of the
frame is obtained. Finally, we combine the mask and the previous frame, and represents the feature map of the
input feature map of the spatial non-local block by current frame which is enhanced by the spatial non-local
element-wise adding to obtain the resulted feature map module. The calculation process of the temporal non-local
which enhanced by the spatial non-local block. The resulted block is similar to the spatial non-local block, but the input
feature map will be used in next layer of the network. of the layer and layer is changed from to , to
measure the similarity between the current frame and the
previous frame.
We also do experiments in FasterRcnn to verify the
validity of our model. FasterRcnn use Resnet-50 as feature
extraction network, so we add non-local blocks after each
residual blocks. The non-local blocks’ architecture are the
same as the structure described above in Yolov2.
IV. EXPERIMENTS
A. Hardware environment
We trained our model on GTX1080ti (11GB). The other
configuration of hardware environment are shown below.
RAM: 32GB
CPU: Intel(R) Core(TM) i7-8700 (3.20GHz)
GPU: GTX1080ti (11GB)
Operating System: ubuntu 18.04
B. Dataset
We conduct experiments on two datasets, Overhead
Contact System driving recorder dataset and OTB50 dataset.
The OCS driving recorder dataset used in this article was
captured by high definition driving recorder device which is
installed in the cab of the train. The frame rate ranged from
24 frames per second to 60 frames per second, and the size
Fig. 2. spatial non-local block
of the captured images ranged from 1920*1080 to
2880*2160. We selected 156 video clips containing foreign
objects of the bird's nest. The dataset contains 2597 labeled
images. These OCS images are captured from different train
lines and various imaging conditions, such as motion blur
and partial occlusion.
The category of the objects is the birdnest, so we
randomly selected 80% of video clips as the training set, and
the remaining 20% are used as the test set.
The OTB50 is a video dataset proposed by Yi Wu[16] et
al. This dataset contains 49 video clips. Among them, 48
video clips contained 1 object, and 1 video clip contained 2
objects, for a total of 49 categories. We set every object
belongs to one category, and set the clips’ name as the
category. A total of 26499 images were contained in OTB50
and were all labeled. Fig. 4 shows part of the first frame of
video where the target object is initialized with a bounding
box.
Because our model needs continuous frames to extract
the temporal relationship between frames, we selected the
first 80% of video clips’ frames as the training set (21181
images), and the remaining 20% are used as the test set
(5318 images).

Fig. 3. temporal non-local block

180

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. blurry video image captured by driving recorder.

TABLE I. THE MAP ON OTB50 OF DIFFERENT NETWORK

Network Resolution dataset mAP

Fig. 4. Part of the first frame of videos in OTB40 dynamic


FasterRcnn OTB50 76.83
size
C. Performance Evaluation
Our work used batch-GD with a mini-batch size of 64.
We use a weight decay of 0.0005 with a momentum of 0.9 Yolov2 544*544 OTB50 83.43
and set the initial learning rate to 0.0001.
Following protocols widely adopted in object detection, Ours (based
dynamic
we evaluate our method on the test set and use mean average on
size
OTB50 79.70
precision (mAP) as the evaluation metric. FasterRcnn)
During the training and testing process, 2 frames in the
Ours (based
video are read in video sequence each time. The previous 544*544 OTB50 84.31
on Yolov2)
frame only uses spatial non-local blocks, and the current
frame uses both the spatial and temporal non-local blocks, As is shown in Table.1, we conduct experiments on
making full use of the spatiotemporal contextual information OTB50 with our method and classical or state-of-the-art
of video sequence images to improve the detection accuracy. models. In OTB50, the mAP achieved by our network
Fig. 5 shows the video image captured in poor lighting improves to 84.31% while the baseline Yolov2 achieves
conditions, and the abnormal object is difficult to detect. Fig. 83.43%, and the mAP achieved by our network improves to
6 shows the blurry video image captured by driving recorder. 79.70% while the baseline FasterRcnn achieves 76.83%.
It proves that our model can detect the target in complex Experiments shows that Darknet-19 performs better than
background. Resnet-50 in video images, and our models are proved to be
effective.
After trying several basic detection networks, we choose
Darknet-19 as our feature extraction network finally.
Therefore, we mainly compare our approach with the
original Yolov2.

TABLE II. THE MAP ON OUR DATASET OF DIFFERENT NETWORK

Network Resolution dataset mAP

OCS driving
Yolov2 544*544 recorder 54.78
dataset

OCS driving
Ours (based
544*544 recorder 58.31
on Yolov2)
dataset
As is shown in Table.2, in OCS driving recorder dataset,
Fig. 5. video image captured in poor lighting conditions combing Spatiotemporal non-local model with the prediction
results of original image improves to 58.31% measure by
mAP while the baseline Yolov2 achieves 54.78%.
Experiments show that the method designed in this paper can

181

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
maintain good detection accuracy in various complex OCS for Object Detection From Videos. IEEE Transactions on Circuits &
video sequence images. Systems for Video Technology (2018)
[6] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, Yichen Wei: Flow-
CONCLUSION Guided Feature Aggregation for Video Object Detection. IEEE
International Conference on Computer Vision (ICCV). pp.408-417
In this work, we propose a non-local model which can (2017)
effectively use the Spatiotemporal contextual information of [7] Mason Liu, Menglong Zhu, Marie White, Yinxiao Li, Dmitry
video sequence images to improve the detection accuracy. Kalenichenko: Looking Fast and Slow: Memory-Guided Mobile
The non-local model proposed in this paper does not require Video Object Detection. arXiv preprint arXiv: 1903.10172 (2019)
additional supervision information or modification of the [8] Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich
Feature Hierarchies for Accurate Object Detection and Semantic
loss function, and can be flexibly added to any model based Segmentation. Computer Vision and Pattern Recognition (CVPR).
on convolutional neural network. It can capture the long- pp.580-587 (2014)
range non-local dependency between frames in spatial and [9] Ross B. Girshick: Fast R-CNN. IEEE International Conference on
temporal to enhance the features of current stage to Computer Vision (ICCV). pp.1440-1448 (2015)
effectively improve the accuracy of detention. [10] Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-
The result of the experiment shows that the video object CNN: Towards Real-Time Object Detection with Region Proposal
detection method based on non-local prior of spatiotemporal Networks. Neural Information Processing Systems (NIPS). pp.91-99
context designed in this paper has higher detection accuracy (2015)
than Yolov2, indicating that this method can better deal with [11] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, Ali
Farhadi: You Only Look Once: Unified, Real-Time Object Detection.
various complex video sequence images and has higher Computer Vision and Pattern Recognition (CVPR). pp.779-788
application value. In the future, we will conduct the research (2016)
on designing an end-to-end framework to complete pillar [12] Joseph Redmon, Ali Farhadi: YOLO9000: Better, Faster, Stronger.
number plate detection and recognition at the same time. Computer Vision and Pattern Recognition (CVPR). pp.6517-6525
(2017)
REFERENCES [13] Joseph Redmon, Ali Farhadi: YOLOv3: An Incremental
[1] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, Improvement. arXiv preprint arXiv: 1804.02767 (2018)
Manohar Paluri: Learning Spatiotemporal Features with 3D [14] Antoni Buades, Bartomeu Coll, Jean-Michel Morel: A Review of
Convolutional Networks. IEEE International Conference on Image Denoising Algorithms, with a New One. Multiscale Modeling
Computer Vision (ICCV). pp.4489-4497 (2015) and Simulation. 4(2). pp.490-530 (2005)
[2] João Carreira, Andrew Zisserman, Quo Vadis: Action Recognition? A [15] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, Kaiming He:
New Model and the Kinetics Dataset. Computer Vision and Pattern Non-Local Neural Networks. Computer Vision and Pattern
Recognition (CVPR). pp.4724-4733 (2017) Recognition (CVPR). pp.7794-7803 (2018)
[3] Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes: [16] Yi Wu, Jongwoo Lim, Ming-Hsuan Yang: Online Object Tracking: A
Spatiotemporal Residual Networks for Video Action Recognition. Benchmark. Computer Vision and Pattern Recognition (CVPR)濁
Neural Information Processing Systems (NIPS). pp.3468-3476 (2016) pp.2411-2418 (2013)
[4] Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang: Object [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Detection from Video Tubelets with Convolutional Neural Networks. Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention is
Computer Vision and Pattern Recognition (CVPR). pp.817-825 (2016) All you Need. Neural Information Processing Systems (NIPS).
[5] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong pp.5998-6008 (2017)
Xiao, et al.: T-CNN: Tubelets With Convolutional Neural Networks

182

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.

You might also like