Professional Documents
Culture Documents
Wei Lu1, Wei Xu2, Zebin Wu1, Yang Xu1, Zhihui Wei1
1
School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, China,
2
China Railway Shanghai Group Co., Ltd., Nanjing Power Supply Section, NangJing 210011, China
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
of calculation. In order to improve the speed of video target sample layers and convolutional layers to expand the
detection, Google has proposed a method combining fast and receptive field of the convolution kernel, but multi-layer
slow network for video detection [7]. However, some of convolution will cause a lot of waste of computing power.
these approaches cannot make effective use of temporal In the traditional image processing algorithm, Buades et
information, some of these approaches have huge calculation, al. [14] proposed a non-local mean filtering algorithm, which
leading to slow speed and low precision. uses the global image information to improve the denoising
When we try to find object in an image, we will focus on effect. This algorithm takes advantage of the characteristics
regions which have the object and pay more attention on it. of multiple image blocks with similar structure in natural
Meanwhile, we will reduce attention on the other regions. images to find the center points of the most similar image
Attention mechanism is inspired by this. Attention blocks for weighted averaging. Compared with the mean
mechanism can distinguish the importance of each region value filtering algorithm that uses the local information of
and enhance their features. In general, attention mechanism the image, it can remove the image noise more effectively
is mainly divided into two categories: soft attention and hard and retain the edge features of the image better.
attention. The hard attention will only focus on important Inspired by the non-local mean filtering algorithm, Wang
areas and dislodge unimportant parts of this image. This way et al. proposed the concept of non-local neural networks. The
can find main feature rapidly, but will loss some features. core of this algorithm lies in non-local blocks. The non-local
The soft attention will use weights which learned by blocks directly calculate the similarity between any two
network to weight the image and strengthen significant points of the image to get the long range dependency, and
regions without loss any features. So soft attention obtains a mask to enhance the feature map. Compared with
mechanism is more suitable for deep learning. multi-layer convolution, the global image information can be
Self-attention is a soft attention mechanism relating effectively used without losing image feature information,
different positions of a single sequence in order to compute a and the calculation amount is smaller. Meanwhile, these
representation of the sequence. In [17], self-attention is points can come from different frames, so the non-local
mainly used for machine translation, but it can also be blocks can also extract the temporal contextual information
applied in image and video problems in computer vision with of the video sequence images to enhance the image features
some improvements. and improve the video object detection accuracy.
Inspired by the non-local mean algorithm and the self-
attention mechanism, Wang et al. [15] proposed the concept 1
of non-local neural networks, using non-local blocks to find yi =
C( x)
¦ f (x , x )g(x )
i j j
long-range, non-local dependency in frames. The non-local j
178
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
layer, non-local blocks are more flexible to use and can frame only is only to provide temporal contextual
maintain the position information of feature maps. information for current frame prediction. The purpose of this
(3) Non-local blocks can be used anywhere in the design is to make full use of the spatial contextual
network and are very flexible to use, while fully connected information of the video sequence image, while effectively
layers are usually used at the last layer of the network. use the temporal contextual information of the image.
Compared with the fully connected layer, non-local modules
can build a network with a richer hierarchical structure.
Due these advantages, we use non-local blocks to
enhance the features of the video frames. The non-local
blocks mainly contains the following calculation.
T ( xiT )I ( x j )
f ( xi , x j ) e
C ( x) ¦ f (x , x )
i j
j
Equation (3) describes the normalization function ( ). Fig 1. the structure of video object detection method based on non-local
priors
Zi h( yi ) xi
In the spatial non-local block, the input is the feature map
of the previous frame at a specific layer, and the output is the
Equation (4) describes the method to get the final result feature map enhanced by the spatial non-local mask, which
of the non-local blocks. is used as the input of the next layer of the previous frame
The non-local mask obtained by equation (1) will be network. In the temporal non-local block, the input is the
calculated by the mapping function ℎ and then added to the respective feature map of the previous frame and the next
original feature map using the residual structure method to frame at a specific layer, and the output is the feature map
obtain the final result of the non-local blocks. It will enhance enhanced by the non-local mask of time, which is used as the
the feature map of current stage to improve the accuracy of input of the next layer of the next frame network. Fig. 2 and
detection. Finally, we get as the result of the non-local Fig. 3 illustrates the calculation process of spatial non-local
blocks. block and temporal non-local block.
As shown in Fig. 2 and Fig.3, the non-local model of our
III. VIDEO OBJECT DETECTION BASED ON NON-LOCAL PRIOR method includes spatial non-local block and temporal non-
We use YOLOv2 as the basic network to extract video local module, contains four convolution layers and an active
sequence image features. Due to the problems of complex layer. To make the non-local blocks more flexible, the input
background, blurry images, and partial occlusion in video feature map and output feature map are the same size. The
sequence images, we propose non-local blocks for extracting non-local model will be used repeatedly in our framework.
spatiotemporal contextual information to improve the feature Fig. 2 illustrates the calculation process of spatial non-
extraction capability of the method. Fig. 1 illustrates the local block to obtain the result feature map. Where
structure of video object detection method based on non- represents the feature map of the current frame and
local priors. represents the feature map of the current frame which is
In video sequence images, due to temporal continuity enhanced by the spatial non-local module. After introducing
between frames, the front and back frames of the video are the non-local modules in the spatial dimension, the feature
similar in space, so how to use the features of the previous map through the three convolutional layers, , and ,which
frame to enhance the features of the next frame is the key have 1×1×1 convolution kernel learned by network. The
point of video object detection. In this method, we use 2 channels of resulted feature map are half of , and will be
continuous frames of the video sequence image for detection. the same as by the last convolutional layer. This follows
The previous frame only uses spatial non-local blocks to the bottleneck design of and reduces the computation of a
provide temporal, and the current frame uses both the spatial block by about a half. The resulted feature map of layer T
and temporal non-local blocks. The result of previous frame and layer will be calculated by matrix multiplication and
and current frame in spatial non-local blocks will be used as normalized by SoftMax function, then calculated with the
the input of the temporal non-local blocks. The previous resulted feature map of layer by matrix multiplication.
179
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
After that the resulted feature map is fed through a Fig. 3 illustrates the calculation process of temporal non-
convolutional layer which has 1×1×1 convolution kernel local block. Where represents the feature map of the
learned by network and the spatial non-local mask of current current frame, represents the feature map of the
frame is obtained. Finally, we combine the mask and the previous frame, and represents the feature map of the
input feature map of the spatial non-local block by current frame which is enhanced by the spatial non-local
element-wise adding to obtain the resulted feature map module. The calculation process of the temporal non-local
which enhanced by the spatial non-local block. The resulted block is similar to the spatial non-local block, but the input
feature map will be used in next layer of the network. of the layer and layer is changed from to , to
measure the similarity between the current frame and the
previous frame.
We also do experiments in FasterRcnn to verify the
validity of our model. FasterRcnn use Resnet-50 as feature
extraction network, so we add non-local blocks after each
residual blocks. The non-local blocks’ architecture are the
same as the structure described above in Yolov2.
IV. EXPERIMENTS
A. Hardware environment
We trained our model on GTX1080ti (11GB). The other
configuration of hardware environment are shown below.
RAM: 32GB
CPU: Intel(R) Core(TM) i7-8700 (3.20GHz)
GPU: GTX1080ti (11GB)
Operating System: ubuntu 18.04
B. Dataset
We conduct experiments on two datasets, Overhead
Contact System driving recorder dataset and OTB50 dataset.
The OCS driving recorder dataset used in this article was
captured by high definition driving recorder device which is
installed in the cab of the train. The frame rate ranged from
24 frames per second to 60 frames per second, and the size
Fig. 2. spatial non-local block
of the captured images ranged from 1920*1080 to
2880*2160. We selected 156 video clips containing foreign
objects of the bird's nest. The dataset contains 2597 labeled
images. These OCS images are captured from different train
lines and various imaging conditions, such as motion blur
and partial occlusion.
The category of the objects is the birdnest, so we
randomly selected 80% of video clips as the training set, and
the remaining 20% are used as the test set.
The OTB50 is a video dataset proposed by Yi Wu[16] et
al. This dataset contains 49 video clips. Among them, 48
video clips contained 1 object, and 1 video clip contained 2
objects, for a total of 49 categories. We set every object
belongs to one category, and set the clips’ name as the
category. A total of 26499 images were contained in OTB50
and were all labeled. Fig. 4 shows part of the first frame of
video where the target object is initialized with a bounding
box.
Because our model needs continuous frames to extract
the temporal relationship between frames, we selected the
first 80% of video clips’ frames as the training set (21181
images), and the remaining 20% are used as the test set
(5318 images).
180
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. blurry video image captured by driving recorder.
OCS driving
Yolov2 544*544 recorder 54.78
dataset
OCS driving
Ours (based
544*544 recorder 58.31
on Yolov2)
dataset
As is shown in Table.2, in OCS driving recorder dataset,
Fig. 5. video image captured in poor lighting conditions combing Spatiotemporal non-local model with the prediction
results of original image improves to 58.31% measure by
mAP while the baseline Yolov2 achieves 54.78%.
Experiments show that the method designed in this paper can
181
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.
maintain good detection accuracy in various complex OCS for Object Detection From Videos. IEEE Transactions on Circuits &
video sequence images. Systems for Video Technology (2018)
[6] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, Yichen Wei: Flow-
CONCLUSION Guided Feature Aggregation for Video Object Detection. IEEE
International Conference on Computer Vision (ICCV). pp.408-417
In this work, we propose a non-local model which can (2017)
effectively use the Spatiotemporal contextual information of [7] Mason Liu, Menglong Zhu, Marie White, Yinxiao Li, Dmitry
video sequence images to improve the detection accuracy. Kalenichenko: Looking Fast and Slow: Memory-Guided Mobile
The non-local model proposed in this paper does not require Video Object Detection. arXiv preprint arXiv: 1903.10172 (2019)
additional supervision information or modification of the [8] Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich
Feature Hierarchies for Accurate Object Detection and Semantic
loss function, and can be flexibly added to any model based Segmentation. Computer Vision and Pattern Recognition (CVPR).
on convolutional neural network. It can capture the long- pp.580-587 (2014)
range non-local dependency between frames in spatial and [9] Ross B. Girshick: Fast R-CNN. IEEE International Conference on
temporal to enhance the features of current stage to Computer Vision (ICCV). pp.1440-1448 (2015)
effectively improve the accuracy of detention. [10] Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-
The result of the experiment shows that the video object CNN: Towards Real-Time Object Detection with Region Proposal
detection method based on non-local prior of spatiotemporal Networks. Neural Information Processing Systems (NIPS). pp.91-99
context designed in this paper has higher detection accuracy (2015)
than Yolov2, indicating that this method can better deal with [11] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, Ali
Farhadi: You Only Look Once: Unified, Real-Time Object Detection.
various complex video sequence images and has higher Computer Vision and Pattern Recognition (CVPR). pp.779-788
application value. In the future, we will conduct the research (2016)
on designing an end-to-end framework to complete pillar [12] Joseph Redmon, Ali Farhadi: YOLO9000: Better, Faster, Stronger.
number plate detection and recognition at the same time. Computer Vision and Pattern Recognition (CVPR). pp.6517-6525
(2017)
REFERENCES [13] Joseph Redmon, Ali Farhadi: YOLOv3: An Incremental
[1] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, Improvement. arXiv preprint arXiv: 1804.02767 (2018)
Manohar Paluri: Learning Spatiotemporal Features with 3D [14] Antoni Buades, Bartomeu Coll, Jean-Michel Morel: A Review of
Convolutional Networks. IEEE International Conference on Image Denoising Algorithms, with a New One. Multiscale Modeling
Computer Vision (ICCV). pp.4489-4497 (2015) and Simulation. 4(2). pp.490-530 (2005)
[2] João Carreira, Andrew Zisserman, Quo Vadis: Action Recognition? A [15] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, Kaiming He:
New Model and the Kinetics Dataset. Computer Vision and Pattern Non-Local Neural Networks. Computer Vision and Pattern
Recognition (CVPR). pp.4724-4733 (2017) Recognition (CVPR). pp.7794-7803 (2018)
[3] Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes: [16] Yi Wu, Jongwoo Lim, Ming-Hsuan Yang: Online Object Tracking: A
Spatiotemporal Residual Networks for Video Action Recognition. Benchmark. Computer Vision and Pattern Recognition (CVPR)濁
Neural Information Processing Systems (NIPS). pp.3468-3476 (2016) pp.2411-2418 (2013)
[4] Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang: Object [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Detection from Video Tubelets with Convolutional Neural Networks. Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention is
Computer Vision and Pattern Recognition (CVPR). pp.817-825 (2016) All you Need. Neural Information Processing Systems (NIPS).
[5] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong pp.5998-6008 (2017)
Xiao, et al.: T-CNN: Tubelets With Convolutional Neural Networks
182
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:31 UTC from IEEE Xplore. Restrictions apply.