You are on page 1of 4

2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE 2021)

An Improved Detection Method of Human Target at


2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE) | 978-1-7281-8319-0/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICCECE51280.2021.9342056

Sea Based on Yolov3


D ongjin Li Liu Yu
aInstitute o f System s Engineering, M icroelectronics Institute
A cadem y o f M ilitary Sciences, PLA, B eijing, China Tianjin U niversity
bTianjin U niversity, Tianjin, C hina Tianjin, C hina
dongjindddj@ 163.com
Rufei Z hang
W ang Jin A I R esearch and D evelopm ent C enter
Institute o f System s Engineering B eijing Institute o f control and electronic technology
A cadem y o f M ilitary Sciences Beijing, China
Beijing, C hina
N iu Fu*
Jiang Feng Institute o f System s Engineering
A I R esearch and D evelopm ent C enter A cadem y o f M ilitary Sciences
B eijing Institute o f control and electronic technology Beijing, China
B eijing, China niufu@ vip.sina.com

Abstract—In the mission of searching and rescuing, it is often The object detection methods based on deep learning can
faced with the situation that the area to be searched is large and automatically extract features through convolutional neural
the target to be searched is small. Combined with the object network, and the semantic information of these features is more
detection technology, this paper proposes a method for searching abundant. It is not easily affected in a complex environment,
drowning people. At first, we make a dataset, which contains a and has better robustness and higher accuracy. The object
large number of human targets at sea. Then, we improve the detection methods based on deep learning can be divided into
Yolov3 algorithm: In the feature extraction network, we use the the two-stage detection algorithm and the one-stage detection
residual module with channel attention mechanism. In the feature
algorithm. The representative algorithms of the two-stage
fusion network, we add a bottom-up structure to the FPN
detection algorithm include R-CNN [4], F ast R-CNN [5], F aster
structure. Moreover, in terms of loss function, we use the CIoU loss
function. Finally, on the settings of the anchor box, we use a linear
R-CNN [6 ], etc. At first, this type of algorithm extracts the
transformation method to deal with the anchor boxes generated by candidate regions roughly, and generates regions where the
clustering algorithm. The detection accuracy of the improved target may exist. Then, these regions are extracted finely to
algorithm for human targets at sea is 72.17%, which has a good generate the final location and classification. accuracy has high
detection effect. accuracy, but the detection speed is slow. The one-stage
detection algorithm directly obtains the location and category
Keywords-deep learning; object detection; human target; information of the final target from the input image. The
searching and rescuing; detection method. representative algorithms of the one-stage detection algorithm
include YOLO [7], SSD [8 ], etc. This type of algorithm has
I. In t r o d u c t io n faster detection speed, but the accuracy is poorer than the two-
At present, in the searching and rescuing mission, it mainly stage detection algorithm. Most of the subsequent algorithms
depends on human vision to search drowning people. However, are improvements of the above algorithms.
in the process of large-scale searching, the search efficiency and
accuracy are low due to the small size of human targets at sea. II. ALGORITHM DESCRIPTION
In addition, the emotion and state of searchers will also affect Yolov3 [9] is a representative algorithm of the object
the efficiency and accuracy of searching, making it difficult to detection methods based on deep learning. Because of its high
find human targets quickly. With the development of computer accuracy and fast detection speed, we choose to make
vision technology, the object detection technology has also improvements on the basis of Yolov3: We improve the feature
attracted much attention and has always been a hot research extraction network, feature fusion network and the loss function.
topic. The application of this technology to the scene of
The structure of the improved detection network is shown
searching and rescuing can improve the efficiency of searching.
in Fig. 1.
The Object detection methods are generally divided into two
types: traditional object detection methods and the object
detection methods based on deep learning. The representative
algorithms of traditional object detection methods include
Haar+SVM [1], HOG+SVM [2], Shapelet+AdaBoost [3], etc.
Traditional object detection methods need artificial designed
features, which have very strict requirements for practitioners.

978-1-7281-8319-0/21/531.00 ©2021 IEEE


100

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:38:22 UTC from IEEE Xplore. Restrictions apply.
B. Feature Fusion Network
In the convolutional neural network, the bottom layer is the
shallower part of the convolutional network, and the feature
2x
map obtained at the bottom layer is large and has rich detailed
2x information, while its semantic information is weak. The top
layer is the deeper part of the convolutional network, and the
3x
feature map obtained at the top layer has a small size, and the
3x feature at the top layer is abstract and has rich semantic
information. The feature fusion network can enhance the
4x
information of different layers by fusing the features of
different layers, which is very important for improving the
detection performance, especially the detection performance of
small objects. The improved feature fusion network is shown in
Fig. 3.
Figure 1. The structure of improved detection network

In the following content, we will introduce this structure


respectively.
A. Feature Extraction Network
In the Yolov3 algorithm, the features extracted by the
network have an important impact on the detection results. The
improved feature extraction network structure is shown in Fig.
2.

Figure 3. The improved Feature Fusion Network

This structure adds a bottom-up structure to the FPN


structure. First, the feature map P1 is convoluted to obtain the
feature map A1. After the down sampling operation of feature
map A1, the obtained results are fused with feature maps P2 and
P3 to obtain feature maps A2 and A3 with fusion information,
which enhances the detail information of each layer. Finally,
these feature maps are used for detecting.
C. Loss function
In this paper, we use CIoU loss [11] to represent the loss of
Figure 2. The improved Feature Extraction Network the target box. The calculation formula of the CIoU loss
function is:
The improved feature extraction network consists of 15 p z(c ,cflr)
residual modules. The structure of each residual module uses LCloU = 1 - IoU + ------------ + a v (1
the channel attention mechanism [10]. The main process of the
mechanism is: At first, the mechanism uses 1*1 convolution to In the above formula, c and cst are the center point
reduce the number of channels to obtain the global information coordinates of the prediction box and the real box, and p
of the channel. Then the mechanism uses 1*1 convolution to represents the Euclidean distance between the two points, d is
recover the channel number to the input channel number and the diagonal distance of the smallest rectangular area containing
uses the sigmoid function to process the results. Finally, the
two boxes, a v is a penalty factor, which takes the aspect ratio
mechanism multiplies the obtained result with the original input
to get the final result. of the prediction box and the real box into account. The
calculation fonnula of a v is as follows:
The structure also uses the depth separable convolution
instead of ordinary convolution, which greatly reduces the
operation cost while ensuring the detection accuracy. (2 )
Finally, we take the output of the 5th Res Block(C1), the 4 w 3t w
11th Res Block(C2) and the 15th Res Block(C3) as the input of v — —z (a r c ta n — — - a rcta n — r
j i 2K h3t hJ (3)
the feature fusion network.
In the above two formulas. w 3 t and are the width
and height of the real box, and w and h are the width and
height of the prediction box.

101

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:38:22 UTC from IEEE Xplore. Restrictions apply.
III. Ex p e r im e n t a n d r e s u l t a n a l y s is The number of labeled boxed of human targets at sea is
In this paper, we use the Ubuntu 16.04 operating system, 226994, which lays a foundation for the training of the human
and use two Titan V Graphics cards for model training. In target detection model at sea.
addition, the deep learning framework we use is pytorch1 .2 B. Anchor box setting
version, and the python version we use is 3.7.
In this paper, we use the K-means++ algorithm to get anchor
A. Dataset boxes, and the obtained anchor boxes are (3,7), (5,10), (4,19),
(6,15), (9,17), (7,25), (11,25), (8,39), (11,49).
The dataset used in this experiment is the seaside human
dataset which is made by our teams. The dataset is formed by However, in the anchor boxes generated by the K-means++
collecting images, processing the images, and labeling the algorithm, there are some anchor boxes with similar sizes, such
human targets in the images. Fig. 4 and Fig. 5 are some samples as (4,19), (6,15), (9,17). Anchor boxes with similar sizes will
in the dataset. increase useless calculation in the process of object box
regression. In this paper, we use a linear transformation method
to deal with the generated anchor boxes. The transformation
formulas are as follows:
x'x — O.Sxj (4)
Xg = 2Xg
(5)
(Xj - i j )
x = (_Xg ~ X ' - y ) + X [
( X g ~ * i )
(6 )
,y*
yi = *
Figure 4. An image of the dataset (7)
In the above transformation formulas, x t and ) ', are the
width and height of the target box before transformation. x[
and y,' are the width and height of the target box after the
transformation. The anchor boxes obtained by this
transformation are (1,3), (3,7), (2,10), (4,11), (8,16), (6,21),
(11,25), (7,35), (11,49).
C. Analysis o f results
Since the purpose of this paper is to propose a detection
technology of drown people, and considering the effect of small
target detection, the detection accuracy of sea person when IOU
Figure 5. Other images of the dataset is 0.25 is used as the evaluation index. The experimental results
are shown in Table 1.
The dataset includes 6079 images and 528422 labeled boxes.
And the dataset divides the human targets into 4 categories: sea TABLE I. EXPERIMENT RESULTS
person (Seap), uncertain sea person (Unseap), land person Improved Improved New Seap
(Landp) and uncertain land person (Unlandp). Sea person refers CIoU
FEN FFN anchors AP25 (%)
to the human target at sea. The uncertain sea person refers to Yolov3 59.22
the uncertain human target at sea. The land person refers to the Yolov3-1 V 60.84
human target on land. The uncertain land person refers to the Yolov3-2 V V 64.04
Yolov3-3 V V V 69.16
uncertain human target on land. The statistical results of Yolov3-4 V V V V 72.17
different types of labeled boxes are shown in Fig. 6 .
284453

Yolov3 uses the anchor boxes generated by K-means++


clustering algorithm, and the detection accuracy of human
targets at sea reaches 59.22%. Yolov3-1 is improved on the
basis of Yolov3, which uses the improved feature extraction
network, and the detection accuracy value is increased from
59.22% to 60.84%, an increase of 1.62 percentage points.
Yolov3-2 is improved on the basis of Yolov3-1, which uses the
improved feature fusion network, and the detection accuracy
value is increased from 60.84% to 64.04%, an increase of 3.2
seap Landp
CATEGORIES
unseap unlandp
percentage points. Yolov3-3 is improved on the basis of
Figure 6. The number of different types of labeled boxes
Yolov3-2, which uses the CloU loss function, and the detection

102

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:38:22 UTC from IEEE Xplore. Restrictions apply.
accuracy value is increased from 64.04% to 69.16%, an increase effectiveness, there are still many shortcomings, such as
of 5.12 percentage points. Yolov3-4 is improved on the basis of inaccurate prediction. Our subsequent research work will
Yolov3-3, which uses the anchor generated by the linear continue to optimize it.
transformation, and the detection accuracy value is increased
from 69.16% to 72.17%, an increase of 3.01 percentage points. ACKNOWLEDGMENT
In summary, the detection accuracy value of the final improved At the end of this article, I would like to thank some people
algorithm we proposed is 72.17%, which is nearly 13% higher who are important to me. First of all, I would like to thank my
than the detection accuracy value of Yolov3. teacher Niu Fu, who has given me a lot of help in my study and
In addition to using indicators to describe the detection life. I would like to express my thanks to my partners of Beijing
results, the effectiveness of the technology can be shown Institute of electronics and control technology, who also have
directly from the image after detection. Fig. 7 and Fig. 8 show helped me a lot. In addition, I would like to thank my parents,
the detection results of two different pictures. it is their support that I can study at ease. Finally, I would like
to sincerely thank the teachers who have worked so hard to
review this article!
Re f e r en c es

[1] Papageorgiou C, Poggio T.A Trainable System for Object Detection[J].


International Journal of Computer Vision,2000,38(1):15-33.
[2] N. Dalal, B. Triggs. Histograms of oriented gradients for human detection.
in: Computer Vision and Pattern Recognition, San Diego, CA, June 20­
25, 2005.
[3] P. Sabzmeydani, G. Mori. Detecting Pedestrians by Learning Shapelet
Features[C]. IEEE Com puter Vision and Pattern Recognition,
Minneapolis, 2007, 1-8.
[4] Ross Girshick, Jeff Donahue, Trevor Darrel, et al. Rich Feature
Figure 7. The detection result of a picture Hierarchies for Accurate Object Detection and Semantic
Segmentation[J].2013:580-587.
[5] R. Girshick. Fast R-CNN[C]. IEEE International Conference on
Computer Vision, Santiago,2015,1440-1448.
[6] Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object
detection with region proposal networks[C]. International Conference on
Neural Information Processing Systems, MIT Press, USA,2015,91-99.
[7] J. Redmon, S. Divvala, R. Girshick, et al. You Only Look Once: Unified,
Real-Time Object Detection[C]. IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas,2015,779-788.
[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot Multi Box
Detector[M]// Computer Vision - ECCV 2016. Springer International
Publishing, 2016.
[9] J. Redmon, A. Farhadi. YOLOv3: An Incremental Improvement[J]. ar
Figure 8. The detection result of another picture
Xiv:1804.02767,2018.
[10] Jie Hu, Li Shen, et al. Squeeze-and-Excitation Networks[J]. IEEE
In these two pictures, the human targets at sea are Conference on Computer Vision and Pattern Recognition,2019.
surrounded by red boxes. It can be seen from these pictures that
[11] Zhaohui Zheng, Ping Wang, et al. Distance-IoU Loss: Faster and Better
most of human targets at sea have been detected, and human Learning for Bounding Box Regression[J]. IEEE Conference on
targets at sea are correctly surrounded by the prediction box, Computer Vision and Pattern Recognition,2019.
which proves the effectiveness of the detection technology
proposed in this paper. However, this technology also has many
shortcomings, for example, some human targets are not
detected in Fig. 8 , and only one target can be detected when two
targets are very close.
IV. Co n c l u s io n

At present, it is difficult to search and rescue drowning


people in the sea accidents. To solve this problem, we apply the
object detection technology to search drowning people, which
provides a new method for searching drowning people. In this
paper, we introduce the seaside human dataset made by
ourselves firstly, and then we improve the Yolov3 algorithm
according to the characteristics of small scale and weak feature
of drowning people. Finally, we proposed a method for detecting
human targets at sea, and we verified the effectiveness of the
proposed method through experiments on the seaside human
target dataset. Although this technology has certain

103

Authorized licensed use limited to: Carleton University. Downloaded on June 03,2021 at 11:38:22 UTC from IEEE Xplore. Restrictions apply.

You might also like