You are on page 1of 14

SS symmetry

Article
Improved YOLOv4 Marine Target Detection Combined
with CBAM
Huixuan Fu, Guoqing Song and Yuchao Wang *

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China;
fuhuixuan@hrbeu.edu.cn (H.F.); sgq@hrbeu.edu.cn (G.S.)
* Correspondence: wangyuchao@hrbeu.edu.cn

Abstract: Marine target detection technology plays an important role in sea surface monitoring, sea
area management, ship collision avoidance, and other fields. Traditional marine target detection
algorithms cannot meet the requirements of accuracy and speed. This article uses the advantages of
deep learning in big data feature learning to propose the YOLOv4 marine target detection method
fused with a convolutional attention module. Marine target detection datasets were collected
and produced and marine targets were divided into ten categories, including speedboat, warship,
passenger ship, cargo ship, sailboat, tugboat, and kayak. Aiming at the problem of insufficient
detection accuracy of YOLOv4’s self-built marine target dataset, a convolutional attention module is
added to the YOLOv4 network to increase the weight of useful features while suppressing the weight
of invalid features to improve detection accuracy. The experimental results show that the improved
YOLOv4 has higher detection accuracy than the original YOLOv4, and has better detection results
for small targets, multiple targets, and overlapping targets. The detection speed meets the real-time
requirements, verifying the effectiveness of the improved algorithm.

 Keywords: deep learning; marine targets; YOLOv4; CBAM; target detection


Citation: Fu, H.; Song, G.; Wang, Y.


Improved YOLOv4 Marine Target
Detection Combined with CBAM. 1. Introduction
Symmetry 2021, 13, 623. https://
As an important carrier of marine resource development and economic activities,
doi.org/10.3390/sym13040623
accurate monitoring of marine targets has become increasingly important. Carrying out
automated research on marine target monitoring is of great significance for strengthening
Academic Editor: Calogero Vetro
the management of sea areas, and whether illegal targets can be detected and located in
time is the focus of marine target monitoring.
Received: 18 March 2021
Accepted: 2 April 2021
Among the traditional marine target detection algorithms, Fefilatyev et al. used a
Published: 8 April 2021
camera system mounted on a buoy to quickly detect ship targets, and proposed a new
type of ocean surveillance algorithm for deep-sea visualization [1]. In the context of ship
Publisher’s Note: MDPI stays neutral
detection, a new horizon detection scheme for complex sea areas has been developed.
with regard to jurisdictional claims in
Shi W. et al. [2] effectively suppressed the noise of the background image and detected
published maps and institutional affil- the ship target by combining morphological filtering with multi-structural elements and
iations. improved median filtering. The influence of sea clutter on the ship target detection was
eliminated by using connected domain calculation. Chen Z et al. proposed a ship target
detection algorithm for marine video surveillance [3], the purpose is to reduce the impact of
clutter in the background and improve the accuracy of ship target detection. In the proposed
Copyright: © 2021 by the authors.
detector, the main steps of background modeling, model training update, and foreground
Licensee MDPI, Basel, Switzerland.
segmentation are all based on the Gaussian Mixture Model (GMM). This algorithm not only
This article is an open access article
improves the accuracy of the target, but also greatly reduces the probability of false alarms
distributed under the terms and and reduces the impact of dynamic scene changes. Although traditional methods have
conditions of the Creative Commons achieved good results, in the face of complex and changeable sea environments with a lot
Attribution (CC BY) license (https:// of noise interference, traditional detection algorithms have problems such as low detection
creativecommons.org/licenses/by/ accuracy and poor robustness. Therefore, traditional methods have great limitations in
4.0/). practical applications.

Symmetry 2021, 13, 623. https://doi.org/10.3390/sym13040623 https://www.mdpi.com/journal/symmetry


Symmetry 2021, 13, 623 2 of 14

In recent years, deep learning technology [4] has been increasingly used in target
detection tasks and video surveillance [5], autonomous driving [6], face recognition [7], etc.
The target detection algorithm based on deep learning relies on big data to obtain better
detection results than traditional methods, and has stronger robustness.
Target detection algorithms are mainly divided into three categories, which are target
detection based on region proposals, target detection based on regression, and target
detection based on anchor-free. Based on region proposals, it is also called two-stage target
detection, and regression-based is called one-stage target detection. Two-stage detection
algorithms mainly include SPP-Net [8], Fast R-CNN [9], Faster R-CNN [10], and R-FCN [11].
The one-stage target detection algorithm does not require the step of generating region
proposals. It can directly obtain the position and category information of the target, and
the detection speed is greatly improved. One-stage detection algorithms mainly include
SSD [12], YOLO [13], YOLOv2 [14], YOLOv3 [15], and YOLOv4 [16]. Algorithms based
on anchor-free target detection mainly include CornerNet [17] and CenterNet [18], and no
longer use a priori box.
Many researchers have applied deep learning technology to marine target detection.
Gu D proposed a maritime ship recognition algorithm based on Faster-R-CNN [19], which
added a dropout layer after the first full connection layer of the regional generation network
to improve the accuracy. Qi L et al. proposed an improved target detection algorithm based
on Faster R-CNN [20], which enhanced the useful information of ship images through
image down-sampling, and used scene narrowing technology to improve detection speed.
Zou Y et al. proposed an improved SSD algorithm based on the MobilenetV2 convolutional
neural network for target detection and recognition in ship images [21]. In order to
improve the accuracy and robustness of ship target detection, Huang H et al. proposed an
improved target detection algorithm based on YOLOv3 [22]. First, guided filtering and gray
enhancement were used to preprocess the input image, and then k-means ++ clustering
was used to get the prior box, and finally reduced feature redundancy by reducing part of
the convolution operation and adding a jump connection mechanism, and the accuracy of
ship identification was improved under the premise of ensuring real-time performance.
Chen X et al. proposed a framework based on integrated YOLO to detect ships from
ocean images [23]. This method can accurately detect ships and successfully identify ships’
historical behavior. Huang Z et al. proposed an improved YOLOv3 network [24] for
intelligent detection and classification of ship images/videos.
This paper provides a method for improving the detection accuracy of small, multiple,
and overlapping marine targets. The contributions of this paper are described as follows.
Aiming at the problem of insufficient detection accuracy of YOLOv4 in the self-built marine
target data set, an improved YOLOv4 algorithm combined with the convolutional block
attention module (CBAM) network is proposed. By adding a CBAM module to each of
the three branches at the end of the feature fusion network, the weights of the channel
features, and spatial features of the feature map are allocated to increase the weight of
useful features while suppressing the weight of invalid features to improve the accuracy of
marine target detection.

2. YOLOv4 Target Detection Algorithm


YOLOv4 introduces path aggregation network (PANet), spatial pyramid pooling (SPP),
mosaic data enhancement, Mish activation function, self-adversarial training, CmBN, and
many other techniques to greatly improve the detection accuracy. The backbone network
uses CSPDarknet53, which integrates the Cross Stage Partial Network (CSPNet), reduces
the amount of calculation while maintaining accuracy, achieving a perfect combination of
speed and accuracy. Figure 1 shows the YOLOv4 network structure.
Symmetry 2021, 13, 623 3 of 14
ry 2021, 13, x FOR PEER REVIEW 3 of 14

Symmetry 2021, 13, x FOR PEER REVIEW 3 of 14

Figure 1. YOLOv4 network structure.


Figure 1. YOLOv4 network structure.
Figure 1. YOLOv4 network structure.
In Figure 1, theIninput image
Figure is sent
1, the inputto image
the backbone
is sent network to complete
to the backbone feature
network to ex-
complete feature
traction, then through SPP
extraction, and
then PANet
through to
SPPcomplete
and the
PANet fusion
to of
complete feature
the maps
fusion of
of different
feature
In Figure 1, the input image is sent to the backbone network to complete feature mapsex-of different
scales, and, finally, output
scales,
traction,and, feature
thenfinally,
through maps
output
SPP andof PANet
threemaps
feature scalesofto predict
three
to complete scales
the thetobounding
fusion predict box,
themaps
of feature cate-
bounding box, category,
of different
and confidence,
gory, and confidence,
scales, andfinally,
and, and
the head the head ofmaps
of YOLOv4
output feature YOLOv4 is scales
is consistent
of three consistent withthe
withtoYOLOv3.
predict YOLOv3.
bounding box, cate-
gory, and confidence, and the head of YOLOv4 is consistent with YOLOv3.
2.1. Feature
2.1. Feature Extraction NetworkExtraction Network
2.1. Feature Extraction
uses Network
YOLOv4 uses aYOLOv4new backbone a new backbone
network network CSPDarknet-53
CSPDarknet-53 for feature
for feature extraction extraction of the
of the
inputYOLOv4
data. uses a new backbone
CSPDarknet-53 is an network CSPDarknet-53
improvement of for feature
Darknet-53. extractionisofcomposed
Darknet-53 the of
input data. CSPDarknet-53 is an improvement of Darknet-53. Darknet-53 is composed of
input data.
5 modules, CSPDarknet-53
large residual modules, is an
each improvement
of residual of Darknet-53.
modulecorrespondingDarknet-53 is
is separate correspondingcomposed of
5 large residual54 large residual
each of residual
modules, each
module
of residual
is module
separateis separate corresponding
8,1,8,to
to 1, 2,to 2,
1, 2, 8, 8, and
8, 8,
and 4 small residualsmall residual
units,residual units,
the residual the residual
module module
solves solves
thesolves
problem the problem
of gradient of gradient
disappear- disappearance
and 4 small units, the residual module the problem of gradient disappear-
caused
ance caused byance by the continuous
the caused
continuous deepening of thegreatly
network, greatly reducesparam-
network parameters,
by thedeepening
continuousof the network,
deepening of the network,reduces network
greatly reduces network param-
eters, and makes anditmakes
easier ittoeasier
train todeeper
train deeper convolutional
convolutional neuralneural networks.
networks. The network structure
The network
eters, and makes it easier to train deeper convolutional neural networks. The network
is
structure is shown shown in
in Figure
structure Figure
is shown 2.
2. in Figure 2.

Figure
Figure2.2.CSPDarknet-53
CSPDarknet-53network structure.
network structure.

Figure 2. CSPDarknet-53 network structure.


In Figure 2, the Convolution layer consists of a convolutional layer, a batch
normalization layer, and a Mish activation function. Cross Stage Partial is a newly added
Symmetry 2021, 13, x FOR PEER REVIEW 4 of 14
cross-stage local network. The residual layer is a small residual unit. CSPDarknet-53 also
Symmetry 2021, 13, 623 4 of 14
removes the pooling layer and the fully connected layer, greatly reducing parameters and
improving calculation speed. During training, the image is stretched and scaled to a size
of 416In Figure
416 , then 2, the
sent Convolution layer consists
it to the convolutional neuralofnetwork.a convolutional
After 5 times layer,of a3 batch
 3/2
normalization
convolution
In Figure layer,
(convolutionand a Mish
kernel size
2, the Convolution activation
layer function.
3  3 , step
is consists of asize Cross Stage
is 2), the size
convolutional Partial is a
is reduced
layer, newly
a batch to 13added
 13
normaliza-
,cross-stage
andlayer,
tion threeandlocal
scales network.
a Mish 52 The52 , residual
of activation  26 layer
function.
26 , 13 is
Cross 13aStage
small residual
are selected
Partial aunit.
is as theCSPDarknet-53
newly size of the
added also
output
cross-stage
removes
feature the pooling
local network.
map. Different layer
The residualsizesandof the
layer isfully
feature connected
a small
maps residual
are usedlayer,
unit.togreatly reducingalso
CSPDarknet-53
detect different parameters
sizesremoves and
of targets.the
improving
pooling
Small calculation
layer
feature and
maps speed.
thedetect
fully During
connected
large training,
layer,
targets, and the
greatly
largeimage
reducing
outputis stretched
parameters
feature and
maps scaled
and to small
a size
improving
detect
of 416  416speed.
calculation
targets. , thenDuring
sent it training,
to the convolutional neural network.
the image is stretched and scaled After to a5 size
timesof of
4163×  416,
3/2
convolution
thenThe
sentCSP (convolution
it to module
the convolutionalkernel
in CSPDarknet-53 size is
neural network.
3  3 , step
solves the size
After is
problem 2), the size is reduced
3 × 3/2 convolution
5 timesofofincreased to 
calculation (con-
13 13
and
,slower
and three
volution scales
kernel size
network of
speedis 352× 3,52
due step
to , size
26  26
redundant , thesize
is 2),gradient
13 are selected
13 isinformation
reduced to 13 as the size of
× 13, andinthree
generated the output
scales of
the network
feature
52 × 52,map.
deepening 26 × Different
26, 13 ×The
process. 13sizes
are of feature
selected
structure maps
ofasthe
theCSP
sizeareofused to isdetect
the output
module different
feature
shown inmap. sizes
Figure 3,ofand
Differenttargets.
sizes
the
Small
structurefeature
of feature of themaps
maps detect
are used
small large
to detect
residual targets,
unitdifferent
is shown and
sizes large output
of targets.
in Figure 4. Small feature
feature maps
mapsdetect
detectsmall
large
targets.
targets, and large output feature maps detect small targets.
The CSP
The CSP module
module in CSPDarknet-53
CSPDarknet-53 solves the problem of increased calculation and
= Res
slower network speed due to redundant
slower
CSPX network
Conv speedConvdue
unit
gradient information
Conv information generatedgenerated in the network
deepening process.
deepening process.The Thestructure
X smallof
structure the CSPCSP
of the
residuals module moduleis shown
Concat in Figure
is shown
Conv in 3, and the
Figure 3, structure
and the
of the small
structure residual
of the smallunit is shown
residual unit in Figure
is shown Conv
4.
in Figure 4.

Figure 3. Cross stage partial (CSP) module


Res structure.
CSPX = Conv Conv Conv
unit
In Figure 3, the feature X smallof
map residuals Concat
the base layer is Conv two parts, and then
divided into they
are merged through a cross-stage hierarchical Conv structure. The CSP module can achieve a
richer gradient combination, greatly reduce the amount of calculation, and improve the
Figureand
Figure
speed Cross
3. Cross
3. stage partial
accuracy (CSP)
(CSP) module
of inference. module structure.
structure.

ResIn Figure 3, the feature map of the base layer is divided into two parts, and then they
= Conv Conv add
areunit
merged through a cross-stage hierarchical structure. The CSP module can achieve a
richer gradient combination, greatly reduce the amount of calculation, and improve the
speed and accuracy of inference.
Figure 4. Small residual unit structure.
Figure 4. Small residual unit structure.
ResIn Figure 3, the feature map of the addbase layer is divided into two parts, and then they
= Conv Conv
In Figure
areunit
merged 4, there
through are more shortcuts
a cross-stage hierarchicalin the small residual
structure. The CSP unit than in
module the
can normala
achieve
structure. The shortcut connection is equivalent to directly transferring
richer gradient combination, greatly reduce the amount of calculation, and improve the the input features
to the output
speed for identity
and accuracy mapping, adding the input of the previous layer and the output
of inference.
Figure
of 4. Small layer.
theIncurrent residual unit structure.
Figure 4, there are more shortcuts in the small residual unit than in the normal
structure. The shortcut connection is equivalent to directly transferring the input features
2.2. In Figure
Feature 4, there
Fusion are more shortcuts in the small residual unit than in the normal
Network
to the output for identity mapping, adding the input of the previous layer and the output
structure.
After The shortcut
the feature connection is equivalent to directly transferring the input features
of the current layer. extraction network extracts the relevant features, the feature fusion
to the output for identity mapping,
network is required to fuse the extracted addingfeatures
the inputto of the previous
improve layer and
the detection the output
ability of the
of the
model. current
2.2. Feature
The Fusion layer.
YOLOv4 Network
feature fusion network includes PAN and SPP. The function of the
SPP module
After the is feature
to makeextraction
the inputnetwork
of the convolutional neural network
extracts the relevant features,nottherestricted by a
feature fusion
2.2. Feature Fusion Network
fixed
networksize,isand it can increase
required the extracted
to fuse the receptive features
field andto effectively
improve separate important
the detection abilitycontext
of the
model.After
features, The the
while feature
YOLOv4 extraction
featurethe
not reducing network
fusion
running extracts
network
speed the network.
includes
of the relevant
PAN andfeatures,
SPP.
The SPPThethefunction
modulefeature fusion
of the
is located
network
after the is
SPP module required
feature to fuse
is toextraction
make the extracted
the input
networkof the features to improve
convolutional
CSPDarknet-53, neural
and the
SPPdetection
thenetwork ability
not restricted
network ofby
structure the
isa
model.
fixed size,
shown The
in YOLOv4
and
Figure it 5. feature fusion
can increase network
the receptive includes
field PAN and
and effectively SPP. The
separate functioncontext
important of the
SPP module
features, whileis tonotmake the input
reducing of the convolutional
the running neural network
speed of the network. The SPPnot restricted
module by a
is located
fixed
after size, and it can
the feature increasenetwork
extraction the receptive field and effectively
CSPDarknet-53, and theseparate important
SPP network context
structure is
features, while
shown in Figure 5. not reducing the running speed of the network. The SPP module is located
after the feature extraction network CSPDarknet-53, and the SPP network structure is
shown in Figure 5.
Symmetry 2021, 13, x FOR PEER REVIEW 5 of 14
Symmetry 2021, 13, 623 5 of 14
Symmetry 2021, 13, x FOR PEER REVIEW 5 of 14

Figure 5. Spatial pyramid pooling (SPP) network structure.


Figure 5.
Figure 5. Spatial
Spatial pyramid
pyramid pooling
pooling (SPP)
(SPP) network
network structure.
structure.
In Figure 5, the SPP network uses four different scales of maximum pooling to pro-
cess the In
In Figure 5,5,thethe
Figurefeature
input SPP SPPnetwork
maps. network usesuses
The pooling fourcoredifferent
four scales
different
sizes of
are 1scales , 5×
× 1maximum 9 ×pooling
of5 ,maximum 9 , 13 ×pooling
to13process
, andto
the input
process feature
the input maps.
feature The pooling
maps. The core
pooling sizes are
core 1
sizes× 1,
are5 ×
1 × 1 is equivalent to without processing, the four feature maps are subjected to a concat
1 5,
1 9
, ×5 9,
5 13
, 9×  13,
9 , and
13  1
13 ×, 1and
is
1 1 is equivalent
equivalent
operation. to without
The maximum processing,
to without the four
processing,
pooling feature maps
the four operation,
adopts padding are subjected
feature maps theare to a concat
subjected
moving step to operation.
is a1,concat
and
The
the maximum
operation.
size of the The pooling
maximum
feature map adopts padding
pooling
does adopts
not change operation, thethe
padding
after moving
operation,
pooling step
layer.theismoving
1, and the stepsize is 1,of and
the
feature
the After map
size ofSPP, does
the featurenot
YOLOv4 change
map after
does
uses the pooling
not change
PANet insteadafter layer.
of thethefeature
poolingpyramid
layer. in YOLOv3 as the
After
After SPP,
SPP, YOLOv4
YOLOv4 uses
uses PANet
PANet instead
instead
method of parameter aggregation. PANet adds a bottom-up path of
of the
the feature
feature pyramid
pyramid in
augmentationin YOLOv3
YOLOv3 as
as the
structure the
method
method of
of parameter
parameter aggregation.
aggregation. PANet
PANet adds
adds a
a bottom-up
bottom-up
after the top-down feature pyramid, which contains two PAN structures. And the PAN path
path augmentation
augmentation structure
structure
after
after the
structure theistop-down
modified.feature
top-down feature pyramid,
pyramid,
The original PAN which
which contains
contains
structure usestwo
two PAN
PAN structures.
a shortcut structures.
connection And
And the
the PAN
to fuse PAN
the
structure is modified.
structure is modified. The original
The with
original PAN structure
PAN feature
structure uses
uses a shortcut
a shortcut connection
connection to
to fuse
fuse the
the
down-sampled feature map the deep map, and the number of channels of the
down-sampled
down-sampled feature
feature map with the
mapunchanged. deep
with the deep feature
feature map,
map,PANand the
and uses number
the number of channels
of channels of
ofthe
the
output feature map remains The modified the concat operation to
output
output thefeature
feature map remains unchanged. The The modified PAN uses
usesthe theconcat operation to
connect two map inputremains
feature unchanged.
maps, and merge modified
the channel PANnumbers concat
of the two operation
featureto
connect
connect the two input feature maps, and merge the channel numbers of the two feature
maps. Thethe two input
top-down feature
feature maps,structure
pyramid and merge the channel
conveys strong numbers
semantic of the two
features, andfeature
the
maps. The
maps. Thepath top-down
top-down feature pyramid
feature pyramid structure
structure conveys strong semantic features, and
andthe
bottom-up augmentation structure makes conveys
full use strong
of shallowsemantic featuresfeatures,to convey the
bottom-up
bottom-up path
pathaugmentation
augmentation structure makes full full
use of shallow features to convey strong
strong positioning features. PANetstructure
can makemakes full use use
of shallowof shallow
features,features to convey
and for different
positioning features.features.
strong positioning PANet can make full use of shallow of features, and for different detector
detector levels, feature fusionPANet can
of different make full
backbone uselayers shallow
to furtherfeatures,
improve and feature
for differentex-
levels, feature fusion of different backbone layers to further improve feature extraction
detector
traction levels, feature
capabilities fusion of
and improve different
detector backbone layers to further improve feature
performance.
capabilities and improve detector performance.
extraction capabilities and improve detector performance.
2.3.
2.3.Predictive
PredictiveNetwork
Network
2.3.YOLOv4
Predictive Network feature maps of different scales to predict the bounding box
YOLOv4 outputsthree
outputs three feature maps of different scales to predict the bounding box
position YOLOv4
position outputs
information,
information, three featurecategory,
corresponding
corresponding maps of different
category, andand scales of
confidence
confidence to the
predictthe the
of target. bounding
YOLOv4
target. YOLOv4con-box
position
tinues the information,
basic idea of corresponding
YOLOv3 bounding category,
box and
prediction confidence
continues the basic idea of YOLOv3 bounding box prediction and adopts a prediction and adopts of a the target.
prediction YOLOv4
scheme
continues
based
scheme on basedthe on
priori basic
box. idea
YOLOv4
priori of YOLOv4
box. YOLOv3 bounding
boundingboundingbox prediction boxprediction
box prediction
is shown is inand
Figure
shown adopts a prediction
6. Figure
in 6.
scheme based on priori box. YOLOv4 bounding box prediction is shown in Figure 6.
‫ܥ‬௫

ܲ௪
‫ܥ‬௬
ܾ௪

σ (t y )
ܲ௛ ܾ௛
 (t y )
σ (t x )
 (tx )

Figure
Figure6.6.YOLOv4
YOLOv4bounding
boundingbox
boxprediction.
prediction.
Figure 6. YOLOv4 bounding box prediction.
Figure6,6,((ccxx, ,ccyy)) are the coordinates of the upper
InInFigure upper left
left corner
cornerof ofthethegrid
gridcellcellwhere
where
In Figure
the target center6,point are the(coordinates
(cx , cisy )located, pw , ph ) is theofwidth
the upper left corner
and height of the ofpriori
the gridbox, cell where
( bw , b )
the target center point is located, ( pw , ph ) is the width and height of the priori box,h
is the width and height of the actual prediction box, and ( σ ( t x
the target center point is located, ( pw , ph ) is the width and height of the priori box, ) , σ ( t y )) is the offset value
(predicted
bw , bh ) is bythethe
width and height
convolutional of thenetwork.
neural actual prediction
The position box, and (σ (t xof
information ), σthe
(t y bounding
)) is the
(bw , bh ) is the width and height of the actual prediction box, and ( (t x ),  (t y )) is the
box is calculated by Formulas (1)–(5),
offset value predicted by the convolutional neural where t , t is also predicted by the
w h network. The position information of convolutional
theoffset
network, value
bounding andpredicted
box ) isby
(bx ,isbycalculatedthethe convolutional
coordinates
by Formulas of the neural network.
centerwhere
(1)–(5), point of The
tw ,the
th is position
actual information
also prediction
predicted box.
by of
theIn
the bounding box is calculated by Formulas (1)–(5), where w h is also predicted by thet , t
Symmetry 2021, 13, x FOR PEER REVIEW 6 of 14

convolutional network, and (bx , by ) is the coordinates of the center point of the actual
Symmetry 2021, 13, 623 6 of 14
prediction box. In the obtained feature map, the length and width of each grid cell are 1,
so (cx , c y ) = (1,1) in Figure 6, and the sigmoid function is used to limit the predicted offset
to between 0 and 1.
the obtained feature map, the length and width of each grid cell are 1, so (c x , cy ) = (1, 1) in
Figure 6, and the sigmoid function is used = σlimit
bx to (t x ) +the
cx predicted offset to between 0 and (1)
1.

by σ=(σ
bx = t x()t y+) +
c xc y (1)
(2)

by = bσ (t=y )p+ectwy (2)


(3)
w w
tw
bw = p w e (3)
bh = ph eth (4)
bh = p h e t h (4)
1
σ ( x) = 1 − x (5)
σ( x ) = 1 +−ex (5)
1+e
Theloss
The lossfunction
functionofofYOLOv4
YOLOv4includes
includesregression,
regression,confidence,
confidence,andandclassification
classificationloss
loss
functions. Among them, the bounding box regression loss function uses CIOU
functions. Among them, the bounding box regression loss function uses CIOU to replace the to replace
the mean
mean squaresquare
error error loss, which
loss, which makesmakes the boundary
the boundary regression
regression faster
faster and moreand more accu-
accurate [25].
rate [25]. By minimizing the loss function between the predicted box and the
By minimizing the loss function between the predicted box and the real box, the network real box, the
is
network
trained andis trained and are
the weights the constantly
weights areupdated.
constantly updated.and
Confidence Confidence and classification
classification loss still use
loss still
cross use cross
entropy loss. entropy loss.

3.3.CBAM-Based
CBAM-BasedYOLOv4
YOLOv4Network
NetworkStructure
StructureImprovement
Improvement
The attention
The attention mechanism generates
generatesaamask maskthrough
throughthe theneural
neuralnetwork,
network, andand
the the
val-
values
ues in in the
the maskrepresent
mask representthetheattention
attentionweights
weights of of different locations.
locations. Common
Commonattention
attention
mechanisms
mechanismsmainly mainlyinclude
includechannel
channelattention
attentionmechanism,
mechanism,spatialspatialattention
attentionmechanism,
mechanism,
and
andmixed
mixeddomain
domainattention
attentionmechanism.
mechanism.The Thechannel
channel attention
attention mechanism
mechanism is to generate
is to generatea
mask for the channel of the input feature map, and different channels have
a mask for the channel of the input feature map, and different channels have correspond- corresponding
attention weights
ing attention to achieve
weights channel-level
to achieve channel-leveldistinction; the the
distinction; spatial attention
spatial mechanism
attention mechanism is
to
is generate
to generatea mask
a maskononthethe
spatial position
spatial position ofof
the
theinput
inputfeature
featuremap,
map,andanddifferent
differentspatial
spatial
regions
regionshave
havecorresponding
correspondingweights
weightstotorealize
realizethe
thedistinction
distinctionofofspatial
spatialregions;
regions;thethehybrid
hybrid
attention mechanism is to introduce the channel attention mechanism
attention mechanism is to introduce the channel attention mechanism and the spatial at- and the spatial
attention mechanismat
tention mechanism atthe
thesame
same time.
time. In
In this
this paper,
paper, the
the mixed
mixed attention
attention CBAM
CBAMmodulemoduleisis
introduced to make the neural network pay more attention to the target
introduced to make the neural network pay more attention to the target area containing area containing
important
important information
information [26],[26], suppress
suppress irrelevant
irrelevantinformation,
information,and andimprove
improve thethe overall
overall ac-
accuracy of target detection.
curacy of target detection.
CBAM
CBAMisisaahigh-efficiency,
high-efficiency,lightweight
lightweightattention
attentionmodule
modulethatthatcan
canbe
beintegrated
integratedinto
into
any
any convolutional neural network architecture, and can be trained end-to-endwith
convolutional neural network architecture, and can be trained end-to-end withthe
the
basic
basicnetwork.
network.The TheCBAM
CBAMmodule
modulestructure
structureisisshown
shownininFigure
Figure7.7.

Figure7.7.Convolutional
Figure Convolutionalblock
blockattention
attentionmodule
module(CBAM)
(CBAM)module
modulestructure.
structure.

In Figure
In Figure 7, the
the CBAM
CBAMmodulemoduleisisdivided
dividedinto a channel
into a channelattention module
attention andand
module a spa-
a
tial attention
spatial module.
attention First,
module. input
First, the the
input feature mapmap
feature intointo
the the
channel attention
channel module,
attention out-
module,
put thethe
output corresponding attention
corresponding map,
attention map,thenthen
multiply the input
multiply feature
the input map map
feature with with
the atten-
the
attention
tion map,map, the output
the output passes
passes through
through the spatial
the spatial attention
attention module,
module, and performs
and performs the
the same
same operation, and, finally, get the output feature map, the mathematical expression of
which is as follows:
F 0 = Mc ( F ) ⊗ F
(6)
F00 = Ms ( F 0 ) ⊗ F 0
operation, and, finally, get the output feature map, the mathematical expression of which
is as follows:
operation, and, finally, get the output feature map, the mathematical expression of which
is as follows: F ′ = M c (F ) ⊗ F
(6)
F ′′ = M s ( F ′) ⊗ F ′
Symmetry 2021, 13, 623 F ′ = M c (F ) ⊗ F 7 of 14
(6)
where ⊗ represents element-wise multiplication,
F ′′ = M s ( F ′) ⊗ F ′ is the input feature map, M c ( F )
is the channel attention map output by the channel attention module, M s ( F ′) is the spa-
where ⊗⊗represents
representselement-wise multiplication, Fis the
element-wisemultiplication, is the input feature map,M M )( Fis)
where
tial attention map output by the spatial attentionFmodule, input
and feature map,
F ′′ is the c ( Fcmap
feature
is the channel attention ′
map output by the channel attention module, M ( F ) is the spa-
0
the channel
output CBAM.map output by the channel attention module, Ms ( Fs ) is the spatial
attention
by the
attention map output
tial attention by theby
map output spatial attention
the spatial attention and F00 is
module,module, the F
and ′′ is the
feature map outputmap
feature by
the CBAM.
output by the CBAM.
3.1. Channel Attention Module
3.1. Each channel of Module
the feature map represents a feature detector. Therefore, channel at-
3.1.Channel
ChannelAttention
Attention Module
tention is used to focus on what features are meaningful. The structure of the channel
Each channel
channelisof
Eachmodule the feature
ofshown
the feature map
maprepresents
representsaafeature
featuredetector.
detector.Therefore,
Therefore, channel
channel at-
attention in Figure 8.
attention is used to focus on what features are meaningful. The structure of the
tention is used to focus on what features are meaningful. The structure of the channelchannel
attention
attentionmodule
moduleisisshown
shownin inFigure
Figure8.8.

Figure 8. Channel attention module structure.

Figure
FigureIn8.8.Figure
Channel
Channel attention
8,attention
the inputmodule structure.
feature
module map F is first subjected to global maximum pooling
structure.
and global average pooling based on width and height, and then a multi-layer perceptron
(MLP) InFigure
In Figure
with 8,8,the
shared theweights
inputfeature
input feature map
is passed map F is
in. Ffirst
The isMLP
first subjected
subjected
contains to global
to aglobal maximum
hidden maximum
layer, ispooling
pooling
which and
equiv-
and
globalglobal average
average poolingpooling
based based
on on
width width
and and
height, height,
and thenand a then a multi-layer
multi-layer
alent to two fully connected layers. The two outputs of the MLP are added pixel by pixel, perceptron
perceptron (MLP)
(MLP)
with
finally, with
shared shared weights
weights
the channel is passed
attention ismap
passed
in. The in.
MLPThecontains
is obtained MLP
throughcontains
a the
hidden a hidden layer,
whichwhich
layer,activation
Sigmoid is equiv-
is function.
equivalent Its
to two
alent tofully
two connected
fully layers.
connected
mathematical expression is: The
layers. two
The twooutputs
outputs of the
of MLP
the MLP are added
are added pixel
pixelby
by pixel,
pixel,
finally,
finally,the
thechannel
channelattention
attentionmap mapisisobtained
obtainedthrough
throughthe theSigmoid
Sigmoidactivation
activationfunction.
function.Its
Its
mathematical
mathematicalexpression M c (F
expression ) = σ (MLP( AvgPool ( F )) + MLP(MaxPool ( F )))
is:
is:
(7)
M ( F ) =σ
= σ ((W
MLP
c
1 (W(0 AvgPool
( Favg )) + W
( F1( W+0 (MLP
))
c
Fmax )))
( MaxPool ( F )))
Mc ( F ) c = σ ( MLP( AvgPool ( F )) + MLP( MaxPool ( F )))
c c )) + W (W ( Fc c ))) (7)
(7)
= σ(σW(1W(W
where σ is the Sigmoid =activation avg )) + W1 (W
0 ( Favg
1 (W0 ( F
1 W (0Fand
00 max )))W1 are the weights of MLP,
function, max
r ×C
W0 ∈ RσC /is
where R C ×C / r activation
1 ∈ Sigmoid
, WSigmoid , r is thefunction,
dimensionality
W0 reduction
W0 and W factor, = 16ofin
andofr MLP, 0this
WMLP,∈
where σ the is the activation function, 1 are
and W1the weights
are the weights
C/r × C
Rpaper. ,CW ∈ R C × C/r , r is the dimensionality reduction factor, and r = 16 in this paper.
1
W0 ∈ R / r ×C , W1 ∈ R C ×C / r , r is the dimensionality reduction factor, and r = 16 in this
paper.
3.2.
3.2.Spatial
SpatialAttention
AttentionModule
Module
After
Afterthe
thechannel
channelattention
attentionmodule,
module,the
thespatial
spatialattention
attentionmodule
moduleisisused
usedtotofocus
focuson
on
3.2.
where Spatial Attention Module
wherethethemeaningful
meaningfulfeatures
featurescome from.
come from.The
Thestructure
structureof of
thethe
spatial attention
spatial module
attention is
module
shown After
in the
Figure channel
9.
is shown in Figure 9. attention module, the spatial attention module is used to focus on
where the meaningful features come from. The structure of the spatial attention module
is shown in Figure 9.

Figure9.9.Spatial
Figure Spatialattention
attentionmodule
modulestructure.
structure.

Figure
In 9.Figure
Spatial9,
attention module
the spatial module takes F 0 as the input feature map, and,
structure.
attention
respectively, through channel-based global maximum pooling and global average pooling,
then merges the two feature maps Favg s and F s
max to obtain a feature map with a channel
number of 2, it passes a 7 × 7 convolutional layer to reduce the channel number to 1,
and finally gets a spatial attention map Ms ( F ) through a Sigmoid activation function. Its
mathematical expression is as follows:

Ms ( F ) = σ( f 7×7 ([ AvgPool ( F ); MaxPool ( F )]))


(8)
= σ( f 7×7 ([ Favg
s ; F s ]))
max

where 7 × 7 represents the size of the convolution kernel.


and finally gets a spatial attention map M s ( F ) through a Sigmoid activation function.
Its mathematical expression is as follows:

M s ( F ) = σ ( f 7×7 ([ AvgPool ( F ); MaxPool ( F )]))


(8)
Symmetry 2021, 13, 623 = σ ( f 7×7 ([ Favg
s s
; Fmax ])) 8 of 14

where 7 × 7 represents the size of the convolution kernel.

3.3. Improved
3.3. Improved YOLOv4 YOLOv4 Algorithm
Algorithm
This paper addsThis paper adds
a CBAM modulea CBAM
to eachmodule
of the to each
three of the three
branches at thebranches at the end of the
end of the
YOLOv4 feature YOLOv4
fusion feature
network.fusion
Aiming network. Aiming at the characteristics
at the characteristics of denseness,
of denseness, mutual oc- mutual oc-
clusion, and multiple small targets of marine targets. By integrating
clusion, and multiple small targets of marine targets. By integrating the CBAM module, the CBAM module,
the weights of the channel features and spatial features of the feature
the weights of the channel features and spatial features of the feature map are assigned to map are assigned to
increase the weights of useful features while suppressing the weights
increase the weights of useful features while suppressing the weights of invalid features, of invalid features,
paying more
paying more attention attention
to target to target
regions regionsimportant
containing containing important information,
information, suppressing suppressing ir-
relevant information,
irrelevant information, and improving and the
improving
overall the overallofaccuracy
accuracy of target detection.
target detection. The im- The improved
network structure is shown
proved network structure is shown in Figure 10.in Figure 10.

Figure 10. Improved YOLOv4


Figure 10. network
Improved structure
YOLOv4 combined
network with
structure CBAM. with CBAM.
combined

In Figure 10, assuming


In Figurethat10, the input image
assuming that thesize is 416
input × 416
image × 3is, take
size 416 ×the
416first
× 3,CBAM
take the first CBAM
module as an example,
module as theaninput featurethe
example, map input is 52 × 52
size feature map× 256
size, after
is 52global
× 52 × maximum
256, after global maxi-
pooling and global averageand
mum pooling pooling,
global gets twopooling,
average feature gets maps two offeature ×1× 256
size 1maps and1 × 1 × 256 and
of size
1×1× 256 , and1then× 1 ×passes
256, andthrough a multi-layer
then passes through perceptron
a multi-layer with shared with
perceptron weights.
sharedTheweights. The di-
dimensionalitymensionality ×1×16 ,tothe
reduces to 1reduces 1 ×dimensionality
1 × 16, the dimensionality
reduction coefficient is 16, and is 16, and then
reduction coefficient
then increases increases
the dimensionality to 1×1×to256
the dimensionality 1 ×, 1adds
× 256,the adds the operation
operation to thetotwo
the two feature maps, and
feature
maps, and getsgetsthe the channel
channel attention
attention mapmapwithwith of 1of×1××256
a size
a size 1 × through
256 through the Sigmoid activation
the Sigmoid
function,
activation function, multiplies
multiplies the input
the input featurefeature
map and map theand the attention
attention map to getmap thetoout-
get the output of
put of 52 × 52 ×52256
× 52 × 256
size. Next,size.
theNext, theattention
spatial spatial attention
module is module
entered, is entered,
through through
channel-channel-based
global maximum
based global maximum poolingpooling
and globaland average
global average
poolingpooling respectively.
respectively. Two feature maps of
Two feature
maps of 52 × 52 ×1 size are obtained, and the number of channels of the two featuretwo
52 × 52 × 1 size are obtained, and the number of channels of the maps feature maps is
is combined to obtain a feature map of size 52 × 52 × 2 , and then a 7 × 7 convolution isconvolution is
combined to obtain a feature map of size 52 × 52 × 2, and then a 7 × 7
used to reduceused to reduce
the number of the number
channels to of
1. channels
Finally, the to 1. Finally,activation
Sigmoid the Sigmoid activation
function is function is
used to obtain a spatial
used to obtain a spatial attention map of size attention 52 ×
map 52 ×
of 1size 52 × 52 × 1, the input
, the input of the spatial attention of the spatial attention
module and the spatial attention map are multiplied
module and the spatial attention map are multiplied to obtain an output feature map of to obtain an output feature map of
size 52 × 52 × 256. The output feature map of the
size 52 × 52 × 256 . The output feature map of the CBAM module is consistent with the CBAM module is consistent with the
input
input feature map. feature map.

4. Network Training and Experimental Results


4.1. Marine Target Data Set
The target detection data set in this article is 3000 images of marine targets collected
from the Internet. The data format is JPG. The marine targets are divided into 10 categories,
namely speedboat, warship, passenger ship, cargo ship, sailboat, tugboat, kayak, boat,
fighter plane, and buoy.
LabelImg is used to label the obtained data. The labeled data is divided into the
training set, validation set, and test set according to the ratio of 5:1:1. The format of the
data set is produced according to the VOC data set. Part of the marine target data is shown
in Figure 11.
ries, namely speedboat, warship, passenger ship, cargo ship, sailboat, tugboat, kayak,
boat, fighter plane, and buoy.
LabelImg is used to label the obtained data. The labeled data is divided into the train-
ing set, validation set, and test set according to the ratio of 5:1:1. The format of the data set
Symmetry 2021, 13, 623 9 of 14
is produced according to the VOC data set. Part of the marine target data is shown in Figure
11.

Figure 11. Part of the marine target


Figure dataset.
11. Part of the marine target dataset.

In order to further increase the generalization ability of the model and increase the
In order to further increase the generalization ability of the model and increase the
diversity of samples, offline data augmentation is performed by mirroring, brightness
diversity of samples, offline data augmentation is performed by mirroring, brightness ad-
adjustment, contrast, random cropping, etc., and then the enhanced data is added to the
justment, contrast, random
training cropping,
dataset etc., the
to complete anddata
then the enhanced
expansion, dataget
and, finally, is 10,000
addedpictures.
to the
training dataset to complete the data expansion, and, finally, get 10,000 pictures.
4.2. Experimental Environment and Configuration
4.2. Experimental Environment and
In order to Configuration
further accelerate the network training speed, this experiment introduces
transferaccelerate
In order to further learning technology,
the networkloads the pre-trained
training model
speed, this on the COCO
experiment dataset, and then
introduces
trains the marine target dataset. The hardware environment and software version of the
transfer learning technology, loads the pre-trained model on the COCO dataset, and then
experiment are shown in Table 1.
trains the marine target dataset. The hardware environment and software version of the
experiment are shown
Table 1.inHardware
Table 1.environment and software version.

Hardwareand
Table 1. Hardware environment and software
Software version. Configuration
Operating System: Ubuntu18.04
Hardware and Software Configuration
CPU: Intel i7-7700
Desktop computer Operating System: Ubuntu18.04
RAM: 15.5 G
CPU: Intel i7-7700
Desktop computer GPU: NVIDIA GTX 1080Ti (11G Video memory)
RAM: 15.5 G
GPU: NVIDIA GTXPycharm2020
Software version 1080Ti (11G+ CMake3.10.2
Video memory)+ Python 3.6.8
+ CUDNN7.4.2 + Opencv3.4.3 + CUDA10.0.130
Pycharm2020 + CMake3.10.2 + Python 3.6.8
Software version
+ CUDNN7.4.2 + Opencv3.4.3 + CUDA10.0.130
The parameters of the training network are shown in Table 2.
The parameters of the training network are shown in Table 2.
Table 2. Training network parameters.

Table 2. Training network parameters.


Parameter Value

Parameter Batch size Value 16


Momentum parameter 0.9
Batch size Learning rate 16 0.001
Momentum parameter
Number of iterations 0.9 20000
Learning rate decay
Learning rate 0.001 0.1
Fixed image size 416 × 416
Number of iterations 20000
Learning rate decay 0.1
Among them, the learning rate decay of 0.1 means that the learning rate is reduced
Fixed image size 416 × 416
to one-tenth of the original after a certain number of iterations. In this experiment, the
learning rate was changed at 16,000 iterations and 18,000 iterations.
The convergence curve of the loss function obtained after training the network is
shown in Figure 12.
Among them, the learning rate decay of 0.1 means that the learning rate is reduced
to one-tenth of the original after a certain number of iterations. In this experiment, the
Symmetry 2021, 13, 623 learning rate was changed at 16,000 iterations and 18,000 iterations. 10 of 14
The convergence curve of the loss function obtained after training the network is
shown in Figure 12.

Figure 12.
Figure 12. Improved
Improved YOLOv4
YOLOv4 training
training loss
loss function
functioncurve.
curve.

ItIt can
canbe
beseen
seenfrom
fromFigure
Figure 1212
that due
that dueto the loading
to the of the
loading of pre-trained
the pre-trainedmodel, the loss
model, the
value can be
loss value reduced
can to a lower
be reduced valuevalue
to a lower in a few
in aiterations. It canItalso
few iterations. can be
alsoseen that the
be seen thatloss
the
value maintains
loss value a downward
maintains a downward trend untiluntil
trend it finally converges.
it finally converges. TheThe
lossloss
value
valuedrops
dropsto to
a
relatively small value when the number of iterations is 4000, reaches a relatively
a relatively small value when the number of iterations is 4000, reaches a relatively stable stable level
when iteration
level when 18,000,18,000,
iteration and, finally, drops to
and, finally, 1.0397.
drops The overall
to 1.0397. effect ofeffect
The overall training is ideal. is
of training
ideal.
4.3. Marine Target Detection Performance Comparison
In order
4.3. Marine to verify
Target the effectiveness
Detection Performanceof the improved YOLOv4 network, a comparative
Comparison
experiment was conducted between the original YOLOv4 training model and the improved
In order to verify the effectiveness of the improved YOLOv4 network, a comparative
YOLOv4 network model. The original YOLOv4 training parameters are consistent with the
experiment was conducted between the original YOLOv4 training model and the im-
improved YOLOv4 training parameters. The commonly used target detection evaluation
proved YOLOv4 network model. The original YOLOv4 training parameters are consistent
mAP is used to compare the model before and after the improvement.
with the improved YOLOv4 training parameters. The commonly used target detection
In the target detection task, according to the intersection over union (IOU) to determine
evaluation mAP is used to compare the model before and after the improvement.
whether the target is successfully detected, the ratio of the intersection and union between
In the target detection task, according to the intersection over union (IOU) to deter-
the prediction box and ground truth box of the model is IOU. For a certain type of target in
mine whether
the dataset, the target
assuming the is successfully
threshold detected,
is α, when the the
IOUratio ofprediction
of the the intersection
box andand union
ground
between
truth box the prediction
is greater thanbox
α, itand ground
means truth
that the box of
model the model
prediction is is IOU. For
correct; whena certain
the IOUtype
of
of target in the dataset, assuming the threshold is α , when the IOU of the
the prediction box and ground truth box is less than α, it means that the model predictsprediction box
and groundThe
incorrectly. truth box is greater
confusion than
matrix is α , in
shown it means
Table 3that the model prediction is correct;
below.
when the IOU of the prediction box and ground truth box is less than α , it means that
the model
Table predicts
3. Confusion incorrectly. The confusion matrix is shown in Table 3 below.
matrix.

Prediction
Table 3. Confusion matrix. Positive Negative
Real
TruePrediction TP
Positive FN
Negative
Real False FP TN
True TP FN
In TableFalse
3, TP represents the numberFP of positive samples correctly TN predicted, FP
represents the number of negative samples incorrectly predicted, FN is the number of
positive samples incorrectly predicted, and TN represents the number of negative samples
correctly predicted. The calculation formulas for precision rate and recall are as follows:

TP
Precision = (9)
TP + FP
Symmetry 2021, 13, 623 11 of 14

TP
Recall = (10)
TP + FN
AP value is usually used as the evaluation index of target detection performance. AP
value is the area under the P-R curve, in which recall is taken as X-axis and precision as
Y-axis. AP represents the accuracy of the model in a certain category. mAP represents the
average the accuracy of all categories and can measure the performance of the network
model in all categories. mAP50 represents the mAP value where the IOU of the prediction
box and ground truth box is greater than 0.5, and mAP75 represents the mAP value where
the IOU threshold is greater than 0.75. The calculation formula of mAP is as follows:

1 N
Z 1

N i∑
mAP = PdR (11)
=1 0

where N represents the number of detected categories.


FPS (Frame Per Second) is used to evaluate the detection speed of the algorithm.
It represents the number of frames that can be processed per second. Models have different
processing speeds under different hardware configurations. Therefore, this article uses the
same hardware environment when comparing detection speed.
The data of the test set is sent to the trained target detection model, and different
thresholds are selected for experimental comparison. The model comparison before and
after the improvement of YOLOv4 is shown in Table 4.

Table 4. Comparison of original and improved YOLOv4 results.

Evaluation mAP50 mAP75 FPS Model Volume


Algorithm
YOLOv4 82.04% 66% 53.6 256.2 MB
Improved YOLOv4 84.06% 67.85% 50.4 262.4 MB

It can be seen from Table 4 that both map50 and map75 of the YOLOv4 combined with
the CBAM algorithm are improved, in which mAP50 is increased by 2.02%, and mAP75 is
increased by 1.85%. Because of the addition of three CBAM modules, the volume of the
model becomes larger, from 256.2 MB to 262.4 MB, resulting in a slight decrease in FPS
value from 53.6 to 50.4, but the speed still meets the real-time requirements. Figure 12
shows the test results before and after the improvement.
In Figure 13, the first column is the input pictures, the second column is the YOLOv4
detection results, and the third column is the improved YOLOv4 detection results. In the
first row, YOLOv4 mis-detected the tug as a warship, and the improved YOLOv4 success-
fully detected it as a tug; from the second, fifth, sixth, and seventh rows, it can be seen that
the improved YOLOv4 detects small targets more accurately than the original algorithm,
and more small targets are detected. Among them, the fifth row of ship targets has more
background environment interference. Improved YOLOv4 has stronger robustness and
detects more targets. In the seventh row, the improved YOLOv4 successfully detects the
small target of the occluded cargo ship; from the third and fourth rows, it can be seen that
the improved YOLOv4 can detect more mutually occluded targets when the ship targets
are dense and mutually occluded. The last line shows that when there is interference in the
background, the original algorithm detection box has a position shift, and the improved
algorithm detection box is more accurate.
Symmetry 2021, 13, x FOR PEER REVIEW 12 of 14

Symmetry 2021, 13, 623 12 of 14


the background, the original algorithm detection box has a position shift, and the im-
proved algorithm detection box is more accurate.

(a) (b) (c)

Figure 13. Comparison of YOLOv4 and improved YOLOv4 combined with CBAM algorithm marine
target detection results. (a) Original picture; (b) YOLOv4 detection results; (c) Improved YOLOv4
detection results.
Symmetry 2021, 13, 623 13 of 14

For dense targets, mutual occlusion targets, and small targets, the improved YOLOv4
network can effectively detect and reduce the missed detection rate. In addition, the
original YOLOv4 network will cause false detections when detecting targets. The im-
proved YOLOv4 network improves the target false detection situation. According to the
experimental results, the improved YOLOv4 combined with the CBAM target detection
algorithm proposed in this paper is more effective, improves the accuracy of target de-
tection, can basically meet the needs of marine target detection tasks, and has practical
application value.

5. Conclusions
Marine target detection technology is of great significance in the fields of sea surface
monitoring, sea area management, ship collision avoidance, etc. and is focused on the
problem of insufficient detection accuracy of YOLOv4 in the self-built ship dataset. On the
basis of YOLOv4, the CBAM attention module is added to make the neural network pay
more attention to the target area that contains important information, suppress irrelevant
information, and improve detection accuracy. Experimental results show that the improved
YOLOv4 model has higher accuracy in target detection tasks than the original YOLOv4,
mAP50 is increased by 2.02%, mAP75 is increased by 1.85%, and the detection speed meets
the real-time requirements, which verifies the effectiveness of the improved algorithm. It
provides a theoretical reference for further practical applications.

Author Contributions: Conceptualization, H.F. and Y.W.; methodology, H.F. and G.S.; software, G.S.
and Y.W.; validation, H.F. and G.S.; writing—original draft preparation, G.S.; writing—review and
editing, H.F. and G.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant num-
ber 52071112; Fundamental Research Funds for the Central Universities, grant number 3072020CF0408.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

YOLO v4 You Only Look Once (version 4)


CBAM Convolutional Block Attention Module
GMM Gaussian Mixture Model
SPP Spatial Pyramid Pooling
CmBN Cross mini-Batch Normalization
Fast R-CNN Fast Region Convolutional Neural Networks
Faster R-CNN Faster Region Convolutional Neural Networks
R-FCN Region based Fully Convolutional Network
SSD Single Shot MultiBox Detector
PANet Path Aggregation Network
CSPDarknet53 Cross Stage Partial Darknet53
CSPNet Cross Stage Partial Network
IOU Intersection Over Union
CIOU Complete Intersection Over Union
MLP Multi-Layer Perceptron
VOC Visual Object Classes
COCO Common Objects in Context
AP Average Precision
mAP mean Average Precision
TP True Positive
Symmetry 2021, 13, 623 14 of 14

FP False Positive
TN Ture Negative
FN False Negative
P-R curve Precision-Recall curve
FPS Frame Per Second

References
1. Fefilatyev, S.; Goldgof, D.; Shreve, M.; Lembke, C. Detection and tracking of ships in open sea with rapidly moving buoy-mounted
camera system. Ocean Eng. 2012, 54, 1–12. [CrossRef]
2. Shi, W.; An, B. Port ship detection method based on multi-structure morphology. Comput. Syst. Appl. 2016, 25, 283–287.
3. Chen, Z.; Yang, J.; Chen, Z.; Kang, Z. Ship Target Detection Algorithm for Maritime Surveillance Video Based on Gaussian
Mixture Model. In Proceedings of the IOP Conference, Bangkok, Thailand, 27–29 July 2018.
4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
5. Chen, J.; Li, K.; Deng, Q.; Li, K.; Yu, P.S. Distributed deep learning model for intelligent video surveillance systems with edge
computing. IEEE Trans. Ind. Inform. 2019. [CrossRef]
6. Xu, Y.; Wang, H.; Liu, X.; He, H.R.; Gu, Q.; Sun, W. Learning to See the Hidden Part of the Vehicle in the Autopilot Scene.
Electronics 2019, 8, 331. [CrossRef]
7. Goswami, G.; Ratha, N.; Agarwal, A.; Singh, R.; Vatsa, M. Unravelling Robustness of Deep Learning based Face Recognition
Against Adversarial Attacks. arXiv 2018, arXiv:1803.00401.
8. Purkait, P.; Zhao, C.; Zach, C. SPP-Net: Deep absolute pose regression with synthetic views. arXiv 2017, arXiv:1712.03452.
9. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18
December 2015; pp. 1440–1448.
10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [CrossRef] [PubMed]
11. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409.
12. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37.
13. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 779–788.
14. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
15. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
16. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
17. Law, H.; Deng, J. Cornernet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer
Vision, Munich, Germany, 8–14 September 2018; pp. 734–750.
18. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE
International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6569–6578.
19. Gu, D.; Xu, X.; Jin, X. Marine Ship Recognition Algorithm Based on Faster-RCNN. J. Image Signal Process. 2018, 7, 136–141.
[CrossRef]
20. Qi, L.; Li, B.; Chen, L.; Wang, L.; Dong, L.; Jia, X.; Huang, J.; Ge, C.; Xue, G.; Wang, D. Ship target detection algorithm based on
improved faster R-CNN. Electronics 2019, 8, 959. [CrossRef]
21. Zou, Y.; Zhao, L.; Qin, S.; Pan, M.; Li, Z. Ship Target Detection and Identification Based on SSD_MobilenetV2. In Proceedings of
the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June
2020; pp. 1676–1680.
22. Huang, H.; Sun, D.; Wang, R.; Zhu, C.; Liu, B. Ship Target Detection Based on Improved YOLO Network. Math. Probl. Eng. 2020,
2020, 1–10. [CrossRef]
23. Chen, X.; Qi, L.; Yang, Y.; Luo, Q.; Postolache, O.; Tang, J.; Wu, H. Video-based detection infrastructure enhancement for
automated ship recognition and behavior analysis. J. Adv. Transp. 2020, 2020, 1–12. [CrossRef]
24. Huang, Z.; Sui, B.; Wen, J.; Jiang, G. An intelligent ship image/video detection and classification method with improved regressive
deep convolutional neural network. Complexity 2020, 2020, 1–11. [CrossRef]
25. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression; AAAI:
New York, NY, USA, 2020; pp. 12993–13000.
26. Woo, S.; Park, J.; Lee, J.Y.; Queon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference
on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19.

You might also like