You are on page 1of 17

electronics

Article
A Lightweight Model for Real-Time Monitoring of Ships
Bowen Xing 1, *,† , Wei Wang 1,† , Jingyi Qian 2,3 , Chengwu Pan 4 and Qibo Le 4

1 College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China;
ww1585395980@163.com
2 College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China;
2180140@tongji.edu.cn
3 Shanghai Aerospace Electronics Co., Ltd., Shanghai 201800, China
4 Ningbo Communication Center, Ningbo 315800, China; nbpcw@163.com (C.P.); leqibo@126.com (Q.L.)
* Correspondence: bwxing@shou.edu.cn
† These authors contributed equally to this work.

Abstract: Real-time monitoring of ships is crucial for inland navigation management. Under complex
conditions, it is difficult to balance accuracy, real-time performance, and practicality in ship detection
and tracking. We propose a lightweight model, YOLOv8-FAS, to address this issue for real-time ship
detection and tracking. First, FasterNet and the attention mechanism are integrated and introduced
to achieve feature extraction simply and efficiently. Second, the lightweight GSConv convolution
method and a one-shot aggregation module are introduced to construct an efficient network neck
to enhance feature extraction and fusion. Furthermore, the loss function is improved based on ship
characteristics to make the model more suitable for ship datasets. Finally, the advanced Bytetrack
tracke is added to achieve the real-time detection and tracking of ship targets. Compared to the
YOLOv8 model, YOLOv8-FAS reduces computational complexity by 0.8 × 109 terms of FLOPs and
reduces model parameters by 20%, resulting in only 2.4 × 106 parameters. The mAP-0.5 is improved
by 0.9%, reaching 98.50%, and the real-time object tracking precision of the model surpasses 88%.
The YOLOv8-FAS model combines light weight with high precision, and can accurately perform
ship detection and tracking tasks in real time. Moreover, it is suitable for deployment on hardware
resource-limited devices such as unmanned surface ships.

Keywords: ship monitoring; deep learning; lightweight model; real-time tracking; YOLOv8

Citation: Xing, B.; Wang, W.; Qian, J.;


Pan, C.; Le, Q. A Lightweight Model
for Real-Time Monitoring of Ships.
1. Introduction
Electronics 2023, 12, 3804. https://
doi.org/10.3390/electronics12183804
Inland waterway transportation is an essential component of the integrated trans-
portation system, playing a significant role in promoting urban development, optimizing
Academic Editor: Ahmed Abu-Siada resource allocation, and facilitating communication and cooperation [1,2]. In urban inland
Received: 14 August 2023
waters or coastal environments such as harbors there is a diverse range of ship types and
Revised: 3 September 2023 a relatively dense distribution of ships [3]. While inland waterway transportation has its
Accepted: 6 September 2023 advantages, it presents formidable challenges to navigational supervision [4]. Tasks such
Published: 8 September 2023 as ship density statistics, ship behavior monitoring, accident investigation, and aiding
navigation rely on ship monitoring. The core of ship monitoring is efficient detection and
real-time tracking of ship objects.
The application of Convolutional Neural Networks (CNN) [5] in various object detec-
Copyright: © 2023 by the authors. tion and tracking fields has become increasingly widespread. In recent years, researchers
Licensee MDPI, Basel, Switzerland. have introduced it into ship monitoring tasks. Object detection algorithms based on convo-
This article is an open access article lutional neural networks can automatically extract essential features from images through
distributed under the terms and training, providing a way to break free of the limitations involved in manual extraction of
conditions of the Creative Commons
unbalanced qualities and leading to more accurate and efficient detection results. Object
Attribution (CC BY) license (https://
tracking, as the subsequent task following detection, has gained popularity in engineering
creativecommons.org/licenses/by/
applications. Currently, deep learning-based object detection algorithms mainly fall into the
4.0/).

Electronics 2023, 12, 3804. https://doi.org/10.3390/electronics12183804 https://www.mdpi.com/journal/electronics


Electronics 2023, 12, 3804 2 of 17

categories of one-stage detection algorithms, represented by Single Shot MultiBox Detector


(SSD) [6] and the YOLO series, and two-stage detection algorithms, represented by the
R-CNN and Faster R-CNN series [7]. ZHAO et al. [8] employed pretrained networks for
feature extraction, reducing redundant feature mapping to enhance the Faster R-CNN de-
tection network and achieving promising detection results. However, two-stage algorithms
such as Faster R-CNN fail to meet real-time detection requirements. Zhang et al. [9] pro-
posed a lightweight model based on SSD for ship detection by introducing a bidirectional
feature fusion module and an attention mechanism to the model. While they achieved im-
proved detection results, the detection ability of small targets and dense targets remains an
area that needs to be strengthened. Building upon YOLOv3, Yang et al. [10] introduced the
K-means clustering algorithm and Soft-NMS algorithm and modified the output classifier,
leading to improved the precision of ship detection, though at the cost of increased model
complexity and memory usage. They coupled these modifications with the DeepSORT
tracker to provide effective ship tracking [10].
While progress has been made in real-time ship monitoring using deep learning
methods, this task involves several further challenges under natural conditions. First,
real-time monitoring of ships is mostly based on SAR ship detectors [11], and the research
on ship detection and tracking in real situations needs to be deepened [12]. Second, the
surface conditions of water are complex, being characterized by various interference factors
such as river structures, buoys, wakes, and other obstacles, all of which impact detection
accuracy [13]. Furthermore, variations in the similar appearance of ships contribute to
these challenges. Achieving a balanced trade-off between accuracy, speed, and computa-
tional cost while deploying ship monitoring models on devices with limited memory and
computing capabilities for practical applications remains a major challenge [14].
In this paper, we introduce a lightweight ship detection and tracking model to address
real-time ship monitoring challenges within natural water environments. This model is an
enhanced version of the YOLOv8n algorithm designed for deployment on devices with
limited memory and computational resources. It combines the improved YOLOv8 detector
with the advanced Bytetrack tracker to reduce the parameters and computational complex-
ity of the model while maintaining model performance. According to the characteristics of
the ship dataset, the loss function is improved to improve the prediction performance and
optimization effect of the model. Our contributions are detailed as follows:
• Inspired by FasterNet, we integrate simple and effective FasterNet blocks into the
backbone of YOLOv8n. Additionally, we fuse the attention mechanism into the
FasterNet block, enhancing the backbone’s lightweight nature and feature extraction
capabilities.
• We introduce a lightweight yet feature-rich neck network, and employ the lightweight
GSConv convolution approach as a substitute for conventional convolution modules.
Additionally, we replace the complex CSP module with a one-shot VoV-GSCSP aggrega-
tion module based on the GSConv design. Flexibly combining GSConv and VoV-GSCSP
achieves an improved balance between computational costs and the performance of the
feature fusion network.
• We introduce an IoU loss measure called MPDIoU based on the minimum points dis-
tance to address the limitations of existing loss functions, leading to faster convergence
speed and more accurate regression results.
• We collected and processed surveillance images of waterborne ships in order to create
a dataset designed explicitly for ship detection and tracking. This dataset includes
various types of ships, making it suitable for real-time ship monitoring tasks.

2. Related Works
2.1. Object Detection
As one of the representative examples of one-stage object detection algorithms [15],
the YOLO series [16] utilizes deep neural networks to identify and locate objects, offering
high operational speeds suitable for real-time monitoring and tracking tasks. The authors
Electronics 2023, 12, 3804 3 of 17

of YOLOv5 have recently introduced a novel state-of-the-art model known as YOLOv8. The
specific architecture of YOLOv8 is shown in Figure 1. Building upon previous iterations
of the YOLO series, YOLOv8 incorporates several improvements that enhance detection
accuracy and speed. This makes it particularly well-suited to serve as a baseline for
ship detection.

backbone DarknetBottleneck DarknetBottleneck ConvModule CSP SPPFBottleneck


add=ture add=False
Inputs(640,640,3) Conv2d
ConvModule ConvModule
ConvModule(320,320,64) ConvModule ConvModule
BatchNorm2d
Split
ConvModule(160,160,128) ConvModule
MaxPool2d
ConvModule SiLU
CSP(160,160,128) DarknetBottleneck
MaxPool2d

ConvModule(80,80,256)
DarknetBottleneck MaxPool2d
CSP(80,80,256) Concat+CSP YOLOHead
Concat
Downsample DarknetBottleneck
UpSampling2D
ConvModule(40,40,512)
Concat+CSP YOLOHead
ConvModule
CSP(40,40,512) Concat+CSP Concat

ConvModule(20,20,512) UpSampling2D Downsample


head ConvModule

CSP(20,20,512)

SPPFBottleneck(20,20,512) Concat+CSP YOLOHead


neck

Figure 1. YOLOv8 model structure.

The entire YOLOv8 network’s operation involves feature extraction, feature enhance-
ment, and prediction of object conditions corresponding to prior bounding boxes.
The backbone is the primary feature extraction network within YOLOv8, where input
images are initially processed to extract features. These extracted features are referred to as
feature layers, constituting a collection of characteristics from the input images. YOLOv8
leverages three effective feature layers within the backbone for constructing subsequent
network components. Compared to previous YOLO series algorithms, YOLOv8 employs
3 × 3 convolution kernels with a stride of two for initial feature extraction, sacrificing
receptive field while enhancing the model’s speed. The preprocessing of the CSP [17]
module involves replacing three successive convolutions with two convolutions, drawing
inspiration from the ELAN [18] architecture of YOLOv7. The specific implementation
method is to expand the number of channels for the first convolution to twice the original
number, then split the convolution results in half on the channels. This approach reduces
the number of convolutions and accelerates the network’s speed.
The Feature Pyramid Network (FPN) [19] is the enhanced feature extraction network
in YOLOv8. The three effective feature layers obtained from the backbone in the main
section of YOLOv8 are fused in the FPN component. Feature fusion aims to combine
features from different scales to facilitate the extraction of more refined characteristics.
Within the FPN segment, the obtained effective feature layers are employed to further
extract the features. YOLOv8 continues to adopt the PANet structure [20], which involves
upsampling features for feature fusion and then downsampling them again to achieve
feature fusion.
The YOLO head serves as the classifier and regressor within YOLOv8. With the
contribution of the backbone and neck, the network obtains three enhanced and effective
feature layers. Each feature layer has the dimensions of width, height, and channel count.
If we consider the feature map as a collection of individual feature points, each feature
point acts as a prior point, eliminating the need for prior bounding boxes. Instead, each
prior point contains features equal to the number of channels. The role of the YOLO Head
is to determine whether an object is associated with each prior point by examining the
corresponding priors’ conditions. YOLOv8 transitions from the previous coupled head
design to a decoupled head design in which classification and regression are no longer
realized within the same 1 × 1 convolution layer.
Electronics 2023, 12, 3804 4 of 17

The loss function of YOLOv8 comprises both regression and classification components.
The predicted category results for the priors are taken in the classification part; the cross-
entropy loss is calculated based on the true box category and the predicted category for
each prior. YOLOv8 employs the Distribution Focal (DF) loss [21] for the final regression
prediction, necessitating the inclusion of the DF loss in the regression section. The regression
loss in YOLOv8 comprises the CIoU loss [22] and the DF loss.

2.2. Lightweight Object Detection Models


In order to achieve effective detection results for ship detection models under con-
strained memory and computational resources, researchers have proposed a series of
lightweight object detection algorithms. The YOLO series [23–29], MobileNet [30], Ghost-
Net [31], and ShuffleNet [32] are all widely employed as lightweight object detection
models. MobileNets extensively employ 1 × 1 convolutions to fuse separately calculated
channel information. ShuffleNets introduce channel shuffling to facilitate mutual communi-
cation of channel information. GhostNets utilize half the standard convolution operations
to maintain inter-channel information exchange.
Recently, Li et al. [33] introduced a novel approach to reduce model complexity while
maintaining accuracy. The authors combined the concepts of MobileNet, GhostNet, and
ShuffleNet, resulting in a lightweight convolution called GSConv. As illustrated in Figure 2,
GSConv initially performs downsampling on the input through a standard convolution,
followed by depth-wise separable convolution (DWConv). Subsequently, an SC and a DSC
are concatenated. Finally, the shuffle operation is applied to mix SC information with DSCs.
The computational complexity of GSConv is approximately half that of SC while retaining
a similar learning capacity to SC.

Figure 2. GSConv.

In addition to object detection models, there are ongoing efforts to achieve fast neural
networks, which hold significant relevance for object detection models. Chen et al. [34]
introduced a new neural network, FasterNet, with a simple architecture that exhibits
remarkable speed; it proved to be highly effective for various visual tasks as well as
being hardware-friendly. The authors introduced a simple and rapid partial convolution,
PConv, to reduce redundant calculations and memory access, enabling better utilization
of computational capabilities on devices and more efficient spatial feature extraction.
However, the PConv operation can result in loss of information, affecting the accuracy
and generalization capability of the model. The optimization methods of FasterNet may
need to be adjusted and optimized for different tasks and datasets. As depicted in Figure 3,
PConv leverages redundancy within the feature map, applying general convolution (Conv)
to only a subset of input channels for spatial feature extraction while leaving the remaining
channels unchanged. The FasterNet architecture built upon PConv performs well, and
exhibits universal speed on different devices such as GPU, CPU, and ARM processors. It is
well-suited for real-time ship detection, ship tracking, and similar tasks.
Electronics 2023, 12, 3804 5 of 17

Figure 3. PConv.

2.3. Attention Mechanism


The attention mechanism can allocate computing resources to tasks that need to be
focused on in the case of limited computing power and quickly screen out more important
information related to the target task. In the target detection model, the introduction of
an attention mechanism can help the model to strengthen the extraction of features and
improve the performance of the detection network. Researchers have introduced various
attention mechanisms, primarily classified into channel attention, spatial attention, or
both. The Squeeze-and-Excitation (SE) attention mechanism models cross-dimensional
interactions to extract channel-wise attention [35]. However, the SE mechanism overlooks
critical spatial information in the image, limiting the improvement in model accuracy to an
extent. To address this concern, Woo et al. [36] proposed the Convolutional Block Attention
Module (CBAM), which establishes cross-channel and cross-spatial information and then
integrates cross-dimensional attention weights into the input features. The implementation
schematic of CBAM is illustrated in Figure 4.

Figure 4. CBAM.

The channel attention module is used to adjust the weights of each channel in the
feature map, aiding the network in selecting relevant feature channels. This keeps the
channel dimension unchanged while compressing the spatial dimensions. The input feature
layer initially undergoes global average pooling and global maximum pooling. A shared
fully connected layer processes the pooling results individually before being added together.
Finally, the sigmoid activation function is applied to obtain the weight for each channel in
the input feature layer, which is then multiplied by the original input feature layer. The
expression for channel attention is as follows:

MC ( F ) = σ ( MLP( AvgPool ( F )) + MLP( MaxPool ( F ))) (1)

The spatial attention module is employed to adjust weights at different positions


in the feature map, aiding in selecting relevant feature regions. It maintains the spatial
dimension while compressing the channel dimension. The input feature layer computes
the maximum and average values for each feature point’s channel, then stacks these two
results. Convolution with a channel count of 1 is applied to adjust the channel count.
Electronics 2023, 12, 3804 6 of 17

Finally, the sigmoid activation function is applied to obtain the weight for each feature
point in the input feature layer, which is then multiplied by the original input feature layer.
The expression for spatial attention is as follows:

Ms ( F ) = σ (f7×7 ([ AvgPool ( F ); MaxPool ( F )]) (2)


where F is the input feature map, AvgPool represents average pooling, MaxPool represents
maximum pooling, MLP is the shared fully connected layer module, σ is the activation
function sigmoid, and f 7×7 represents 7 × 7 convolution.

2.4. Loss Function


The bounding box regression loss function is a crucial component of the object de-
tection loss function, and has a significant impact on the performance of object detection
models [37]. The original bounding box loss function for the YOLOv8 network is the CIoU
loss. The CIoU takes into account the overlap area, the distance between central points,
and the aspect ratio of the width and height between the prediction boxes and ground true
boxes; its loss function is shown in Equation (6):

ρ2 b, b gt

LCIoU = 1 − IoU + + αv (3)
c2
where IoU is the intersection ratio between the predicted box and the ground truth box, b
represents the center point of the ground truth box, b gt represents the center point of the
predicted box, c represents the diagonal distance of the smallest rectangular box covering
both frames, ρ represents the distance between the center points of b and b gt , α is the weight
function, and v describes the aspect ratio consistency, with α and v defined below.
v
α= (4)
1 − IoU + v

2
w gt

4 w
v= 2 arctan gt − arctan (5)
π h h
Most existing approaches represented by CIoU do not involve image dimensions,
leaving them unable to optimize cases in which the predicted boxes and ground truth boxes
share the same aspect ratio while having completely different width and height values.
As a result, we introduced a novel bounding box similarity comparison metric, MPDIoU,
based on the minimum point distance.

2.5. Multiple Object Tracking


Object tracking is a task in computer vision that involves real-time localization and
tracking of specific objects in video sequences. Multiple Object Tracking (MOT) is a task
in which objects such as ships, pedestrians, and cars are detected and assigned unique
IDs for trajectory tracking in video sequences without prior knowledge of the number of
targets. Different objects are assigned different IDs. MOT typically comprises a detector
module and a data association module. With advancements in object detection techniques,
’tracking-by-detection’ has emerged as one of the mainstream frameworks for MOT.
Zhang [22] proposed a multi-object tracking model called ByteTrack based on object
detection. This model retains all detected boxes and categorizes them into high-score
and low-score detection boxes. It performs tracking by associating each detection box
rather than just high-score ones. The ByteTrack model utilizes YOLOX as its detector
module. A simple and efficient data association method called BYTE is introduced in
the data association part. BYTE leverages the similarity between detection boxes and
tracking trajectories. It retains high-score detection results while removing the background
from low-score detection results, thereby uncovering genuine objects (e.g,. challenging
samples such as occluded or blurred instances). This approach reduces missed detections
Electronics 2023, 12, 3804 7 of 17

and enhances trajectory coherence. Specifically, BYTE first matches high-score boxes with
previous tracking trajectories and then matches low-score boxes with tracking trajectories
that are not initially matched with high-score boxes. BYTE creates a new tracking trajectory
for detection boxes without matched tracking trajectories that have sufficiently high scores.
To track trajectories without matched detection boxes, BYTE retains them for 30 frames and
attempts matching again when they reappear.

3. Methodology
Our detector model is an enhancement of the YOLOv8 network. The YOLOv8 project
categorizes the network into five sizes based on different depth, width, and maximum
channel combinations (n, s, m, l, x). YOLOv8n was selected as the baseline with a minor
parameter count and a balanced detection performance. Building upon YOLOv8n, we
incorporated the ideas of FasterNet, attention mechanism, slim-neck, and MPDIoU to
create an optimized model called YOLOv8-FAS. The overall architecture of the YOLOv8-
FAS model is illustrated in Figure 5. YOLOv8-FAS enhances detection accuracy while
reducing parameter count and computational complexity. The detection result inputs
the ByteTrack tracker, which depends on the detection accuracy, ultimately realizing the
real-time monitoring of surface ships.

Figure 5. YOLOv8-FAS model structure.

3.1. Backbone
In this paper, we propose a lightweight yet solid feature extraction backbone. The
primary architecture is depicted in Figure 5; the backbone section is designed following
the feature extraction network structure of YOLOv8 while incorporating both the efficient
concept of FasterNet and the excellent feature extraction capabilities of the attention mech-
anism. Initially, YOLOv8-FAS employs ordinary 3 × 3 convolutional kernels with a stride
of two for initial feature extraction. After drawing inspiration from the ideas of FasterNet
and the attention mechanism, the original CSP module is modified, leading to the creation
of a novel CSP-A module. The CSP-A module is illustrated in Figure 6.
Electronics 2023, 12, 3804 8 of 17

Figure 6. CSP-A module.

The CSP-A module ingeniously combines the FasterNet block and the CBAM attention
mechanism. The CSP-A module shows fewer parameters, lower computation, and higher
feature extraction efficiency than the traditional CSP module. Its specific implementation
involves doubling the channel count of the first convolutional layer, then splitting the
convolutional output in half along the channel dimension. One of the halves is fed into the
FasterNet block for processing. The concatenated output is subjected to the lightweight
CBAM attention mechanism, further enhancing the extraction of image features. The
specific structure and functioning of CBAM have been detailed in the second section of this
paper. The faster module comprises an inverted residual block consisting of a PConv layer
and two 1 × 1 convolutional layers alongside batch normalization and ReLU activation
layers. The FasterNet block is visualized in Figure 7.

Figure 7. FasterNet block.

3.2. Neck
In this paper, we improve the backbone section while retaining light weight and
enhancing the feature extraction capability of the neck section. The primary task of the
neck section is to fuse and further extract features from the three effective feature layers
obtained in the main section, as illustrated in the diagram above. We introduce GSConv
and a one-shot aggregation module known as VoV-GSCSP based on GSConv into the
neck section of YOLOv8. The structure of VoV-GSCSP is depicted in Figure 8. The fea-
ture maps received by the neck have the highest channel count and the smallest spatial
dimension, containing minimal redundant information. As a result, there is no need for
compression, making GSConv particularly effective for lightweight models. The flexible
combination of the GSConv and VoV-GSCSP modules accelerates model inference speed,
reduces computational costs, and concurrently enhances detection accuracy.
Electronics 2023, 12, 3804 9 of 17

Figure 8. VoV-GSCSP.

3.3. Loss Function


Computing the loss is a comparison between the predicted results of the network and
the ground truth. The loss function of our model in this paper is the same as that of YOLOv8,
consisting of regression and classification components. The regression component pertains
to the regression parameters of feature points, which determine the object’s category that
is present at the feature point. Considering the advantages and disadvantages of the
existing BBR loss functions and aligning with the practical tasks of ship object detection,
we incorporate MPDIoU [22] into our work, inspired by the geometric characteristics of
rectangular boxes. MPDIoU is a loss function for efficient and accurate bounding box
regression, and encompasses all relevant factors considered in existing loss functions.
MPDIoU simplifies the calculation process by minimizing the distances between the top
left and bottom right corners of the predicted bounding box and the annotated bounding
box, achieving accurate and efficient bounding box regression. The computation method
for MPDIoU is as follows:

d21 = ( x1B − x1A )2 + (y1B − y1A )2 (6)

d22 = ( x2B − x2A )2 + (y2B − y2A )2 (7)

A∩B d2 d2
MPDIoU = − 2 1 2− 2 2 2 (8)
A∪B w +h w +h
where A and B are two arbitrary convex shapes, w and h represent the width and height
of the input image, respectively, ( x1A , y1A ) and ( x2A , y2A ) represent the top left and bottom
right point coordinates of A, respectively, and ( x1B , y1B ) and ( x2B , y2B ) represent the top left
and bottom right point coordinates of B, respectively.

3.4. Ship Tracking


Building on this well-optimized detection network, we proceeded with the selection
of a suitable tracker. Compared to classical methods such as Sort and DeepSort, ByteTrack
exhibits superior performance and offers a more streamlined solution in practical appli-
cations. The related work involving ByteTrack has been explained in the second part of
this paper. The model remains simple and fast, as ByteTrack solely employs a motion
Electronics 2023, 12, 3804 10 of 17

model without relying on ReID features for appearance similarity calculations. However,
this implies that the tracking effectiveness relies heavily on the detection performance.
When the detector performs well, the tracking outcome is favorable. Therefore, leveraging
this paper’s optimized object detection algorithm, we substitute the original detector by
integrating the enhanced YOLOv8 with ByteTrack. This synergy gives rise to a lightweight
and efficient ship object detection and tracking model.

4. Experiments
In this section, we first introduce the evaluation metrics and experimental platform;
then, the datasets are introduced; finally, we conduct ablation experiments and demonstrate
the effectiveness and applicability of our model.

4.1. Evaluation Metrics


In order to clearly and objectively evaluate the effectiveness of algorithm improvement
in terms of model weight reduction, we selected FLOPs and the number of parameters to
evaluate model complexity and size. In terms of object detection accuracy, the evaluation
indicators we chose were the Precision (P), Recall (R), Mean Average Precision (mAP),
and P-R curve. The P-R curve uses the Recall and Precision as the horizontal and vertical
coordinates, and can directly reflect the global performance of the model.

4.2. Experimental Platform


The experiments were based on an Ubuntu 20.04 operating system, NVIDIA RTX
A4000 GPU, and Intel(R) Xeon(R) Silver 4210R CPU @ 2.39GHz. The deep learning frame-
work was Pytorch 1.12.1, and the programming language was Python 3.8.16. The GPU
acceleration library was CUDA 11.4. The number of iterations for model training was set to
300 and the batch size to 16. Optimization used the SGD optimizer with the momentum set
to 0.937.

4.3. Dataset
The experimental data used in the model training conducted in this paper were taken
from actual river ship data collected by fixed-point cameras on the shore. Referring to
the public dataset Seaships [38] and the evaluation index requirements, we framed and
labeled the videos and obtained 3824 ship pictures. The dataset contained four ship target
categories: passenger ship, yacht, bulk carrier, and general cargo ship. The characteristics
of the dataset are as follows:
(1) The picture backgrounds are highly complex and disturbed, including but not limited
to nearshore buildings;
(2) The size difference between the ship targets in the images is significant, and it is
difficult to identify small targets;
(3) The appearance of similar ships is quite different, with the bulk carrier ship being the
most complex.
Figure 9 shows example images from the datasets.
We used LabelImg 1.8.6 software to label the datasets, in which the training set
accounted for 70%, the validation set for 10%, and the test set for 20%. We used a rectangular
frame to mark the ship objects. The label information included the corresponding picture
name, ship category, ship label frame position, etc., and was generated as an XML file in
PASCAL VOC format.
Electronics 2023, 12, 3804 11 of 17

Figure 9. Dataset examples.

4.4. Ablation Experiments


We conducted ablation experiments to verify the improved effectiveness and reliability
of the detection model and to explore the specific contributions of each improvement
strategy to the optimization of the detection model. The models used in these experiments
were trained on private datasets, and the experimental conditions such as equipment,
experimental environment, and the number of iterations were kept as the same in all
the experiments. We selected parameters and FLOPs as the measures of the size and
complexity of the model and mAP-0.5 to measure the accuracy of the detection model.
The experimental results are shown in Table 1, where ‘+’ represents the corresponding
improvement strategy of the model.

Table 1. Effects of different improvement operations.

Improvement Parameters FLOPs mAP0.5


Model
Backbone Slimneck MDPIoU /×106 /×109 /%
YOLOv8 3.0 8.1 97.6
YOLOv8-FA + 2.6 7.1 98.0
YOLOv8-S + 2.8 7.3 97.8
YOLOv8-I + 3.0 8.1 97.8
YOLOv8-FAS + + + 2.4 6.3 98.5

The YOLOv8-FA model replaces the backbone of YOLOv8 with a feature extraction
network that combines FasterNet and a CBAM attention mechanism. The number of model
parameters is reduced by 0.4 × 106 , the FLOPs are reduced by 1.0 × 109 , and the mAP-0.5 is
increased by 0.4 percentage points. These results indicate that the enhancement strategy
applied to the backbone reduces the number of model parameters and computational
load while enhancing the feature extraction capabilities, resulting in improved detection
accuracy. By incorporating the slim-neck strategy in the feature fusion network, the
YOLOv8-S model is created. This reduces the number of model parameters by 0.2 × 106 and
FLOPs by 0.8 × 109 while increasing mAP-0.5 by 0.2 percentage points. Consequently, the
combination of GSConv and VoVGSCSP optimizes the lightweight design of the detection
network’s neck portion, resulting in more effective feature fusion and enhanced feature
extraction without affecting the model size and computational complexity. Finally, an IoU
loss algorithm called MPDIoU based on the minimum point distance is introduced into
YOLOv8, effectively improving model detection accuracy by 0.2%.
Electronics 2023, 12, 3804 12 of 17

In summary, the results of the ablation experiment show that the multiple improve-
ments to the backbone and neck parts improve the model’s detection accuracy while
ensuring its light weight. On this basis, the accuracy of the model can be effectively im-
proved by further optimizing the loss function, and the inference effect is better as well.
These improvements culminate in the development of the final YOLOv8-FAS detection
model, which achieves both light weight and high detection accuracy.

4.5. Validation of the Improved Model


In order to verify the effectiveness of the YOLOv8-FAS algorithm proposed in this
paper, it was compared with the original YOLOv8 model. This experiment used the same
datasets for YOLOv8-FAS and the traditional YOLOv8 model, keeping the same parameters;
the input image size was 640 × 640, the number of epochs was 100, the batch size was
16, and the other parameters were the same as well. In order to measure the weight and
detection accuracy of the models more objectively, we examined and compared multiple
indicators. The detection results of the improved YOLOv8-FAS model and the original
model YOLOv8 on our datasets are shown in Table 2.

Table 2. Test results of the algorithm before and after improvement.

Detection Network FLOPs Parameters Precision/% Recall/% mAP0.5/% mAP0.5:0.95/%


YOLOv8 8.1 × 109 3.0 × 106 98.0 94.4 97.6 81.2
YOLOv8-FAS 6.3 × 109 2.4 × 106 98.4 95.8 98.5 84.9

From the data comparison in Table 2, it can be seen that the detection accuracy of the
YOLOv8-FAS model is dramatically improved compared with the traditional YOLOv8;
the mAP-0.5 is increased by 0.9 percentage points, and the mAP-0.5:0.95 is increased by
3.7 percentage points.
At the same time, the YOLOv8-FAS model further reduces the amount of calculation
and number of parameters of the lightweight YOLOv8n model. Compared with the
original model, the FLOPs of YOLOv8-FAS are reduced by 0.8 × 109 and the number of
parameters by 20%. The light weight of the model is notable, as YOLOv8-FAS enhances the
accuracy of ship detection and meets the requirements for a lightweight design, making it
hardware-friendly and facilitating subsequent application of the detection results.
Figure 10 compares the P-R curves before and after the improvements to the YOLOv8
algorithm. The P-R curve can be used to reflect a model’s performance; P stands for
precision and R stands for recall. When P = R, the Break-Even Point (BEP) is reached. The
larger the area under the PR curve and the larger the value of the balance point, the better
the performance of the learner. The P-R curve of YOLOv8-FAS has a larger area enclosed
by the two coordinate axes, and its BEP is closer to the coordinate point (1,1). Based on
these comparisons, it can be concluded that the improved YOLOv8-FAS ship detection
model exhibits better overall system performance.

(a) YOLOv8 (b) YOLOv8-FAS

Figure 10. P-R curves of the model before and after improvement.
Electronics 2023, 12, 3804 13 of 17

Figure 11 shows the real-time detection results of the YOLOv8 model for ships in
different situations before and after improvement; in addition, it shows the types and confi-
dence of bounding boxes along with the ships. We set the IOU threshold to 0.7. Figure 11a,b
shows the detection results of YOLOv8 and YOLOv8-FAS for two ships of different types
with similar appearances, one of which is partially occluded. The results show that while
the original model can correctly identify the two ships, the position of the detection frame
is not accurate enough for the partially occluded ship. On the other hand, the improved
model can correctly identify the two ships and the positioning is accurate, demonstrating
a 7% increase in the confidence score for ship detection compared to the original model.
Figure 11c,d shows the detection results of YOLOv8 and YOLOv8-FAS for severely oc-
cluded ships. It can be seen that the original YOLOv8 model fails to detect the occluded
general cargo ship, while the improved YOLOv8-FAS model successfully identifies all ships,
providing accurate category and location information. Figure 11e,f shows the respective de-
tection results of YOLOv8 and YOLOv8-FAS for small and incomplete objects in the image.
Both models accurately identify the small bulk carrier and the incomplete general cargo
ship in the image; however, the original model suffers from false positives, misidentifying
a bridge as a yacht due to interference from coastal buildings in the background. A similar
problem exists in Figure 11g. The original YOLOv8 model is disturbed by the shoreline,
and the bulk carrier is misidentified in the background. YOLOv8-FAS avoids both of these
problems; as shown in Figure 11h, it accurately identifies the number, type, and location
information of multiple ships, exhibiting a high confidence score for ship detection. The
sets of detection results in Figure 11 demonstrate that the proposed YOLOv8-FAS model
achieves light model weight with high detection accuracy under varied circumstances,
and that it can significantly reduce the rates of ship omissions and false positives. Overall,
YOLOv8-FAS exhibits superior system performance on the ship dataset.

(a) YOLOv8 (b) YOLOv8-FAS

(c) YOLOv8 (d) YOLOv8-FAS

Figure 11. Cont.


Electronics 2023, 12, 3804 14 of 17

(e) YOLOv8 (f) YOLOv8-FAS

(g) YOLOv8 (h) YOLOv8-FAS

Figure 11. Comparison of ship recognition images before and after model improvement.

In order to further assess the model’s performance, we conducted experiments on


the Seaships dataset [39], with the improved model obtaining a mean average precision of
98.9%. The specific results are shown in Table 3.

Table 3. Experiments on the Seaships public dataset.

Class Precision/% Recall/% mAP0.5/% mAP0.5:0.95/%


all 97.6 97.5 98.9 78.8
ore carrler 100 96.8 99.4 81.0
passenger ship 94.4 96.7 97.9 74.2
general cargo ship 96.5 96.7 98.3 74.6
bulk cargo carrier 97.6 98.7 99.5 81.4
container ship 99.2 100 99.5 85.2
fishing boat 97.8 95.9 99.0 76.6

After optimizing the detection network, we fed the more accurate detection results into
the ByteTrack tracker, which relies on the detector’s accuracy, for real-time ship tracking
tests. We considered a variety of monitoring situations, including partial occlusion, scale
change, multiple targets, and camera movement. Selected test result images are shown in
Figure 12. The tracking results shown in the figure include the number of tracking targets,
the ID assigned to each target, the category of the tracking targets, and the confidence level.
The test results show that the frame rate FPS when the model tracks the ships in the
video can reach more than 60 frames per second. Compared with the video frame input
of 25 frames per second, the object tracking model proposed in this paper fully meets the
needs of real-time ship monitoring. Even in the cases of occlusion, interference from the
water surface, incomplete display of the ship, an uncertain number of ship types, etc., the
corresponding ships can be accurately positioned and tracked. The video tracking results
provide a real-time display of target IDs, ship categories, and ship tracking accuracy, with
the multiple object tracking precision (MOTP) exceeding 88%. MOTP is an evaluation
metric used to measure the accuracy of multiple object tracking algorithms. It calculates
the average distance error between all targets and their corresponding predicted positions
to assess the precision of the tracking. MOTP is typically computed over a video sequence,
analyzing and processing the positions of targets across consecutive frames to evaluate the
Electronics 2023, 12, 3804 15 of 17

performance of the tracking algorithm. Nevertheless, there is room for improvement in


detecting and tracking ships. Considering both the real-time performance and accuracy of
tracking, this model is well-suited for real-time ship monitoring and tracking applications.
Although it performed well on the ship dataset, further evaluation and validation are
needed in order to determine its performance in additional scenarios.

Figure 12. ship tracking.

In summary, our optimized YOLOv8-FAS model further reduces the number of param-
eters and calculations, making it very friendly to hardware devices with limited memory
and computing resources. Our proposed improvements increase the model’s detection
accuracy while reducing its weight of the model and greatly reducing both its missed
detection rate and false positive rate. Integrating YOLOv8-FAS with the high-performance
ByteTrack tracker yields excellent tracking results, meeting the practical engineering de-
mands of real-time ship monitoring. Therefore, the proposed model holds significant
practical value in real-time ship monitoring tasks.

5. Conclusions
In this paper, we have proposed a lightweight approach called YOLOv8-FAS for
real-time ship monitoring. First, an efficient FasterNet module coupled with attention
mechanisms for feature extraction was integrated into the backbone network, achiev-
ing a lightweight initial model and enhanced feature extraction capabilities. Second, a
lightweight convolutional method called GSConv and a one-shot aggregation module
were introduced in the feature enhancement and fusion stage to construct an efficient neck
network, further enhancing detection speed and accuracy. In addition, we introduced
the MPDIoU, a loss function that uses the minimum point distance based on the geo-
metric characteristics of ships, which can lead to faster convergence and more accurate
regression results. Finally, the advanced tracker Bytetrack was introduced to accomplish
real-time ship detection and tracking tasks. Compared to the conventional lightweight
YOLOv8n detection network, YOLOv8-FAS reduces computational complexity by 0.8 × 109
and model parameters by 20%, with only 2.4 × 106 parameters. YOLOv8-FAS achieves a
detection precision of 98.50% in terms of mAP-0.5, an improvement of 0.9%, and achieves a
3.7% increase in mAP-0.5:0.95. The real-time frame rate for ship object tracking based on
detection surpasses 60 frames/s, significantly exceeding the typical video input frame rate
of 25 frames/s. It achieves real-time transmission of object IDs, ship types, positions, and
quantities, and maintains a multi-object tracking precision of over 88%. The verification
results using the datasets described in this paper show that YOLOv8-FAS has good overall
performance, achieving an effective balance between light weight and high precision. It
accurately performs the tasks of real-time ship detection and tracking, and can be deployed
on devices with limited memory and computational resources. In future research, we
intend to focus on further optimizing the object detection and tracking models to enhance
their simplicity, speed, and efficiency and to deploy them on resource-constrained devices
such as unmanned surface ships.
Author Contributions: Conceptualization, B.X. and W.W.; Methodology, B.X. and W.W.; formal
analysis, B.X., W.W. and J.Q.; data curation, C.P. and Q.L.; software, C.P.; writing—original draft
preparation, B.X. and W.W.; writing—review and editing, W.W.; supervision, B.X., J.Q. and C.P.;
project administration, B.X. and Q.L. All authors have read and agreed to the published version of
the manuscript.
Electronics 2023, 12, 3804 16 of 17

Funding: This research was funded by the Shanghai Science and Technology Committee (STCSM)
Local Universities Capacity-Building Project (No. 22010502200).
Data Availability Statement: The data are available on request.
Acknowledgments: The authors would like to express their gratitude for the support of the Fishery
Engineering and Equipment Innovation Team of Shanghai High-Level Local University and Daishan
County Transportation Bureau.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Bauwens, J. Datasharing in Inland Navigation. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer
Nature: Singapore, 2023; pp. 1353–1356.
2. Wu, Z.; Woo, S.H.; Lai, P.L.; Chen, X. The economic impact of inland ports on regional development: Evidence from the Yangtze
River region. Transp. Policy 2022, 127, 80–91. [CrossRef]
3. Zhou, J.; Liu, W.; Wu, J. Strategies for High Quality Development of Smart Inland Shipping in Zhejiang Province Based on
“Four-Port Linkage”. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer Nature: Singapore, 2023;
pp. 1409–1418.
4. Zhang, J.; Wan, C.; He, A.; Zhang, D.; Soares, C.G. A two-stage black-spot identification model for inland waterway transportation.
Reliab. Eng. Syst. Saf. 2021, 213, 107677. [CrossRef]
5. Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 28–23 June 2018; pp. 1468–1476.
6. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings
of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14;
Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37.
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
8. Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci.
Remote Sens. Lett. 2018, 16, 751–755. [CrossRef]
9. Zhang, X.; Wang, H.; Xu, C.; Lv, Y.; Fu, C.; Xiao, H.; He, Y. A Lightweight Feature Optimizing Network for Ship Detection in SAR
Image. IEEE Access 2019, 7, 141662–141678. [CrossRef]
10. Jie, Y.; Leonidas, L.; Mumtaz, F.; Ali, M. Ship Detection and Tracking in Inland Waterways Using Improved YOLOv3 and Deep
SORT. Symmetry 2021, 13, 308. [CrossRef]
11. Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for SAR ship detection: Past, present and future. Remote. Sens. 2022, 14, 2712.
[CrossRef]
12. Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023,
11, 696. [CrossRef]
13. Er M.J.; Zhang, Y.; Chen, J.; Gao, W. Ship detection with deep learning: A survey. Artif. Intell. Rev. 2023, 56, 11825–11865.
[CrossRef]
14. Yun, J.; Jiang, D.; Liu, Y.; Sun, Y.; Tao, B.; Kong, J.; Tian, J.; Tong, X.; Xu, M.; Fang, Z. Real-time target detection method based on
lightweight convolutional neural network. Front. Bioeng. Biotechnol. 2022, 10, 861286. [CrossRef]
15. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7,
128837–128868. [CrossRef]
16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
17. Wang, C.Y.; Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability
of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA,
USA, 13–19 June 2020; pp. 390–391.
18. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022,
arXiv:2211.04800.
19. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [CrossRef]
20. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768.
21. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012.
22. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In
Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–
13000.
Electronics 2023, 12, 3804 17 of 17

23. Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. Fighting against COVID-19: A novel deep learning model based on
YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [CrossRef] [PubMed]
24. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
25. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
26. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. Ultralyt-
ics/Yolov5: V7. 0-Yolov5 Sota Realtime Instance Segmentation. Zenodo. 2022. Available online: https://ui.adsabs.harvard.edu/
abs/2022zndo...7347926J/abstract (accessed on 22 November 2022).
27. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022, arXiv:2209.02976.
28. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June
2023; pp. 7464–7475.
29. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430.
30. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Wey, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
31. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589.
32. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 6848–6856.
33. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for
autonomous vehicles. arXiv 2022, arXiv:2206.02424.
34. Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural
Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
17–24 June 2023; pp. 12021–12031.
35. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 7132–7141.
36. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
37. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636.
38. Siliang, M.; Yong, X. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662.
39. Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans.
Multimed. 2018, 20, 2593–2604. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like