Professional Documents
Culture Documents
Article
A Lightweight Model for Real-Time Monitoring of Ships
Bowen Xing 1, *,† , Wei Wang 1,† , Jingyi Qian 2,3 , Chengwu Pan 4 and Qibo Le 4
1 College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China;
ww1585395980@163.com
2 College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China;
2180140@tongji.edu.cn
3 Shanghai Aerospace Electronics Co., Ltd., Shanghai 201800, China
4 Ningbo Communication Center, Ningbo 315800, China; nbpcw@163.com (C.P.); leqibo@126.com (Q.L.)
* Correspondence: bwxing@shou.edu.cn
† These authors contributed equally to this work.
Abstract: Real-time monitoring of ships is crucial for inland navigation management. Under complex
conditions, it is difficult to balance accuracy, real-time performance, and practicality in ship detection
and tracking. We propose a lightweight model, YOLOv8-FAS, to address this issue for real-time ship
detection and tracking. First, FasterNet and the attention mechanism are integrated and introduced
to achieve feature extraction simply and efficiently. Second, the lightweight GSConv convolution
method and a one-shot aggregation module are introduced to construct an efficient network neck
to enhance feature extraction and fusion. Furthermore, the loss function is improved based on ship
characteristics to make the model more suitable for ship datasets. Finally, the advanced Bytetrack
tracke is added to achieve the real-time detection and tracking of ship targets. Compared to the
YOLOv8 model, YOLOv8-FAS reduces computational complexity by 0.8 × 109 terms of FLOPs and
reduces model parameters by 20%, resulting in only 2.4 × 106 parameters. The mAP-0.5 is improved
by 0.9%, reaching 98.50%, and the real-time object tracking precision of the model surpasses 88%.
The YOLOv8-FAS model combines light weight with high precision, and can accurately perform
ship detection and tracking tasks in real time. Moreover, it is suitable for deployment on hardware
resource-limited devices such as unmanned surface ships.
Keywords: ship monitoring; deep learning; lightweight model; real-time tracking; YOLOv8
2. Related Works
2.1. Object Detection
As one of the representative examples of one-stage object detection algorithms [15],
the YOLO series [16] utilizes deep neural networks to identify and locate objects, offering
high operational speeds suitable for real-time monitoring and tracking tasks. The authors
Electronics 2023, 12, 3804 3 of 17
of YOLOv5 have recently introduced a novel state-of-the-art model known as YOLOv8. The
specific architecture of YOLOv8 is shown in Figure 1. Building upon previous iterations
of the YOLO series, YOLOv8 incorporates several improvements that enhance detection
accuracy and speed. This makes it particularly well-suited to serve as a baseline for
ship detection.
ConvModule(80,80,256)
DarknetBottleneck MaxPool2d
CSP(80,80,256) Concat+CSP YOLOHead
Concat
Downsample DarknetBottleneck
UpSampling2D
ConvModule(40,40,512)
Concat+CSP YOLOHead
ConvModule
CSP(40,40,512) Concat+CSP Concat
CSP(20,20,512)
The entire YOLOv8 network’s operation involves feature extraction, feature enhance-
ment, and prediction of object conditions corresponding to prior bounding boxes.
The backbone is the primary feature extraction network within YOLOv8, where input
images are initially processed to extract features. These extracted features are referred to as
feature layers, constituting a collection of characteristics from the input images. YOLOv8
leverages three effective feature layers within the backbone for constructing subsequent
network components. Compared to previous YOLO series algorithms, YOLOv8 employs
3 × 3 convolution kernels with a stride of two for initial feature extraction, sacrificing
receptive field while enhancing the model’s speed. The preprocessing of the CSP [17]
module involves replacing three successive convolutions with two convolutions, drawing
inspiration from the ELAN [18] architecture of YOLOv7. The specific implementation
method is to expand the number of channels for the first convolution to twice the original
number, then split the convolution results in half on the channels. This approach reduces
the number of convolutions and accelerates the network’s speed.
The Feature Pyramid Network (FPN) [19] is the enhanced feature extraction network
in YOLOv8. The three effective feature layers obtained from the backbone in the main
section of YOLOv8 are fused in the FPN component. Feature fusion aims to combine
features from different scales to facilitate the extraction of more refined characteristics.
Within the FPN segment, the obtained effective feature layers are employed to further
extract the features. YOLOv8 continues to adopt the PANet structure [20], which involves
upsampling features for feature fusion and then downsampling them again to achieve
feature fusion.
The YOLO head serves as the classifier and regressor within YOLOv8. With the
contribution of the backbone and neck, the network obtains three enhanced and effective
feature layers. Each feature layer has the dimensions of width, height, and channel count.
If we consider the feature map as a collection of individual feature points, each feature
point acts as a prior point, eliminating the need for prior bounding boxes. Instead, each
prior point contains features equal to the number of channels. The role of the YOLO Head
is to determine whether an object is associated with each prior point by examining the
corresponding priors’ conditions. YOLOv8 transitions from the previous coupled head
design to a decoupled head design in which classification and regression are no longer
realized within the same 1 × 1 convolution layer.
Electronics 2023, 12, 3804 4 of 17
The loss function of YOLOv8 comprises both regression and classification components.
The predicted category results for the priors are taken in the classification part; the cross-
entropy loss is calculated based on the true box category and the predicted category for
each prior. YOLOv8 employs the Distribution Focal (DF) loss [21] for the final regression
prediction, necessitating the inclusion of the DF loss in the regression section. The regression
loss in YOLOv8 comprises the CIoU loss [22] and the DF loss.
Figure 2. GSConv.
In addition to object detection models, there are ongoing efforts to achieve fast neural
networks, which hold significant relevance for object detection models. Chen et al. [34]
introduced a new neural network, FasterNet, with a simple architecture that exhibits
remarkable speed; it proved to be highly effective for various visual tasks as well as
being hardware-friendly. The authors introduced a simple and rapid partial convolution,
PConv, to reduce redundant calculations and memory access, enabling better utilization
of computational capabilities on devices and more efficient spatial feature extraction.
However, the PConv operation can result in loss of information, affecting the accuracy
and generalization capability of the model. The optimization methods of FasterNet may
need to be adjusted and optimized for different tasks and datasets. As depicted in Figure 3,
PConv leverages redundancy within the feature map, applying general convolution (Conv)
to only a subset of input channels for spatial feature extraction while leaving the remaining
channels unchanged. The FasterNet architecture built upon PConv performs well, and
exhibits universal speed on different devices such as GPU, CPU, and ARM processors. It is
well-suited for real-time ship detection, ship tracking, and similar tasks.
Electronics 2023, 12, 3804 5 of 17
Figure 3. PConv.
Figure 4. CBAM.
The channel attention module is used to adjust the weights of each channel in the
feature map, aiding the network in selecting relevant feature channels. This keeps the
channel dimension unchanged while compressing the spatial dimensions. The input feature
layer initially undergoes global average pooling and global maximum pooling. A shared
fully connected layer processes the pooling results individually before being added together.
Finally, the sigmoid activation function is applied to obtain the weight for each channel in
the input feature layer, which is then multiplied by the original input feature layer. The
expression for channel attention is as follows:
Finally, the sigmoid activation function is applied to obtain the weight for each feature
point in the input feature layer, which is then multiplied by the original input feature layer.
The expression for spatial attention is as follows:
ρ2 b, b gt
LCIoU = 1 − IoU + + αv (3)
c2
where IoU is the intersection ratio between the predicted box and the ground truth box, b
represents the center point of the ground truth box, b gt represents the center point of the
predicted box, c represents the diagonal distance of the smallest rectangular box covering
both frames, ρ represents the distance between the center points of b and b gt , α is the weight
function, and v describes the aspect ratio consistency, with α and v defined below.
v
α= (4)
1 − IoU + v
2
w gt
4 w
v= 2 arctan gt − arctan (5)
π h h
Most existing approaches represented by CIoU do not involve image dimensions,
leaving them unable to optimize cases in which the predicted boxes and ground truth boxes
share the same aspect ratio while having completely different width and height values.
As a result, we introduced a novel bounding box similarity comparison metric, MPDIoU,
based on the minimum point distance.
and enhances trajectory coherence. Specifically, BYTE first matches high-score boxes with
previous tracking trajectories and then matches low-score boxes with tracking trajectories
that are not initially matched with high-score boxes. BYTE creates a new tracking trajectory
for detection boxes without matched tracking trajectories that have sufficiently high scores.
To track trajectories without matched detection boxes, BYTE retains them for 30 frames and
attempts matching again when they reappear.
3. Methodology
Our detector model is an enhancement of the YOLOv8 network. The YOLOv8 project
categorizes the network into five sizes based on different depth, width, and maximum
channel combinations (n, s, m, l, x). YOLOv8n was selected as the baseline with a minor
parameter count and a balanced detection performance. Building upon YOLOv8n, we
incorporated the ideas of FasterNet, attention mechanism, slim-neck, and MPDIoU to
create an optimized model called YOLOv8-FAS. The overall architecture of the YOLOv8-
FAS model is illustrated in Figure 5. YOLOv8-FAS enhances detection accuracy while
reducing parameter count and computational complexity. The detection result inputs
the ByteTrack tracker, which depends on the detection accuracy, ultimately realizing the
real-time monitoring of surface ships.
3.1. Backbone
In this paper, we propose a lightweight yet solid feature extraction backbone. The
primary architecture is depicted in Figure 5; the backbone section is designed following
the feature extraction network structure of YOLOv8 while incorporating both the efficient
concept of FasterNet and the excellent feature extraction capabilities of the attention mech-
anism. Initially, YOLOv8-FAS employs ordinary 3 × 3 convolutional kernels with a stride
of two for initial feature extraction. After drawing inspiration from the ideas of FasterNet
and the attention mechanism, the original CSP module is modified, leading to the creation
of a novel CSP-A module. The CSP-A module is illustrated in Figure 6.
Electronics 2023, 12, 3804 8 of 17
The CSP-A module ingeniously combines the FasterNet block and the CBAM attention
mechanism. The CSP-A module shows fewer parameters, lower computation, and higher
feature extraction efficiency than the traditional CSP module. Its specific implementation
involves doubling the channel count of the first convolutional layer, then splitting the
convolutional output in half along the channel dimension. One of the halves is fed into the
FasterNet block for processing. The concatenated output is subjected to the lightweight
CBAM attention mechanism, further enhancing the extraction of image features. The
specific structure and functioning of CBAM have been detailed in the second section of this
paper. The faster module comprises an inverted residual block consisting of a PConv layer
and two 1 × 1 convolutional layers alongside batch normalization and ReLU activation
layers. The FasterNet block is visualized in Figure 7.
3.2. Neck
In this paper, we improve the backbone section while retaining light weight and
enhancing the feature extraction capability of the neck section. The primary task of the
neck section is to fuse and further extract features from the three effective feature layers
obtained in the main section, as illustrated in the diagram above. We introduce GSConv
and a one-shot aggregation module known as VoV-GSCSP based on GSConv into the
neck section of YOLOv8. The structure of VoV-GSCSP is depicted in Figure 8. The fea-
ture maps received by the neck have the highest channel count and the smallest spatial
dimension, containing minimal redundant information. As a result, there is no need for
compression, making GSConv particularly effective for lightweight models. The flexible
combination of the GSConv and VoV-GSCSP modules accelerates model inference speed,
reduces computational costs, and concurrently enhances detection accuracy.
Electronics 2023, 12, 3804 9 of 17
Figure 8. VoV-GSCSP.
A∩B d2 d2
MPDIoU = − 2 1 2− 2 2 2 (8)
A∪B w +h w +h
where A and B are two arbitrary convex shapes, w and h represent the width and height
of the input image, respectively, ( x1A , y1A ) and ( x2A , y2A ) represent the top left and bottom
right point coordinates of A, respectively, and ( x1B , y1B ) and ( x2B , y2B ) represent the top left
and bottom right point coordinates of B, respectively.
model without relying on ReID features for appearance similarity calculations. However,
this implies that the tracking effectiveness relies heavily on the detection performance.
When the detector performs well, the tracking outcome is favorable. Therefore, leveraging
this paper’s optimized object detection algorithm, we substitute the original detector by
integrating the enhanced YOLOv8 with ByteTrack. This synergy gives rise to a lightweight
and efficient ship object detection and tracking model.
4. Experiments
In this section, we first introduce the evaluation metrics and experimental platform;
then, the datasets are introduced; finally, we conduct ablation experiments and demonstrate
the effectiveness and applicability of our model.
4.3. Dataset
The experimental data used in the model training conducted in this paper were taken
from actual river ship data collected by fixed-point cameras on the shore. Referring to
the public dataset Seaships [38] and the evaluation index requirements, we framed and
labeled the videos and obtained 3824 ship pictures. The dataset contained four ship target
categories: passenger ship, yacht, bulk carrier, and general cargo ship. The characteristics
of the dataset are as follows:
(1) The picture backgrounds are highly complex and disturbed, including but not limited
to nearshore buildings;
(2) The size difference between the ship targets in the images is significant, and it is
difficult to identify small targets;
(3) The appearance of similar ships is quite different, with the bulk carrier ship being the
most complex.
Figure 9 shows example images from the datasets.
We used LabelImg 1.8.6 software to label the datasets, in which the training set
accounted for 70%, the validation set for 10%, and the test set for 20%. We used a rectangular
frame to mark the ship objects. The label information included the corresponding picture
name, ship category, ship label frame position, etc., and was generated as an XML file in
PASCAL VOC format.
Electronics 2023, 12, 3804 11 of 17
The YOLOv8-FA model replaces the backbone of YOLOv8 with a feature extraction
network that combines FasterNet and a CBAM attention mechanism. The number of model
parameters is reduced by 0.4 × 106 , the FLOPs are reduced by 1.0 × 109 , and the mAP-0.5 is
increased by 0.4 percentage points. These results indicate that the enhancement strategy
applied to the backbone reduces the number of model parameters and computational
load while enhancing the feature extraction capabilities, resulting in improved detection
accuracy. By incorporating the slim-neck strategy in the feature fusion network, the
YOLOv8-S model is created. This reduces the number of model parameters by 0.2 × 106 and
FLOPs by 0.8 × 109 while increasing mAP-0.5 by 0.2 percentage points. Consequently, the
combination of GSConv and VoVGSCSP optimizes the lightweight design of the detection
network’s neck portion, resulting in more effective feature fusion and enhanced feature
extraction without affecting the model size and computational complexity. Finally, an IoU
loss algorithm called MPDIoU based on the minimum point distance is introduced into
YOLOv8, effectively improving model detection accuracy by 0.2%.
Electronics 2023, 12, 3804 12 of 17
In summary, the results of the ablation experiment show that the multiple improve-
ments to the backbone and neck parts improve the model’s detection accuracy while
ensuring its light weight. On this basis, the accuracy of the model can be effectively im-
proved by further optimizing the loss function, and the inference effect is better as well.
These improvements culminate in the development of the final YOLOv8-FAS detection
model, which achieves both light weight and high detection accuracy.
From the data comparison in Table 2, it can be seen that the detection accuracy of the
YOLOv8-FAS model is dramatically improved compared with the traditional YOLOv8;
the mAP-0.5 is increased by 0.9 percentage points, and the mAP-0.5:0.95 is increased by
3.7 percentage points.
At the same time, the YOLOv8-FAS model further reduces the amount of calculation
and number of parameters of the lightweight YOLOv8n model. Compared with the
original model, the FLOPs of YOLOv8-FAS are reduced by 0.8 × 109 and the number of
parameters by 20%. The light weight of the model is notable, as YOLOv8-FAS enhances the
accuracy of ship detection and meets the requirements for a lightweight design, making it
hardware-friendly and facilitating subsequent application of the detection results.
Figure 10 compares the P-R curves before and after the improvements to the YOLOv8
algorithm. The P-R curve can be used to reflect a model’s performance; P stands for
precision and R stands for recall. When P = R, the Break-Even Point (BEP) is reached. The
larger the area under the PR curve and the larger the value of the balance point, the better
the performance of the learner. The P-R curve of YOLOv8-FAS has a larger area enclosed
by the two coordinate axes, and its BEP is closer to the coordinate point (1,1). Based on
these comparisons, it can be concluded that the improved YOLOv8-FAS ship detection
model exhibits better overall system performance.
Figure 10. P-R curves of the model before and after improvement.
Electronics 2023, 12, 3804 13 of 17
Figure 11 shows the real-time detection results of the YOLOv8 model for ships in
different situations before and after improvement; in addition, it shows the types and confi-
dence of bounding boxes along with the ships. We set the IOU threshold to 0.7. Figure 11a,b
shows the detection results of YOLOv8 and YOLOv8-FAS for two ships of different types
with similar appearances, one of which is partially occluded. The results show that while
the original model can correctly identify the two ships, the position of the detection frame
is not accurate enough for the partially occluded ship. On the other hand, the improved
model can correctly identify the two ships and the positioning is accurate, demonstrating
a 7% increase in the confidence score for ship detection compared to the original model.
Figure 11c,d shows the detection results of YOLOv8 and YOLOv8-FAS for severely oc-
cluded ships. It can be seen that the original YOLOv8 model fails to detect the occluded
general cargo ship, while the improved YOLOv8-FAS model successfully identifies all ships,
providing accurate category and location information. Figure 11e,f shows the respective de-
tection results of YOLOv8 and YOLOv8-FAS for small and incomplete objects in the image.
Both models accurately identify the small bulk carrier and the incomplete general cargo
ship in the image; however, the original model suffers from false positives, misidentifying
a bridge as a yacht due to interference from coastal buildings in the background. A similar
problem exists in Figure 11g. The original YOLOv8 model is disturbed by the shoreline,
and the bulk carrier is misidentified in the background. YOLOv8-FAS avoids both of these
problems; as shown in Figure 11h, it accurately identifies the number, type, and location
information of multiple ships, exhibiting a high confidence score for ship detection. The
sets of detection results in Figure 11 demonstrate that the proposed YOLOv8-FAS model
achieves light model weight with high detection accuracy under varied circumstances,
and that it can significantly reduce the rates of ship omissions and false positives. Overall,
YOLOv8-FAS exhibits superior system performance on the ship dataset.
Figure 11. Comparison of ship recognition images before and after model improvement.
After optimizing the detection network, we fed the more accurate detection results into
the ByteTrack tracker, which relies on the detector’s accuracy, for real-time ship tracking
tests. We considered a variety of monitoring situations, including partial occlusion, scale
change, multiple targets, and camera movement. Selected test result images are shown in
Figure 12. The tracking results shown in the figure include the number of tracking targets,
the ID assigned to each target, the category of the tracking targets, and the confidence level.
The test results show that the frame rate FPS when the model tracks the ships in the
video can reach more than 60 frames per second. Compared with the video frame input
of 25 frames per second, the object tracking model proposed in this paper fully meets the
needs of real-time ship monitoring. Even in the cases of occlusion, interference from the
water surface, incomplete display of the ship, an uncertain number of ship types, etc., the
corresponding ships can be accurately positioned and tracked. The video tracking results
provide a real-time display of target IDs, ship categories, and ship tracking accuracy, with
the multiple object tracking precision (MOTP) exceeding 88%. MOTP is an evaluation
metric used to measure the accuracy of multiple object tracking algorithms. It calculates
the average distance error between all targets and their corresponding predicted positions
to assess the precision of the tracking. MOTP is typically computed over a video sequence,
analyzing and processing the positions of targets across consecutive frames to evaluate the
Electronics 2023, 12, 3804 15 of 17
In summary, our optimized YOLOv8-FAS model further reduces the number of param-
eters and calculations, making it very friendly to hardware devices with limited memory
and computing resources. Our proposed improvements increase the model’s detection
accuracy while reducing its weight of the model and greatly reducing both its missed
detection rate and false positive rate. Integrating YOLOv8-FAS with the high-performance
ByteTrack tracker yields excellent tracking results, meeting the practical engineering de-
mands of real-time ship monitoring. Therefore, the proposed model holds significant
practical value in real-time ship monitoring tasks.
5. Conclusions
In this paper, we have proposed a lightweight approach called YOLOv8-FAS for
real-time ship monitoring. First, an efficient FasterNet module coupled with attention
mechanisms for feature extraction was integrated into the backbone network, achiev-
ing a lightweight initial model and enhanced feature extraction capabilities. Second, a
lightweight convolutional method called GSConv and a one-shot aggregation module
were introduced in the feature enhancement and fusion stage to construct an efficient neck
network, further enhancing detection speed and accuracy. In addition, we introduced
the MPDIoU, a loss function that uses the minimum point distance based on the geo-
metric characteristics of ships, which can lead to faster convergence and more accurate
regression results. Finally, the advanced tracker Bytetrack was introduced to accomplish
real-time ship detection and tracking tasks. Compared to the conventional lightweight
YOLOv8n detection network, YOLOv8-FAS reduces computational complexity by 0.8 × 109
and model parameters by 20%, with only 2.4 × 106 parameters. YOLOv8-FAS achieves a
detection precision of 98.50% in terms of mAP-0.5, an improvement of 0.9%, and achieves a
3.7% increase in mAP-0.5:0.95. The real-time frame rate for ship object tracking based on
detection surpasses 60 frames/s, significantly exceeding the typical video input frame rate
of 25 frames/s. It achieves real-time transmission of object IDs, ship types, positions, and
quantities, and maintains a multi-object tracking precision of over 88%. The verification
results using the datasets described in this paper show that YOLOv8-FAS has good overall
performance, achieving an effective balance between light weight and high precision. It
accurately performs the tasks of real-time ship detection and tracking, and can be deployed
on devices with limited memory and computational resources. In future research, we
intend to focus on further optimizing the object detection and tracking models to enhance
their simplicity, speed, and efficiency and to deploy them on resource-constrained devices
such as unmanned surface ships.
Author Contributions: Conceptualization, B.X. and W.W.; Methodology, B.X. and W.W.; formal
analysis, B.X., W.W. and J.Q.; data curation, C.P. and Q.L.; software, C.P.; writing—original draft
preparation, B.X. and W.W.; writing—review and editing, W.W.; supervision, B.X., J.Q. and C.P.;
project administration, B.X. and Q.L. All authors have read and agreed to the published version of
the manuscript.
Electronics 2023, 12, 3804 16 of 17
Funding: This research was funded by the Shanghai Science and Technology Committee (STCSM)
Local Universities Capacity-Building Project (No. 22010502200).
Data Availability Statement: The data are available on request.
Acknowledgments: The authors would like to express their gratitude for the support of the Fishery
Engineering and Equipment Innovation Team of Shanghai High-Level Local University and Daishan
County Transportation Bureau.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Bauwens, J. Datasharing in Inland Navigation. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer
Nature: Singapore, 2023; pp. 1353–1356.
2. Wu, Z.; Woo, S.H.; Lai, P.L.; Chen, X. The economic impact of inland ports on regional development: Evidence from the Yangtze
River region. Transp. Policy 2022, 127, 80–91. [CrossRef]
3. Zhou, J.; Liu, W.; Wu, J. Strategies for High Quality Development of Smart Inland Shipping in Zhejiang Province Based on
“Four-Port Linkage”. In PIANC Smart Rivers 2022: Green Waterways and Sustainable Navigations; Springer Nature: Singapore, 2023;
pp. 1409–1418.
4. Zhang, J.; Wan, C.; He, A.; Zhang, D.; Soares, C.G. A two-stage black-spot identification model for inland waterway transportation.
Reliab. Eng. Syst. Saf. 2021, 213, 107677. [CrossRef]
5. Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 28–23 June 2018; pp. 1468–1476.
6. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings
of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14;
Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37.
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
8. Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci.
Remote Sens. Lett. 2018, 16, 751–755. [CrossRef]
9. Zhang, X.; Wang, H.; Xu, C.; Lv, Y.; Fu, C.; Xiao, H.; He, Y. A Lightweight Feature Optimizing Network for Ship Detection in SAR
Image. IEEE Access 2019, 7, 141662–141678. [CrossRef]
10. Jie, Y.; Leonidas, L.; Mumtaz, F.; Ali, M. Ship Detection and Tracking in Inland Waterways Using Improved YOLOv3 and Deep
SORT. Symmetry 2021, 13, 308. [CrossRef]
11. Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for SAR ship detection: Past, present and future. Remote. Sens. 2022, 14, 2712.
[CrossRef]
12. Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023,
11, 696. [CrossRef]
13. Er M.J.; Zhang, Y.; Chen, J.; Gao, W. Ship detection with deep learning: A survey. Artif. Intell. Rev. 2023, 56, 11825–11865.
[CrossRef]
14. Yun, J.; Jiang, D.; Liu, Y.; Sun, Y.; Tao, B.; Kong, J.; Tian, J.; Tong, X.; Xu, M.; Fang, Z. Real-time target detection method based on
lightweight convolutional neural network. Front. Bioeng. Biotechnol. 2022, 10, 861286. [CrossRef]
15. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7,
128837–128868. [CrossRef]
16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
17. Wang, C.Y.; Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability
of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA,
USA, 13–19 June 2020; pp. 390–391.
18. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022,
arXiv:2211.04800.
19. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [CrossRef]
20. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768.
21. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012.
22. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In
Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–
13000.
Electronics 2023, 12, 3804 17 of 17
23. Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. Fighting against COVID-19: A novel deep learning model based on
YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [CrossRef] [PubMed]
24. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
25. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
26. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. Ultralyt-
ics/Yolov5: V7. 0-Yolov5 Sota Realtime Instance Segmentation. Zenodo. 2022. Available online: https://ui.adsabs.harvard.edu/
abs/2022zndo...7347926J/abstract (accessed on 22 November 2022).
27. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022, arXiv:2209.02976.
28. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June
2023; pp. 7464–7475.
29. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430.
30. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Wey, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
31. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589.
32. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 6848–6856.
33. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for
autonomous vehicles. arXiv 2022, arXiv:2206.02424.
34. Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural
Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
17–24 June 2023; pp. 12021–12031.
35. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 7132–7141.
36. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
37. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636.
38. Siliang, M.; Yong, X. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662.
39. Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans.
Multimed. 2018, 20, 2593–2604. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.