You are on page 1of 10

Improving Detection Capabilities of YOLOv8-n for

Small Objects in Remote Sensing Imagery: Towards


Better Precision with Simpli ed Model Complexity
Ruihan Bai
Hohai University
Feng Shen (  shenfeng1023@163.com )
Suzhou University of Science and Technology
Mingkang Wang
Tongji University
Jiahui Lu
Tongji University
Zhiping Zhang
Tongji University

Research Article

Keywords: Object detection, YOLOv8-n, Grad-CAM

Posted Date: June 22nd, 2023

DOI: https://doi.org/10.21203/rs.3.rs-3085871/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Improving Detection Capabilities of YOLOv8-n for Small Objects in Remote Sensing Imagery: Towards Better Precision with

Simplified Model Complexity.

Ruihan Bai1, Feng Shen2,*, Mingkang Wang3, Jiahui Lu3, Zhiping Zhang3
1
School of Civil and Transportation Engineering, Hohai University, Nanjing, 210000, Jiangsu, China
2
School of Civil Engineering, Suzhou University of Science and Technology, Suzhou, 215011, Jiangsu, China
3
School of Civil Engineering, Tongji University, 200000, Shanghai, China

*Feng Shen, E-mail: shenfeng1023@163.com

Abstract. This study presents a comprehensive analysis and improvement of the YOLOv8-n algorithm for object detection, focusing on

the integration of Wasserstein Distance Loss, FasterNext, and Context Aggravation strategies. Through a detailed ablation study, each

strategy was systematically evaluated individually and collectively to assess its contribution to the model's performance. The results indicate

that each strategy uniquely enhances the model's performance, significantly increasing mAP and reducing model complexity when all three

are integrated. Visualizations through Grad-CAM further substantiate the improved model's capacity to extract and focus on key object

features. Comparisons with existing models, such as YOLOv5-n, YOLOv5-s, YOLOX-n, YOLOX-s, and YOLOv7-tiny, the improved

YOLOv8-n model achieves an optimal balance between accuracy and model complexity, outperforming other models in terms of model

accuracy, model complexity, and model inference speed. Further image inference tests validate the model's performance, showcasing its

superior detection capabilities.

Keywords: Object detection, YOLOv8-n, Grad-CAM

usually require manual design and selection of features, which


largely limits their performance and application scope. With the
1 Introduction development of deep learning, researchers have started to utilize
models like Convolutional Neural Networks (CNN) for automatic
Remote sensing technology acquires image data of the earth's feature learning in target detection within remote sensing images
surface through remote equipment such as unmanned aerial [2]. However, due to the unique nature of remote sensing images,
vehicles and satellites. Due to its ability to provide extensive and such as large-scale variations, complex backgrounds, and small
continuous coverage of terrestrial information, it is widely used in targets, directly applying deep learning models struggles to
various fields, including climate change research, military achieve satisfactory performance [3]. To solve the problem of
reconnaissance, urban planning, and disaster monitoring. Object small target detection in remote sensing datasets, many
detection, which entails recognizing and locating specific objects researchers have improved Faster RCNN and YOLO series
(like vehicles, buildings, airplanes, etc.) from complex remote network models. Chen et al. [1] proposed a deep learning-based
sensing images, is one of the critical tasks in remote sensing image approach for identifying small objects. Their study overviews
analysis. However, compared to ordinary natural scene images, several critical aspects of microscopic item identification: multi-
remote sensing images possess some unique characteristics, such scale rendering, contextual information, and super-resolution.
as large-scale variations and complex backgrounds, which pose They also present modern datasets designed explicitly for small
new challenges to target detection algorithms [1]. Firstly, the same object detection. The study further investigates current micro-
target in remote sensing images may appear at different scales and object detection systems, focusing on modifications and
orientations, requiring target detection models to possess strong optimizations to enhance detection efficiency compared to
scale invariance. Secondly, remote sensing images usually have a conventional object recognition technologies. Some researchers
high resolution, which means that the targets within the image improved the detection effect of small targets through the rotation
typically only occupy a small part, while most of the area is prediction frame and rotation detectors, such as ReDet proposed
occupied by a complex background. The detection algorithms face by Han et al. [4], oriented bounding boxes proposed by Zand M et
significant challenges due to the diminished resolution, which al. [5], and box boundary-aware vectors proposed by Yu D et al.
restricts the availability of visual information and hampers the [6]. YOLO-Z model [7] adopts some operations, such as replacing
extraction of distinctive features. Furthermore, small objects are with Bi-FPN and enlarging the Neck layer to make the features of
exceptionally vulnerable to environmental disturbances, the middle and shallow layers well integrated. However, it is
intensifying the difficulty in accurately identifying them. unsuitable for scenes with significant changes in the object size.
Although traditional machine learning methods have achieved Ren et al. [8] addressed the challenge of using remote sensing
certain results in target detection of remote sensing images, they technology to identify tiny objects in optical imaging. They

1
developed an improved version of the faster R-CNN approach that models they are based. The models may inherently have
leverages enhanced techniques. They ensure comprehensive disadvantages, such as large model parameters and the slow
visualization of all identified items by employing a similar inference speed of the model, thereby restricting the effectiveness
architecture that avoids connections and generates a single high- of improvements in practical applications. In response to the above
resolution, high-level feature map. Liu et al.[9] proposed a multi- shortcomings, this paper shifts the research focus to new models,
block SSDs method incorporating sub-layers to detect and expand such as YOLOv8. YOLOv8, as the latest model, has significantly
local context information. The test results compared the improved in various aspects, including its ability to handle small
performance of multiple SSDs with conventional SSDs. The objects. By conducting improvement research based on YOLOv8,
presented algorithm achieved an increased detection rate of small we may further enhance the performance of small object detection.
objects by 23.2%. Bosquet et al. [10] introduced STDNet and Thus, this research made three improvements to the YOLOv8-n
ConvNet as approaches for identifying tiny objects smaller than model for small object detection based on remote sensing data.
16 × 16 pixels based on regional concepts. STDNet utilizes an While enhancing the model's accuracy, these modifications also
additional visual attention process called RCN to select the most contribute to the model's lightweight and improve its inference
probable candidate area, which includes one or more small items speed. The improvement point includes the following:
and their surrounding context. This enhances accuracy while • Use the Wasserstein Distance loss function instead of the
conserving memory and increasing the frame rate. The study also CIoU loss function to evaluate the similarity between predicted
incorporates automated k-means anchoring, which improves upon and ground-truth object bounding boxes, ensuring consistent
traditional heuristics. Zheng et al. [11] introduced a novel HyNet sensitivity across objects of varying sizes.
framework for large-scale target recognition in MSR remote • Due to high similarity across different channels, replace the
sensing imaging. The framework represents a significant model backbone's last feature extraction module C2f with the
advancement in the study of scale-invariant functions. One notable lightweight module FasterNext to reduce the model complexity
feature of HyNet is the incorporation of display zoom functions, and improve the computational efficiency of the model.
which utilize pyramid-shaped detection areas. These display zoom • Incorporating the Context module into the C2f module
functions enable more precise object detection across multiple improves the model's capability to learn contextual information.
scales in MSR remote sensing images, offering new research
possibilities in this domain. Li et al.[12] utilized generative 2 Method
adversarial learning to map the features of small low-resolution
objects into equivalent features resembling those of high- 2.1 Fundamentals of the YOLOv8 model
resolution objects. The technique aims to achieve comparable
detection performance for smaller-sized objects as observed for
larger-sized objects. Ji et al. [13] introduced the attention module
in the neck of YOLOv4 and Soft-CIOU loss function To improve
the detection accuracy of small objects. Zhou et al. [14] utilize the
high-speed and accurate YOLOv5-S as the foundation,
incorporating the Contextual Transformer (CoT) module to
optimize the residual neural network within the backbone feature
extraction network. Wu et al. [15] integrate local FCN and YOLO-
v5 to enhance small target detection in remote sensing images. The
approach surpasses other algorithms, especially in detecting small
objects such as tennis courts, vehicles, and storage tanks.
In small object detection, most existing research is based on
earlier models such as YOLO and Fast RCNN. While these studies
have somewhat improved small object detection, they still have
Fig. 1 YOLOv8 model
several shortcomings. (1) Technological obsolescence: Deep
learning is advancing at an extraordinary rate, with new models, The backbone of YOLOv8 primarily comprises the C2f

frameworks, and technologies constantly emerging. Although the module inspired by the ELAN module. The architecture of the C2f

models of these studies were cutting-edge technologies at the time, module integrates two parallel gradient flow branches, facilitating

recent advancements offer superior performance, higher a more robust gradient information flow. Additionally, YOLOv8

efficiency, and more accurate results. Thus, persisting in uses the Spatial Pyramid Pooling Fusion (SPPF) module, a

improving older models may overlook these new developments, characteristic also found in architectures such as YOLOv5. This

thereby constraining the efficacy of the enhancement efforts. (2) module's ability to extract contextual information from images at

Performance Limitations: The improvements might be varying scales significantly enhances the model's generalization

constrained by the inherent performance limitations of the earlier capabilities; In YOLOv8's architectural design, the model neck

2
uses Path Aggregation Network (PAN) principles. A direct
comparison between YOLOv8 and its predecessor, YOLOv5,
demonstrates that YOLOv8 has implemented two critical
modifications within the Feature Pyramid Network (FPN). The
model removed convolutional structures during the up-sampling
phase and strategically replaced the C3 module with the C2f
module; In object detection models, YOLOv8 introduces a
"Decoupled-Head" design. The novel concept aims to reduce the
degree of coupling between different tasks associated with object
detection, such as classification and localization.

2.2 Improved algorithm


Fig. 2 The sensitivity analysis of CIOU on different-sized objects

2.2.1 Loss function In this paper, the CIOU used by YOLOv8 is replaced with
the Normalized Gaussian Wasserstein Distance [16]. The idea is

The regression loss in YOLOv8 for object detection to convert the bounding box of the object into a two-dimensional

bounding boxes has two parts. The first portion is Distribution Gaussian distribution and then measure the similarity between the

Focal Loss (DFL), which uses cross-entropy as an optimization two distributions using the Wasserstein distance. More

mechanism to concentrate the network's predictive distribution specifically, for a bounding box R (cx, cy, w, h), cx, cy, w, and h

closer to the label values. Another portion is calculating the denote the bounding box's center coordinates, width, and height,

Intersection over Union (IoU) between the predicted and true respectively. The equation for its inner tangent ellipse is:

bounding boxes, utilizing the CIoU loss metric. The final ( y − c y )2


( x − cx )2
+ =1
regression loss is weighted and summed by two portions using a w
2 h
2
    (4)
specific weight coefficient. Among them, CIoU loss is calculated 2 2

by
where (cx, cy) denotes the coordinate of the ellipse's center, while
B  Bi  2 (b, b gt ) (w/2, h/2) signifies the length along the x and y axes, respectively.
LCIOU = 1 − + +  (1)
B  Bi c2
The probability density function of a two-dimensional

 Gaussian distribution could be expressed as:


=
1 − IOU +  (2)
( X −  )T  −1( X −  ))
1
exp( −
f ( X  , ) = 2
4 w gt w
1
= (arctan( ) − arctan( )) 2 (3) 2  2 (5)
2 h gt h

where b and bgt denote the center points of the predicted and actual where X is the position variable, (μ, ∑) is the mean vector, and the

bounding boxes, respectively; the Euclidean distance between covariance matrix.


Because equation (4) can be converted into the form of
these two central points is represented by ρ; c signifies the
equation (6), thus, the inner tangent ellipse is a distribution
diagonal length of the smallest rectangle encompassing both the
contour of the 2D Gaussian distribution N(μ, ∑), μ and ∑ are
predicted and actual frames; the height and width of the predicted
shown in equation (7).
and actual bounding boxes are represented by h, hgt, w, and wgt,
respectively.
As depicted in Figure 2, the CIOU loss function displays ( X −  )T  −1 ( X −  ) = 1 (6)

varied sensitivity towards different-sized objects. Small-sized  2 


w 0 
objects, having fewer image pixels than regular-sized ones, cx   
= ,  =  4  (7)
 y  h2 
c
demonstrate substantial fluctuations in their CIOU value with 
 0 4 
minor positional alterations of the predicted bounding boxes.
However, the same positional changes for regular-sized objects For two Gaussian distributions μ1=N1(m1, ∑1) and μ2=N2(m2,

produce minimal variations in the CIOU value. The sensitivity of ∑2), the Wasserstein distance between μ1 and μ2 can be defined as:

CIOU value on small-size objects causes the labels of small w h w h


2
W 2 ( N1, N 2 ) = ([cx1, cy1, 1 , 1 ]T , [cx2 , cy2 , 2 , 2 ]T ) (8)
2 2 2 2 2
objects to become opposite easily during the positive and negative F

sample assignment, leading to positive and negative samples with where ||.||F is Frobenius norm.
similar features. This condition complicates the convergence of The research suggests that the Normalized Gaussian
the object detection network during training. Wasserstein Distance loss function maintains consistent
sensitivity across objects of different sizes, exhibiting smooth

3
value changes for smaller objects [16]. This method can provide a resembles a T-shaped convolution, which, unlike the standard
more detailed depiction of the object's spatial distribution, thus convolution that evenly processes patches, places more emphasis
improving object detection and localization effectiveness in on the central location. The research affirms the value of the T-
numerous computer vision. shaped receptive field by quantifying the importance of each
position via calculation of the Frobenius norm. The assumption is
2.2.2 FasterNext that a position demonstrating a larger Frobenius norm is typically
more significant, which is further validated by their

In practical applications, object detection models are often comprehensive analysis of each filter in a pre-trained ResNet18,

required to operate in environments with constrained revealing the center position as most commonly being the most

computational resources, such as embedded systems and mobile impactful [18]. Overall, the FasterNet network with the T-shaped

devices. Consequently, there is a pressing need for efficient, receptive field principle centers its attention on the central position

lightweight models that satisfy real-time performance and and facilitates a more comprehensive understanding of contextual

resource constraints. Thus, the lightweight of Yolov8 is information within the input image. The methodology aids in

investigated in this paper. Many research endeavors have found identifying which positions are more critical for feature extraction,

that feature maps show high commonality or similarity across thereby further optimizing the convolutional operations. In this

different channels [17]. Here is an example of the image features study, the last C2f module in the backbone part of the yolov8

extracted from the backbone part of the yolov8 model. As model is replaced with a FasterNet module.

visualized in Figure 3, the feature maps share similarities among


different channels (red, purple, and orange boxes).

Fig. 5 Overall architecture of our FasterNet

2.2.3 Global aggravation


Fig. 3 Visualization of feature maps in an intermediate layer of
the Yolov8 model
Partial Convolution (PConv) [18] is an innovative
convolution operation designed to efficiently extract spatial
features while minimizing redundancy in computations and
memory access. Unlike other convolution operations (traditional
convolution utilizes all input feature map channels; Depthwise
Convolution partitions the input feature map into multiple groups,
each subject to an independent convolution operation), PConv
selectively convolves a subset of channels, leaving the rest intact
Fig. 6 The architecture of Context Aggregation block and
(Figure 4). The PConv operation lessens the redundancy of
Improved C2f module
channel feature information during computation, thereby boosting
In the analysis of remote-sensing images, we observe that
computational efficiency. Furthermore, PConv can retain
objects often occupy only a tiny portion of the image area, leaving
pertinent information, enhancing feature representation
large areas of background information that may not be relevant.
capabilities without compromising computational efficiency.
The conventional design does not discern between object and
background information and may inadvertently incorporate
excessive uninformative background features. To address this
challenge, this paper proposes augmenting the C2f module with

Fig. 4 The comparison of different convolution modules. the Context Aggregation block [19], specifically designed to
evaluate the informativeness of each pixel in the image. Figure
The FasterNet framework utilized in this study is constructed
6(a) demonstrates the structure of the Context Aggregation block,
based on the PConv (Figure 5). To optimally exploit the
which embodies a two-branch architecture: one branch is
information available in all channels, the FasterNet architecture
dedicated to computing global context information, and the other
supplements the PConv with a pointwise convolution (PW-Conv).
is dedicated to local feature extraction. These branches are fused
The effective receptive field on the input feature map visually

4
via a weighted summation operation controlled by a gating the vehicle class from the VisDrone dataset were incorporated into
mechanism. The gating mechanism is adaptively weighted based the NWPU VHR-10 dataset to enrich the experimental datasets.
on the input features, preserving valuable global information while Therefore, the final dataset was expanded to approximately 1500
diminishing irrelevant content. Specifically, the operations within samples. The datasets utilized in this paper are depicted in Figure
the Context Aggregation Block are as follows: 7.
• Global Context Information Extraction: A convolutional
layer paired with a Global Average Pooling layer works together 3.2 Evaluation metric
to extract global context information, enhancing the model's
understanding of the surrounding context of a target. The Precision-Accuracy (PA) curves exhibit the interaction
• Local Feature Extraction: Local features are derived between Precision and Recall across different confidence
through another convolutional layer. thresholds. The AP value is derived from the area under the
• Gating Mechanism: A gating signal is computed using a prediction and recall curves, and the Mean Average Precision
combination of convolutional layers and an activation function. (mAP) is calculated as the average of the AP values for all
This signal is then applied to the weighted summation operation categories. The formula used to calculate this is provided below:
between the global context information and local features. 1
• Weighted Summation: The gating signal is incorporated AP =
0
p(r )dr (9)

into the weighted summation operation between the global context


information and local features, resulting in the final fused feature
mAP =
 AP
m (10)
representation. m
Overall, the underlying principle of the Context Frames Per Second (FPS) is a critical performance metric
Aggregation block is straightforward: if a pixel's features are in object detection algorithms, quantifying the number of image
deemed sufficiently informative, there is a minimal necessity to frames the algorithm can process per second. FPS is calculated by
aggregate features from other spatial positions. The methodology taking the reciprocal of the time taken to process one frame (in
maintains an adaptive balance between incorporating essential seconds), that is:
global context and preserving distinct local features, thereby 1
FPS =
enhancing the model's ability to discern features within a larger T (11)

context while retaining detailed local differences. As depicted in


Figure 6(b), the enhancement presented in this study involves
3.3 Experimental configuration
incorporating a context module following the C2f module,
improving the model's capability to learn contextual information.
The research conducted in this study was carried out within
an experimental environment consisting of Ubuntu 18.04, Python
3.1 Data introduction
3.8.13, CUDA-11.4, cuDNN-8.2.2, and PyTorch 1.10.2. The
hardware infrastructure leveraged for this study incorporated an
RTX-3090 GPU with 8GB memory. The CPU used was an
Intel(R) Core(TM) i7-6500M, operating at a frequency of
3.20GHz.

4 Experimental results and analysis

4.1 Ablation experimental

Fig. 7 Visualization of partial experimental datasets


4.1.1 Experimental scheme design
This paper utilized a combined experimental dataset
comprising several open-source collections, including the NWPU
In this section, the performance of the proposed
VHR-10 remote sensing images, RSOD remote sensing images,
improvement is assessed systematically via ablation studies on
and VisDrone UAV images. As the primary component of the
experimental datasets. The datasets were apportioned into
experimental dataset, the NWPU VHR-10 dataset contains ten
training, validation, and testing datasets, following a 7:1:2 ratio.
categories: aircraft, ships, storage tanks, baseball fields, tennis
Apart from that, the fairness of the comparative experiments was
courts, basketball courts, ground tracks, ports, bridges, and
ensured by setting the division of the datasets and the experimental
vehicles. Despite this diversity, the NWPU VHR-10 dataset's
parameters uniformly, including the input size of the image (640),
amount is somewhat limited, with approximately 600 samples.
the number of training epochs (100), choice of the optimizer
The aircraft and storage tank classes from the RSOD dataset and
(Adam), batch size (45), etc.

5
Table 1 Test results after different improvement strategy 4.1.2 Grad-CAM Visualization
combinations
Different Improvement strategy
YOLOv8-n Wasserstein Context
FasterNext
algorithm models Distance Loss aggravation

Experiment 1 — — —

Experiment 2 √ — —

Experiment 3 — √ —

Experiment 4 — — √

Experiment 5 √ √ —

Experiment 6 √ — √

Experiment 7 — √ √

Experiment 8 √ √ √ Fig. 8 Grad-CAM visualization results

Table 1 summarizes Yolov8-n models with different Gradient-weighted Class Activation Mapping (Grad-CAM)

improvement strategies: Wasserstein Distance Loss, FasterNext, [20] is a tool for understanding how object detection models make

and Context Aggravation. Each experiment in the Table indicates predictions. It shows which areas in an image were crucial for the

which improvement strategies were implemented. Experiment 1 model's prediction, allowing researchers to better understand the

served as the baseline, where no improvement strategy was model's decision-making process by highlighting these key areas

applied, providing a reference for evaluating the subsequent within objects. In this study, the researchers used this tool to show

experiments. As shown in Table 2, the individual application of how modifications improved the YOLOv8-n model backbone's

each strategy displayed a noteworthy increment in mAP over the ability to extract image features and focus on the right areas.

baseline model (Experiment 1), highlighting the positive impact According to Grad-CAM visualizations (Figure 8), the improved

of these strategies. An insightful finding is a simultaneous YOLOv7-tiny model is well-suited to highlight important areas

improvement in mAP and reduction in model complexity within objects and draw out their key features. This is particularly

exhibited by the integration of Wasserstein Distance Loss and useful when dealing with remote sensing images, which often

FasterNext (Experiment 5), outperforming the scenarios where contain targets of different scales and have complex backgrounds.

these strategies are implemented separately. However, the


combination of Wasserstein Distance Loss or FasterNext with 4.2 Compare with other models

Context Aggravation (Experiments 6 and 7) did not concurrently


improve mAP while decreasing model complexity, suggesting In this section, the paper compared the improved YOLOv8-
potential interference among strategies that may undermine their n model with other models like YOLOv5-n, YOLOv5-s, YOLOX-
benefits. Notably, the holistic application of all three strategies n, YOLOX-s, and YOLOv7-tiny. Figure 9 presents the Precision-
(Experiment 8) yielded the maximum mAP, with a model Recall (PR) curves of different object detection algorithms, with
complexity comparable to the baseline. the area under the curve (AUC) signifying the performance of each
Table 2 Comparison of overall detection performance model. The PR curve of the improved YOLOv8-n model is
observed to be closest to the upper-right corner, suggesting it has
Model mAP Parameter GFLOPs
the highest true positive rate for a given false positive rate
Experiment 1 0.866 3007598 8.1 compared to other models. The performance suggests that the
Experiment 2 0.876 3007598 8.1 improvements incorporated into the YOLOv8-n model enhanced

Experiment 3 0.870 2721902 7.9 its object detection capabilities.

Experiment 4 0.872 3200750 8.4

Experiment 5 0.881 2721902 7.9

Experiment 6 0.864 3200750 8.4

Experiment 7 0.851 2848620 8.1

Experiment 8 0.889 2848620 8.1

Fig. 9 The comparison of PR curves with different models

6
The performance of the different models is analyzed based Starting with the YOLOv5 series, the YOLOv5-s and
on three aspects: the accuracy of the model, the complexity of the YOLOv5-n models show relatively lower detection accuracy with
model, and the speed of model inference. All experimental results mAPs of 0.857 and 0.818, respectively. Despite exhibiting a high
are shown in Table 3. FPS, they are still lacking in model accuracy. The YOLOX-s
Table 3 Comparison with other detection models model achieves a high mAP value of 0.883, demonstrating

Object detection excellent object detection performance. However, the superior


mAP Parameter GFLOPs FPS
algorithm accuracy comes at the expense of computational efficiency, as its
YOLOv5-s 0.857 7037095 15.8 57.14 FPS score of 51.81 is the lowest among the models compared. Its
counterpart, the YOLOX-n model, offers a faster FPS of 53.76 but
YOLOv5-n 0.818 1772695 4.2 60.98
experiences a significant drop in mAP to 0.807. The YOLOv7-
YOLOX-s 0.883 8044109 21.6 51.81
tiny model balances accuracy and computational efficiency,
YOLOX-n 0.807 2016061 5.6 53.76
providing an acceptable mAP of 0.872 and a moderate FPS of
YOLOv7-tiny 0.872 6031950 13.1 52.36 52.36. The YOLOv8-n (baseline model) performs adequately with
YOLOv8-n 0.866 3007598 8.1 56.18 the mAP of 0.866 and the FPS of 56.18, demonstrating its
proficiency in executing object detection tasks. The improved
Improved YOLOv8-n
0.889 2848620 8.1 58.14 YOLOv8-n model offers an ideal equilibrium between exceptional
(this paper)
accuracy and computational efficiency. It exhibits the highest
(1) Model accuracy –Parameter mAP of 0.889, signifying superior detection accuracy while
maintaining an impressive FPS of 58.14. Figure 10 demonstrates
our achievement in improving model accuracy and detection
speed.

Fig. 10 Comparison of detection accuracy and parameter


from different models
In terms of the model's accuracy and complexity, the
detection accuracy of the YOLOv5-n and YOLOv5-s models is Fig. 11 Comparison of detection accuracy and speed from
relatively lower. YOLOX-s model yields a high mAP value of different models
0.883, showing excellent object detection performance. However,
it comes with the largest parameter count of 8044109, suggesting 4.3 Visualization
that its excellent precision comes at the cost of high complexity.
For YOLOX-n, despite a parameter reduction to 2016061, this
model's mAP drops to 0.807, the lowest in the comparison. The
YOLOv7-tiny model attains a good mAP of 0.872 and a relatively
moderate parameter count of 6031950. It strikes a balance between
precision and model complexity. YOLOv8-n (baseline) performs
reasonably well, with an mAP of 0.866 and a significantly reduced
parameter count of 3007598, proving its efficiency in object
detection tasks. The improved YOLOV8-n model stands out in the
comparison, boasting the highest mAP of 0.889, thus providing
superior detection accuracy. Remarkably, it achieves this with a
further reduced parameter count of 2848620, which underlines our
success in improving accuracy and efficiency. Overall, the
comparison reaffirms the success of the modifications made to the
YOLOv8-n model. As depicted in Figure 9, the model
demonstrates an optimal balance, offering high precision and
lower complexity, a significant advancement in object detection
models. Fig. 12 Detection results of YOLOv8-n and improved

(2) Model accuracy – FPS YOLOv8-n model in multiple scenes

7
To effectively illustrate the improved algorithm's References
superiority, this paper performed inference on three distinct sets
1. Chen, G., Wang, H., Chen, K., Li, Z., Song, Z., Liu, Y.,
of images from the test dataset. All experiments were carried out
Chen, W., Knoll, A.: A Survey of the Four Pillars for Small
under homogeneous parameters and device conditions. The visible
Object Detection: Multi-scale Representation, Contextual
results of these experiments are shown in Figure 12, and the
Information, Super-Resolution, and Region Proposal. IEEE
YOLOv8-n algorithms exhibit varying degrees of missed
T. Syst Man Cy-S. 52(2), 936–953 (2022)
detections across different images. However, the improved
2. Lin, T-Y., Dollár, P., Girshick, R., He, K., Hariharan, B.,
YOLOv8-n model significantly ameliorates this issue, improving
Belongie, S.: Feature Pyramid Networks for Object
object detection performance.
Detection. arXiv preprint arXiv:1612.03144 (2017)

5 Conclusion 3. Wang, J., Tao, X., Xu, M., Duan, Y., Lu, J.: Hierarchical
Objectness Network for Region Proposal Generation and
This study enhances the YOLOv8-n object detection Object Detection. Pattern Recogn. 83, 260–272 (2018)
algorithm by applying Wasserstein Distance Loss, FasterNext, and 4. Han, J., Ding, J., Xue, N., Xia, G.-S.: ReDet: A Rotation-
Context Aggravation strategies. The visualizations showed that Equivariant Detector for Aerial Object Detection. arXiv
the improved YOLOv8-n model could highlight crucial areas preprint arXiv:2103.07733 (2021)
within objects and enhance its feature extraction capabilities by 5. Zand, M., Etemad, A., Greenspan, M.: Oriented Bounding
using Grad-CAM to understand the model's decision-making Boxes for Small and Freely Rotated Objects. IEEE T.
process. Furthermore, the improved YOLOv8-n model was also Geosci Remote. 60, 1–15 (2022)
compared with existing models like YOLOv5-n, YOLOv5-s, 6. Yu, D., Xu, Q., Guo, H., Xu, J., Lu, J., Lin, Y., Liu, X.:
YOLOX-n, YOLOX-s, and YOLOv7-tiny. Anchor-Free Arbitrary-Oriented Object Detector Using Box
The improved model demonstrated a superior balance Boundary-Aware Vectors. IEEE J. Stars. 15, 1–1 (2022)
between detection accuracy and model complexity, outperforming 7. Benjumea, A., Teeti, I., Cuzzolin, F., Bradley, A.: YOLO-
the other models regarding mAP, model parameters, and FPS. Z: Improving Small Object Detection in YOLOv5 for
Overall, the improved model displayed superior performance in Autonomous Vehicles. arXiv preprint arXiv:2112.11798
precision, computational efficiency, and object detection (2023)
capabilities. Future studies may explore more effective strategy 8. Ren, Y., Zhu, C., Xiao, S.: Small Object Detection in Optical
combinations to enhance the performance of object detection Remote Sensing Images via Modified Faster R-CNN. Appl
models further. Sci-Basel 8 (5), 813 (2018)
9. Liu, S., Huang, D., Wang, Y.: Receptive Field Block Net for
Author contributions: Ruihan Bai wrote the main manuscript Accurate and Fast Object Detection. arXiv preprint
text and performed the experiments. Feng Shen gave guidance and arXiv:1711.07767 (2018)
suggestions for model improvement of this article, as well as 10. Bosquet, B., Mucientes, M., Brea, V. M.: STDnet:
reviewed and edited the manuscript. Mingkang Wang summarized Exploiting High Resolution Feature Maps for Small Object
the methods of the existing related research. Jiahui Lu drew the Detection. Eng. Appl Artif Intel. 91, 103615 (2020)
diagrams (model structure and related improvement) in the article 11. Zheng, Z., Zhong, Y., Ma, A., Han, X., Zhao, J., Liu, Y.,
and provided suggestions for improving the article's layout. Zhang, L.: HyNet: Hyper-Scale Object Detection Network
Zhiping Zhang contributed to the check, review, and editing of the Framework for Multiple Spatial Resolution Remote Sensing
article. All authors reviewed and revised the manuscript. Imagery. Isprs J. Photogramm. 166, 1–14 (2020)
Funding: This study was supported by the Qinglan Project of 12. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.:
Jiangsu Province and the Priority Academic Program Perceptual Generative Adversarial Networks for Small
Development of Jiangsu Higher Education Institutions. Object Detection. arXiv preprint arXiv:1706.05274 (2017)
Availability of data and materials: The data supporting the 13. Ji, S., Ling, H., Han, F.: An improved algorithm for small
conclusions of this article are included within the article. Any object detection based on YOLO v4 and multiscale
queries regarding these data may be directed to the corresponding contextual information. Comput Electr Eng. 105, 108490
author. (2023)
14. Zhou, Q., Zhang, W., Li, R., Wang, J., Zhen, S., Niu. F.:
Declarations Improved YOLOv5-S object detection method for optical
remote sensing images based on contextual transformer. J.
Conflict of interest The authors declare that he has no conflict of
Electron Imaging. 31(4), 043049 (2022)
interest.
15. Wu, W., Liu. H., Li, L., Long, Y., Wang, X., Wang, Z., Li,
J., Chang, Y.: Application of local fully Convolutional
Neural Network combined with YOLO v5 algorithm in

8
small target detection of remote sensing image. Plos One.
16(10), 0259283 (2021)
16. Wang, J., Xu, C., Yang, W., Yu, L.: A normalized Gaussian
Wasserstein distance for tiny object detection. arXiv
preprint arXiv:2110.13389 (2021)
17. Zhang, Q., Jiang, Z., Lu, Q., Han, J. Zeng, Z., Gao, S., Men.
A.: Split to be slim: An overlooked redundancy in vanilla
convolution. arXiv preprint arXiv:2006.12085 (2020)
18 Chen, J., Kao, S., He, H., Zhuo, W., Wen, S., Lee, C.H.,
Chan, S.H.: Run, Don't Walk: Chasing Higher FLOPS for
Faster Neural Networks. arXiv preprint arXiv:2303.03667
(2023)
19. Liu, Y., Li, H., Hu, C., Luo, S., Luo, Y., Chen, C.: Learning
to Aggregate Multi-Scale Context for Instance
Segmentation in Remote Sensing Images. arXiv preprint
arXiv:2111.11057 (2022)
20. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., Batra, D.: Grad-cam: Visual explanations from
deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on
computer vision, pp 618-626. (2017)

Publisher's Note Springer Nature remains neutral with regard to


jurisdictional claims in published maps and institutional
affiliations.

Springer Nature or its licensor (e.g. a society or other partner)


holds exclusive rights to this article under a publishing agreement
with the author(s) or other rightsholder(s); author self-archiving of
the accepted manuscript version of this article is solely governed
by the terms of such publishing agreement and applicable law.

You might also like