Professional Documents
Culture Documents
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-3085871/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Ruihan Bai1, Feng Shen2,*, Mingkang Wang3, Jiahui Lu3, Zhiping Zhang3
1
School of Civil and Transportation Engineering, Hohai University, Nanjing, 210000, Jiangsu, China
2
School of Civil Engineering, Suzhou University of Science and Technology, Suzhou, 215011, Jiangsu, China
3
School of Civil Engineering, Tongji University, 200000, Shanghai, China
Abstract. This study presents a comprehensive analysis and improvement of the YOLOv8-n algorithm for object detection, focusing on
the integration of Wasserstein Distance Loss, FasterNext, and Context Aggravation strategies. Through a detailed ablation study, each
strategy was systematically evaluated individually and collectively to assess its contribution to the model's performance. The results indicate
that each strategy uniquely enhances the model's performance, significantly increasing mAP and reducing model complexity when all three
are integrated. Visualizations through Grad-CAM further substantiate the improved model's capacity to extract and focus on key object
features. Comparisons with existing models, such as YOLOv5-n, YOLOv5-s, YOLOX-n, YOLOX-s, and YOLOv7-tiny, the improved
YOLOv8-n model achieves an optimal balance between accuracy and model complexity, outperforming other models in terms of model
accuracy, model complexity, and model inference speed. Further image inference tests validate the model's performance, showcasing its
1
developed an improved version of the faster R-CNN approach that models they are based. The models may inherently have
leverages enhanced techniques. They ensure comprehensive disadvantages, such as large model parameters and the slow
visualization of all identified items by employing a similar inference speed of the model, thereby restricting the effectiveness
architecture that avoids connections and generates a single high- of improvements in practical applications. In response to the above
resolution, high-level feature map. Liu et al.[9] proposed a multi- shortcomings, this paper shifts the research focus to new models,
block SSDs method incorporating sub-layers to detect and expand such as YOLOv8. YOLOv8, as the latest model, has significantly
local context information. The test results compared the improved in various aspects, including its ability to handle small
performance of multiple SSDs with conventional SSDs. The objects. By conducting improvement research based on YOLOv8,
presented algorithm achieved an increased detection rate of small we may further enhance the performance of small object detection.
objects by 23.2%. Bosquet et al. [10] introduced STDNet and Thus, this research made three improvements to the YOLOv8-n
ConvNet as approaches for identifying tiny objects smaller than model for small object detection based on remote sensing data.
16 × 16 pixels based on regional concepts. STDNet utilizes an While enhancing the model's accuracy, these modifications also
additional visual attention process called RCN to select the most contribute to the model's lightweight and improve its inference
probable candidate area, which includes one or more small items speed. The improvement point includes the following:
and their surrounding context. This enhances accuracy while • Use the Wasserstein Distance loss function instead of the
conserving memory and increasing the frame rate. The study also CIoU loss function to evaluate the similarity between predicted
incorporates automated k-means anchoring, which improves upon and ground-truth object bounding boxes, ensuring consistent
traditional heuristics. Zheng et al. [11] introduced a novel HyNet sensitivity across objects of varying sizes.
framework for large-scale target recognition in MSR remote • Due to high similarity across different channels, replace the
sensing imaging. The framework represents a significant model backbone's last feature extraction module C2f with the
advancement in the study of scale-invariant functions. One notable lightweight module FasterNext to reduce the model complexity
feature of HyNet is the incorporation of display zoom functions, and improve the computational efficiency of the model.
which utilize pyramid-shaped detection areas. These display zoom • Incorporating the Context module into the C2f module
functions enable more precise object detection across multiple improves the model's capability to learn contextual information.
scales in MSR remote sensing images, offering new research
possibilities in this domain. Li et al.[12] utilized generative 2 Method
adversarial learning to map the features of small low-resolution
objects into equivalent features resembling those of high- 2.1 Fundamentals of the YOLOv8 model
resolution objects. The technique aims to achieve comparable
detection performance for smaller-sized objects as observed for
larger-sized objects. Ji et al. [13] introduced the attention module
in the neck of YOLOv4 and Soft-CIOU loss function To improve
the detection accuracy of small objects. Zhou et al. [14] utilize the
high-speed and accurate YOLOv5-S as the foundation,
incorporating the Contextual Transformer (CoT) module to
optimize the residual neural network within the backbone feature
extraction network. Wu et al. [15] integrate local FCN and YOLO-
v5 to enhance small target detection in remote sensing images. The
approach surpasses other algorithms, especially in detecting small
objects such as tennis courts, vehicles, and storage tanks.
In small object detection, most existing research is based on
earlier models such as YOLO and Fast RCNN. While these studies
have somewhat improved small object detection, they still have
Fig. 1 YOLOv8 model
several shortcomings. (1) Technological obsolescence: Deep
learning is advancing at an extraordinary rate, with new models, The backbone of YOLOv8 primarily comprises the C2f
frameworks, and technologies constantly emerging. Although the module inspired by the ELAN module. The architecture of the C2f
models of these studies were cutting-edge technologies at the time, module integrates two parallel gradient flow branches, facilitating
recent advancements offer superior performance, higher a more robust gradient information flow. Additionally, YOLOv8
efficiency, and more accurate results. Thus, persisting in uses the Spatial Pyramid Pooling Fusion (SPPF) module, a
improving older models may overlook these new developments, characteristic also found in architectures such as YOLOv5. This
thereby constraining the efficacy of the enhancement efforts. (2) module's ability to extract contextual information from images at
Performance Limitations: The improvements might be varying scales significantly enhances the model's generalization
constrained by the inherent performance limitations of the earlier capabilities; In YOLOv8's architectural design, the model neck
2
uses Path Aggregation Network (PAN) principles. A direct
comparison between YOLOv8 and its predecessor, YOLOv5,
demonstrates that YOLOv8 has implemented two critical
modifications within the Feature Pyramid Network (FPN). The
model removed convolutional structures during the up-sampling
phase and strategically replaced the C3 module with the C2f
module; In object detection models, YOLOv8 introduces a
"Decoupled-Head" design. The novel concept aims to reduce the
degree of coupling between different tasks associated with object
detection, such as classification and localization.
2.2.1 Loss function In this paper, the CIOU used by YOLOv8 is replaced with
the Normalized Gaussian Wasserstein Distance [16]. The idea is
The regression loss in YOLOv8 for object detection to convert the bounding box of the object into a two-dimensional
bounding boxes has two parts. The first portion is Distribution Gaussian distribution and then measure the similarity between the
Focal Loss (DFL), which uses cross-entropy as an optimization two distributions using the Wasserstein distance. More
mechanism to concentrate the network's predictive distribution specifically, for a bounding box R (cx, cy, w, h), cx, cy, w, and h
closer to the label values. Another portion is calculating the denote the bounding box's center coordinates, width, and height,
Intersection over Union (IoU) between the predicted and true respectively. The equation for its inner tangent ellipse is:
by
where (cx, cy) denotes the coordinate of the ellipse's center, while
B Bi 2 (b, b gt ) (w/2, h/2) signifies the length along the x and y axes, respectively.
LCIOU = 1 − + + (1)
B Bi c2
The probability density function of a two-dimensional
where b and bgt denote the center points of the predicted and actual where X is the position variable, (μ, ∑) is the mean vector, and the
produce minimal variations in the CIOU value. The sensitivity of ∑2), the Wasserstein distance between μ1 and μ2 can be defined as:
sample assignment, leading to positive and negative samples with where ||.||F is Frobenius norm.
similar features. This condition complicates the convergence of The research suggests that the Normalized Gaussian
the object detection network during training. Wasserstein Distance loss function maintains consistent
sensitivity across objects of different sizes, exhibiting smooth
3
value changes for smaller objects [16]. This method can provide a resembles a T-shaped convolution, which, unlike the standard
more detailed depiction of the object's spatial distribution, thus convolution that evenly processes patches, places more emphasis
improving object detection and localization effectiveness in on the central location. The research affirms the value of the T-
numerous computer vision. shaped receptive field by quantifying the importance of each
position via calculation of the Frobenius norm. The assumption is
2.2.2 FasterNext that a position demonstrating a larger Frobenius norm is typically
more significant, which is further validated by their
In practical applications, object detection models are often comprehensive analysis of each filter in a pre-trained ResNet18,
required to operate in environments with constrained revealing the center position as most commonly being the most
computational resources, such as embedded systems and mobile impactful [18]. Overall, the FasterNet network with the T-shaped
devices. Consequently, there is a pressing need for efficient, receptive field principle centers its attention on the central position
lightweight models that satisfy real-time performance and and facilitates a more comprehensive understanding of contextual
resource constraints. Thus, the lightweight of Yolov8 is information within the input image. The methodology aids in
investigated in this paper. Many research endeavors have found identifying which positions are more critical for feature extraction,
that feature maps show high commonality or similarity across thereby further optimizing the convolutional operations. In this
different channels [17]. Here is an example of the image features study, the last C2f module in the backbone part of the yolov8
extracted from the backbone part of the yolov8 model. As model is replaced with a FasterNet module.
Fig. 4 The comparison of different convolution modules. the Context Aggregation block [19], specifically designed to
evaluate the informativeness of each pixel in the image. Figure
The FasterNet framework utilized in this study is constructed
6(a) demonstrates the structure of the Context Aggregation block,
based on the PConv (Figure 5). To optimally exploit the
which embodies a two-branch architecture: one branch is
information available in all channels, the FasterNet architecture
dedicated to computing global context information, and the other
supplements the PConv with a pointwise convolution (PW-Conv).
is dedicated to local feature extraction. These branches are fused
The effective receptive field on the input feature map visually
4
via a weighted summation operation controlled by a gating the vehicle class from the VisDrone dataset were incorporated into
mechanism. The gating mechanism is adaptively weighted based the NWPU VHR-10 dataset to enrich the experimental datasets.
on the input features, preserving valuable global information while Therefore, the final dataset was expanded to approximately 1500
diminishing irrelevant content. Specifically, the operations within samples. The datasets utilized in this paper are depicted in Figure
the Context Aggregation Block are as follows: 7.
• Global Context Information Extraction: A convolutional
layer paired with a Global Average Pooling layer works together 3.2 Evaluation metric
to extract global context information, enhancing the model's
understanding of the surrounding context of a target. The Precision-Accuracy (PA) curves exhibit the interaction
• Local Feature Extraction: Local features are derived between Precision and Recall across different confidence
through another convolutional layer. thresholds. The AP value is derived from the area under the
• Gating Mechanism: A gating signal is computed using a prediction and recall curves, and the Mean Average Precision
combination of convolutional layers and an activation function. (mAP) is calculated as the average of the AP values for all
This signal is then applied to the weighted summation operation categories. The formula used to calculate this is provided below:
between the global context information and local features. 1
• Weighted Summation: The gating signal is incorporated AP =
0
p(r )dr (9)
5
Table 1 Test results after different improvement strategy 4.1.2 Grad-CAM Visualization
combinations
Different Improvement strategy
YOLOv8-n Wasserstein Context
FasterNext
algorithm models Distance Loss aggravation
Experiment 1 — — —
Experiment 2 √ — —
Experiment 3 — √ —
Experiment 4 — — √
Experiment 5 √ √ —
Experiment 6 √ — √
Experiment 7 — √ √
Table 1 summarizes Yolov8-n models with different Gradient-weighted Class Activation Mapping (Grad-CAM)
improvement strategies: Wasserstein Distance Loss, FasterNext, [20] is a tool for understanding how object detection models make
and Context Aggravation. Each experiment in the Table indicates predictions. It shows which areas in an image were crucial for the
which improvement strategies were implemented. Experiment 1 model's prediction, allowing researchers to better understand the
served as the baseline, where no improvement strategy was model's decision-making process by highlighting these key areas
applied, providing a reference for evaluating the subsequent within objects. In this study, the researchers used this tool to show
experiments. As shown in Table 2, the individual application of how modifications improved the YOLOv8-n model backbone's
each strategy displayed a noteworthy increment in mAP over the ability to extract image features and focus on the right areas.
baseline model (Experiment 1), highlighting the positive impact According to Grad-CAM visualizations (Figure 8), the improved
of these strategies. An insightful finding is a simultaneous YOLOv7-tiny model is well-suited to highlight important areas
improvement in mAP and reduction in model complexity within objects and draw out their key features. This is particularly
exhibited by the integration of Wasserstein Distance Loss and useful when dealing with remote sensing images, which often
FasterNext (Experiment 5), outperforming the scenarios where contain targets of different scales and have complex backgrounds.
6
The performance of the different models is analyzed based Starting with the YOLOv5 series, the YOLOv5-s and
on three aspects: the accuracy of the model, the complexity of the YOLOv5-n models show relatively lower detection accuracy with
model, and the speed of model inference. All experimental results mAPs of 0.857 and 0.818, respectively. Despite exhibiting a high
are shown in Table 3. FPS, they are still lacking in model accuracy. The YOLOX-s
Table 3 Comparison with other detection models model achieves a high mAP value of 0.883, demonstrating
7
To effectively illustrate the improved algorithm's References
superiority, this paper performed inference on three distinct sets
1. Chen, G., Wang, H., Chen, K., Li, Z., Song, Z., Liu, Y.,
of images from the test dataset. All experiments were carried out
Chen, W., Knoll, A.: A Survey of the Four Pillars for Small
under homogeneous parameters and device conditions. The visible
Object Detection: Multi-scale Representation, Contextual
results of these experiments are shown in Figure 12, and the
Information, Super-Resolution, and Region Proposal. IEEE
YOLOv8-n algorithms exhibit varying degrees of missed
T. Syst Man Cy-S. 52(2), 936–953 (2022)
detections across different images. However, the improved
2. Lin, T-Y., Dollár, P., Girshick, R., He, K., Hariharan, B.,
YOLOv8-n model significantly ameliorates this issue, improving
Belongie, S.: Feature Pyramid Networks for Object
object detection performance.
Detection. arXiv preprint arXiv:1612.03144 (2017)
5 Conclusion 3. Wang, J., Tao, X., Xu, M., Duan, Y., Lu, J.: Hierarchical
Objectness Network for Region Proposal Generation and
This study enhances the YOLOv8-n object detection Object Detection. Pattern Recogn. 83, 260–272 (2018)
algorithm by applying Wasserstein Distance Loss, FasterNext, and 4. Han, J., Ding, J., Xue, N., Xia, G.-S.: ReDet: A Rotation-
Context Aggravation strategies. The visualizations showed that Equivariant Detector for Aerial Object Detection. arXiv
the improved YOLOv8-n model could highlight crucial areas preprint arXiv:2103.07733 (2021)
within objects and enhance its feature extraction capabilities by 5. Zand, M., Etemad, A., Greenspan, M.: Oriented Bounding
using Grad-CAM to understand the model's decision-making Boxes for Small and Freely Rotated Objects. IEEE T.
process. Furthermore, the improved YOLOv8-n model was also Geosci Remote. 60, 1–15 (2022)
compared with existing models like YOLOv5-n, YOLOv5-s, 6. Yu, D., Xu, Q., Guo, H., Xu, J., Lu, J., Lin, Y., Liu, X.:
YOLOX-n, YOLOX-s, and YOLOv7-tiny. Anchor-Free Arbitrary-Oriented Object Detector Using Box
The improved model demonstrated a superior balance Boundary-Aware Vectors. IEEE J. Stars. 15, 1–1 (2022)
between detection accuracy and model complexity, outperforming 7. Benjumea, A., Teeti, I., Cuzzolin, F., Bradley, A.: YOLO-
the other models regarding mAP, model parameters, and FPS. Z: Improving Small Object Detection in YOLOv5 for
Overall, the improved model displayed superior performance in Autonomous Vehicles. arXiv preprint arXiv:2112.11798
precision, computational efficiency, and object detection (2023)
capabilities. Future studies may explore more effective strategy 8. Ren, Y., Zhu, C., Xiao, S.: Small Object Detection in Optical
combinations to enhance the performance of object detection Remote Sensing Images via Modified Faster R-CNN. Appl
models further. Sci-Basel 8 (5), 813 (2018)
9. Liu, S., Huang, D., Wang, Y.: Receptive Field Block Net for
Author contributions: Ruihan Bai wrote the main manuscript Accurate and Fast Object Detection. arXiv preprint
text and performed the experiments. Feng Shen gave guidance and arXiv:1711.07767 (2018)
suggestions for model improvement of this article, as well as 10. Bosquet, B., Mucientes, M., Brea, V. M.: STDnet:
reviewed and edited the manuscript. Mingkang Wang summarized Exploiting High Resolution Feature Maps for Small Object
the methods of the existing related research. Jiahui Lu drew the Detection. Eng. Appl Artif Intel. 91, 103615 (2020)
diagrams (model structure and related improvement) in the article 11. Zheng, Z., Zhong, Y., Ma, A., Han, X., Zhao, J., Liu, Y.,
and provided suggestions for improving the article's layout. Zhang, L.: HyNet: Hyper-Scale Object Detection Network
Zhiping Zhang contributed to the check, review, and editing of the Framework for Multiple Spatial Resolution Remote Sensing
article. All authors reviewed and revised the manuscript. Imagery. Isprs J. Photogramm. 166, 1–14 (2020)
Funding: This study was supported by the Qinglan Project of 12. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.:
Jiangsu Province and the Priority Academic Program Perceptual Generative Adversarial Networks for Small
Development of Jiangsu Higher Education Institutions. Object Detection. arXiv preprint arXiv:1706.05274 (2017)
Availability of data and materials: The data supporting the 13. Ji, S., Ling, H., Han, F.: An improved algorithm for small
conclusions of this article are included within the article. Any object detection based on YOLO v4 and multiscale
queries regarding these data may be directed to the corresponding contextual information. Comput Electr Eng. 105, 108490
author. (2023)
14. Zhou, Q., Zhang, W., Li, R., Wang, J., Zhen, S., Niu. F.:
Declarations Improved YOLOv5-S object detection method for optical
remote sensing images based on contextual transformer. J.
Conflict of interest The authors declare that he has no conflict of
Electron Imaging. 31(4), 043049 (2022)
interest.
15. Wu, W., Liu. H., Li, L., Long, Y., Wang, X., Wang, Z., Li,
J., Chang, Y.: Application of local fully Convolutional
Neural Network combined with YOLO v5 algorithm in
8
small target detection of remote sensing image. Plos One.
16(10), 0259283 (2021)
16. Wang, J., Xu, C., Yang, W., Yu, L.: A normalized Gaussian
Wasserstein distance for tiny object detection. arXiv
preprint arXiv:2110.13389 (2021)
17. Zhang, Q., Jiang, Z., Lu, Q., Han, J. Zeng, Z., Gao, S., Men.
A.: Split to be slim: An overlooked redundancy in vanilla
convolution. arXiv preprint arXiv:2006.12085 (2020)
18 Chen, J., Kao, S., He, H., Zhuo, W., Wen, S., Lee, C.H.,
Chan, S.H.: Run, Don't Walk: Chasing Higher FLOPS for
Faster Neural Networks. arXiv preprint arXiv:2303.03667
(2023)
19. Liu, Y., Li, H., Hu, C., Luo, S., Luo, Y., Chen, C.: Learning
to Aggregate Multi-Scale Context for Instance
Segmentation in Remote Sensing Images. arXiv preprint
arXiv:2111.11057 (2022)
20. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., Batra, D.: Grad-cam: Visual explanations from
deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on
computer vision, pp 618-626. (2017)