Professional Documents
Culture Documents
Article
Revolutionizing Target Detection in Intelligent Traffic Systems:
YOLOv8-SnakeVision
Qi Liu , Yang Liu and Da Lin *
School of Mathematical Sciences, Inner Mongolia University, Hohhot 010021, China; lllmeqi77@gmail.com (Q.L.);
mathliuyang@imu.edu.cn (Y.L.)
* Correspondence: 111977331@imu.edu.cn
Abstract: Intelligent traffic systems represent one of the crucial domains in today’s world, aiming to
enhance traffic management efficiency and road safety. However, current intelligent traffic systems
still face various challenges, particularly in the realm of target detection. These challenges include
adapting to complex traffic scenarios and the lack of precise detection for multiple objects. To address
these issues, we propose an innovative approach known as YOLOv8-SnakeVision. This method
introduces Dynamic Snake Convolution, Context Aggregation Attention Mechanisms, and the Wise-
IoU strategy within the YOLOv8 framework to enhance target detection performance. Dynamic
Snake Convolution assists in accurately capturing complex object shapes and features, especially
in cases of target occlusion or overlap. The Context Aggregation Attention Mechanisms allow the
model to better focus on critical image regions and effectively integrate information, thus improving
its ability to recognize obscured targets, small objects, and complex patterns. The Wise-IoU strategy
combines dynamic non-monotonic focusing mechanisms, aiming to more precisely regress target
bounding boxes, particularly for low-quality examples. We validate our approach on the BDD100K
and NEXET datasets. Experimental results demonstrate that YOLOv8-SnakeVision excels in various
complex road traffic scenarios. It not only enhances small object detection but also strengthens
the ability to recognize multiple targets. This innovative method provides robust support for the
development of intelligent traffic systems and holds the promise of achieving further breakthroughs
in future applications.
Citation: Liu, Q.; Liu, Y.; Lin, D. Keywords: intelligent traffic systems; target detection; YOLOv8-SnakeVision; dynamic snake convo-
Revolutionizing Target Detection in lution; context aggregation attention mechanisms; wise-IoU
Intelligent Traffic Systems:
YOLOv8-SnakeVision. Electronics
2023, 12, 4970. https://doi.org/
10.3390/electronics12244970
1. Introduction
Academic Editor: Hamid Reza In today’s society, due to the continuous advancement of urbanization and rapid
Karimi economic development, the number of motor vehicles is consistently increasing. This
Received: 31 October 2023
trend is leading to a more complex and congested road traffic environment [1–3]. This
Revised: 6 December 2023 increased traffic complexity poses significant challenges to urban residents’ mobility and
Accepted: 7 December 2023 escalates the risks of traffic accidents and violations [4,5]. In light of these issues, the
Published: 12 December 2023 development of intelligent traffic systems becomes particularly critical. Such systems not
only have the potential to enhance traffic flow efficiency but also hold the promise of
substantially improving road safety. At the core of intelligent traffic system technologies
lies the accurate identification of motor vehicles, pedestrians, and other entities on the road,
Copyright: © 2023 by the authors.
which is one of the most vital tasks [6,7]. Target detection technology, as a key component
Licensee MDPI, Basel, Switzerland.
of intelligent traffic systems, provides an effective means to automatically recognize and
This article is an open access article
track various entities in complex road traffic environments. In recent years, the emergence
distributed under the terms and
of deep learning has gained widespread attention in the field of target detection due to its
conditions of the Creative Commons
exceptional performance. Deep learning models not only achieve highly accurate target
Attribution (CC BY) license (https://
detection but also possess strong adaptability and excellent generalization capabilities,
creativecommons.org/licenses/by/
4.0/).
making them an ideal choice for addressing intricate road traffic challenges [8,9].
In the domain of object detection, despite the remarkable progress achieved through
deep learning techniques, several pivotal issues remain unresolved. Firstly, current object
detection algorithms still grapple with challenges when dealing with intricate scenarios
such as occlusions, variations in illumination, and changes in target scale [10,11]. Addi-
tionally, the robustness and versatility of the algorithms demand further enhancement to
ensure effective operation across diverse environments and tasks [12,13]. To address these
challenges, researchers have recently proposed a plethora of innovative object detection
algorithms, including DETR, EfficientDet, and YOLOv8. To begin with, the Transformer-
based DETR model offers a novel perspective for object detection [14]. This model suc-
cessfully integrates the self-attention mechanism of Transformers into object detection,
facilitating more efficient processing of contextual information within images. It circum-
vents the traditional anchor box design, directly outputting the location and category of
the target. Such a design promotes simplicity and robustness but requires more computa-
tional resources and extended training periods, potentially posing challenges for real-time
applications. Next, EfficientDet, which integrates EfficientNet as its backbone, embodies
an object detection model that achieves efficiency and precision in object detection tasks
through a compound scaling strategy [15]. Its adaptive nature ensures exemplary perfor-
mance across devices with varying computational capabilities. However, in certain specific
and complex scenarios, EfficientDet may require more in-depth fine-tuning to achieve
outstanding performance. Lastly, our attention turns to YOLOv8. As a fresh entrant in
the YOLO family, YOLOv8 further refines the detection speed and accuracy built upon
YOLOv5 [16]. Yet, in scenarios with overlapping multi-targets or when certain targets are
obscured, the absence of an attention mechanism might impede YOLOv8’s ability to recog-
nize and pinpoint the obstructed sections. Concurrently, traditional convolution operations
are similarly constrained in their capability to detect small targets and capture intricate
patterns and structural information within images. The bounding box loss function, being
an integral part of the object detection loss function, plays a crucial role in the object detec-
tion task. However, YOLOv8’s loss function overly emphasizes bounding box regression
for low-quality examples and lacks a dynamic non-monotonic focusing mechanism along
with a more judicious gradient gain allocation strategy [17]. These shortcomings evidently
jeopardize potential enhancements in the model’s detection performance.
Given the identified constraints of YOLOv8, this study introduces an advanced model
named YOLOv8-SnakeVision. Inspired by the segmentation of tubular structures based on
topological geometric constraints, we incorporated Dynamic Snake Convolution (DSConv)
into YOLOv8. This allows the model to adeptly capture slender and tortuous structural
attributes when handling obscured or overlapping targets, thereby offering heightened
precision in delineating their forms and characteristics, especially in scenarios with over-
lapping multiple targets. Furthermore, by integrating the Context Aggregation Attention
Mechanisms (CAAM), the model is better positioned to focus on pivotal segments of the
image, facilitating effective information amalgamation, thus bolstering the recognition of
obscured objects, small targets, and intricate patterns. Ultimately, we refined YOLOv8’s
loss function, adopting the Wise-IoU strategy. This approach amalgamates a dynamic
non-monotonic focusing mechanism with a judicious gradient gain allocation strategy,
aiming to more accurately regress bounding boxes, particularly for low-quality examples.
Through this strategy, the model emphasizes on anchor boxes of average quality in the
training data, rather than solely accentuating high- or low-quality examples, effectively
enhancing the model’s detection and localization capabilities for targets.
The following are the three contributions of this paper:
• This paper introduces the YOLOv8-SnakeVision model, which exhibits significant
innovation in the field of object detection. By incorporating DSConv and CAAM on
top of YOLOv8, we are able to capture the shapes and features of complex objects more
accurately, especially in cases of multiple overlapping or occluded objects. This inno-
vation not only enriches the toolbox of object detection techniques but also enhances
the performance and usability of intelligent traffic systems.
Electronics 2023, 12, 4970 3 of 21
• In scientific research, adaptability to multiple scenarios has always been a key con-
cern. Our study underscores the versatility of the YOLOv8-SnakeVision model across
various road traffic scenarios. The model can handle complex situations including
occlusions, overlapping objects, small targets, and intricate patterns, providing crucial
support for practical intelligent traffic applications. This multi-scenario adaptability
represents a significant contribution to this research, offering effective tools for tackling
complex traffic issues.
• Our research also highlights the Wise-IoU strategy employed in the YOLOv8-SnakeVision
model. This strategy combines a dynamic non-monotonic focusing mechanism, en-
abling more accurate regression of object-bounding boxes, particularly for low-quality
examples. The improvement in this loss function is expected to significantly en-
hance the performance of object detection algorithms, making them more reliable in
real-world road traffic environments.
Here is the structure of the remaining work. Section 2 introduces the related work in
road scene object detection. In Section 3, the principles of our approach will be elaborated.
Section 4 will describe our experimental process. Finally, Section 5 summarizes and
provides an outlook on future work.
2. Related Work
2.1. Research on Two-Stage Approaches in Object Detection
In the field of object detection, two-stage object detection algorithms are renowned
for their high accuracy. These algorithms divide the object detection task into two key
steps: region proposal and object classification. They excel in handling complex scenes and
small object detection. The earliest model, R-CNN (Region-based Convolutional Neural
Network) [18], employed selective search to generate region proposals, followed by feature
extraction and classification using convolutional neural networks (CNNs). While it demon-
strated excellent accuracy, it incurred high computational costs due to the need to process a
large number of region proposals independently [19]. SPPNet (Spatial Pyramid Pooling
Network) improved upon R-CNN by introducing spatial pyramid pooling, allowing for
variable-sized input images and reducing computational costs [20]. Fast R-CNN integrated
the region of interest (RoI) pooling layer directly into the network architecture, eliminating
the need for external region proposal methods, and resulting in improved speed and accu-
racy. Faster R-CNN introduced a Region Proposal Network (RPN) that learned to generate
region proposals within the network, enabling end-to-end training and striking a balance
between high accuracy and speed [21]. Mask R-CNN extended Faster R-CNN by adding a
mask prediction branch, enabling instance segmentation in addition to object detection [22].
Sparse R-CNN introduced a sparse-aware learning framework, enhancing inference speed
by dynamically pruning unimportant regions during inference. Each two-stage object
detection model has its unique strengths and limitations, necessitating a careful choice
to meet specific requirements such as accuracy, speed, and computational resources in
different application scenarios.
Although two-stage object detection algorithms excel in various domains, including
road scene object detection, they also exhibit certain drawbacks [23]. Firstly, these algo-
rithms often come with high computational complexity due to their multi-step nature,
involving region proposal and object classification. This results in significant demands
for computational resources and time, limiting their application in real-time scenarios.
Secondly, some two-stage algorithms may experience instability when generating region
proposals, leading to inconsistent region selection and subsequently causing issues such
as false positives or missed detections, especially in scenarios with complex backgrounds
or multi-scale objects [24]. Additionally, these algorithms may not perform well in the
detection of small-sized objects, potentially leading to missed detections or inaccurate local-
ization, which poses a challenge in applications like detecting tiny objects [25,26]. Table 1
illustrates the advantages and disadvantages of two-stage object detection algorithms.
Electronics 2023, 12, 4970 4 of 21
3. Method
3.1. Overview of Our Network
In this study, we propose an innovative object detection method known as YOLOv8-
SnakeVision to address the challenges in intelligent traffic systems. This method incorpo-
rates three key components into the YOLOv8 framework:
DSConv: We integrate DSConv into the backbone network of YOLOv8s for the fol-
lowing reasons. First, traditional convolutional kernels have fixed weights, resulting in
the same receptive field size when processing different regions of an image. However,
objects of different scales or deformations may correspond to different locations in feature
maps, requiring the model to adaptively adjust its receptive field. Second, DSConv closely
matches the size and shape of objects, making it more robust during sampling compared
to regular convolution. Lastly, small objects often have smaller sizes and varying shapes,
which traditional convolution may struggle to detect accurately. DSConv effectively en-
hances the detection performance of small objects. This innovation endows the backbone
network with adaptability, allowing it to better capture features of objects with different
sizes and shapes, thereby improving overall model performance.
CAAM: We introduce the CAAM into the neck of YOLOv8s. This choice is based on the
module’s ability to adaptively select and adjust channel and spatial weights in feature maps,
Electronics 2023, 12, 4970 6 of 21
thus effectively capturing and representing crucial image features. Furthermore, CAAM
requires relatively lower computational overhead, making it computationally efficient
compared to other attention mechanisms. In the YOLOv8 network, the neck plays a critical
role in connecting the backbone network and prediction output heads. Due to its unique
bottom-up and top-down construction, it facilitates a comprehensive fusion of features
at different scales, laying the foundation for subsequent predictions. Hence, the network
structure in the neck significantly influences the algorithm’s performance.
Wise-IoU Loss: Training data often includes low-quality examples, where conventional
geometric metrics may overly penalize these instances, reducing model generalization. To
address this issue, we introduce the Wise-IoU loss, which dynamically adjusts bounding
box regression loss while reducing penalties for metrics like distance and aspect ratio. This
approach better considers multiple factors between predicted and ground truth boxes,
including IoU, position, size, and shape, resulting in improved detection accuracy.
By combining these three key components, the YOLOv8-SnakeVision method excels
in complex traffic scenarios. It not only enhances the detection of small objects but also
improves the recognition of multiple targets. This innovative approach provides robust
support for the development of intelligent traffic systems, with the potential for further
breakthroughs in future applications. It aims to enhance urban traffic safety and efficiency,
reduce traffic accidents, and alleviate congestion, offering increased convenience and safety
for future intelligent traffic systems. Figure 1 illustrates the overall network architecture of
YOLOv8-SnakeVision.
Figure 2. Left: Illustration of the coordinates calculation of the DSConv. Right: The receptive field of
the DSConv.
To grant the convolution kernel more flexibility in focusing on the target’s intricate
geometric features, we introduce a deformation offset, ∆. However, allowing the model to
learn these offsets freely might lead the receptive field astray from the target. To address
this, DSConv adjusts the convolution kernel in both x-axis and y-axis directions.
For the x-axis direction, the coordinates are:
(
( xi+c , yi+c ) = ( xi + c, yi + Σii+c ∆y),
Ki ± c = (2)
( xi−c , yi−c ) = ( xi − c, yi + Σii−c ∆y),
K = ΣK B(K 0 , K ) · K 0 (4)
Here, B is the bilinear interpolation kernel, which can be decomposed into two one-
dimensional interpolation kernels:
Figure 3. The semantic embedding boundary features generated by the multi-scale boundary module
are subjected to CAAM to achieve more accurate contextual aggregation.
exp( Bi · A Tj )
F (i, j) = (6)
∑N T
p=1 exp( Bi · A p )
where F (i, j) represents the association between the i-th position of boundary feature map
B and the j-th position of semantic feature map A.
Using the obtained boundary semantic similarity, we further compute the enhanced
feature D, given by the formula:
N
D j = A1 j + ∑ F (i, j) · A2i (7)
i =1
Electronics 2023, 12, 4970 9 of 21
Considering that boundary regions of the same class are typically assigned higher
weights, this formula can be approximated as:
F (boundary, j)
D j = A1 j + · A2i (8)
Z
The significant role of CAAM in optimizing YOLOv8 cannot be underestimated.
Within YOLOv8, there may be a multitude of objects in the image, appearing in varying
scales, shapes, and positions. This necessitates the model’s ability to make full use of
contextual information within the image for more precise object detection and localization.
By introducing CAAM, YOLOv8 is better equipped to comprehend the relationships among
different objects within the image and their connections to the background during image
processing. This understanding of interrelatedness helps reduce false positives and false
negatives, thereby enhancing the accuracy and robustness of detection. Furthermore,
CAAM also enhances the detection of small-scale objects as it allows the model to better
leverage the contextual information of these objects in the image, thus improving detection
stability. Figure 2 illustrates the network architecture of CAAM.
3.4. Wise-IoU
Wise-IoU introduces a novel Bounding Box Regression (BBR) loss function, with its
core principle being the utilization of a dynamic non-monotonic focusing mechanism to en-
hance object localization performance [17]. This approach emphasizes handling low-quality
and average-quality samples, rather than solely relying on high-quality ones, offering a
more comprehensive training guidance for the model. Compared to the conventional IoU
loss function, WIoU’s primary innovation lies in its dynamic non-monotonic focusing
mechanism. This mechanism employs the degree of anomaly instead of IoU to assess the
quality of anchor boxes, determining the attention level for each sample. Such an evalua-
tion method enables a more accurate distinction among high-, medium-, and low-quality
samples, allocating appropriate gradients to them. This strategy not only reduces the
competitiveness of high-quality anchor boxes but also diminishes the detrimental gradient
generated by low-quality samples. Next, we introduce the mathematical derivation process
of Wise-IoU:
First, we considered a version of IoU based on the distance between the center points
of the anchor box and the ground truth box, called Distance IoU (DIoU). It is defined as
( x − x gt )2 + (y − y gt )2
DIoU = (9)
Wgt 2 + Hgt 2
where x, y are the coordinates of the center point of the anchor box, while x gt , y gt are the
coordinates of the center point of the ground truth box. Wgt and Hgt represent the width
and height of the ground truth box respectively.
Further, to consider the offset between the center points of the anchor box and the
ground truth box, we introduced the Enhanced IoU (EIoU):
( x − x gt )2 (y − y gt )2
EIoU = DIoU + + (10)
Wgt 2 Hgt 2
Additionally, in order to comprehensively consider the consistency of the box size and
the aspect ratio, we proposed the Complete IoU (CIoU):
Lastly, in order to further improve the quality assessment of the anchor box, we
proposed the third version of Wise-IoU, which allocates appropriate gradient gains for the
anchor box through a dynamic non-monotonic focal mechanism:
δ+e
W Iouv 3 = rLW Iouv 1 ; r= (13)
δ
where δ represents the degree of outlier, and e is a small constant to ensure that the
denominator is not zero.
Integrating Wise-IoU into YOLOv8 holds significant importance. YOLOv8 is renowned
for its speed and accuracy, and it can benefit even more from the refined gradient updates
provided by Wise-IoU. During the training phase, the dynamic non-monotonic focal mech-
anism of Wise-IoU ensures that anchor boxes receive appropriate gradient gains. This
mechanism helps reduce common issues faced when training deep neural networks for ob-
ject detection, such as gradient vanishing or gradient explosion. Moreover, by emphasizing
the quality of bounding box predictions, Wise-IoU assists YOLOv8 in handling complex
scenes with overlapping objects of varying sizes and aspect ratios. Integrating Wise-IoU
into YOLOv8 not only enhances the model’s performance but also sets a new benchmark
for future object detection frameworks.
4. Experiment
In the experimental section, we first introduced the characteristics of the two major
datasets, BDD100K and NEXET. Subsequently, we described the experimental environment
setup and detailed procedures, including data preprocessing, model selection, model train-
ing, and evaluation methods. Ultimately, we delved deeply into the performance results of
the model and validated the effectiveness of the algorithm through ablation studies.
4.1. Materials
4.1.1. Dataset
BDD100K is a large-scale driving scene dataset released by the Berkeley AI Re-
search Laboratory (BAIR) [44]. This dataset provides annotations for object detection
on 100,000 images, including vehicles, pedestrians, traffic signs, and more. To enhance
the dataset’s diversity, each image is annotated multiple times, each with different angles,
scales, transformations, and occlusions to simulate real-world visual scenes.
The NEXET dataset is released by the Nexar company, mainly for training and validat-
ing autonomous driving algorithms [45]. It comprises 500,000 images for vehicle detection,
with five categories. All images in the dataset come with detailed annotations, including
bounding boxes for vehicles, pedestrians, traffic signs, and other targets.
The emergence of datasets like BDD100K and NEXET provides valuable data support
for the development of intelligent transportation systems. While BDD100K is designed
to emulate real-world visual scenarios, NEXET offers a wealth of detailed annotations,
enabling training and validation of autonomous driving algorithms in more realistic and
challenging environments. These two datasets provide abundant training data for deep
learning models, assisting researchers and developers in further refining and optimizing
object detection techniques to adapt to the increasingly complex road traffic environment.
for deep learning tasks. These detailed configuration specifications are crucial for ensuring
the accuracy and reproducibility of our experimental results.
Specifically, we initialized the model using the widely acknowledged pre-trained weights
from the COCO dataset. Subsequently, this model was meticulously fine-tuned on our
dedicated dataset. Throughout the training process, we adopted a batch size of 32 with an
initial learning rate of 0.01 and carried out a total of 1000 training epochs. Table 4 provides
an exhaustive list of the configuration parameters pertaining to YOLOv8-SnakeVision.
Moreover, we employed the PyTorch deep learning framework to facilitate the model’s
training, which furnishes comprehensive tools and libraries essential for constructing and
training neural networks. Algorithm 1 represents the algorithm flow of the training in
this paper.
Parameters Values
Epoch 1000
Learning rate 0.01
Image size 640
Batch size 32
Number of images 25,000
Layers 168
Parameters 4,151,904
Figure 4 displays the overall results of the model proposed in this study for road de-
tection. By comprehensively considering various performance metrics, our model demon-
strates outstanding performance in various aspects. Particularly noteworthy is that in
complex road scenarios, our model not only achieves high-precision object detection but
also maintains stable performance when dealing with blurry backgrounds and objects of
varying scales.
TP + TN
Accuracy = (14)
TP + TN + FP + FN
where TP denotes the number of true positives, TN denotes the number of true negatives,
FP is the number of false positives, and FN is the number of false negatives.
Building upon accuracy, Precision specifically quantifies the model’s performance
concerning false positives:
TP
Precision = (15)
TP + FP
where TP denotes the number of true positives and FP is the number of false positives.
Another essential metric, Recall, measures the model’s ability to correctly detect
positive instances:
TP
Recall = (16)
TP + FN
where TP denotes the number of true positives and FN is the number of false negatives.
To strike a balance between precision and recall, the F1 Score acts as their harmonic mean:
Precision × Recall
F1 = 2 × (17)
Precision + Recall
Electronics 2023, 12, 4970 14 of 21
where Precision is the precision of the model and Recall is the recall of the model.
Lastly, to account for multi-class detection tasks, the mean Average Precision (mAP)
calculates the mean of average precisions across all classes:
N
1
mAP =
N ∑ APi (18)
i =1
where APi is the average precision for the ith class and N is the total number of classes.
4.3. Results
As shown in Table 5 and Figure 5, the performance comparison experiments between
YOLOv8-SnakeVision and several other well-known object detection algorithms on two
benchmark datasets, BDD100K and NEXET, are presented. The evaluation metrics em-
ployed include mAP50 (Mean Average Precision at IoU = 0.5), mAP50-95 (Mean Average
Precision across IoU thresholds from 0.5 to 0.95), APs (Average Precision for small objects),
and APm (Average Precision for medium-sized objects). First and foremost, YOLOv8-
SnakeVision exhibits significant superiority over other algorithms on the BDD100K dataset.
It outperforms other algorithms with remarkable scores in mAP50, mAP50-95, APs, and
APm, reaching 0.63, 0.44, 0.19, and 0.37, respectively, far exceeding the performance of
other algorithms. Furthermore, the same algorithm excels on the NEXET dataset as well. It
achieves outstanding scores in mAP50, mAP50-95, APs, and APm, with values of 0.69, 0.50,
0.27, and 0.43, respectively, once again outpacing other algorithms. It is worth noting that
YOLOv8-SnakeVision particularly excels in small object detection (APs) and medium-sized
object detection (APm), demonstrating its exceptional performance in complex scenarios.
This comparison vividly underscores the outstanding performance of YOLOv8-SnakeVision
in the field of object detection. In summary, YOLOv8-SnakeVision demonstrates excep-
tional performance in object detection within traffic scenes, whether it be in the detection of
vehicles, pedestrians, traffic signs, or bicycles, establishing a clear advantage. It contributes
significantly to achieving safer and more efficient urban transportation.
Table 5. Comparative experimental results between YOLOv8-SnakeVision and other object detection
algorithms. Bold represents the best result.
BDD100K NEXET
Model
mAP50 mAP50-95 APs APm mAP50 mAP50-95 APs APm
YOLOv5n [35] 0.51 0.284 0.095 0.279 0.628 0.435 0.106 0.365
SSD [46] 0.528 0.279 0.098 0.283 0.632 0.462 0.11 0.382
YOLOv7-Tiny [47] 0.547 0.316 0.112 0.298 0.643 0.453 0.125 0.379
EfficientDet [15] 0.535 0.355 0.115 0.295 0.655 0.458 0.126 0.382
VitDet [48] 0.545 0.345 0.135 0.312 0.645 0.488 0.133 0.425
RTMet [49] 0.618 0.375 0.125 0.325 0.643 0.476 0.145 0.425
YOLOv8n [16] 0.620 0.388 0.110 0.335 0.672 0.482 0.189 0.433
YOLOv8-SnakeVision 0.635 0.449 0.195 0.371 0.687 0.495 0.266 0.432
performs exceptionally well in Frames Per Second (FPS), processing image frames at
speeds of up to 68 FPS, showcasing outstanding performance in real-time applications. On
the NEXET dataset, YOLOv8-SnakeVision similarly leads in Precision, achieving a high
Precision score of 0.753. Although its Recall value is relatively lower, with a recall rate
of 0.531, it still falls within a relatively high range, indicating its capability to effectively
identify relevant target instances. Additionally, an F1 Score of 0.625 shows that the model
maintains a reasonable balance between precision and recall on this dataset. Similar to the
BDD100K dataset, YOLOv8-SnakeVision also performs well in Frames Per Second (FPS) on
the NEXET dataset, achieving speeds of 67 FPS, making it suitable for real-time applications
with high demands.
Table 6. Comparative experimental results between YOLOv8-SnakeVision and other object detection
algorithms. Bold represents the best result.
BDD100K NEXET
Model
Precision Recall F1 Score FPS Precision Recall F1 Score FPS
YOLOv5n [35] 0.582 0.512 0.546 45 0.633 0.530 0.578 52
SSD [46] 0.611 0.525 0.567 52 0.631 0.512 0.565 53
YOLOv7-Tiny [47] 0.511 0.514 0.512 48 0.608 0.513 0.557 62
EfficientDet [15] 0.532 0.525 0.529 55 0.665 0.508 0.577 54
VitDet [48] 0.723 0.512 0.601 60 0.705 0.529 0.604 55
RTMet [49] 0.715 0.507 0.595 59 0.734 0.506 0.602 60
YOLOv8n [16] 0.724 0.515 0.604 57 0.722 0.532 0.614 62
YOLOv8-SnakeVision 0.743 0.520 0.611 68 0.753 0.531 0.625 67
Table 7. Comparison of Model Parameters (PARAMS) and Floating Point Operations (FLOPs) on
BDD100K and NEXET datasets.
BDD100K NEXET
Model
PARAMS FLOPs PARAMS FLOPs
YOLOv5n [35] 2.85 M 4.5 B 2.33 M 4.1 B
SSD [46] 1.77 M 3.2 B 1.45 M 3.2 B
YOLOv7-Tiny [47] 5.12 M 9.5 B 5.07 M 9.2 B
EfficientDet [15] 6.53 M 9.5 B 6.03 M 8.7 B
VitDet [48] 13.13 M 19.8 B 12.98 M 18.6 B
RTMet [49] 11.95 M 17.5 B 10.7 M 15.5 B
YOLOv8n [16] 4.35 M 8.5 B 4.15 M 8.3 B
YOLOv8-SnakeVision 4.15 M 8.2 B 4.12 M 8.1 B
maintains its remarkable performance, providing an efficient solution for target detection
in various applications.
5. Conclusions
Our experimental results have fully demonstrated the effectiveness and innovation of
the YOLOv8-SnakeVision model in the field of object detection for intelligent traffic systems.
By introducing DSConv, CAAM, and the Wise-IoU strategy into the YOLOv8 framework,
we have successfully achieved significant improvements in accurately capturing complex
object shapes and features, even in scenarios involving occlusion, overlap, small objects,
and intricate patterns. The adaptability of this model to diverse road traffic situations
makes it a valuable addition to the toolbox of intelligent traffic systems.
However, we must also acknowledge that the model still has some limitations. Firstly,
target detection under extreme weather conditions remains a challenge, and this is one of
the issues that need to be addressed in the future. In this regard, we suggest that future
research focus on the use of multi-modal sensor fusion methods to enhance the model’s
robustness in target detection under conditions such as rain, snow, and heavy fog. By
integrating information from multiple sources such as radar, infrared, and cameras, the
model can have a more comprehensive perception of the environment, thereby improving
the robustness of target detection. Secondly, issues such as changes in the orientation of
objects, viewing angles, and lighting conditions may significantly impact detection accuracy.
For this reason, future research could explore advanced visual attention mechanisms
and lighting invariance techniques. Attention mechanisms help the model focus more
specifically on critical areas, improving its ability to perceive details of the target. At
the same time, lighting invariance techniques can assist the model in better adapting
to different lighting conditions, mitigating the negative impact of these variations on
detection performance. In conclusion, this research is of great practical significance for
improving urban traffic safety and efficiency, reducing traffic accidents, and alleviating
traffic congestion. We believe that through continuous efforts and research, we can further
advance the development of intelligent traffic systems, bringing more convenience and
safety to future urban transportation.
Author Contributions: Q.L. contributed to the conception and design of the study and completed
various sections of the manuscript. Y.L. conducted the statistical analysis, and D.L. revised the
manuscript, read, and approved the submitted version. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zhu, Y.; Yan, W.Q. Traffic sign recognition based on deep learning. Multimed. Tools Appl. 2022, 81, 17779–17791. [CrossRef]
2. Du, W.; Chen, L.; Wang, H.; Shan, Z.; Zhou, Z.; Li, W.; Wang, Y. Deciphering urban traffic impacts on air quality by deep learning
and emission inventory. J. Environ. Sci. 2023, 124, 745–757. [CrossRef]
3. Fakhrurroja, H.; Pramesti, D.; Hidayatullah, A.R.; Fashihullisan, A.A.; Bangkit, H.; Ismail, N. Automated License Plate Detection
and Recognition using YOLOv8 and OCR With Tello Drone Camera. In Proceedings of the 2023 International Conference on
Computer, Control, Informatics and its Applications (IC3INA), Bandung, Indonesia, 4–5 October 2023 ; IEEE: Piscataway, NJ,
USA, 2023; pp. 206–211.
Electronics 2023, 12, 4970 20 of 21
4. Yang, Z.; Zhang, W.; Feng, J. Predicting multiple types of traffic accident severity with explanations: A multi-task deep learning
framework. Saf. Sci. 2022, 146, 105522. [CrossRef]
5. Hameed, A.; Violos, J.; Leivadeas, A. A deep learning approach for IoT traffic multi-classification in a smart-city scenario. IEEE
Access 2022, 10, 21193–21210. [CrossRef]
6. Babbar, S.; Bedi, J. Real-time traffic, accident, and potholes detection by deep learning techniques: A modern approach for traffic
management. Neural Comput. Appl. 2023, 35, 19465–19479. [CrossRef]
7. Zhang, Y.; Zhao, T.; Gao, S.; Raubal, M. Incorporating multimodal context information into traffic speed forecasting through
graph deep learning. Int. J. Geogr. Inf. Sci. 2023, 37, 1909–1935. [CrossRef]
8. Sattar, K.; Chikh Oughali, F.; Assi, K.; Ratrout, N.; Jamal, A.; Masiur Rahman, S. Transparent deep machine learning framework
for predicting traffic crash severity. Neural Comput. Appl. 2023, 35, 1535–1547. [CrossRef]
9. Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. A systematic review of drone based road traffic monitoring system.
IEEE Access 2022, 10, 101537–101555. [CrossRef]
10. Ortataş, F.N.; Kaya, M. Performance Evaluation of YOLOv5, YOLOv7, and YOLOv8 Models in Traffic Sign Detection. In
Proceedings of the 2023 8th International Conference on Computer Science and Engineering (UBMK), Burdur, Turkiye, 13–15
September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 151–156.
11. Huangfu, Z.; Li, S. Lightweight You Only Look Once v8: An Upgraded You Only Look Once v8 Algorithm for Small Object
Identification in Unmanned Aerial Vehicle Images. Appl. Sci. 2023, 13, 12369. [CrossRef]
12. Shokri, D.; Larouche, C.; Homayouni, S. A Comparative Analysis of Multi-Label Deep Learning Classifiers for Real-Time Vehicle
Detection to Support Intelligent Transportation Systems. Smart Cities 2023, 6, 2982–3004. [CrossRef]
13. Iftikhar, S.; Asim, M.; Zhang, Z.; Muthanna, A.; Chen, J.; El-Affendi, M.; Sedik, A.; Abd El-Latif, A.A. Target Detection and
Recognition for Traffic Congestion in Smart Cities Using Deep Learning-Enabled UAVs: A Review and Analysis. Appl. Sci. 2023,
13, 3995. [CrossRef]
14. Wei, H.; Zhang, Q.; Qin, Y.; Li, X.; Qian, Y. YOLOF-F: You only look one-level feature fusion for traffic sign detection. Vis. Comput.
2023, 1–14. [CrossRef]
15. Gupta, M.; Miglani, H.; Deo, P.; Barhatte, A. Real-time traffic control and monitoring. e-Prime-Adv. Electr. Eng. Electron. Energy
2023, 5, 100211. [CrossRef]
16. Aboah, A.; Wang, B.; Bagci, U.; Adu-Gyamfi, Y. Real-time multi-class helmet violation detection using few-shot data sampling
technique and yolov8. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC,
Canada, 18–22 June 2023; pp. 5349–5357.
17. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023,
arXiv:2301.10051.
18. Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on
faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [CrossRef]
19. Li, X.; Xie, Z.; Deng, X.; Wu, Y.; Pi, Y. Traffic sign detection based on improved faster R-CNN for autonomous driving. J.
Supercomput. 2022, 78 , 7982–8002. [CrossRef]
20. Ghahremannezhad, H.; Shi, H.; Liu, C. Object Detection in Traffic Videos: A Survey. IEEE Trans. Intell. Transp. Syst. 2023, 24,
6780–6799. [CrossRef]
21. Arora, N.; Kumar, Y.; Karkra, R.; Kumar, M. Automatic vehicle detection system in different environment conditions using fast
R-CNN. Multimed. Tools Appl. 2022, 81, 18715–18735. [CrossRef]
22. Fang, S.; Zhang, B.; Hu, J. Improved mask R-CNN multi-target detection and segmentation for autonomous driving in complex
scenes. Sensors 2023, 23, 3853. [CrossRef]
23. Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Li, W.; Zhang, Z.; Wang, P.; Zhang, W. Global Mask R-CNN for marine ship instance
segmentation. Neurocomputing 2022, 480, 257–270. [CrossRef]
24. He, D.; Qiu, Y.; Miao, J.; Zou, Z.; Li, K.; Ren, C.; Shen, G. Improved Mask R-CNN for obstacle detection of rail transit. Measurement
2022, 190, 110728. [CrossRef]
25. Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones
2023, 7, 117. [CrossRef]
26. Zhang, Y.; Sun, Y.; Wang, Z.; Jiang, Y. YOLOv7-RAR for Urban Vehicle Detection. Sensors 2023, 23, 1801. [CrossRef] [PubMed]
27. Othmani, M. A vehicle detection and tracking method for traffic video based on faster R-CNN. Multimed. Tools Appl. 2022,
81, 28347–28365. [CrossRef]
28. Varesko, L.; Oreski, G. Performance comparison of novel object detection models on traffic data. In Proceedings of the 2023 8th
International Conference on Machine Learning Technologies, Stockholm, Sweden, 10–12 March 2023; pp. 177–184.
29. Soylu, E.; Soylu, T. A performance comparison of YOLOv8 models for traffic sign detection in the Robotaxi-full scale autonomous
vehicle competition. Multimed. Tools Appl. 2023, 1–31. [CrossRef]
30. Chen, J.; Hong, H.; Song, B.; Guo, J.; Chen, C.; Xu, J. MDCT: Multi-Kernel Dilated Convolution and Transformer for One-Stage
Object Detection of Remote Sensing Images. Remote Sens. 2023, 15, 371. [CrossRef]
31. Zou, H.; Zhan, H.; Zhang, L. Neural Network Based on Multi-Scale Saliency Fusion for Traffic Signs Detection. Sustainability
2022, 14, 16491. [CrossRef]
Electronics 2023, 12, 4970 21 of 21
32. Taouqi, I.; Klilou, A.; Chaji, K.; Arsalane, A. Yolov2 Implementation and Optimization for Moroccan Traffic Sign Detection.
In Proceedings of the International Conference on Artificial Intelligence and Smart Environment, Errachidia, Morocco, 24–26
November 2022; Springer: Cham, Switzerland, 2022; pp. 837–843.
33. Guillermo, M.; Francisco, K.; Concepcion, R.; Fernando, A.; Bandala, A.; Vicerra, R.R.; Dadios, E. A Comparative Study on
Satellite Image Analysis for Road Traffic Detection using YOLOv3-SPP, Keras RetinaNet and Full Convolutional Network. In
Proceedings of the 2023 8th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, 18–19 May
2023; IEEE: Piscataway, NJ, USA, 2023; pp. 578–584.
34. Li, Y.; Li, J.; Meng, P. Attention-YOLOV4: A real-time and high-accurate traffic sign detection algorithm. Multimed. Tools Appl.
2023, 82, 7567–7582. [CrossRef]
35. Chen, X. Traffic Lights Detection Method Based on the Improved YOLOv5 Network. In Proceedings of the 2022 IEEE 4th
International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 12–14 October 2022; IEEE:
Piscataway, NJ, USA, 2022; pp. 1111–1114.
36. Tarun, R.; Esther, B.P. Traffic Anomaly Alert Model to Assist ADAS Feature based on Road Sign Detection in Edge Devices.
In Proceedings of the 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC),
Coimbatore, India, 6–8 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 824–828.
37. Krishnendhu, S.; Mohandas, P. SAD: Sensor-based Anomaly Detection System for Smart Junctions. IEEE Sens. J. 2023, 23,
20368–20378.
38. Xia, J.; Li, M.; Liu, W.; Chen, X. DSRA-DETR: An Improved DETR for Multiscale Traffic Sign Detection. Sustainability 2023,
15, 10862. [CrossRef]
39. Liu, X.; Zhang, B.; Liu, N. CAST-YOLO: An Improved YOLO Based on a Cross-Attention Strategy Transformer for Foggy Weather
Adaptive Detection. Appl. Sci. 2023, 13, 1176. [CrossRef]
40. Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular
Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6
October 2023; pp. 6070–6079.
41. He, L.; Wang, M. SliceSamp: A Promising Downsampling Alternative for Retaining Information in a Neural Network. Appl. Sci.
2023, 13, 11657. [CrossRef]
42. Liu, Z.; Li, J.; Song, R.; Wu, C.; Liu, W.; Li, Z.; Li, Y. Edge Guided Context Aggregation Network for Semantic Segmentation of
Remote Sensing Imagery. Remote Sens. 2022, 14, 1353. [CrossRef]
43. Ma, H.; Yang, H.; Huang, D. Boundary guided context aggregation for semantic segmentation. arXiv 2021, arXiv:2110.14587.
44. Huang, K.; Lertniphonphan, K.; Chen, F.; Li, J.; Wang, Z. Multi-Object Tracking by Self-Supervised Learning Appearance Model.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June
2023; pp. 3162–3168.
45. Unal, D.; Catak, F.O.; Houkan, M.T.; Mudassir, M.; Hammoudeh, M. Towards robust autonomous driving systems through
adversarial test set generation. ISA Trans. 2023, 132, 69–79. [CrossRef]
46. Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast vehicle detection algorithm in traffic scene based on improved
SSD. Measurement 2022, 201, 111655. [CrossRef]
47. Li, S.; Wang, S.; Wang, P. A small object detection algorithm for traffic signs based on improved YOLOv7. Sensors 2023, 23, 7145.
[CrossRef] [PubMed]
48. Fang, Z.; Zhang, T.; Fan, X. A ViTDet based dual-source fusion object detection method of UAV. In Proceedings of the 2022
International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Xi’an, China, 28–30 October
2022; IEEE: Piscataway, NJ, USA, 2022; pp. 628–633.
49. Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19830–19843.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.