You are on page 1of 20

sensors

Article
VV-YOLO: A Vehicle View Object Detection Model Based on
Improved YOLOv4
Yinan Wang 1 , Yingzhou Guan 1 , Hanxu Liu 1 , Lisheng Jin 2 , Xinwei Li 2 , Baicang Guo 2 and Zhe Zhang 2, *

1 China FAW Corporation Limited, Global R&D Center, Changchun 130013, China
2 School of Vehicle and Energy, Yanshan University, Qinhuangdao 066000, China
* Correspondence: zhangzhe@stumail.ysu.edu.cn

Abstract: Vehicle view object detection technology is the key to the environment perception modules
of autonomous vehicles, which is crucial for driving safety. In view of the characteristics of complex
scenes, such as dim light, occlusion, and long distance, an improved YOLOv4-based vehicle view
object detection model, VV-YOLO, is proposed in this paper. The VV-YOLO model adopts the
implementation mode based on anchor frames. In the anchor frame clustering, the improved
K-means++ algorithm is used to reduce the possibility of instability in anchor frame clustering
results caused by the random selection of a cluster center, so that the model can obtain a reasonable
original anchor frame. Firstly, the CA-PAN network was designed by adding a coordinate attention
mechanism, which was used in the neck network of the VV-YOLO model; the multidimensional
modeling of image feature channel relationships was realized; and the extraction effect of complex
image features was improved. Secondly, in order to ensure the sufficiency of model training, the loss
function of the VV-YOLO model was reconstructed based on the focus function, which alleviated
the problem of training imbalance caused by the unbalanced distribution of training data. Finally,
the KITTI dataset was selected as the test set to conduct the index quantification experiment. The
results showed that the precision and average precision of the VV-YOLO model were 90.68% and
80.01%, respectively, which were 6.88% and 3.44% higher than those of the YOLOv4 model, and the
model’s calculation time on the same hardware platform did not increase significantly. In addition to
testing on the KITTI dataset, we also selected the BDD100K dataset and typical complex traffic scene
data collected in the field to conduct a visual comparison test of the results, and then the validity and
robustness of the VV-YOLO model were verified.
Citation: Wang, Y.; Guan, Y.; Liu, H.;
Jin, L.; Li, X.; Guo, B.; Zhang, Z.
Keywords: object detection; deep learning; vehicle view; YOLOv4; network optimization
VV-YOLO: A Vehicle View Object
Detection Model Based on Improved
YOLOv4. Sensors 2023, 23, 3385.
https://doi.org/10.3390/s23073385
1. Introduction
Academic Editor: Petros Daras
As a key technology that can effectively alleviate typical traffic problems and improve
Received: 1 March 2023 traffic safety, an intelligent transportation system has been fully developed around the
Revised: 13 March 2023 world [1,2]. The large-scale application of autonomous driving technology has become
Accepted: 14 March 2023 an inevitable choice for the development of modern transportation [3]. Environmental
Published: 23 March 2023 awareness technology is the key to realizing autonomous driving and the basis for subse-
quent path planning and decision control of autonomous vehicles. As an important branch
of environmental perception technology, object detection from the vehicle perspective is
tasked with predicting the position, size, and category of objects in the area of interest
Copyright: © 2023 by the authors.
in front of the vehicle [4], which directly affects the performance of the environmental
Licensee MDPI, Basel, Switzerland.
perception system of autonomous vehicles.
This article is an open access article
distributed under the terms and
In terms of sensors used for vehicle-mounted visual angle object detection, visual
conditions of the Creative Commons
sensors have become the most used sensors for object detection due to their ability to obtain
Attribution (CC BY) license (https://
abundant traffic information, low cost, easy installation, and high stability [5–7]. With the
creativecommons.org/licenses/by/ continuous development of hardware systems, such as graphics cards and computing units,
4.0/). object detection based on deep learning is the mainstream of current research [8,9]. With

Sensors 2023, 23, 3385. https://doi.org/10.3390/s23073385 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 3385 2 of 20

its advantages of high robustness and good portability, object detection of four-wheeled
vehicles, two-wheeled vehicles, and pedestrians has been realized in many scenes.
In the field of object detection, the deep learning-based object detection model can
be divided into two stages, one of which stage is according to the implementation logic.
The two-stage object detection model is usually composed of two parts: region of interest
generation and candidate box regression. The R-CNN series [10–13] model, R-FCN [14],
SPP [15], and other structures are the representatives of the two-stage object detection
model. The two-stage object detection model has made a great breakthrough in precision
performance, but it is difficult to use in embedded platforms with insufficient computing
power, such as roadside units and domain controllers, which also promotes the birth of
the single-stage object detection model. The single-stage object detection model treats
the object detection task as a regression problem. By designing the network structure
of the end-to-end mode, the feature extraction of the input image is carried out directly,
and the prediction results are output. Early single-stage object detection models mainly
include YOLO [16] and SSD [17]. Such models have great advantages in inference speed,
but their detection precision is lower than that of the two-stage model. Due to this, the
balance between detection precision and inference speed has become the focus of single-
stage object detection model research and achieved rapid development in recent years.
Excellent models, such as RetinaNet [18], YOLOv4 [19], CornerNet [20], and YOLOv7 [21],
have emerged.
Table 1 shows the representative work in the field of vehicle-view object detection in
recent years. Although these studies can solve the problem of object detection in complex
vehicle-view scenes to a certain extent, they usually need to introduce additional large
modules, such as the GAN [22] network and its variants, or just study a single object, such
as pedestrians or vehicles. However, the autonomous vehicle needs to pay attention to
three objects—a four-wheel vehicle, two-wheel vehicle and pedestrian—from the onboard
perspective at the same time, and the computing power of its computing platform is limited,
so the precision and real-time performance cannot be taken into account.

Table 1. Summary of literature survey on vehicle view object detection model.

Year Title Method Limitation Reference


A visual enhancement mechanism
Vehicle Detection and Tracking in There is the introduction of
was proposed and introduced into the
Adverse Weather larger modules, and only
2021 YOLOv3 model to realize vehicle [23]
Using a Deep Learning the vehicle objects
detection in snowy, foggy, and
Framework are considered.
other scenarios.
AugGAN network was proposed to
enhance vehicle targets in dark light GAN networks are
GAN-Based Day-to-Night Image images, and the data generated by this introduced, and multiple
2021 Style Transfer for Nighttime strategy was used to train R-CNN and models need to be trained, [24]
Vehicle Detection YOLO faster, which improved the and only vehicle objects
performance of the object detection are considered.
model under dark light conditions.
A SA-YOLOv3 model is proposed, in
SA-YOLOv3: An Efficient and which dilated convolution and
There are fewer test
Accurate Object Detector Using self-attention module (SAM) are
2022 scenarios to validate [25]
Self-Attention Mechanism for introduced into YOLOv3, and the
the model.
Autonomous Driving GIOU loss function is introduced
during training.
The fusion module of SA and FC
Only pedestrian targets are
Feature Calibration Network for features is designed, and FC-NET is
2022 considered, and there are [26]
Occluded Pedestrian Detection further proposed to realize pedestrian
few verification scenarios.
detection in occlusion scenes
Sensors 2023, 23, 3385 3 of 20

Table 1. Cont.

Year Title Method Limitation Reference


QTNet and FCNet adaptive networks
were proposed to learn the image With the introduction of
R-YOLO: A Robust Object
features without tags and applied to additional large networks,
2023 Detector [27]
YOLOv3, YOLOv5 and YOLOX, multiple models need to be
in Adverse Weather
which improved the precision of trained.
object detection in foggy scenarios.

Inspired by the above research results and remaining problems, this paper proposes a
vehicle view object detection model, VV-YOLO, based on improved YOLOv4. This model
adopts the end-to-end design idea and optimizes the YOLOv4 benchmark model from
three aspects: anchor frame clustering algorithm, loss function and neck network. Firstly,
the improved K-means++ [28] algorithm is used to achieve more accurate and stable anchor
frame clustering for the experimental dataset, which is a prerequisite for the target detection
model based on the anchor frame to obtain a model with excellent performance. Secondly, a
focal loss [18] loss function was introduced in the model training part to improve the feature
extraction ability of the model for the target of interest in complex scenes. Finally, combined
with the coordinate attention module [29], the CA-PAN neck network was proposed to
model the channel relationship of image features, which could greatly improve the model’s
attention at the region of interest.

2. Related Works
2.1. Structure of the YOLOv4 Model
In 2020, Alexey Bochkovskiy et al. [30] improved YOLOv3 with a lot of clever optimiza-
tion ideas and then proposed YOLOv4. Figure 1 shows its network structure. The design
idea of YOLOv4 is consistent with that of YOLO. It is also a single-stage model, which can
be divided into three parts: backbone network, neck network and detection network. The
backbone network is called CSPDarkNet53 [19]. Different from the DarkNet53 [30] used in
YOLOv3, it uses a cross-stage hierarchical structure for network connection, which reduces
the amount of computation and ensures the feature extraction effect. The neck network of
YOLOv4 was constructed using the PAN [31] path aggregation network, which improved
the fusion effect of multilevel features compared to the FPN [32] feature pyramid network.
In addition, YOLOv4 also uses the SPP network in front of the neck network to enrich
the receptive field of image features. After the output features of the neck network are
obtained, the input features are decoded by the prediction head of three scales to realize
the perception of the large, medium and small-scale objects.
YOLOv4 still applies the strategy of prior box and batch standardization from
YOLOv2 [33] to ensure the regularity of model training parameters. Meanwhile, the
Mish [34] activation function was introduced in YOLOv4 to make the training gradient
descent smoother. Compared with the ReLU [35] activation function, the possibility of loss
falling into local minimization was reduced. In addition, YOLOv4 also used Mosaic [19]
data enhancement and DropBlock [36] regularization to reduce the overfitting of the model.

2.2. Loss Function of the YOLOv4 Model


The loss function of the YOLOv4 is composed of regression loss, confidence loss and
classification loss. Different from the function adopted by other YOLO models, YOLOv4
uses the CIoU [37] function to construct the model intersection ratio loss function. It uses
the diagonal distance of the minimum enclosure box to formulate a penal strategy to further
reduce the false detection rate of the small-scale objects. However, in the class loss function,
the cross-entropy function is still adopted.
Sensors 2023, 23, 3385 4 of 20

obj
L = λcoord ∑iK=×0K ∑ jM=0 Iij (2 − wi ×hi )(1 − CIoU )−
obj
∑iK=×0K ∑ jM=0 Iij [Ĉi log(Ci ) + (1−Ĉi )log(1 − Ci )]−
noobj (1)
λnoobj ∑iK=×0K ∑ jM=0 Iij [Ĉi log(Ci ) + (1−Ĉi )log(1 − Ci )]−
Sensors 2023, 23, x FOR PEER REVIEW obj 4 of 20
∑iK=×0K ∑iM
=0 Iij ∑c∈classes [ p̂i (c)log( pi (c)) + (1− p̂i (c))log(1 − pi (c))]

Figure 1. YOLOv4 model structure.


Figure 1. YOLOv4 model structure.

YOLOv4
In Equation still(1),
applies
K × Kthe strategythe
represents of mesh
prior size,
box which
and batchcan be standardization
19 × 19, 38 × 38 fromor
YOLOv2 [33] to ensure the regularity of model training parameters.
76 × 76. M represents the detection dimension, whose value is 3. λcoord represents the Meanwhile, the Mish
[34] activation function was introduced in YOLOv4 to make the training gradient obj descent
noobj
positive sample weight coefficient, whose value is generally 1. The values of I and Iij
smoother. Compared with the ReLU [35] activation function, the possibility ofij loss falling
are either
into local 0minimization
or 1, which is was usedreduced.
to judge the positivityYOLOv4
In addition, or negativity alsoofused Mosaic Ĉ
the sample. i and
[19] Ci
data
represent the sample and predicted values, respectively. (2 − wi × hi ) is used to punish the
enhancement and DropBlock [36] regularization to reduce the overfitting of the model.
smaller prediction box. wi and hi indicate the width and height of the center point of the
prediction box, respectively.
2.2. Loss Function of the YOLOv4 TheModel
CIoU’s equation is shown below.

The loss function of the YOLOv4 is composed of regression loss, confidence loss and

ρ2 (b, b gt
classification loss. Different from CIoUthe=function
IoU − adopted −by βνother YOLO models, YOLOv4 (2)
uses the CIoU [37] function to construct the model c2 intersection ratio loss function. It uses
the diagonal distance of the minimum
 enclosure box to formulate a penal strategy to fur-
In Equation (2), ρ2 (b, b gt represents the Euclidean distance between the center point
ther reduce the false detection rate of the small-scale objects. However, in the class loss
of the prediction
function, box and the
the cross-entropy real box,
function still cadopted.
is and represents the diagonal distance between the
minimum closure region that can contain both the prediction box and the real box. β is the
M obj
parameter measuring L=λ
the ∑K×K
i=0 ∑j=0 Iijof the iaspect
consistency
coord (2-w ×hi )(1-CIoU)-
ratio, and ν is the tradeoff parameter.
The calculation equations are shown
M obj in Equations (3) and (4), respectively.
∑K×K
i=0 ∑j=0 Iij �Ĉi log(Ci
)+�1-Ĉi �log(1-Ci )�-
M noobj
(1)
λnoobj ∑K×K
ν
i=0 ∑ I
j=0 ij β =�Ĉi log(C i )+�1-Ĉ i �log(1-C i )�- (3)
1 − IoU + ν
K×K M obj
∑i=0 ∑i=0 Iij ∑c∈classes �p̂i (c)log�pi (c)�+�1-p̂ 2 i
(c)�log�1-pi (c)��
4 wgt w
In Equation (1), K × K represents ν = 2 (the arctan
mesh −arctan ) can be 19 × 19, 38 × 38 or 76(4)
π h size, which h ×
76. M represents the detection dimension, whose value is 3. λcoord represents the positive
obj noobj
sample weight coefficient, whose value is generally 1. The values of Iij and Iij are either
0 or 1, which is used to judge the positivity or negativity of the sample. Ĉi and Ci represent
the sample and predicted values, respectively. (2-wi × hi ) is used to punish the smaller pre-
diction box. wi and hi indicate the width and height of the center point of the prediction
box, respectively. The CIoU’s equation is shown below.
ρ2 (b,bgt )
CIoU=IoU- -βν (2)
parameter measuring the consistency of the aspect ratio, and ν is the tradeoff parameter.
The calculation equations are shown in Equations (3) and (4), respectively.
ν
β= (3)
1-IoU+ν

2
4 wgt w 5 of(4)
Sensors 2023, 23, 3385
ν= (arctan -arctan ) 20
π2 h h

2.3.Discussion
2.3. Discussionon on YOLOv4
YOLOv4 Model Model Detection
Detection Performance
Performance
Asan
As anadvanced
advanced single-stage
single-stage target
target detection
detection model,
model, YOLOv4
YOLOv4 has has aa great
great advantage
advantage
overthe
over thetwo-stage
two-stage target
target detection
detection model
model withwith detection
detection speed.
speed. ItIt can
can achieve
achieve aa balance
balance
betweenprecision
between precision andand speed
speed inin conventional
conventional scenarios
scenarios and and meet
meet the
the basic
basic requirements
requirements
of an
of an automatic
automatic driving
driving system.
system.Figure
Figure 2 is2 aistypical
a typicalscene fromfrom
scene the vehicle mount’s
the vehicle per-
mount’s
spective. As can be seen from the figure, complex situations, such as dark
perspective. As can be seen from the figure, complex situations, such as dark light, occlusion light, occlusion
anddistance,
and distance, are
are prone
prone to to occur
occur under
under the the vehicle-mounted
vehicle-mounted perspective,
perspective, and and multiple
multiple
types of traffic targets are often included. In the face of such scenarios, the YOLOv4YOLOv4
types of traffic targets are often included. In the face of such scenarios, the model’s
model’s ability to learn and extract effective features of the target is
ability to learn and extract effective features of the target is reduced, often resulting reduced, often result-
in
ing in missed
missed detection detection
and false and false detection.
detection. It can beItseencan bethatseen
the that the problem
current current problem that
that urgently
urgently
needs to beneeds
solvedto isbeobject
solved is objectunder
detection detection under the unfavorable
the unfavorable conditions ofconditions of the
the vehicle-view
vehicle-view angle. Therefore, starting with the model structure and
angle. Therefore, starting with the model structure and training strategy, this paper uses training strategy, this
paper uses targeted design to improve the image feature modeling
targeted design to improve the image feature modeling ability of the YOLOv4 model and ability of the YOLOv4
model and
improve theimprove
learning the andlearning andeffects
extraction extraction effectsfeatures
of effective of effective features
of the modelofinthe model
occlusion,
in occlusion, dark light and other scenes, and proposes the vehicle-mounted
dark light and other scenes, and proposes the vehicle-mounted perspective target detection perspective
target VV-YOLO.
model detection model VV-YOLO.

Figure2.2.Typical
Figure Typicalscene
scenefrom
fromvehicle
vehicleview.
view.

3.3.Materials
Materialsand
and Methods
Methods
3.1.Improvements
3.1. Improvements to
to the
the Anchor
Anchor Box
Box Clustering
Clustering Algorithm
Algorithm
For the
For the object
object detection
detection model
model based
based on
on the
the regression
regression anchor
anchor box,
box, thethe size
size of
of the
the
anchor box
anchor box is
is usually
usually set
set by
bythe
theclustering
clusteringalgorithm,
algorithm, and
and thethe
YOLOv4
YOLOv4 model
modeluses the the
uses K-
means clustering
K-means clusteringalgorithm
algorithm[38].
[38].First,
First,randomly
randomlyselect
selectall
allthe
theoriginal
originalanchor
anchorboxes
boxesfrom
from
allthe
all thereal
realboxes,
boxes,and
andthen
thenadjust
adjustthe
theposition
positionofofthe
theanchor
anchorboxesboxesbyby comparing
comparing the
the IoUIoU
of
of each original anchor box to the real box, and then get the new anchor frame size. Repeat
each original anchor box to the real box, and then get the new anchor frame size. Repeat
theabove
the above steps
steps until
until all
all the
the anchor
anchor boxes
boxes nono longer
longer change.
change. According
According to to the
the position
position
relationship between the anchor box and the bounding box in Figure 3, the formula for
relationship between the anchor box and the bounding box in Figure 3, the formula for
calculating the IoU can be obtained, as shown in Equation (5).
calculating the IoU can be obtained, as shown in Equation (5).

|Anchor box ∩ Bounding box|


IoU = (5)
|Anchor box ∪ Bounding box|

The clustering effect of the anchor frame of the YOLOv4 model depends on the random
setting of the original anchor box, which has great uncertainty and cannot guarantee the
clustering effect, and it usually takes multiple experiments to obtain the optimal anchor
box size. In order to avoid the bias and instability caused by the random setting of points,
the VV-YOLO model is based on the improved K-means++ clustering algorithm, which is
used for the anchor box coordinate setting of experimental data, and its implementation
logic is shown in Figure 4.
Sensors2023,
Sensors 23,3385
2023,23, x FOR PEER REVIEW 66 of
of2020

Figure 3. Illustration of the IoU calculation.

|Anchor box ∩ Bounding box|


IoU = (5)
|Anchor box ⋃ Bounding box|

The clustering effect of the anchor frame of the YOLOv4 model depends on the ran-
dom setting of the original anchor box, which has great uncertainty and cannot guarantee
the clustering effect, and it usually takes multiple experiments to obtain the optimal an-
chor box size. In order to avoid the bias and instability caused by the random setting of
points, the VV-YOLO model is based on the improved K-means++ clustering algorithm,
which is used for the anchor box coordinate setting of experimental data, and its imple-
mentation logic is of
Figure3.3.Illustration
Figure Illustrationshown
ofthe in calculation.
theIoU
IoU Figure 4.
calculation.

|Anchor box ∩ Bounding box|


IoU = (5)
|Anchor box ⋃ Bounding box|

The clustering effect of the anchor frame of the YOLOv4 model depends on the ran-
dom setting of the original anchor box, which has great uncertainty and cannot guarantee
the clustering effect, and it usually takes multiple experiments to obtain the optimal an-
chor box size. In order to avoid the bias and instability caused by the random setting of
points, the VV-YOLO model is based on the improved K-means++ clustering algorithm,
which is used for the anchor box coordinate setting of experimental data, and its imple-
mentation logic is shown in Figure 4.

Figure4.4.Improved
Figure ImprovedK-means++
K-means++ algorithm
algorithm logic.
logic.

The essential difference between the improved K-means++ algorithm and the K-means
algorithm is reflected in the initialization of the anchor box size and the method of the
anchor frame selection. The former first randomly initializes a real box as the original
anchor box, and secondly, each real box uses Equation (1) to calculate the difference value
from the current anchor box, and the difference value calculation formula is shown in
Equation (6).
d(box, centroid) = 1 − IoU (box, centroid) (6)

Figure 4. Improved K-means++ algorithm logic.


means algorithm is reflected in the initialization of the anchor box size and the method of
the anchor frame selection. The former first randomly initializes a real box as the original
anchor box, and secondly, each real box uses Equation (1) to calculate the difference value
from the current anchor box, and the difference value calculation formula is shown in
Sensors 2023, 23, 3385 Equation (6). 7 of 20

d(box,centroid) = 1-IoU(box,centroid) (6)


In Equation (6), box represents the current anchor box; centroid represents a sample
In Equation (6), box represents the current anchor box, centroid represents a sample
of data; IoU represents the intersection and union ratio of the data sample to the current
of data; IoU represents the intersection and union ratio of the data sample to the current
anchor box.
anchor box.
After the variance value is calculated, a new sample is selected as the next anchor
After the variance value is calculated, a new sample is selected as the next anchor
frame using the roulette method until all anchor frames are selected. The principle of se-
frame using the roulette method until all anchor frames are selected. The principle of
lection is is
selection that
thatsamples
samples thatthat
differ significantly
differ fromfrom
significantly the previous anchor
the previous box have
anchor box ahave
higher
a
probability of being selected as the next anchor box. The following
higher probability of being selected as the next anchor box. The following mathematical mathematical expla-
nation is given
explanation for it:for it:
is given
Suppose the
Suppose the minimum minimum difference
differencevaluevalueof ofN Nsamples
samplesto to thethe anchor
anchor box box
is
is {D
{ D1 , D ,D ,D …D } , and then use Equation (7) to calculate the sum of
1 , 2D .3. . D N} , and then use Equation (7) to calculate the sum of the minimum the minimum differ-
2 3 N
ences fromfrom
differences N samples
N samplesto thetocurrent anchor
the current box. box.
anchor Then,Then,
randomly selectselect
randomly a value that that
a value does
not exceed Sum , use Equation (8) to iteratively calculate the difference,
does not exceed Sum, use Equation (8) to iteratively calculate the difference, stop calculating stop calculating
whenrrisisless
when lessthan
than0,0,and
andthetheresulting
resultingpoint
pointisisthe
thenew
new anchor
anchor box
box size.
size.
Sum = D + D + … + D (7)
Sum = D1 +1D2 +2. . . + D NN (7)

r = rr =−r -DDNN (8)


(8)
Figure55shows
Figure showsthe
thecomparison
comparisonof ofthe
theaverage
averageresults
resultsof
of multiple
multipleclusters
clustersof
of K-means,
K-means,
K-means++ and improved K-means++ on the KITTI dataset [39].
K-means++ and improved K-means++ on the KITTI dataset [39]. The abscissa represents The abscissa represents
thenumber
the numberofofiterations
iterationsofofthe
theclustering
clustering algorithm,
algorithm, and
and thethe abscissa
abscissa represents
represents the the aver-
average
age intersection ratio (IoU) of the obtained anchor box and all real boxes.
intersection ratio (IoU) of the obtained anchor box and all real boxes. Figure 6 shows the Figure 6 shows
the anchor
anchor box box clustering
clustering results
results of the
of the improved
improved K-means++
K-means++ algorithm.The
algorithm. Theresults
resultsininthe
the
abovefigure
above figureshowshowthat
thatthe
theimproved
improvedK-means++
K-means++algorithm
algorithmcancanobtain
obtainaabetter
betterclustering
clustering
effect,and
effect, anditsitsaverage
average intersection
intersection union
union ratio
ratio is 72%,
is 72%, which
which is better
is better thanthan the K-means
the K-means and
and K-means++ algorithms, which verifies
K-means++ algorithms, which verifies its effectiveness. its effectiveness.

Figure5.5.The
Figure Theclustering
clusteringeffect
effectof
oftwo
twoclustering
clusteringalgorithms
algorithmson
onKITTI
KITTIdataset.
dataset.

3.2. Optimization of the Model Loss Function Based on Sample Balance


For the definition of samples in the YOLOv4 model, the concepts of the four samples
are explained as follows:
1. The essence of object detection in the YOLOv4 model is to carry out intensive sampling,
generate a large number of prior boxes in an image, and match the real box with some
prior boxes. The prior box on the successful match is a positive sample, and the one
that cannot be matched is a negative sample.
2. Suppose there is a dichotomous problem, and both Sample 1 and Sample 2 are in
Category 1. In the prediction results of the model, the probability that Sample 1
Sensors 2023, 23, 3385 8 of 20

Sensors 2023, 23, x FOR PEER REVIEW belongs to Category 1 is 0.9, and the probability that Sample 2 belongs to Category 1 8 of 20
is 0.6; the former predicts more accurately and is an easy sample to classify; the latter
predicts inaccurately and is a difficult sample to classify.

Figure 6. Cluster results of the improved K-means++ algorithm on the KITTI dataset.
Figure 6. Cluster results of the improved K-means++ algorithm on the KITTI dataset.
For deep learning models, sample balance is very important. A large number of
3.2. Optimization
negative of theaffect
samples will ModeltheLoss Function
model’s Basedof
judgment onpositive
Sample samples,
Balance and then affect
For the of
the accuracy definition
the model, ofand
samples in thewill
the dataset YOLOv4 model,
inevitably havethe
an concepts
imbalanceofofthe four samples
positive
are explained as follows:
and negative samples and difficult samples due to objective reasons. In order to alleviate
the sample imbalance caused by the distribution of the dataset, this paper uses the focus
1. The essence of object detection in the YOLOv4 model is to carry out intensive sam
function focal loss to reconstruct the loss function of the model and control the training
pling, generate a large number of prior boxes in an image, and match the real box
weight of the sample.
with Equation
From some prior (1) boxes.
above, The prior
it can box that
be seen on the
thesuccessful
confidencematch is a positive
loss function of the sample
YOLOv4andmodel
the one that cannotusing
is constructed be matched is a negative
the cross-entropy sample.
function, which can be simplified to
2. Suppose there
the following equation: is a dichotomous problem, and both Sample 1 and Sample 2 are in
Category 1. In the prediction results of the model, the probability that Sample 1 be
K ×K M K ×K M
f = ∑i =0 1∑ I [−log(Ci )] + ∑i=0 ∑
obj noobj
longs LtoconCategory isj=0.9,
0 ijand the probability that
j=0Sample
Iij [−2logbelongs
(Ci )] to Category
(9) 1 is
0.6; the former predicts more accurately and is an easy sample to classify; the latter
The confidence function of YOLOv4 is reconstructed by using the focus function focal
predicts inaccurately and is a difficult sample to classify.
loss, and the loss function of the VV-YOLO model is obtained, as shown in Equation (10).
For deep learning models, sample balance is very important. A large number of neg
ative samples L = will
λcoordaffect the model’s
obj judgment of positive samples, and then affect the
∑iK=×0K ∑ jM=0 Iij (2 − wi × hi ) (1 − CIoU )−
accuracy of the model, and
obj the dataset will inevitably have an imbalance of positive and
∑iK=×0K ∑ jM=0 Iij [αt (1 − C i ) log(C i )]−
γ
negative samples and difficult samples due to objective reasons. In order to alleviate (10) the
×K M noobj
λnoobj ∑iK=
sample imbalance 0 ∑ j=0by
caused Iij the[αdistribution γ
t (1 − C i ) logof (Cthe
i )]−dataset, this paper uses the focus func
obj
∑iK=×to
tion focal loss 0 reconstruct
∑i=0 Iij ∑c∈the
K M
classesloss
[ p̂i (function
c)log ( pi (cof
) ) the
+ (1model
− p̂i (c) )and
log(1control
− pi (c)the
) ] training weigh
of the sample.
In Equation (10), αt is the balance factor, which is used to balance the positive and
From Equation (1) above, it can be seen that the confidence loss function of the
negative sample weights; γ is the regulator, which is used to adjust the proportion of
YOLOv4 model is constructed using the cross-entropy function, which can be simplified
difficult and easy sample loss. In particular, when γ is 0, Equation (10) is the loss function
to the
of theYOLOv4
following equation:
model.
In order to verify the validity of αt obj
and γ in the loss function M of the VV-YOLO model,
noobj
∑K×K ∑Mj=0 Iis
𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 =derivation
the following mathematical i=0
[−log(C
ij carried out ∑K×K
i )]in+this
i=0 ∑ IijTo reduce
section. [−log(C i )]effect of
the (9
j=0

The confidence function of YOLOv4 is reconstructed by using the focus function foca
loss, and the loss function of the VV-YOLO model is obtained, as shown in Equation (10)
obj
L = λcoord ∑K×K M
i=0 ∑j=0 Iij
(2-wi × hi )(1-CIoU)-
obj γ
∑K×K M
i=0 ∑j=0 Iij αt (1-Ci ) log(Ci ) -
[ ]
negative sample weights; γ is the regulator, which is used to adjust the proportion of dif
ficult and easy sample loss. In particular, when γ is 0, Equation (10) is the loss function o
the YOLOv4 model.
In order to verify the validity of αt and γ in the loss function of the VV-YOLO mode
Sensors 2023, 23, 3385 the following mathematical derivation is carried out in this section. To reduce 9 ofthe
20 effect o

negative samples, add a balance factor αt to Equation (9), leaving aside the parameter
that do not affect the result, to get Equation (11).
negative samples, add a balance factor αt to Equation (9), leaving aside the parameters that
CE(Ci ) = -αt log(Ci )
do not affect the result, to get Equation (11). (11
In Equation (11), αt ranges CE(from
C i ) = 0−αtot log
1,(Cand αt is α when the sample(11) is positive
i)
and αt is 1-α when the sample is negative, as shown in Equation (12). It can be seen tha
by setting the value
In Equation (11), αof α , it isfrom
t ranges possible
0 to 1,toand
control
αt is αthe contribution
when the sampleofispositive
positive,and
and negativ
αsamples
t is 1 − α to the loss function.
when the sample is negative, as shown in Equation (12). It can be seen that by
setting the value of α, it is possible to control the contribution of positive and negative
α if sample is positive
samples to the loss function. αt = � (12
 1-α otherwise
α if sample is positive
For verification of theαeffect
t = of regulator γ , a part of Equation (10) can be taken and
(12)
1 − α otherwise
rewritten as the following equation:
For verification of the effect of regulator γ
γ, a part of Equation (10) can be taken and
Lfl = -Ĉi (1 - Ci )γ log(Ci ) - �1 - Ĉi �Ci log(1 - Ci ) (13
rewritten as the following equation:
In the training of deep learning models, the gradient
γ descent method is used to search
Lfl = −Ĉi (1 − Ci )γ log(Ci ) − (1 − Ĉi )Ci log(1 − Ci ) (13)
for the optimal solution to the loss function. The gradient can indicate the training weigh
of different samples
In the training during
of deep the models,
learning trainingthe
process,
gradientand the method
descent gradient is related
is used to the first
to search
order
for partial derivative
the optimal of the
solution to the loss loss function,
function. so using
The gradient Equation
can (13)training
indicate the to findweight
the first-orde
partial
of derivative
different of the the
samples during variable Ci process,
training , we canand
obtain Equation
the gradient (14). to the first-order
is related
partial derivative of the loss function, so using Equation (13) to find the first-order partial
∂Lfl γ-1 1 1
= Ĉofi γ(1-C
derivative i ) log(C
the variable Ci ,i )we
- Ĉcan Ci )γ Equation
i (1 -obtain (14).i Ĉi -1 log(1 - Ci ) + �1 - Ĉi �Ci γ
- �1 - Ĉi �γC (14
∂Ci Ci 1-Ci

Ci ) + Ĉ
1 values of C are 0.
= Ĉi γ(1 − Ci )γ−1 log(Suppose
Ci ) − Ĉi (1that
− Cthere
γ 1 are two sample
− (1 − Ĉi )γCi Ĉi −1points
log(1 −where is Ĉ
0 )and γ the
∂Lfl
i) (1i − i Ci (14) i
∂Ci and 0.4, respectively. When Ci γ is 0, that is, when the loss function is 1− aCcross-entropy
i func
tion,Suppose
the values of the
that there arepartial derivative
two sample are 1.11
points where Ĉi isand
0 and1.66,
the respectively;
values of Ci arewhen
0.1 andγ is 2, th
values
0.4, of the partial
respectively. When γderivative
is 0, that is, are
when0.032 andfunction
the loss 0.67, respectively. It can
is a cross-entropy be seen
function, thethat afte
setting a certain value for γ , the ratio of hard-to-distinguish samples
values of the partial derivative are 1.11 and 1.66, respectively; when γ is 2, the values of the to easy-to-distin
guish derivative
partial samples isaregreatly
0.032 andincreased, which increases
0.67, respectively. It can be theseenweight of setting
that after difficult-to-distinguish
a certain
samples
value in the
for γ, network
ratio oftraining and effectively
hard-to-distinguish improves
samples the problem of insufficient
to easy-to-distinguish samples is trainin
causedincreased,
greatly by uneven data
which distribution.
increases the weight of difficult-to-distinguish samples in network
training and effectively improves the problem of insufficient training caused by uneven
data distribution.
3.3. Neck Network Design Based on Attention Mechanism
TheNetwork
3.3. Neck attention mechanism
Design in convolutional
Based on Attention Mechanism neural networks is a specific design tha
simulates the human
The attention brain,inwhich
mechanism can be introduced
convolutional intoismultiple
neural networks a specifictasks
design inthat
the field o
computerthe
simulates vision
humanand has the
brain, whichrole
can ofbe
judging the importance
introduced into multipleoftasks
image features.
in the field ofThe mos
classic attention
computer mechanism
vision and network
has the role of judgingis SENet [40], whose
the importance structure
of image is shown
features. in Figure 7
The most
which uses the global average pooling strategy and the fully connected layer
classic attention mechanism network is SENet [40], whose structure is shown in Figureto7,establis
which uses the global average
the interrelationship modelpooling
between strategy and the
channels andfully connected
effectively layer tothe
extract establish
importance o
the interrelationship model between channels and effectively extract the importance of
different channels.
different channels.

Figure7.7.SENet
Figure SENetmodel
modelstructure.
structure.
However, SENet only considers the importance of each channel by modeling channel
relationships, ignoring the influence of feature location information on feature extraction.
zw
c (w) = ∑0≤j≤H xc (j,w) (17)
H

In the above equation, x is the input. zhc (h) and zw c (w) are obtained by encoding each
channel along the horizontal and vertical coordinates using a pooled kernel of size (𝐻𝐻, 1)
Sensors 2023, 23, 3385
or size (W,1) . This parallel modeling structure allows the attention module to capture one
10 of 20
spatial direction while saving precise location information in another spatial direction,
which helps the network more accurately mine out the object of interest. After the location
information modeling is completed, the weights along the horizontal and vertical direc-
Considering the influence of the accuracy of feature position information on target detection
tions are this
accuracy, obtained throughthe
paper chooses the convolution
coordinate operation
attention networkand
as asigmoid function. The
module introduced into calcula-
tion formula
the neck for the
network; output feature
its structure is shownmap is as 8.
in Figure follows:
In order to build an interaction model
with accurate capture ability, each channel was codedh alongwthe horizontal and vertical
y �i,j� = x (i,j) × g (i) × g (j) (18)
coordinates, respectively. The coding cformula cis shownc below.c
According to the analysis of the1 YOLOv4
H W
model in the previous article, based on the
two existing improvement cmethods,H × Wa∑third
i =1 ∑improvement method is proposed
z = x (i, j)
j =1 c
(15)to solve
the problem of the declining feature extraction ability of the model. The coordinate atten-
h neck1network of the YOLOv4 model, which improves the
tion module is introduced in zthe
c (h) =
W ∑ 0 ≤ i ≤W
xc (h, i ) (16)
model’s attention to effective features by modeling the two dimensions of features and
1
then improves the image feature
zw extraction
c (w) = ∑ ability xc ( j,of
w)the model. (17)
H 0≤ j ≤ H

Figure 8. Coordinate
Figure 8. Coordinateattention
attention module
module structure.
structure.

Considering
In that image
the above equation, x is features
the input.are
zch (transmitted
c ( w ) differently
h) and zw are obtainedinbythe backbone
encoding eachnetwork
and neck network, this paper hopes that the model can adaptively provide more1)training
channel along the horizontal and vertical coordinates using a pooled kernel of size ( H,
weight W,effective
or size (to 1) . This parallel
featuresmodeling
when the structure
featureallows
transferthemode
attention module
changes, sotoascapture
to reduce the
one spatial direction while saving precise location information in another spatial direction,
impact of invalid features on the model’s training. Therefore, the coordinate attention
which helps the network more accurately mine out the object of interest. After the location
module is inserted
information modelingbetween the the
is completed, backbone
weights network and the neck
along the horizontal network,
and vertical the CA-PAN
directions
neck network
are obtained is designed
through and the VV-YOLO
the convolution operation model shown
and sigmoid in Figure
function. 9 is
The finally formed.
calculation
formula for the output feature map is as follows:

yc (i, j)= x c (i, j) × gch (i ) × gw


c ( j) (18)

According to the analysis of the YOLOv4 model in the previous article, based on the
two existing improvement methods, a third improvement method is proposed to solve the
problem of the declining feature extraction ability of the model. The coordinate attention
module is introduced in the neck network of the YOLOv4 model, which improves the
model’s attention to effective features by modeling the two dimensions of features and
then improves the image feature extraction ability of the model.
Considering that image features are transmitted differently in the backbone network
and neck network, this paper hopes that the model can adaptively provide more training
weight to effective features when the feature transfer mode changes, so as to reduce the
impact of invalid features on the model’s training. Therefore, the coordinate attention
module is inserted between the backbone network and the neck network, the CA-PAN
neck network is designed and the VV-YOLO model shown in Figure 9 is finally formed.
Sensors 2023, 23, x FOR PEER REVIEW 11 of 20
Sensors 2023, 23,
Sensors 2023, 23, x FOR PEER REVIEW
3385 1111of
of 20
20

Figure 9. VV-YOLO model structure.


Figure
Figure 9.
9. VV-YOLO
VV-YOLO model
model structure.
structure.
4. Results and Discussion
4. Results
4. Results and
and Discussion
Discussion
4.1. Test Dataset
4.1. Test Dataset
4.1. Test
TheDataset
KITTI dataset [39], as the world’s largest computer vision algorithm evaluation
The KITTI dataset [39], as the world’s largest computer vision algorithm evaluation
TheinKITTI
dataset datasetdriving
unmanned [39], asscenarios,
the world’s was largest
jointlycomputer
proposed vision
by thealgorithm
Karlsruhe evaluation
Institute
dataset in unmanned driving scenarios, was jointly proposed by the Karlsruhe Institute of
dataset in unmanned
of Technology driving
in Germany scenarios, was jointly proposed by the Karlsruhe Institute
Technology in Germany andand the Toyota
the Toyota Institute
Institute of Technology
of Technology in theinUnited
the United
StatesStates
in 2012.in
of Technology
2012. The in Germany
dataset can be andto
used the Toyota multiple
evaluate Institute tasks
of Technology
in the in the United
computer vision States
field, in
in-
The dataset can be used to evaluate multiple tasks in the computer vision field, including
2012. Theobject
cluding dataset can be usedobjecttotracking,
evaluate visual
multiple tasks in etc.
the The
computerused vision field, in-
object detection,detection,
object tracking, visual odometry, odometry,
etc. The data used data
to evaluatetotheevaluate
object
cluding
the objectobject detection,
detection model object
in thetracking,
KITTI visual
dataset odometry,
contains etc.
nearly The
10,000
detection model in the KITTI dataset contains nearly 10,000 images in eight categories, data used
images to
in evaluate
eight cat-
the objectincluding
egories,
including detection model
car,
car, van, truck, in the
van, KITTI
truck,
pedestrian, dataset
pedestrian,
person contains
person
(sitting), nearly 10,000
(sitting),
cyclist, tram images
cyclist,
and misc, tramin and
eightmore
marking cat-
misc,
egories,
marking
than including
more
200,000 car, van,
than 200,000
objects truck,
in total. objects
The data pedestrian,
in distribution person
total. The data (sitting),
distribution
is shown cyclist,
is shown
in Figure tram and
10. in Figure 10. misc,
marking more than 200,000 objects in total. The data distribution is shown in Figure 10.

Figure 10.
Figure 10. KITTI
KITTI dataset
dataset data
data distribution.
distribution.
Figure 10. KITTI dataset data distribution.
Figure 11
Figure 11shows
showsthe theproportion
proportionofofvarious
various objects
objects inin
thethe object
object detection
detection data.
data. It can
It can be
found Figure
be found 11the
thatthat
the shows
number the
number proportion
of classes
of car ofexceeds
car classes
far various
far exceeds objects
that that in
of the
of other other object detection
categories,
categories, data.forIt52%,
accounting
accounting can
for
be
withfound
52%, withthat
serious the number
serious
sample sample ofimbalance.
imbalance.car classes farpoint
From
From the exceeds that of
theofpoint
view of other
ofview
model categories,
of accounting
model hyperparameter
hyperparameter for
tuning,
52%,
tuning,
highly with serious
highly
unbalanced sample
unbalanced imbalance.
data distribution From
data distribution
will the
willpoint
seriously of view
seriously
affect affect
the ofthe
model
fitting hyperparameter
fitting
effect. effect. Accord-
According to
tuning,
the highly
ing characteristics unbalanced
to the characteristics data
ofscene
of traffic distribution
trafficfrom
scene will seriously
from a vehicle’s
a vehicle’s affect
perspective the
perspective fitting
and the and effect.
theof
objects Accord-
objects
interest of
ing to the
interest
studied in characteristics
studied
this in this
paper, of traffic
paper,
a Python scene
ascript
Python fromisto
script
is written a written
vehicle’s to perspective
merge eight merge
typeseight and the
types
of objects ofobjects
in objects
the KITTI of
in
interest
the KITTI
dataset studied
intodataset in into
vehicle, this paper,
vehicle,a pedestrian
pedestrian Python script
and cyclist and iscyclist
[41]. written to merge
[41].
The Vehicle The
class eight
Vehicle types ofof
class is
is composed objects
composed
car, Van, in
the KITTI
of car,
truck, Van,
tram dataset
truck,
and into
tram
misc. vehicle,
Theand pedestrian
misc.
PedestrianTheclass and
Pedestrian cyclist
consists of [41].
class The Vehicle
consists
pedestrian of
and class is
pedestrian
person composed
and person
(sitting).
of car, Van, truck, tram and misc. The Pedestrian class consists of pedestrian and person
(sitting).
(sitting).
Sensors 2023, 23, x FOR PEER REVIEW 12 of 20

Sensors2023,
Sensors 23,3385
2023,23, x FOR PEER REVIEW 12 of 20
12 of 20

Figure 11. The proportion of various objects in the KITTI dataset.

Figure
Figure
4.2. 11.of
11.
Index The
The proportionof
proportion
Evaluation ofvarious
variousobjects
objectsin
inthe
theKITTI
KITTIdataset.
dataset.

In order
4.2. Index to evaluate different object detection algorithms reasonably in an all-round
of Evaluation
4.2. Index of Evaluation
way, In it is usually necessary
order to evaluate to quantify
different object the performance
detection algorithms of object detection
reasonably in an algorithms
all-round
from In order to evaluate different object detection algorithms reasonably in an all-round
way, it is usually necessary to quantify the performance of object detection algorithmsguid-
the real-time and precision perspectives. Reasonable evaluation has important from
way, it is usually
ing necessarya reasonable
to quantify the performance of objectindetection algorithms
thesignificance
real-time and forprecision
selecting perspectives.object detection
Reasonable algorithm
evaluation has different
important scenarios.
guiding
from
For thethe real-time
object and task
detection precision
from perspectives.
the vehicle Reasonable evaluation hasprecision,
important guid-
significance for selecting a reasonable objectview perspective,
detection algorithm focus inon recall,
different scenarios.
ing
average significance
precision for selecting
and real-time a reasonable
performance. object detection algorithm in different scenarios.
For the object detection task from the vehicle view perspective, focus on precision, recall,
For the object detection task from the vehicle view perspective, focus on precision, recall,
average precision and real-time performance.
average
4.2.1. precision
Precision andand real-time performance.
Recall
4.2.1.InPrecision
the field of and Recall learning, there are usually the following four relationship def-
machine
4.2.1. Precision and Recall
initions for positive
In the and negative
field of machine sample
learning, relationships.
there are usuallyTP the(True Positive)
following fourisrelationship
the correct
In the field of machine learning, there are usually the
positive sample, indicating that the negative sample is correctly identified. FP (False
definitions for positive and negative sample relationships. following
TP (True four relationship
Positive) is def-
Pos-
the correct
initions for positive and negative sample relationships. TP (True
itive) indicates the positive sample, indicating that the positive sample is incorrectly iden-
positive sample, indicating that the negative sample is correctly Positive)
identified. is the
FP correct
(False
positive
tified.
Positive)FN sample,
(False
indicates indicating
Negative)
the positive thatsample,
is the the negative
wrong negative sample
indicatingsample,
thatis correctly
indicating
the identified.
positive that FP
the negative
sample (False Pos-
sam-
is incorrectly
itive)
ple indicates
is identified
identified. the
FN (False positive
incorrectly.
Negative) sample,
TN (True indicating
Negative)
is the wrong that the
indicates
negative positive sample
the correct
sample, indicating is incorrectly
negative
that thesample, iden-
negativein-
tified. FN
dicating
sample that
is (False Negative)
the negative
identified is theTN
sample
incorrectly. wrong(Truenegative
is correctly sample,
identified.
Negative) indicating
indicates that the
the correct negative
negative sam-
sample,
ple The
is identified
indicating confusion
that the incorrectly. TN
matrix ofsample
negative the(True Negative)
classical indicates
evaluation
is correctly system
identified. theofcorrect
machine negative
learning sample,
can be in-
dicating that the negative
matrix sample
of the is correctly
classical identified.
formed by arranging the above four positive and negative sample relations in matrix form,
The confusion evaluation system of machine learning can be
as shown
formed Thebyinconfusion
Figure 12.
arranging matrix
the above of thefourclassical
positiveevaluation
and negative system
sample of machine
relations learning
in matrixcan form,be
formed
as shownbyinarranging
Figure 12.the above four positive and negative sample relations in matrix form,
as shown in Figure 12.

Figure
Figure12.
12.Confusion
Confusionmatrix
matrixstructure.
structure.

Figure 12. Confusion


According
According to thematrix
to the structure.
confusion
confusion matrix,
matrix, the
the Precision and Recall
Precision and Recall of
of commonly
commonly used
used quanti-
quan-
tization methodscan
zation methods can bebe defined.
defined. TheThe precision
precision represents
represents the proportion
the proportion of correct
of correct pre-
prediction
of theAccording
diction of thein
model to the
model
all theinresults
confusion matrix,
all thewhose
results the Precision
whose
prediction and
prediction
result is Recall
aresult isof
positive commonly
asample.
positiveThe used
sample. quan-
Theis
formula
tizationin
formula
shown ismethods
shown in
Equation can be defined.
Equation
(19). (19). The precision represents the proportion of correct pre-
diction of the model in all the results whose predictionTP result is a positive sample. The
formula is shown in Equation (19).Precision = = TP
Precision (19)
(19)
++FP
TPTP FP
TP
Recall,
Recall, also
also known as sensitivity,
known as sensitivity,represents
represents theproportion
Precision =the proportionofof correct
correct model
model predic-
(19)
prediction
TP + FP
tion among all the results whose true value is a positive sample, as shown in Equation
among all the results whose true value is a positive sample, as shown in Equation (20).
(20). Recall, also known as sensitivity, represents the proportion of correct model predic-
tion among all the results whose true value
Recall = is a positive sample, as shown in Equation
TP
(20)
(20). TP + FN
Sensors 2023, 23, 3385 13 of 20

4.2.2. Average Precision


According to the above formula of precision and precision, it can be seen that the
relationship between precision and precision is contradictory. If a single performance im-
provement is pursued, the performance of another index will often be sacrificed. Therefore,
in order to comprehensively evaluate the object detection algorithm under different usage
scenarios, PR curve is introduced.
The vertical coordinate of the PR curve is the precision under different confidence
levels of detection boxes, and the horizontal coordinate is the precision under current
confidence levels. The average precision is defined as the area under the PR curve, and its
formula is shown in Equation (21).
Z 1
AP = PRdR (21)
0

When evaluating the object detection model, the average precision of each type of
object will be averaged to get mAP. mAP is one of the most commonly used evaluation
means, and its size is between 0 and 1. Generally, the larger the mAP is, the better the
performance of the object detection algorithm in terms of data. Its formula is shown in
Equation (22).
1 N
mAP = ∑i=1 APi (22)
N

4.3. VV-YOLO Model Training


Before model training, configuration files and super parameters need to be set. Config-
uration files mainly include category files and prior box files stored in txt file format. The
category file stores the name of the object to be trained, and the prior box file stores the
coordinates of the prior boxes of different sizes.
The hyperparameters of the model training in this paper are set as follows:
• Input image size: 608 × 608;
• Number of iterations: 300;
• Initial learning rate: 0.001;
• Optimizer: Adam;
In order to avoid the problem of not obvious feature extraction due to too random
weights during model training, the strategy of transfer learning was adopted during VV-
YOLO model training, that is, the pre-training model provided by YOLOv4 developers was
loaded during training, so as to obtain stable training effects. The change curves of loss
function value and training accuracy during model training are shown in Figure 13
Sensors 2023, 23, x FOR PEER REVIEW 14 and
of 20
Figure 14, respectively. The loss function value and training accuracy eventually converge
to about 0.015 and 0.88, achieving the ideal training effect.

Figure13.
Figure 13.Training
Trainingloss
losscurve
curveofofVV-YOLO
VV-YOLOmodel.
model.
Sensors 2023, 23, 3385 14 of 20

Figure 13. Training loss curve of VV-YOLO model.

Figure14.
Figure 14.Training
Trainingaverage
average precision
precision change
change curve
curve of VV-YOLO
of VV-YOLO model.
model.

4.4.Discussion
4.4. Discussion
4.4.1. Discussion on Average Precision of VV-YOLO Model
4.4.1. Discussion on Average Precision of VV-YOLO Model
The YOLOv4 model and VV-YOLO model were used to test on the KITTI dataset [39],
The YOLOv4 model and VV-YOLO model were used to test on the KITTI dataset
and the precision, recall and average precision results obtained were shown in the following
[39], and
table. the precision,
According recall in
to the results and average
Table 2, the precision results obtained
average precision were shown
of the VV-YOLO model in the
following table. According to the results in Table 2, the average precision
is 80.01%, which is 3.44% higher than that of the YOLOv4 model. In terms of precision of the VV-YOLO
model
and is 80.01%,
recall, which model
the VV-YOLO is 3.44% higher
is only than
lower thatthe
than ofYOLOv4
the YOLOv4
modelmodel. In terms
in the recall of pre-
of the
cision and target,
pedestrian recall, and
the VV-YOLO model
the rest of the is onlyhave
indicators lower thanthe
taken thelead.
YOLOv4
Figuremodel in the
15 shows therecall
of the pedestrian
average precision of target, andtypes
the three the rest of the indicators
of objects of the two have taken
models, andthe
thelead. Figure
results show15 shows
that
theVV-YOLO
the average precision
model is of the three
superior types
to the of objects
YOLOv4 of the two models, and the results show
model.
that the VV-YOLO model is superior to the YOLOv4 model.
Table 2. Test results of the YOLOv4 model and the VV-YOLO model on the KITTI dataset.
Table 2. Test results of the YOLOv4 model and the VV-YOLO model on the KITTI dataset.
Evaluation Indicators YOLOv4 VV-YOLO
Evaluation Indicators Vehicle YOLOv495.01% VV-YOLO
96.87%
Precision Vehicle Cyclist 95.01%81.97% 96.87%
93.41%
Precision CyclistPedestrian 81.97%74.43% 81.75%
93.41%
Vehicle 80.79% 82.21%
Sensors 2023, 23, x FOR PEER REVIEW Recall PedestrianCyclist 74.43%55.87% 81.75%
55.75% 15 of 20
VehiclePedestrian 80.79%56.58% 82.21%
52.24%
Recall Average Cyclist
precision 55.87%76.57% 55.75%
80.01%
Pedestrian 56.58% 52.24%
Average precision 76.57% 80.01%

(a) (b)
Figure
Figure15.
15.Schematic
Schematicdiagram
diagramofofaverage
averageprecision:
precision:(a)
(a)YOLOv4;
YOLOv4;(b)
(b)VV-YOLO.
VV-YOLO.

To verify the effectiveness of each improved module of VV-YOLO, multiple rounds


of ablation experiments were performed on the KITTI dataset, and the results are shown
in the table below. From the results in the table, it can be concluded that the precision of
the proposed model is improved by 6.88% and the average precision is improved by 3.44%
with a slight increase in the number of parameters. Table 3 also shows the experimental
results of comparison between the proposed model and a variety of advanced attention
Sensors 2023, 23, 3385 15 of 20

To verify the effectiveness of each improved module of VV-YOLO, multiple rounds


of ablation experiments were performed on the KITTI dataset, and the results are shown
in the table below. From the results in the table, it can be concluded that the precision of
the proposed model is improved by 6.88% and the average precision is improved by 3.44%
with a slight increase in the number of parameters. Table 3 also shows the experimental
results of comparison between the proposed model and a variety of advanced attention
mechanisms, which also proves the effectiveness of the improved module.

Table 3. Ablation experimental results of VV-YOLO model on the KITTI dataset.

Average
Test Model Precision Recall
Precision
Baseline 83.80% 64.41% 76.57%
+Improved K-means++ 89.83% 60.70% 77.49%
+Focal Loss 90.24% 61.79% 78.79%
+SENet 89.47% 62.99% 78.61%
attention
+CBAM 89.83% 60.69% 78.49%
mechanisms
+ECA 89.66% 61.96% 78.48%
VV-YOLO 90.68% 63.40% 80.01%

In addition, six mainstream object detection models are selected for comparative
testing, and Table 4 shows the precision, recall and average precision of the VV-YOLO
model and the mainstream object detection model. From the results in the table, it can be
concluded that the VV-YOLO model has achieved a leading position in other indicators
except for slightly lower precision and recall than YOLOv5 and YOLOv4.

Table 4. Comparative test results of the VV-YOLO model and mainstream object detection model.

Test Model Precision Recall Average Precision


RetinaNet 90.43% 37.52% 66.38%
CenterNet 87.79% 34.01% 60.60%
YOLOv5 89.71% 61.08% 78.73%
Faster-RCNN 59.04% 76.54% 75.09%
SSD 77.59% 26.13% 37.99%
YOLOv3 77.75% 32.07% 47.26%
VV-YOLO 90.68% 63.40% 80.01%

4.4.2. Discussion on the Real-Time Performance of VV-YOLO Model


The weight size of VV-YOLO model is 245.73MB, only 1.29MB higher than that of the
YOLOv4 model. On the NVIDIA GeForce RTX 3070 Laptop graphics card, the VV-YOLO
model and seven mainstream object detection models were used to test and reason about
the pictures in the KITTI dataset. Before the test and reasoning, the model would adjust the
test pictures to the same pixel size.
After 100 inferences, the results of inference time and inference frames are shown in
Table 5. The data transmission frame rate of the autonomous driving perception system
is usually 15, and it is generally believed that the inference frame number of the object
detection model is greater than 25 to meet the real-time requirements of the system, while
the inference time of the VV-YOLO model is 37.19 ms, which is only 0.7 ms more than the
YOLOv4 model, and the inference frame rate is 26.89. Compared with the YOLOv3 and
YOLOv5 models, although the inference time of the VV-YOLO model has increased, its
comprehensive performance is the best when combined with the precision test results.
Sensors 2023, 23, 3385 16 of 20

Table 5. Real-time comparison between the VV-YOLO model and the mainstream object detec-
tion model.

Test Model Inference Time Inference Frames


RetinaNet 31.57 ms 31.67
YOLOv4 36.53 ms 27.37
CenterNet 16.49 ms 60.64
YOLOv5 26.65 ms 37.52
Faster-RCNN 62.47 ms 16.01
SSD 52.13 ms 19.18
YOLOv3 27.32 ms 36.60
VV-YOLO 37.19 ms 26.89

4.4.3. Visual Analysis of VV-YOLO Model Detection Results


Figure 16 shows the model inference heat maps of the YOLOv4 model and the VV-
YOLO model in multiple scenes from vehicle-mounted perspectives. The results in the
figure show that, compared with YOLOv4, VV-YOLO can provide more attention to distant
Sensors 2023, 23, x FOR PEER REVIEW objects, occlusion and other objects. Figure 17 shows the detection results of YOLOv4
17 and
of 21
VV-YOLO on the test data of the KITTI dataset. It can be seen that VV-YOLO can detect
objects well when facing distant objects and occludes.

(a) (b)
Figure 16.16.
Figure Object detection
Object model
detection modelinference
inferenceheat
heatmap:
map: (a)
(a) YOLOv4; (b)VV-YOLO.
YOLOv4; (b)VV-YOLO.

In order to verify the generalization performance of the VV-YOLO model, this paper
also selected the BDD100K dataset and self-collected data in typical traffic scenes to conduct
a comparison test of detection results. The test results are shown in Figures 18 and 19.
As can be seen from the results in the figure, the VV-YOLO model can detect both false
detection and missing detection in the YOLOv4 model. The positive performance of the
VV-YOLO model in actual scenarios is attributable to the specific design of the clustering
algorithm, network structure and loss function in this paper.
Sensors 2023, 23, 3385 (a) (b) 17 of 20

Figure 16. Object detection model inference heat map: (a) YOLOv4; (b)VV-YOLO.

Sensors 2023, 23, x FOR PEER REVIEW (a) (b) 18 of 21


Figure 17. Object detection results of KITTI dataset: (a) YOLOv4; (b) VV-YOLO.
Figure 17. Object detection results of KITTI dataset: (a) YOLOv4; (b) VV-YOLO.

(a) (b)
Figure 18.18.
Figure Object detection
Object results
detection of BDD100K
results dataset:dataset:
of BDD100K (a) YOLOv4; (b) VV-YOLO.
(a) YOLOv4; (b) VV-YOLO.
Sensors
Sensors 2023, 23, 23,
2023, 3385
x FOR PEER REVIEW 19 of 21 18 of 20

(a) (b)
Figure 19. Object
Figure detection
19. Object resultsresults
detection of collected data: (a)
of collected YOLOv4;
data: (b) VV-YOLO.
(a) YOLOv4; (b) VV-YOLO.

5. Conclusions
5. Conclusions
Based on the
Based end-to-end
on the design
end-to-end idea,
design thisthis
idea, paper proposes
paper proposes a vehicle
a vehicleviewing
viewing angle
angle object
object detection
detection model,
model, VV-YOLO.
VV-YOLO. Through
Through thethe improved
improved K-means++clustering
K-means++ clusteringalgorithm,
algo- fast
rithm,
andfast and anchor
stable stable anchor box generation
box generation is realized
is realized on theon the model
model data In
data side. side.
theInVV-YOLO
the VV- model
YOLO modelstage,
training trainingthestage,
focusthe focus function
function focal loss focal loss is
is used toused to construct
construct the modelthe model
loss function,
losswhich
function, which the
alleviates alleviates theimbalance
training training imbalance
caused bycaused by data distribution
data distribution imbalance. imbal-
At the same
ance. At the same time, the coordinate attention mechanism is introduced
time, the coordinate attention mechanism is introduced into the model, and the CA-PAN into the model,
andneck
the CA-PAN
networkneck network to
is designed is designed
improve to theimprove
learningtheability
learningof ability of thefor
the model model
the features
for the
of interest. In addition to the experiments on the experimental dataset, thisthis
features of interest. In addition to the experiments on the experimental dataset, study also
study also collected
collected some realsome realcomplex
road road complex
scene scene
data indata in China
China for detection
for detection and com-
and comparison tests,
parison
and tests, and the visualization
the visualization results confirmed
results confirmed the superiority
the superiority of the VV-YOLO
of the VV-YOLO model. Several
model. Several experimental results in this paper confirm that the improved model VV-
experimental results in this paper confirm that the improved model VV-YOLO can better
YOLO can better realize object detection from the vehicle perspective and can take into
realize object detection from the vehicle perspective and can take into account the precision
account the precision and speed of model reasoning at the same time, which provides a
and speed of model reasoning at the same time, which provides a new implementation idea
new implementation idea for the autonomous vehicle perception module that has good
for the autonomous vehicle perception module that has good theoretical and engineering
theoretical and engineering practical significance.
practical significance.
Author Contributions: Conceptualization, Y.W.; methodology, H.L.; software, H.L. and B.G.; vali-
Author
dation, formal analysis, Conceptualization,
Z.Z.; Contributions: Y.W.; investigation, Y.W.;
Y.G.; methodology, H.L.;H.L.;
resources, L.J. and software, H.L. andX.L.
data curation, B.G.; valida-
and tion,
Y.W.;Z.Z.; formal analysis,
writing—original Y.W.;
draft investigation,
preparation, Y.G.; Y.G.; resources, L.J.
writing—review andand H.L.;Z.Z.;
editing, data visualiza-
curation, X.L. and
tion,Y.W.;
Z.Z.; writing—original
supervision, Y.W.; draft
projectpreparation, Y.G.;H.L.;
administration, writing—review and editing,
funding acquisition, Z.Z.;
B.G. All visualization,
authors have Z.Z.;
readsupervision,
and agreed toY.W.; project administration,
the published H.L.; funding acquisition, B.G. All authors have read and
version of the manuscript.
agreed to the published version of the manuscript.
Funding: This work was supported in part by the Major Scientific and Technological Special Projects
in Jilin Province
Funding: This
andwork was supported
Changchun in part by the Major
City (20220301008GX), Scientific
the National and Technological
Natural Special Projects
Science Foundation
in Jilin Province and Changchun City (20220301008GX), the National Natural Science Foundation of
China (52072333, 52202503), the Hebei Natural Science Foundation (F2022203054), the Science and
Technology Project of Hebei Education Department (BJK2023026).
Sensors 2023, 23, 3385 19 of 20

Data Availability Statement: Data and models are available from the corresponding author
upon request.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Saleemi, H.; Rehman, Z.U.; Khan, A.; Aziz, A. Effectiveness of Intelligent Transportation System: Case study of Lahore safe city.
Transp. Lett. 2022, 14, 898–908. [CrossRef]
2. Kenesei, Z.; Ásványi, K.; Kökény, L.; Jászberényi, M.; Miskolczi, M.; Gyulavári, T.; Syahrivar, J. Trust and perceived risk: How
different manifestations affect the adoption of autonomous vehicles. Transp. Res. Part A Policy Pract. 2022, 164, 379–393. [CrossRef]
3. Hosseini, P.; Jalayer, M.; Zhou, H.; Atiquzzaman, M. Overview of Intelligent Transportation System Safety Countermeasures for
Wrong-Way Driving. Transp. Res. Rec. 2022, 2676, 243–257. [CrossRef]
4. Zhang, H.; Bai, X.; Zhou, J.; Cheng, J.; Zhao, H. Object Detection via Structural Feature Selection and Shape Model. IEEE Trans.
Image Process. 2013, 22, 4984–4995. [CrossRef] [PubMed]
5. Rabah, M.; Rohan, A.; Talha, M.; Nam, K.-H.; Kim, S.H. Autonomous Vision-based Object Detection and Safe Landing for UAV.
Int. J. Control. Autom. Syst. 2018, 16, 3013–3025. [CrossRef]
6. Tian, Y.; Wang, K.; Wang, Y.; Tian, Y.; Wang, Z.; Wang, F.-Y. Adaptive and azimuth-aware fusion network of multimodal local
features for 3D object detection. Neurocomputing 2020, 411, 32–44. [CrossRef]
7. Shirmohammadi, S.; Ferrero, A. Camera as the Instrument: The Rising Trend of Vision Based Measurement. IEEE Instrum. Meas.
Mag. 2014, 17, 41–47. [CrossRef]
8. Noh, S.; Shim, D.; Jeon, M. Adaptive Sliding-Window Strategy for Vehicle Detection in Highway Environments. IEEE Trans. Intell.
Transp. Syst. 2016, 17, 323–335. [CrossRef]
9. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556.
10. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
11. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16
December 2015; pp. 1440–1448.
12. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Machine Intell. 2017, 39, 1137–1149. [CrossRef]
13. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Machine Intell. 2020, 42, 386–397. [CrossRef]
[PubMed]
14. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016,
arXiv:1605.06409.
15. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single Shot MultiBox Detector. In Proceedings of the
European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37.
18. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 42, 318–327. [CrossRef]
19. Bochkovskiy, A.; Wang, C.-Y.; Mark Liao, H.-Y. YOLOv4: Optimal Speed and Precision of Object Detection. arXiv 2020,
arXiv:2004.10934.
20. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer
Vision, Munich, Germany, 8–14 September 2018; pp. 734–750.
21. Wang, C.-Y.; Bochkovskiy, A.; Mark Liao, H.-Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696v1.
22. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Networks. Commun. ACM 2020, 63, 139–144. [CrossRef]
23. Hassaballah, M.; Kenk, M.; Muhammad, K.; Minaee, S. Vehicle Detection and Tracking in Adverse Weather Using a Deep
Learning Framework. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4230–4242. [CrossRef]
24. Lin, C.-T.; Huang, S.-W.; Wu, Y.-Y.; Lai, S.-H. GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle Detec-tion.
IEEE Trans. Intell. Transp. Syst. 2021, 22, 951–963. [CrossRef]
25. Tian, D.; Lin, C.; Zhou, J.; Duan, X.; Cao, D. SA-YOLOv3: An Efficient and Accurate Object Detector Using Self-Attention
Mechanism for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4099–4110. [CrossRef]
26. Zhang, T.; Ye, Q.; Zhang, B.; Liu, J.; Zhang, X.; Tian, Q. Feature Calibration Network for Occluded Pedestrian Detection. IEEE
Trans. Intell. Transp. Syst. 2022, 23, 4151–4163. [CrossRef]
Sensors 2023, 23, 3385 20 of 20

27. Wang, L.; Qin, H.; Zhou, X.; Lu, X.; Zhang, F. R-YOLO: A Robust Object Detector in Adverse Weather. IEEE Trans. Instrum. Meas.
2023, 72, 1–11. [CrossRef]
28. Arthur, D.; Vassilvitskii, S. k-means plus plus: The Advantages of Careful Seeding. In Proceedings of the ACM-SIAM Symposium
on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035.
29. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13708–13717.
30. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
31. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
32. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Dollár, P.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 936–944.
33. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 6517–6525.
34. Misra, D. Mish: A Self Regularized Non-Monotonic Activation Function. arXiv 2019, arXiv:1908.08681.
35. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
36. Ghiasi, G.; Lin, T.-Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. In Proceedings of the Con-ference
on Neural Information Processing Systems, Montreal, Canada, 2–8 December 2018; pp. 10727–10737.
37. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In
Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000.
38. Franti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [CrossRef]
39. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2254–3361.
40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
41. Cai, Y.; Wang, H.; Sotelo, M.A.; Li, Z. YOLOv4-5D: An Effective and Efficient Object Detector for Autonomous Driving. IEEE
Trans. Instrum. Meas. 2021, 70, 4503613. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like