You are on page 1of 18

processes

Article
Object Detection Algorithm for Surface Defects Based on a
Novel YOLOv3 Model
Ning Lv 1,2 , Jian Xiao 2 and Yujing Qiao 1, *

1 School of Mechanical Engineering, Yangzhou Polytechnic College, Yangzhou 225009, China; ning_lv@163.com
2 School of Automation, Harbin University of Science and Technology, Harbin 150080, China;
xiaojian0451@126.com
* Correspondence: qiaoyujing@sina.com

Abstract: The surface defects of industrial structural parts have the characteristics of a large-scale
span and many small objects, so a novel YOLOv3 model, the YOLOv3-ALL algorithm, is proposed in
this paper to solve the problem of precise defect detection. The K-means++ algorithm is combined
with the intersection-over-union (IoU) and comparison of the prior box for clustering, which improves
the clustering effect. The convolutional block attention module (CBAM) is embedded in the network,
thus improving the ability of the network to obtain key information in the image. By adding fourth-
scale prediction, the detection capability of a YOLOv3 network for small-object defects is greatly
improved. A loss function is designed, which adds the generalized intersection-over-union (GIoU)
loss combined with focal loss to solve the problems of L2 loss and class imbalance in samples.
Experiments regarding contour-defect detection for stamping parts show that the mean average
precision (mAP) of the YOLOV3-ALL algorithm reaches 75.05% in defect detection, which is 25.16%
higher than that of the YOLOv3 algorithm. The average detection time is 39 ms/sheet. This proves
that the YOLOv3-ALL algorithm has good real-time detection efficiency and high detection accuracy.

 Keywords: defect detection; YOLOv3; object detection; K-means++; loss function
Citation: Lv, N.; Xiao, J.; Qiao, Y.
Object Detection Algorithm for
Surface Defects Based on a Novel
YOLOv3 Model. Processes 2022, 10, 1. Introduction
701. https://doi.org/10.3390/
Surface-defect detection is an important research topic in the field of machine vision [1–3].
pr10040701
Classical methods usually adopt conventional processing algorithms or artificially designed
Academic Editor: Mohd Azlan features and classifiers. However, in a real and complex industrial environment, surface-
Hussain defect detection often faces many challenges, such as small differences between the defect
Received: 5 March 2022
imaging and background, low contrast, large changes in defect scale, various types of
Accepted: 2 April 2022
defects, and a large amount of noise in defect images. There can also be significant interfer-
Published: 5 April 2022
ence in defect imaging in the natural environment [4–8]. Currently, the classical methods
tend to have limited use, and it is difficult to obtain good detection results with them. With
Publisher’s Note: MDPI stays neutral
the development of artificial intelligence technology, the research focus of surface-defect
with regard to jurisdictional claims in
detection based on machine vision has shifted from classical image processing and machine
published maps and institutional affil-
learning methods to deep learning methods, which solve problems that cannot be solved
iations.
by traditional methods in many industrial contexts [9–14]. If the problem of very precise
defect-size detection is not considered, which corresponds to the task of computer vision in
defect detection, the essence of defect detection is similar to object detection. This is used
Copyright: © 2022 by the authors.
to determine what the defects are, where they are located, how many defects there are, and
Licensee MDPI, Basel, Switzerland. how they interact.
This article is an open access article Before 2016, object-detection algorithms based on deep learning were mainly realized
distributed under the terms and through ergodic classification tasks, such as DPM, R-CNN, etc. [15–17]. The more precisely
conditions of the Creative Commons these algorithms are traversed, the more accurate the detector will be, but with huge time
Attribution (CC BY) license (https:// and space costs. The YOLO algorithm reconstructs object recognition into a regression
creativecommons.org/licenses/by/ problem, which can directly predict boundary coordinates and class probability through
4.0/). image pixels. In other words, in the YOLO system, you only need to look at an image

Processes 2022, 10, 701. https://doi.org/10.3390/pr10040701 https://www.mdpi.com/journal/processes


Processes 2022, 10, 701 2 of 18

once to predict what the object is and where it is, which is the same as the objective
for defect detection [18,19]. The whole network is based on YOLO, and draws from the
essence of Resnet, Densenet, and FPN, which can be said to be a fusion of all of the most
effective object-detection techniques in the industry at that time. The developed YOLOv3
algorithm, although its accuracy is slightly better than SSD, slightly worse than Faster
R-CNN, and almost the same but less than RetinaNet, its speed is at least two times faster
than SSD, RetinaNet, and Faster R-CNN, so it has high application value in industry. Based
on the above analysis, aiming at the characteristics of a large-scale span of workpiece
surface defects and many small objects, this paper proposes a novel YOLOv3 model for
efficient detection of surface-defect objects, and takes stamping parts’ contour-defect-object
detection as an example to verify the superior performance of the proposed object-detection
algorithm [20,21].
The main contributions of our work are as follows: The K-means++ algorithm is
introduced and optimized to improve the clustering effect of the ground-truth box of
marked data and provide a good data-clustering foundation for YOLOv3 object detection.
The CBAM is embedded into the YOLOv3 network to improve the detection ability of the
network for small objects; the GIoU (generalized intersection-over-union) loss function is
introduced. The improved method can solve the problem that the loss function value is
0 when there is no overlap between the prediction box and the ground-truth box, so that
the network can carry out back-propagation-optimization parameters; in order to further
improve the detection ability of small-object defects, the prediction of the fourth scale is
added to YOLOv3. This study provides an excellent technical and theoretical basis for the
rapid detection of defects in the process of industrial on-line production.
Experiment design: Taking the contour-defect-object detection of stamping parts as an
example to verify the performance of the proposed defect-object-detection algorithm.
The rest of this paper is organized as follows: Section 2 briefly discusses the YOLOv3
object detection algorithm. Section 3 elaborates the principle and method of the new
YOLOv3 object-detection algorithm. Section 4 takes the contour defect of stamping parts as
the research object to verify the performance of the proposed algorithm, and corresponding
conclusions are drawn in the experiment.

2. Object-Detection Algorithm Based on YOLOv3


As shown in Figure 1, the image input size of the YOLOv3 network is 416 × 416 × 3.
The Darknetconv2D_BN_Leaky (DBL) component is the most used module in YOLOv3,
which includes a convolution layer (Conv), batch normalization (BN), and Leaky ReLU
activation function. After that, the connected part isthe residual block (Resblock_body),
which is composed of zero padding, DBL, and n residual units (Res_unit); the values
of n are 1, 2, 8, and 4; and then two DBL components are used for a jump connection
to form a residual unit. The above parts constitute the backbone network for image
feature extraction.
The sizes of the feature graph’s output by the YOLOv3 network are 13 * 13, 26 * 26 and
52 * 52. When the center of the detection object falls in a grid, the current grid is responsible
for detecting the object. The feature graph is divided into many grids to predict small
objects, which enhances the detection ability for small objects, and the size of fewer grids is
responsible for the prediction of large objects. YOLOv3 draws on the network structure of
feature pyramid networks (FPN) [20], as shown in Figure 2. The bottom feature semantic
information of the feature pyramid is less; even if the object can be selected correctly, it may
be divided into wrong categories. There is a lot of semantic information about high-level
features. Although objects can be classified correctly, the small object may not be selected.
Therefore, when predicting the features of different sizes independently, the features of the
previous layer need to be fused through up-sampling to enhance the ability of the network
to obtain the features.
13*13*225
416*416*3

Up DBL*5
DBL samping

concact DBL DBL conv y2


Processes 2022, 10, x FOR PEER REVIEW 26*26*225 3 of 18
Processes 2022, 10, 701 3 of 18

Down DBL*5
DBL samping

concact DBL DBL conv y3

Darknet- 53 without FC layer DBL*5 52*52*225

DBL res1 res2 res8 res8 res4 DBL DBL conv y1


Darknetconv2D_BN_Leaky Res_unit Leak Resblock_body
res
add y zero res13*13*225
416*416*3 DBL conv BN Leaky
relu
uint DBL DBL resn padding DBL uint
res unit*n
Up DBL*5
DBL
Figure 1. YOLOv3 network structure.
samping

concact DBL DBL conv y2


The sizes of the feature graph’s output by the YOLOv3 network are 26*26*225
13 * 13, 26 * 26
and 52 * 52. When the center of the detection object falls in a grid, the current grid is re-
sponsible for detectingDBL the object.
Down The featureDBL*5
graph is divided into many grids to pre-
samping
dict small objects, which enhances the detection
concact DBLabilityDBL
for small
conv objects,y3and the size of
fewer grids is responsible for the prediction of large objects. YOLOv3 draws on the net-
work structure of feature pyramid networks (FPN) [20], as shown in 52*52*225 Figure 2. The bot-
tom feature semantic information of the feature pyramid is less; even if the object can be
Darknetconv2D_BN_Leaky Res_unit
selected correctly, it may be divided intoLeak Resblock_body
wrong categories. There is a lot of semantic in-
res zero res
DBL conv BN Leaky
formation uint DBL DBL add
about high-level features. Although y be DBL
objects canpadding
resn classifieduint
correctly, the
relu
small object may not be selected. Therefore, when predicting the features of different
res unit*n
sizes independently, the features of the previous layer need to be fused through
up-sampling
Figure 1. YOLOv3
Figure to network
1. YOLOv3 enhance
networkthe ability of the network to obtain the features.
structure.
structure.

The sizes of the feature graph’s output by the YOLOv3 network are 13 * 13, 26 * 26
and 52 * 52. When the center of the detection object falls in a grid, Predict the current grid is re-
sponsible for detecting the object. The feature graph is divided into many grids to pre-
dict small objects, which enhances the detection ability for small objects, and the size of
Predict
fewer grids is responsible for the prediction of large objects. YOLOv3 draws on the net-
work structure of feature pyramid networks (FPN) [20], as shown Predict in Figure 2. The bot-
tom feature semantic information of the feature pyramid is less; even if the object can be
selected correctly, it may be divided into wrong categories. There is a lot of semantic in-
formation about high-level features. Although objects can be classified correctly, the
small object
Figure
Figure maypyramid
Feature
2. Feature
2. not benetwork
pyramid selected.
network Therefore,
schematic
schematic diagram.when predicting the features of different
diagram.
sizes independently, the features of the previous layer need to be fused through
The first
The first size
size feature
featuregraph
graphofof
YOLOv3
YOLOv3 is processed by passing
is processed through
by passing the darknet-53
through the dark-
up-sampling to enhance the ability of the network to obtain the features.
network, following 5 DBL layers, and then through one DBL and one convolution layer
net-53 network,following 5 DBL layers, and then through one DBL and one convolution
to acquire the final output. Each grid is able to predict three bounding boxes, and each
layer to acquire the final output. Each grid is able to predict three bounding boxes, and
bounding box predicts the central coordinates (the width and height of bounding boxes),
each bounding box predicts the central coordinates (the width and height of bounding
as well as the confidence degree, with a total of five parameters. Therefore, each feature
boxes), as well as the confidence degree, with a total of five parameters. Therefore,
Predict each
graph can generate S × S × 3 × (4 + 1 + C ) parameters, where S represents the size of
(4
feature graph can generate 𝑆 × 𝑆 × 3 × + 1 + 𝐶) parameters, where S represents the
the divided grid, 3 represents the number of bounding boxes predicted by each grid,
size of the divided grid, 3 represents the number of bounding boxes predicted by each
4 indicates the central coordinates (the width and height of bounding boxes predicted),
Predict
grid, 4 indicates the central coordinates (the width and height of bounding boxes pre-
and C represents the total number of predicted categories. For the COCO dataset, there
dicted), and C represents the total number of predicted categories. For the COCO da-
are 80 prediction categories, so the tensor parameter for each grid of the first dimension
taset, there are 80 prediction categories, so the tensor parameter for each grid of the first
is 255. The second-dimension feature graph is convoluted and up-sampledPredict
by the branch
of the first dimension, spliced and fused with the fourth residual block of darknet-53 to
obtain new features, and then the DBL layer and a convolution layer are used to obtain the
feature graph with the final output of 26 * 26 * 255. The third-dimension feature graph is
convolved and up-sampled by the branch of the second dimension, and spliced and fused
with the third residual block of DarkNET-53 to obtain new features. Then, the final output
Figure
52 *2.52Feature pyramid
* 255 feature network
graph schematic
is obtained diagram.
through a DBL layer and a convolution layer. After
the above process, the YOLOv3 algorithm completes the prediction of three feature scales.
The first size feature graph of YOLOv3 is processed by passing through the dark
net-53 network,following 5 DBL layers, and then through one DBL and one convolution
layer to acquire the final output. Each grid is able to predict three bounding boxes, and
each bounding box predicts the central coordinates (the width and height of bounding
boxes), as well as the confidence degree, with a total of five parameters. Therefore, each
feature graph can generate 𝑆 × 𝑆 × 3 × (4 + 1 + 𝐶) parameters, where S represents the
Processes 2022, 10, 701 4 of 18

3. Object-Detection Algorithm Based on Novel YOLOv3 Model


3.1. Optimization of K-Means++ Algorithm Clustering Prior Box
YOLOv3 uses a K-means algorithm to generate a prior box by clustering a marked-
ground-truth box, and then takes the size of the prior box as a reference to predict the
prediction box of an object. The K-means algorithm first needs to set K clustering centers.
For the selection of K, the prior method can be used to set the K value by knowing the
prior information of the whole dataset. YOLOv3 predicts feature graphs of three sizes, and
each grid of each size has three prior boxes, so the K value is set to 9. The object-detection
algorithm based on YOLOv3 requires a large amount of image data, which will generate
more marked-ground-truth boxes for clustering, and then more ground-truth boxes with a
large size gap will participate in clustering. K-means algorithms are sensitive to outliers
in a large amount of data and is easy to generate local-optimal clustering. Therefore, the
original K-means algorithm cannot cluster the prior boxes in the dataset well. In view of the
problems existing in K-means algorithms, this paper introduces the K-means ++ algorithm
and optimizes the K-means++ algorithm to improve the clustering effect of annotated
ground-truth boxes in the data, and also provides a good data-clustering foundation for
YOLOv3 object detection.
The K-means++ algorithm is mainly used to optimize the selective mode of the initial
clustering center. The basis for selecting the initial cluster center is that the distance between
the selected points should be as large as possible, so as to ensure that the clustering center
will no longer gather in a certain area, which has a better effect on the global clustering
of data [22]. The K-means ++ algorithm first randomly selects a sample from the dataset
as the initial cluster center, then calculates the distance between other data points and the
selected cluster center. The calculation formula is as follows:

D ( xi ) = argmin|| xi − ur || (1)

where xi represents each point in the dataset, ur is the selected cluster center, and r is the
selected cluster center. The calculated distance is used to select new data points as the
cluster center. The selection basis is that the points with large D(x) have a high probability
of being selected as the cluster center. The calculation formula of probability is as follows:

D ( x )2
Px = 2
(2)
∑ x∈X D ( x )

where x is the dataset to be clustered.


The distance between the sample point calculated in the above process and the clus-
tering center is the Euclidean distance. In object detection, the purpose of clustering is to
make the generated anchor box and the annotated ground-truth box closer. The closer they
are, the better the clustering. Therefore, the intersection-over-union (IoU) is used to define
a new distance, as shown in Equation (3):

D ( xi ) = 1 − IoU ( xi , ur ) (3)

where xi represents each point in the dataset, ur is the selected cluster center, and r is the
number of the selected cluster center. IoU represents the IoU of the two parameters, and
the calculation formula is as follows:
xi ∩ ur
IoU ( xi , ur ) = (4)
xi ∪ ur

The above steps must be repeated to select a new cluster centroid until K cluster
centroids are selected, and then the selected K cluster centroids must be used to perform
the K-means clustering algorithm. At this point, the whole clustering process is completed.
𝑥𝑖 ∩ 𝑢𝑟
𝐼𝑜𝑈(𝑥𝑖 , 𝑢𝑟 ) = (4)
𝑥𝑖 ∪ 𝑢𝑟
The above steps must be repeated to select a new cluster centroid until K cluster
centroids are selected, and then the selected K cluster centroids must be used to perform
Processes 2022, 10, 701 the K-means clustering algorithm. At this point, the whole clustering process is 5com-of 18
pleted.

3.2. Fused
3.2. Fused Attention
Attention Mechanism
Mechanism
The convolutional
The convolutional blockblock attention
attention module
module (CBAM)
(CBAM) is is aa soft
soft attention
attention mechanism
mechanism
lightweight
lightweight module of the convolution module [23], as shown in Figure 3, in which the
in which the
channel attention module and spatial attention module process data in the different di-
channel
mensions respectively.
mensions respectively.First,
First,the
thechannel
channelattention
attentionmodule
moduleis is used
used to to compress
compress thethe spa-
spatial
tial dimension
dimension of theoffeature
the feature
graph,graph, and
and then thethen the attention
spatial spatial attention
module ismodule
used toiscompress
used to
compress
the channelthe channel dimension.
dimension. Finally, theFinally, the convolution
convolution operation isoperation is used
used to ensure tothe
that ensure
out-
thatdata
put the output
dimension dataisdimension
consistent is consistent
with the inputwith thedimension,
data input datasodimension,
as to completeso asthe
to
complete theof
construction construction
the attention ofmechanism.
the attention mechanism.

Channel
Attention Spatial
Module Attention
Input Feature Module Refined Feature

Figure 3.
Figure 3. CBAM
CBAM network
network structure.
structure.

3.2.1.
3.2.1. Channel
Channel Attention
Attention Module
Module
In
In convolutional neuralnetworks,
convolutional neural networks,some someinformation
information is useless
is useless relative to the
relative to features
the fea-
and information weight can be re-assigned by learning way. The
tures and information weight can be re-assigned by learning way. The basic idea of thebasic idea of the channel
attention mechanism
channel attention is to increase
mechanism is to the weight
increase theofweight
the effective
of the channel
effectiveand decrease
channel and the
de-
weight of the invalid channel, and the most important weight is the
crease the weight of the invalid channel, and the most important weight is the attention attention point of the
channel attention mechanism. For image processing tasks, each layer of the convolutional
point of the channel attention mechanism. For image processing tasks, each layer of the
neural network has multiple convolutional kernels. The channel attention mechanism
convolutional neural network has multiple convolutional kernels. The channel attention
acts on the feature channels corresponding to each convolutional kernel, and each channel
mechanism acts on the feature channels corresponding to each convolutional kernel, and
has different weights, so that the network pays more attention to the important features
each channel has different weights, so that the network pays more attention to the im-
Processes 2022, 10, x FOR PEER REVIEWand weakens the nonimportant features. As show in Figure 4, the processing flow of
portant features and weakens the nonimportant features. As show in Figure 4, the pro-6 of 18
the channel attention mechanism is as follows: First, the feature graph of the inputs is
cessing flow of the channel attention mechanism is as follows: First, the feature graph of
subjected to maximum pooling and average pooling operations, through which the loss of
the inputs is subjected to maximum pooling and average pooling operations, through
feature information can be reduced, and two spatial-information features can be obtained.
which the loss of feature information can be reduced, and two spatial-information fea-
Then, through a multi-layer perceptron with shared weight value, the output features of
tures can be obtained.
𝑀𝑐 (𝐹) =Then,
𝜎 (𝑀𝐿𝑃through a multi-layer perceptron with shared weight value,
multi-layer perceptron are added(𝐴𝑣𝑔𝑝𝑜𝑜𝑙(𝐹) + 𝑀𝐿𝑃(𝑀𝑎𝑥𝑝𝑜𝑜𝑙(𝐹))))
with the elementwise feature. After the operating of
the output features of multi-layer
the activation function, the channel attention perceptron are added
feature is with the
output. Theelementwise feature.
sigmoid activation (5)
𝑐 𝑐
After the operating of = 𝜎(𝑊
the 1 (𝑊0 (𝐹function,
activation 𝑎𝑣𝑔 )) + 𝑊 1 (𝑊
the 0 (𝐹𝑚𝑎𝑥 )))
channel attention feature is output.
function is selected here. The output channel attention feature and the input feature are
The sigmoid
multiplied by activation function
the elementwise is selected
feature, and thehere.
outputThefeature
outputis channel
taken as attention feature
the input feature
and
of Inthe
the Equation (5), M
input feature
spatial-attention e(F)
are represents
multiplied
module. the
For thebyinput outputF,ofitsthe
the elementwise
feature channel
feature,
mathematical and attention
the
expression for the  is
outputmodule,
feature
issigmoid
taken attention
thechannel as activation
the input feature
function,of the
module is as follows: W spatial-attention
0 and W 1 are themodule.
weight For the input
coefficients feature
of the F, its
multi-layer
mathematical 𝑐
expression for 𝑐 channel attention module is as follows:
the
perceptron, and 𝐹𝑎𝑣𝑔 and 𝐹𝑚𝑎𝑥 represent the output feature graph after average pool-
Mc ( F ) = σ ( MLP( Avgpool ( F ) + MLP( Maxpool ( F ))))
ing and maximum pooling. c )) + W (W ( F c ))) (5)
= σ(W1 (W0 ( Favg 1 0 max

Input Feature Maxpool Channel Attention


Mc
f

Avgpool Shared MLP

Figure 4.4.Channel
Figure Channelattention module.
attention module.

3.2.2. Spatial Attention Module.


The spatial attention module is used to compress the data on the channel and focus
on obtaining the effective information in the channel. As shown in Figure 5, after the
channel attention mechanism processing is completed, the generated feature graph and
ing and maximum pooling.

Input Feature Maxpool Channel Attention


Mc

Processes 2022, 10, 701


f 6 of 18

Avgpool Shared MLP


In Equation (5), Me (F) represents the output of the channel attention module, σ is
Figure 4. Channel
the sigmoid attentionfunction,
activation module. W 0 and W 1 are the weight coefficients of the multi-layer
perceptron, and Favgc and F c
max represent the output feature graph after average pooling and
3.2.2. Spatialpooling.
maximum Attention Module.
The spatial attention module is used to compress the data on the channel and focus
3.2.2.
on Spatial the
obtaining Attention
effectiveModule
information in the channel. As shown in Figure 5, after the
channelTheattention
spatial attention
mechanism module is used is
processing to completed,
compress the data
the on the channel
generated featureand
graphfocus
and on
obtaining
the the effective
input feature graphinformation
are multiplied in thebychannel. As shown feature
the elementwise in Figure as5,the
after the channel
input of the
attention
spatial mechanism
attention module, processing is completed,
and the input feature graphthe generated
is pooledfeature
in the graph
channeland the input
dimension
by using the average pooling and maximum pooling methods to generate two attention
feature graph are multiplied by the elementwise feature as the input of the spatial feature
module,
graphs, and the
which are input
combined feature
intograph
a newisfeature
pooledmap. in the channel
Because thedimension
dimensionbyofusing the
the fea-
average pooling and maximum pooling methods to generate two
ture graph of the spatial dimension is different from that of the input feature graph, so feature graphs, which are
combined into a new feature map. Because the dimension of the
the convolution operation is needed to keep it consistent, and then activate it through afeature graph of the spatial
dimension
sigmoid is different
activation from to
function that of theoutput
finally input feature graph,
the spatial so the convolution
attention feature graph.operation
For theis
input feature F, the mathematical expression of its spatial attention module is shown into
needed to keep it consistent, and then activate it through a sigmoid activation function
(6). the spatial attention feature graph. For the input feature F, the mathematical
finally output
Equation
expression of its spatial attention module is shown in Equation (6).
M s ( F ) =  ( f 77 ([ Avgpool ( F ); Max( Maxpool ( F )]))
Ms ( F ) = σ7(7f 7×7s([ Avgpool ( F ); Max ( Maxpool ( F )])) (6)
=  ( f ([7×F7avg ; Fsmax
s
]))s (6)
= σ( f ([ Favg ; Fmax ]))
where 𝑀𝑠 (𝐹) represents the output of the spatial attention module,  represents the
where Ms ( F ) represents the output of the spatial attention module, σ represents the sigmoid
sigmoid activation function, 𝑓 7×7 represents the convolution kernel with a size of 7 ×
7×7 represents
activation function,
𝑠 f 𝑠 the convolution kernel with a size of 7 × 7, and the
7, sand the 𝐹s𝑎𝑣𝑔 and 𝐹𝑚𝑎𝑥 represent the feature graph output after average pooling and
Favg and Fmax represent the feature graph output after average pooling and maximum
maximum pooling respectively.
pooling respectively.

Conv Layer
f

Channel-refined [Maxpool, Avgpool] Spatial Attention


feature F 
rocesses 2022, 10, x FOR PEER REVIEW MS

Spatialattention
Figure5.5.Spatial
Figure attentionmodule.
module.

CBAMintegrates
CBAM integrateschannel
channelattention
attentionand
andspatial
spatialattention,
attention,which
whichnot
notonly
onlyensures
ensuresthe
the
acquisition
acquisition of important
of important channel information, but also ensures the acquisition of important
structures,
information
and thechannel
of characteristic
YOLOv3 information,
areas. CBAM
but alsouses
algorithm ensures the acquisition
residual network of im-
structur
portant information of characteristic areas. is widelyisused
CBAM in residual
widely used in network
residual structures,
network
dient
and disappearance.
the YOLOv3 algorithm usesTherefore, CBAM
residual network is embedded
structures in the
to reduce gradient YOLOv3 ne
disappear-
ance.
per.Therefore,
As shown CBAM in isFigure
embedded in theembedding
6, the YOLOv3 network in this paper.mechanisms
of attention As shown in can
Figure 6, the embedding of attention mechanisms can be completed by adding a CBAM
adding
module a CBAM
behind module
the residual unit. behind the residual unit.

Leak
res
unit DBL DBL add y CBAM

Figure
Figure 6. CBAM
6. CBAM added added to the
to the residual residual
unit. unit.

3.3. Improving the Network Structure


The YOLOv3 draws on the principle of feature pyramid structure
tegrates the low-level information and the high-level information after u
and makes independent prediction on three feature scales, namely 13 *
Figure 6. CBAM added to the residual unit.

3.3. Improving the Network Structure


Processes 2022, 10, 701 The YOLOv3 draws on the principle of feature pyramid structure 7 offor
18 refere
tegrates the low-level information and the high-level information after up-sampli
and makes independent prediction on three feature scales, namely 13 * 13, 26* 26,
*3.3.52.
Improving
Manytheindustrial
Network Structure
structures, such as the contour of stamping parts,
The YOLOv3 draws
large-defect-scale on the
span, principlemany
including of feature
smallpyramid structure
objects. for reference,
For the inte-
three characteristi
grates the low-level information and the high-level information
of YOLOv3, the amount of information provided by the grid is limited, after up-sampling [24], andand the
makes independent prediction on three feature scales, namely 13 * 13, 26 * 26, and 52 * 52.
divided by each
Many industrial scale. such
structures, Theasdetection
the contourlayer of eight-fold
of stamping parts, havedown-sampling
a large-defect-scaleis the s
object layer that
span, including cansmall
many be detected,
objects. For that
the is, 52characteristic
three * 52, whichscales will of
lead to an the
YOLOv3, inaccurate
tion
amount effect of small objects.
of information provided by the grid is limited, and the grid is divided by each
scale.InTheorder
detection
to layer of eight-fold
further improve down-sampling
the detection is the smallest
ability object
of the layer that can netw
YOLOv3
be detected, that is, 52 * 52, which will lead to an inaccurate detection effect of small objects.
small-object defects of industrial structures, this paper continues to use the FPN
In order to further improve the detection ability of the YOLOv3 network for small-
ple,
objectshown inindustrial
defects of Figure 7, to addthis
structures, a scale prediction,
paper continues and
to use the to
FPN set the characteristics
principle, shown
scales,
in Figurenamely,
7, to add 13 * 13,
a scale 26 * 26,and
prediction, 52*52,
to setand 104 * 104. The
the characteristics fourth
to four scale
scales, can use th
namely,
13 * 13, 26 * 26, 52*52, and 104 * 104. The fourth scale can use the
low-detail information and deep-feature-semantic information for defecting-ob shallow-detail information
and deep-feature-semantic information for defecting-object detection, which not only
tection, which not only improves the image processing ability of the network, b
improves the image processing ability of the network, but also enhances the robustness of
enhances
the network. the robustness of the network.

Predict

Predict

Predict

Predict

Figure 7. Improved feature fusion network.


Figure 7. Improved feature fusion network.
Based on the FPN principle and defect image features of industrial structure, this paper
Based
proposes on themodel
a YOLOv3 FPN with
principle and defect
an improved networkimage features
structure, of industrial
as shown in Figure 8.structu
The input image passes through the Darknet-53 backbone
paper proposes a YOLOv3 model with an improved network structure, network that does not have as sh
a full connection layer, and the feature graph of four scales is output. The fifth residual
Figure 8. The input image passes through the Darknet-53 backbone network th
block of the backbone network outputs a 32-fold down-sampling feature diagram. After
not havethrough
it passes a full connection layer, and
five DBL convolution theone
layers, feature graph of four
3 *3 convolution scales
core and oneis1 output.
*1 T
residual
convolution block
core, of
thethe backbone
feature graph of network outputs
the first scale 13* 13 isaacquired.
32-fold The
down-sampling
output of the featu
gram. After block
fifth residual it passes
in thethrough
backbonefive DBLisconvolution
network layers, onefeature
a 32-fold down-sampling 3 *3 convolution
graph. c
After it passes through five DBL convolution layers, the up-sampling
one 1 * 1 convolution core, the feature graph of the first scale 13* 13 is acquir is spliced with the
16-fold down-sampling feature graph output by the fourth residual block in the backbone
output of the fifth residual block in the backbone network is a 32-fold down-sa
network as the feature graph of the second scale 26 * 26. The aforementioned 16-fold down-
feature
samplinggraph. Afterafter
feature graph it passes
splicingthrough five DBL
is passed through five convolution
DBL convolutionallayers, the
layers, andup-sam
spliced with the 16-fold
then the up-sampling down-sampling
feature graph is stitched withfeature graph
the eight-fold output by feature
down-sampling the fourth r
diagram
block inoutput by the thirdnetwork
the backbone residual block in the
as the backbone
feature network
graph of as thesecond
the feature graph
scale 26 *
of the third scale 52 * 52. In the same way, the above spliced eight-fold down-sampling
aforementioned 16-fold down-sampling feature graph after splicing is passed t
feature diagram is passed through five DBL convolution layers, and then the up-sampling
feature diagram is stitched with the four-fold down-sampling feature diagram output
by the second residual block in the backbone network as the feature graph of the fourth
scale 104 * 104. The attention mechanism is used in the network structure, and the CBAM
module is embedded behind each residual unit, so that the improved network structure
can make full use of the shallow-detail information and high-level semantic information,
and improve the detection ability of the network for small objects.
four-fold down-sampling feature diagram output by the second residual block in the
backbone network as the feature graph of the fourth scale 104 * 104. The attention
mechanism is used in the network structure, and the CBAM module is embedded be-
hind each residual unit, so that the improved network structure can make full use of the
Processes 2022, 10, 701 shallow-detail information and high-level semantic information, and improve8 ofthe
18 detec-

tion ability of the network for small objects.

Darknet- 53 without FC layer DBL*5


DBL res1 res2 res8 res8 res4 DBL DBL conv y1
13*13*225
416*416*3

Up- DBL*5
DBL sampling
concact DBL DBL conv y2
26*26*225

Up- DBL*5
DBLsampling
concact DBL DBL conv y3

52*52*225

Up- DBL*5
DBLsampling
concact DBL DBL conv y4

104*104*225

Darknetconv2D_BN_Leaky Res_unit Resblock_body


Leak
res res
DBL conv BN Leaky uint DBL DBL add y CBAM resn zero
DBL
relu padding uint
res unit*n

Figure 8. Improved
Figure YOLOv3
8. Improved YOLOv3network structure.
network structure.

3.4. Optimizing the Loss Function


3.4. Optimizing the Loss Function
The loss function plays an important role in object detection. Whether the design of
The
the loss
loss function
function plays anorimportant
is reasonable rolerelated
not is directly in object
to thedetection. Whether
training time the design of
and detection
the loss function
accuracy of the is reasonable
model or not isthere
[25]. At present, directly relatedloss
is no general to the training
function, and time and detection
the selection
of the loss function requires a comprehensive consideration of factors such
accuracy of the model [25]. At present, there is no general loss function, and the selection as machine
learning
of the algorithms,requires
loss function model convergence time, and consideration
a comprehensive the confidence ofofthe prediction
factors suchresults.
as machine
The L2 loss function is used in the YOLOv3 to calculate the loss of the predicted box
learning algorithms, model convergence time, and the confidence of the prediction re-
coordinates. The basic idea of the L2 loss function is to reduce the sum of the squares of
sults.
theThe L2 lossbetween
differences function theistwo
used
datainasthe
muchYOLOv3 to calculate
as possible. the loss
Its mathematical of the predicted
expression is
box shown
coordinates.
in EquationThe(7):basic idea of the L2 loss function is to reduce the sum of the
m
squares of the differences between the two data as much 2 as possible. Its mathematical
L2 (ŷ, y) = ∑ (y(i) − ŷ(i) ) (7)
expression is shown in Equation (7): i =0
𝑚
where y(i) represents the true value, ŷ(i) represents the estimated value, and m represents
the total number of samples. 𝐿2 (𝑦̂, 𝑦) = ∑(𝑦 (𝑖) − 𝑦̂ (𝑖) )2 (7)
The robustness of the L2 loss function is poor 𝑖=0 and mainly reflected in the error squared,
which(𝑖)will make the model sensitive to samples (𝑖) with large errors and sacrifice more sample
where 𝑦 with
values represents the true
small errors value, 𝑦̂ The
for adjustment. represents the estimated
IoU loss function value,
is proposed for and
objectm repre-
sentsdetection,
the totaland
number of samples.
its evaluation idea is to calculate the intersection ratio between the prediction
Theand
box robustness of thebox.
the ground-truth L2 The
losslarger
function is poor
the value, and the
the higher mainly
degreereflected in the error
of coincidence
between the two boxes, and the more similar they are. The mathematical
squared, which will make the model sensitive to samples with large errors and expression for thesacrifice
evaluation of the IoU loss function is as follows:
more sample values with small errors for adjustment. The IoU loss function is proposed
for object detection, and its evaluation idea
| Ais∩ to
B| calculate the intersection ratio between
IoU = (8)
the prediction box and the ground-truth box.| The
A ∪ B | larger the value, the higher the degree

In the formula: A and B represent the predicted box and the ground-truth box, respec-
tively. The intersection ratio reflects the detection effect of the predicted box relative to the
ground-truth box, but the IoU loss function is insensitive to scale.
Processes 2022, 10, x FOR PEER REVIEW 9 of 18

of coincidence between the two boxes, and the more similar they are. The mathematical
of coincidence
expression between of
for the evaluation thethe
two IoU boxes, and the more
loss function similar they are. The mathematical
is as follows:
expression for the evaluation of the IoU loss function is as follows:
|𝐴 ∩ 𝐵|
Processes 2022, 10, 701 𝐼𝑜𝑈 = |𝐴 ∩ 𝐵| (8) 9 of 18
|𝐴 ∪=𝐵|
𝐼𝑜𝑈 (8)
|𝐴 ∪ 𝐵|
In the formula: A and B represent the predicted box and the ground-truth box, re-
spectively.In Thetheintersection
formula: Aratio and reflects
B represent the predicted
the detection effectbox and
of the the ground-truth
predicted box relativebox, re-
TableThe
to thespectively.
ground-truth 1 demonstrates
intersection
box, but the IoU that
ratio when
reflects
loss function theis
the L2 loss effect
detection values
insensitive ofare
the the
to scale. same,box
predicted IoU loss values
therelative
are
to different,
the ground-truthand the
box, GIoU
but theloss
IoU values
loss are
function also
is very different.
insensitive
Table 1 demonstrates that when the L2 loss values are the same, the IoU loss values to It
scale. can be clearly seen from
the position
Table 1 of the
demonstrates boxes in
that Figure
when 9
the that
L2 the
loss IoU
values loss
are values
the
are different, and the GIoU loss values are also very different. It can be clearly seen from same,in Figure
the IoU 9a
loss are significantly
values
lessdifferent,
are
the position than the
of the andIoU
boxes inloss
the GIoU values
Figure loss inthe
values
9 that Figure
are also
IoU 9c.very
loss In different.
values Figure
in Figure9,Itthe green
canare
9a frames
besignificantly
clearly represent the
seen from
the position
the IoU of
ground-truth
less than the
loss boxes
boxes,
values in Figure
and the red99c.
Figure thatInthe
frames IoU loss
represent
Figure values
9, the the
green inframes
Figure represent
prediction 9aboxes.
are significantly
the IoU and GIoU
L2.
less
ground-truththanboxes,
are used the IoU and
to calculatelossthevalues
redloss.
the in Figure
frames 9c. In Figure
represent 9, the green
the prediction boxes.frames represent
L2. IoU and the
GIoU ground-truth boxes, the
are used to calculate andloss.
the red frames represent the prediction boxes. L2. IoU and
GIoU
Table are used
1. The to values
loss calculateof the
L2, loss.
IoU loss, and GIoU loss.
Table 1. The loss values of L2, IoU loss, and GIoU loss.
Table Number/Loss
1. The loss values of L2, IoU loss,
L2 and
LossGIoU loss. IoU Loss GIoU Loss
Number/Loss L2 Loss IoU Loss GIoU Loss
a
Number/Loss
a 19.69
19.69L2 Loss 0.16IoU Loss0.16 GIoU Loss 0.07
0.07
a b 19.69 0.18 0.18
b 19.69 19.69 0.18 0.16 0.18 0.07
c bc 19.69
19.69 19.69 0.29 0.18
0.29
0.29 0.18
0.29
c 19.69 0.29 0.29

(c)
(a) (b)
(c)
(a) (b)

FigureFigure 9. The comparison


9. The comparison of L2 loss,ofIoU
L2loss, IoUGIoU
loss,and loss,loss. GIoU loss
and(a)The loss.values
(a)Theof loss values
L2. (b) of L2. (b) The loss
The loss
valuesFigure
of IoU.9.of
values The
(c) comparison
The
IoU. loss value
(c) The of GIoU.
of
loss L2 loss,ofIoU
value loss, and GIoU loss. (a)The loss values of L2. (b) The loss
GIoU.
values of IoU. (c) The loss value of GIoU.
The IoU Theloss
IoUfunction also has also
loss function certain
hasshortcomings that cannot that
certain shortcomings accurately
cannot reflect
accurately reflect
The IoU
the coincidence loss of
degree function
the two also has certain
regions. Figure shortcomings
10a indicates that
that cannot
the realaccurately
defect and reflect
the coincidence degree of the two regions. Figure 10a indicates that the real defect and
the coincidence
the prediction degree
defect are close,ofand
the Figure
two regions. Figure 10a
10b indicates thatindicates that the
the real defect andreal
thedefect
pre- and
theprediction
prediction
thedefect
defect are close, and Figure 10b indicates that the realand
defect and the pre-
diction are fardefect
away are
fromclose,
one and Figure
another. When10b the
indicates that the
prediction box real
doesdefect
not coincidethe pre-
diction
diction defect are
defect are box,far
far away away from
from oneloss one
another.another.
When When
the the
prediction prediction
box perform boxcoincide not coin-
does not pa- does
with the ground-truth the resulting is 0, and the network cannot
cide
with
rameter with the ground-truth box, the resulting loss is 0, and the
the ground-truth box, the resulting loss is 0, and the network cannot perform
learning. network cannot
pa- perform
parameter learning.
rameter learning.

(a) (b)
(a) (b)

Figure 10. Comparison of the distance between the prediction box and ground-truth frame. (a) The
real defect is close to the predicted defect. (b) The real defect is far from the predicted defect.

In view of the above problems, we introduce the GIoU (generalized intersection-over-


union) loss function [26]. This improved method can solve the problem of the loss function
value being 0 when there is no overlap between the prediction box and the ground-truth
box, so that the network can carry out back-propagation to optimize the parameters. The
GIoU loss is calculated as follows:
|C \( A ∪ B)|
GIoU = IoU − (9)
|C |
In view of the above problems, we introduce the GIoU (generalized intersec-
tion-over-union) loss function [26]. This improved method can solve the problem of the
loss function value being 0 when there is no overlap between the prediction box and the
ground-truth box, so that the network can carry out back-propagation to optimize the
Processes 2022, 10, 701 parameters. The GIoU loss is calculated as follows: 10 of 18

|𝐶\(𝐴 ∪ 𝐵)|
𝐺𝐼𝑜𝑈 = 𝐼𝑜𝑈 − (9)
|𝐶|
where A and B represent the prediction box and the ground-truth box, respectively, and C
where A and B represent the prediction box and the ground-truth box, respectively, and
represents the smallest bounding rectangle of the prediction box and the ground-truth box,
C represents the smallest bounding rectangle of the prediction box and the ground-truth
as shown
box, in theinblack
as shown box line
the black boxinline
the in
Figure 11. IoU11.
the Figure represents the intersection-over-union
IoU represents the intersec-
of the prediction box and the ground-truth box.
tion-over-union of the prediction box and the ground-truth box.

Figure
Figure11.
11.The
Thesmallest
smallestbounding rectangle
bounding C of GIoU.
rectangle C of GIoU.

ItItcan
canbebedetermined
determined from Equation
from (9) that
Equation (9) the
thatvalue range of
the value GIoUofisGIoU
range (−1,1].isGIoU
(−1,1]. GIoU
comprehensively considers the overlapping area and the nonoverlapping
comprehensively considers the overlapping area and the nonoverlapping area,area,
which
which can
can reflect the degree of coincidence between the ground-truth box and the predicted
reflect the degree of coincidence between the ground-truth box and the predicted box. The
box. The shorter the distance between the two boxes, the closer the value of GIoU is to 0,
shorter the distance between the two boxes, the closer the value of GIoU is to 0, so this
so this paper defines the GIoU loss function as follows:
paper defines the GIoU loss function as follows:
𝐺𝐼𝑜𝑈 𝑙𝑜𝑠𝑠 = 1 − 𝐺𝐼𝑜𝑈 (10)
GIoU loss = 1 − GIoU
As can be seen from Equation (10), when the GIoU of the prediction box and the
(10)
ground-truth box is larger, the loss value is smaller, and the network model will perform
As can be seen from Equation (10), when the GIoU of the prediction box and the
parameter optimization in the direction of model convergence.
ground-truth boxofisthis
In the dataset larger,
study, thethe
loss valueused
samples is smaller, and the
are stamping network
parts, whichmodel will perform
have many
parameter optimization in the direction of model convergence.
small defects. In an image, the areas with defects are used as positive samples, and the
areas In the dataset
without defectsof this
are usedstudy, the samples
as negative samples. used are actual
In the stamping parts,
feature which
graph have many
output,
there are few positive samples of candidate boxes containing defective objects. This will the areas
small defects. In an image, the areas with defects are used as positive samples, and
without
generate adefects are used
large number of as negative samples.
negative-sample In the
candidate actual
boxes, featureingraph
resulting output,
a class imbal- there are
ance problem, which will cause the model to fail to learn effective
few positive samples of candidate boxes containing defective objects. This will information. The
generate a
YOLOv3 adopts cross-entropy as a classification loss function.
large number of negative-sample candidate boxes, resulting in a class imbalanceThe cross-entropy loss is problem,
more sensitive
which to class
will cause theimbalance,
model to meaning that when
fail to learn effectivethe classification
information.samples are un- adopts
The YOLOv3
balanced, the trained model will be biased to the category with
cross-entropy as a classification loss function. The cross-entropy loss is more more samples in the
sensitive to
training set, so that the multi-classification detection effect of the model is not good.
class imbalance, meaning that when the classification samples are unbalanced, the trained
model will be biased to the category with more samples in the training set, so that the
multi-classification detection effect of the model is not good.
In view of the above problems, this paper introduces the focal loss function to improve
the YOLOv3 confidence loss to solve the problem of class imbalance. The Focal loss formula
is as follows: 
−α(1 − p)γ log( p) y = 1
Focal loss = (11)
−(1 − α) pγ log(1 − p) y = 0
where, α is the weight coefficient, which is responsible for adjusting the balance of positive
and negative samples; γ is the hyperparameter, which is responsible for adjusting the
balance between difficult and easy to classify samples; and y represents whether it is a real
label. The results of a large number of experiments prove that when the value of dataset α
is 0.5 and the value of γ is 2, the model training effect is the best.
In summary, an improved scheme for integrating GIoU and focal loss is given for the
existing problems. The finalloss function of the improved YOLOv3 model in this paper is
as follows:
Processes 2022, 10, 701 11 of 18

S2 B obj
Loss = ∑ ∑ Iij (1 − GIoU ) × (2 − wi × hi )−
i =0 j =0
S2 B obj
∑ ∑ Iij [Ĉi α(1 − Ci ) log(Ci ) + (1 − Ĉi )(1 − α)(Ci ) log(1 − Ci )]−
γ γ
i =0 j =0
(12)
S2 B noobj
λnoobj ∑ ∑ Iij [Ĉi α(1 − Ci )γ log(Ci ) + (1 − Ĉi )(1 − α)(Ci )γ log(1 − Ci )]−
i =0 j =0
S2 obj
∑ Ii ∑c∈classes [ P̂i (c) log( Pi (c)) + (1 − P̂i (c)) log(1 − Pi (c))]
i =0

where: S is the size of the image being meshed, B is the number of bounding boxes predicted
by each grid cell, and λnoobj is the confidence loss weight when the grid cell does not contain
obj noobj
an object. Iij and Iij are control terms, indicating whether the jth bounding box of
the ith grid cell is responsible for the detection of the current object. When the center
point of the object falls on the ith grid cell, and the jth bounding box of the ith grid cell
obj noobj
has the largest intersection ratio with the ground-truth box, then Iij is 1, and Iij is 0,
obj noobj obj
otherwise Iij is 0, and Iij is 1. Ii indicates whether the ith grid cell contains an object.
wi and hi represent the width and height of the ground-truth box, respectively. α is the
weight coefficient and γ is a hyperparameter. Ci represents the confidence level of the
predicted bounding box, and Ĉi represents the confidence level of the ground-truth box.
Pi (c) represents the class probability of the predicted bounding box, and P̂i (c) represents
the class probability of the ground-truth box.

4. Experimental Results and Analysis


4.1. Experimental Environment and Model Parameters
The dataset used in this paper contains 3500 contour-defect images of stamping parts,
which are divided into five types of defects. The dataset is divided into training sets,
validation sets, and test sets in an 8:1:1 ratio. The K-means++ algorithm optimized in this
paper is used to cluster prior-bounding boxes, and then the Tensorflow framework is used
to build the improved object-detection algorithm. The adaptive moment estimation (Adam)
optimizer is used to calculate the gradient to update the parameters in the network. Adam
can improve the robustness of the model parameters. Finally, the detection experiment is
carried out in the test set of this paper and the results are analyzed.
The experimental environment of the algorithm in this paper includes a hardware
environment and a software environment. The program is written in the Python language
and accelerated by CUDA. The specific-hardware-environment configuration is shown
in Table 2. The hyperparameter setting values required for the network training process
are shown in Table 3, and the set hyperparameters are suitable for all the target-detection
algorithms that need to be compared in this paper. The size of the input image used in the
experiment is 416 * 416.

Table 2. The configuration of the experimental environment.

Category Model/Version
CPU Intel Xeon CPU E5-2678 v3 @ 2.50 GHz
GPU NVIDIA GeForce RTX 2080 Ti 11 G
RAM Samsung RECC DDR4 16 G
SSD Samsung SSD 860EVO 512 G
Operating system Ubuntu 18.04
CUDA version CUDA10.0
cuDNN cuDNN 7.6
Programming language Python 3.7
GPU NVIDIA GeForce RTX 2080 Ti 11 G
RAM Samsung RECC DDR4 16 G
SSD Samsung SSD 860EVO 512 G
Operating system Ubuntu 18.04
Processes 2022, 10, 701 CUDA version CUDA10.0 12 of 18
cuDNN cuDNN 7.6
Programming language Python 3.7
Table 3. Hyperparameter setting.
Table 3. Hyperparameter setting.
Hyperparameter Name Name
Hyperparameter Parameter ValueValue
Parameter
Learning
Learning rate rate 0.0001
0.0001
Batch size
Batch size 16 16
WeightWeight
decay coefficient
decay coefficient 0.0005
0.0005
The number of iterations 100,000
The number of iterations 100,000

4.2. Analysis of 4.2.


Experimental
Analysis ofResult
Experimental Result
In order to evaluate
In orderthetooptimized
evaluate theK-means++
optimized algorithm,
K-means++four K-means
algorithm, fourseries
K-meansalgo-series alg
rithms are usedrithms
for experiments
are used for on experiments
the dataset inon this
thestudy. The
dataset inreceptive fields
this study. Thecorrespond-
receptive fields corr
ing to the threesponding
feature-graph
to thescales
three of YOLOv3 arescales
feature-graph large,ofmedium,
YOLOv3and are small, respectively.
large, medium, and small, r
The more the meshes are The
spectively. divided,
more the
the smaller
meshes are the divided,
receptivethefield. Thethe
smaller cluster centers
receptive areThe clust
field.
visualized, andcenters
Figureare 12visualized,
is a clustering-effect
and Figure diagram, wherein Figure
12 is a clustering-effect 12a shows
diagram, wherein the Figure 1
K-means algorithm
shows with
theEuclidean
K-means distance
algorithm as with
the measurement method,asFigure
Euclidean distance 12b shows metho
the measurement
the K-means++Figure 12b shows
algorithm the K-means++
with Euclidean distance algorithm with Euclideanmethod,
as the measurement distanceFigure
as the 12c
measureme
method,
shows the K-means Figure using
algorithm 12c shows
the IoUtheas K-means algorithm method,
the measurement using theand IoUFigure
as the12d measureme
method, and
shows the K-means++ Figure using
algorithm 12d shows
the IoU the K-means++ algorithm using
as the measurement method. the IoU
The as the measur
blue
mentare
points in the figure method. The blue points
the ground-truth boxes in marked
the figurebyare
thethe ground-truth
dataset, boxes
and the red marked by th
points
dataset,
denote the cluster andThe
center. theabscissa
red points denote
is the width theofcluster center.
the cluster The
box, andabscissa is the width
the ordinate is of th
cluster
the height of the box,box,
cluster andin thepixels.
ordinateTableis the height the
4 shows of the clustervalues
specific box, incorresponding
pixels. Table 4 shows th
specific in
to the cluster centers values
Figurecorresponding
12. The numbers to the cluster
a, b, c,centers
and d in Figure 12. The
correspond numbers
to the above a, b, c, an
four schemes. d correspond to the above four schemes.

Processes 2022, 10, x FOR PEER REVIEW 13 of

(a) (b)

(c) (d)
Figureeffect
Figure 12. Clustering 12. Clustering
of differenteffect of different
algorithms: algorithms:
(a) K-means (a) K-means
algorithm algorithm distance;
with Euclidean with Euclidean d
tance; (b) K-means++ algorithm with Euclidean distance; (c) K-means algorithm
(b) K-means++ algorithm with Euclidean distance; (c) K-means algorithm with IoU; (d) K-means++ with IoU; (
K-means++ algorithm with IoU.
algorithm with IoU.
Table 4. Cluster prior boxes.

Number of Receptive
Feature Graph Prior Box
Figure Field
13 ∗ 13 Big (40 × 36) (68 × 61) (212 × 58)
a 26 ∗ 26 Middle (54 × 229) (114 × 122) (228 × 135)
52 ∗ 52 Small (162 × 268) (293 × 197) (358 × 374)
Big (55 × 51) (122 × 116) (100 × 229)
Processes 2022, 10, 701 13 of 18

Table 4. Cluster prior boxes.

Number of Feature Receptive


Prior Box
Figure Graph Field
13 ∗ 13 Big (40 × 36) (68 × 61) (212 × 58)
a 26 ∗ 26 Middle (54 × 229) (114 × 122) (228 × 135)
52 ∗ 52 Small (162 × 268) (293 × 197) (358 × 374)
13 ∗ 13 Big (55 × 51) (122 × 116) (100 × 229)
b 26 ∗ 26 Middle (255 × 93) (208 × 190) (132 × 357)
52 ∗ 52 Small (330 × 183) (260 × 327) (377 × 373)
13 ∗ 13 Big (54 × 49) (113 × 116) (240 × 90)
c 26 ∗ 26 Middle (92 × 242) (194 × 186) (331 × 149)
52 ∗ 52 Small (162 × 358) (294 × 240) (361 × 372)
13 ∗ 13 Big (59 × 46) (99 × 97) (265 × 54)
d 26 ∗ 26 Middle (69 × 271) (167 × 159) (303 × 123)
52 ∗ 52 Small (168 × 322) (285 × 207) (350 × 357)

In this study, the contour coefficient is used to evaluate the quantitative effect of the
above experiments, and the contour silhouette diagram is drawn, as shown in Figure 13.
Figure 13a shows the K-means algorithm with Euclidean distance as the measurement
method, Figure 13b shows the K-means++ algorithm with Euclidean distance as the mea-
surement method, Figure 13c shows the K-means algorithm with the IoU as the mea-
surement method, and Figure 13d shows the K-means++ algorithm with the IoU as the
measurement method. The abscissa in the figure is the value of the contour coefficient.
The larger the contour coefficient value, the more appropriate the clustering effect is. The
ordinate represents the category of clustering, meaning the nine clustering prior boxes that
the YOLOv3 needs to choose. According to the contour coefficient of all the sample points,
the silhouette image in the figure is drawn. The larger the vertical height of the silhouette
image occupies, the more samples there are in the current category. The red-dotted line in
the figure represents the cluster average contour coefficient. When there are more silhouette
images in the figure, the horizontal width exceeding the red dotted line can be considered
to be suitable for clustering; the larger the value of the cluster average silhouette coefficient,
the better the clustering effect.
It can be seen from Figure 13 that most of the values of the four silhouette images are
positive, so the clustering is effective. It can be seen from Table 5 that the clustering average
contour coefficients are all greater than 0.4. For the dataset in this paper, the K-means++
algorithm is better than the K-means algorithm in the same measurement method, and the
clustering algorithm using IoU as the measurement method is used. The clustering effect is
better than the Euclidean distance, and the vertical height of the silhouette image using IoU
as the measurement method is more average. On the premise that the samples of various
defect types in the dataset are balanced, it is also reasonable for the data to be clustered
more evenly. The improved K-means++ algorithm, meaning the K-means++ algorithm
that uses IoU as the unit of measurement, has the highest average silhouette coefficient of
0.48, so this paper adopts the K-means++ algorithm combined with IoU as the clustering
algorithm of the prior box.
The improvement effect in the model-training process can be evaluated by observing
the convergence speed of different algorithm models. The faster the loss value decreases, the
faster the model converges. In Figure 14, YOLOv3-CBAM represents the YOLOv3 algorithm
embedded in CBAM, YOLOv3-Loss represents the YOLOv3 algorithm that improves the
loss function, YOLOv3-4L represents the YOLOv3 algorithm that adds the fourth scale,
and YOLOv3-ALL is the YOLOv3 algorithm fusing all the above improvement points. The
YOLOv3-ALL model proposed in this paper basically converges after 90,000 iterations. The
loss value at 100,000 iterations is 2.78, while the loss value of the unimproved YOLOv3
model is 3.74, and it can also be seen from the image that the loss value of the YOLOv3-ALL
model decreases faster, indicating that the YOLOv3-ALL model converges more easily and
Processes 2022, 10, 701 14 of 18

Processes 2022, 10, x FOR PEER REVIEW the curve fluctuation amplitude is smaller, while the unimproved YOLOv3 model curve 14 of 18
fluctuation is larger, which also shows that the improved loss function designed in this
study is more appropriate, so that the predicted value is closer to the true value faster.

(a) (b)

(c) (d)
Figure 13. 13.
Figure TheThecontour
contoursilhouette
silhouetteimage
image of
of clustering algorithm.(a)(a)K-means
clustering algorithm. K-means algorithm
algorithm withwith Eu
clidean distance;
Euclidean (b) K-means++
distance; (b) K-means++ algorithm withEuclidean
algorithm with Euclidean distance;
distance; (c) K-means
(c) K-means algorithm
algorithm with with
IoU;IoU;
(d) (d)
K-means++
K-means++ algorithm
algorithmwith IoU.
with IoU.

Table 5. Comparison of clustering effects.


It can be seen from Figure 13 that most of the values of the four silhouette images
are positive, so the clustering is effective. It can be Number seen fromof Table 5 that the clustering
Number of Clustering Measurement Value of Average
averageFigure
contour coefficients are all Method Clustering
greater than 0.4. For the dataset
Method Contour in this paper, the
Coefficient
Centers
K-means++ algorithm is better than the K-means algorithm in the same measurement
a K-means Euclidean distance 9 0.41
method, andb
the clustering
K-means++
algorithm using IoU as the
Euclidean distance 9
measurement0.42 method is used
The clustering
c effectK-means
is better than theIoU Euclidean distance, 9 and the vertical
0.43 height of the
silhouettedimage using K-means++ IoU
IoU as the measurement method9 is more average. 0.48On the premise
that the samples of various defect types in the dataset are balanced, it is also reasonable
for the data to be clustered
Comparing the data inmore
Table evenly. The
6, it can be improved
seen K-means++
that in the detection ofalgorithm,
five kinds of meaning
the stamping-contour
K-means++ algorithm defects,that
the AP value
uses IoUof as
thethe
algorithm
unit ofin measurement,
this paper for thehas
fivethe
types of
highest av-
defect detection results for pit, patches, scratches, crazing, and concave are 94.85%, 82.58%,
erage silhouette coefficient of 0.48, so this paper adopts the K-means++ algorithm com-
78.11%, 68.54%, and 51.15%, respectively, while the AP values of the unimproved YOLOv3
bined withforIoU
model theas the
five clustering
types of defectalgorithm of theare
detection results prior box.65.46%, 54.74, 30.32% and
76.94%,
21.97%, respectively. The YOLOv3-ALL model in this paper performs better than the
Table 5. Comparison of clustering effects.

Number of Value of Aver-


Number of Clustering Measurement
Clustering Cen- age Contour
Figure Method Method
that adds the fourth scale, and YOLOv3-ALL is the YOLOv3 algorithm fusing all the
above improvement points. The YOLOv3-ALL model proposed in this paper basically
converges after 90,000 iterations. The loss value at 100,000 iterations is 2.78, while the
loss value of the unimproved YOLOv3 model is 3.74, and it can also be seen from the
Processes 2022, 10, 701 15 of 18
image that the loss value of the YOLOv3-ALL model decreases faster, indicating that the
YOLOv3-ALL model converges more easily and the curve fluctuation amplitude is
smaller, while the unimproved
unimproved YOLOv3 YOLOv3 model
model, and curve fluctuation
the detection accuracy for is
pit larger,
defects iswhich also
up to 94.85%,
shows that the improved loss function designed in this study is more appropriate,
which also shows that the improvement of the network structure in this paper improves so
that the predicted
thevalue is closer
network’s abilityto
tothe true
detect value
small faster.
objects.

(a) (b)
Figure 14. Comparison
Figure of
14. four loss function
Comparison curves.
of four loss (a) curves.
function The loss function
(a) The valuesvalues
loss function of YOLOv3 and
of YOLOv3 and
YOLOv3-ALL; (b)YOLOv3-ALL;
The loss function
(b) The values of YOLOv3-ALL,
loss function YOLOv3-CBMA,
values of YOLOv3-ALL, YOLOv3-CBMA,YOLOv3-4L, and
YOLOv3-4L, and
YOLOv3-Loss. YOLOv3-Loss.
Table 6. Experimental performance evaluation.
Comparing the data in Table 6, it can be seen that in the detection of five kinds of
stamping-contour defects,
Algorithmthe AP valuePitof the Patches
Name algorithmScratches Crazingfor the
in this paper Concave mAPof
five types
(AP%) (AP/%) (AP/%) (AP/%) (AP/%) (%)
defect detection results for pit, patches, scratches, crazing, and concave are 94.85%,
YOLOv3 76.94 65.46 54.74 30.32 21.97 49.89
82.58%, 78.11%, 68.54%, and 51.15%, respectively,
YOLOv3-CBAM 80.36 while the
67.39 56.16AP values
31.23 of the unimproved
23.41 51.71
YOLOv3 model for YOLOv3-Loss
the five types of 85.21
defect detection
71.46 results
72.36 are43.45
76.94%, 42.53
65.46%, 54.74,
63.00
YOLOv3-4L
30.32% and 21.97%, respectively. The 84.69
YOLOv3-ALL 75.23 73.63
model in this51.16 40.19
paper performs 64.98
better
YOLOv3-ALL 94.85 82.58 78.11 68.54 51.15 75.05
than the unimproved YOLOv3 model, and the detection accuracy for pit defects is up to
94.85%, which also shows that the improvement of the network structure in this paper
As can be seen from Table 6, the mAP value of the YOLOv3-ALL algorithm reaches
improves the network’s ability
75.05%, while to detect
the mAP valuesmall objects. YOLOv3 algorithm is 49.89%. In general,
of the unimproved
the mAP value of the novel YOLOv3 algorithm increases by 25.16%, which shows that the
Table 6. Experimental performance
algorithm evaluation.
in this paper is better than the traditional YOLOv3 algorithm in all categories
of contour-defect detection for stamping parts. Among these types of defects, cracks and
Pit
pits are typically Patches
small Scratches
and difficult-to-detect Crazing
objects. Since theConcave mAP
algorithm in this paper
Algorithm Name adds the feature-graph detection of the fourth scale, the detection effect is significantly
(AP%) (AP/%) (AP/%) (AP/%) (AP/%) (%)
improved. The obvious improvement also shows that the algorithm in this paper has
YOLOv3 superior76.94 65.46 54.74 30.32 21.97 49.89
performance in the feature detection of various defect categories. As can be seen
YOLOv3-CBAM 80.36
from Table 67.39
7, the average detection time56.16 31.23 algorithm
of the YOLOv3-ALL 23.41 51.71is 39
in this paper
YOLOv3-Loss ms, which is
85.21 7 ms slower than
71.46 the traditional
72.36 YOLOv3 algorithm,
43.45 but it
42.53 can also meet the
63.00
real-time requirements in an industrial environment.
YOLOv3-4L 84.69 75.23 73.63 51.16 40.19 64.98
YOLOv3-ALLTable 7. Time
94.85comparison82.58 78.11 models. 68.54
of detection by different 51.15 75.05
Test Pictures Total Time Taken Average Detection
As can be seenAlgorithm
from Table
Type 6, the mAP value of the YOLOv3-ALL algorithm reaches
(piece) (ms) Time (ms)
75.05%, while the mAP value
YOLOv3 of the unimproved
300 YOLOv3 algorithm
9713 is 49.89%.32In gen-
YOLOv3-ALL 300 11628 39
Processes 2022, 10, 701 16 of 18

The detection results of the test set using the YOLOv3-ALL algorithm and the YOLOv3 al-
gorithm, respectively, are shown in Figures 15 and 16. In the two figures, Figures 15a and 16a
show the stamping-part contour-defect sample with pit defects, and are marked with cyan
boxes and labed pit; Figures 15b and 16b represent the stamping-part contour-defect
sample with patch defects, and are marked with dark blue boxes and labeled patch;
Figures 15c and 16c represent the stamping-part contour-defect sample with scratch de-
fects, and are marked with green boxes and labeled scratches; Figures 15d and 16d show
a sample of stamping-contour defects with crack defects and are marked with orange
boxes and labeled crazing; and Figure 15e shows the sample of stamping-contour defects
with concave defects, and are marked with pink boxes and labeled concave. It can be
When copying seen
to afromnew theword,
two sets keep the
of figures thatoriginal format
both algorithms and paste
can detect various it as a
types
When copying but
to the
a new word, keep the original format and paste it asofadefects,
algorithm of the YOLOv3-ALL in this paper performs well in the test results of
various defects, and the detection accuracy is higher than that of YOLOv3. Furthermore, it
vector diagram.also has a better detection effect when there are multiple stamping-contour defects in the
vector diagram.
same sample.

pit:0.94 scratches:0.85 crazing:0.74


pit:0.94 patches:0.88 scratches:0.85 crazing:0.74
patches:0.86patches:0.88
patches:0.86
scratches:0.58
scratches:0.58
crazing:0.32 concave:0.50
crazing:0.32 concave:0.50

(a) (b) (c) (d) (e)


(a) (b) (c) (d) (e)
Figure 15. The test results of YOLOv3-ALL. (a)
The test results of(a)
The defects (a)
YOLOv3-ALL.
of pit.defects
(b) The defects of defects
patch. (c)patch.
The
Figure of
Figure 15. The test results 15. YOLOv3-ALL. The defects ofThe pit. (b) Theof pit. (b) The
defects of patch. of
(c) The (c) The
defects of scratch. (d) The defects
defects of crack.
of scratch. (e) defects
(d) The The defects of concave
of crack. (e) The defects of pit.
defects of scratch. (d) The defects of crack. (e) The defects of concave
pit:0.65 scratches:0.38 crazing:0.31
pit:0.65 patches:0.71 scratches:0.38 crazing:0.31
patches:0.59patches:0.71
patches:0.59

concave:0.36
concave:0.36

(a) (b) (c) (d) (e)


(a) (b) (c) (d) (e)
Figure 16. The test results of YOLOv3. (a) The defects of pit. (b) The defects of patch. (c) The defects of
Figure 16.
scratch. (d)The
Thetest results
Figure
defects of16.
YOLOv3.
of crack. The
(e) test (a) The
Theresults ofdefects
defects YOLOv3.
of of(a)
concave pit. (b)defects
The The defects ofThe
of pit. (b) patch. (c) of
defects The defects
patch. of defects
(c) The
scratch. (d) The defects
ofof crack.(d)
scratch. (e)The
Thedefects
defects
of of concave
crack. (e) The defects of pit.

5. Conclusions
In order to achieve efficient and accurate detection of contour defects for industrial
parts, we propose a novel YOLOv3 defect-detection algorithm YOLOv3-ALL, and tested
it on a test dataset. Experimental results show that this algorithm has good real-time
detection efficiency and high detection accuracy, and provides theoretical and technical
support for on-line defect detection of industrial structural parts. Due to the limitations of
the experimental environment and industrial environment, there are fewer types and lower
numbers of stamping contour defects collected in this study, so our algorithm in this paper
still has room for improvement in detection accuracy. Our experiments were all carried
out on a PC, and the trained models took up a lot of disk space. In the future, knowledge
distillation or model pruning can be considered to compress the models so that they can be
applied to embedded devices.
Processes 2022, 10, 701 17 of 18

Author Contributions: Conceptualization, N.L., Y.Q. and J.X.; methodology, N.L., J.X. and Y.Q.;
software, Y.Q. and J.X.; formal analysis, N.L., J.X. and Y.Q.; investigation, N.L., J.X. and Y.Q.; resources,
N.L. and Y.Q.; writing—original draft preparation, J.X. and Y.Q.; writing—review and editing, N.L.,
J.X. and Y.Q.; project administration, N.L. and Y.Q.; funding acquisition, N.L. and Y.Q. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Yangzhou City, 2021. Leading talents of “Green Yang
Jinfeng project” (Innovation in Colleges and Universities) (Grant no. YZLYJFJH2021CX044) and the
Science and Technology Talent support project of Jiangsu Province, China (Grant no. FZ20211137).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Anyone who needs the data, please ask the authors for it.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Luo, Q.; Fang, X.; Liu, L.; Yang, C.; Sun, Y. Automated Visual Defect Detection for Flat Steel Surface: A Survey. IEEE Trans.
Instrum. Meas. 2020, 69, 626–644. [CrossRef]
2. Czimmermann, T.; Ciuti, G.; Milazzo, M.; Chiurazzi, M.; Roccella, S.; Oddo, C.M.; Dario, P. Visual-based defect detection and
classification approaches for industrial applications—A survey. Sensors 2020, 20, 1459. [CrossRef] [PubMed]
3. Zhang, H.; Jing, H.; Chen, T.; Zhang, Y.; Pu, W. Partial Application of Defect Detection in Industry. Int. Core J. Eng. 2021, 7,
144–147.
4. Neogi, N.; Mohanta, D.K.; Dutta, P.K. Defect detection of steel surfaces with global adaptive percentile thresholding of gradient
image. J. Inst. Eng. (India) Ser. B 2017, 98, 557–565. [CrossRef]
5. Haoran, G.; Wei, S.; Awei, Z. Novel defect recognition method based on adaptive global threshold for highlight metal surface.
Chin. J. Sci. Instrum. 2017, 38, 2797–2804.
6. Wang, Z.; Zhang, C.; Li, W.; Qian, J.; Tang, D.; Cai, B.; Chang, Y. Cathodic Copper Plate Surface Defect Detection based on Bird
Swarm Algorithm with Chaotic Theory. J. Image Graph. 2020, 25, 697–707.
7. Cao, G.; Ruan, S.; Peng, Y.; Huang, S.; Kwok, N. Large-Complex-Surface Defect Detection by Hybrid Gradient Threshold
Segmentation and Image Registration. IEEE Access 2018, 6, 36235–36246. [CrossRef]
8. Shi, T.; Kong, J.; Wang, X.; Liu, Z.; Zheng, G. Improved Sobel Algorithm for Defect Detection of Rail Surfaces with Enhanced
Efficiency and Accuracy. J. Cent. South Univ. 2016, 23, 2867–2875. [CrossRef]
9. Zhou, S.Y. Research on Detecting Method for Image of Surface Defect of Steel Sheet Based on Visual Saliency and Sparse
Representation. Ph.D. Thesis, Huazhong University of Science and Technology, Wuhan, China, 2017.
10. Huang, Q.; Zhang, H.; Zeng, X.; Huang, W. Automatic Visual Defect Detection Using Texture Prior and Low-Rank Representation.
IEEE Access 2018, 6, 37965–37976. [CrossRef]
11. Wang, J.; Li, Q.; Gan, J.; Yu, H.; Yang, X. Surface Defect Detection via Entity Sparsity Pursuit With Intrinsic Priors. IEEE Trans. Ind.
Inform. 2020, 16, 141–150. [CrossRef]
12. Perez, H.; Tah, J.H.; Mosavi, A. Deep learning for detecting building defects using convolutional neural networks. Sensors 2019,
19, 3556. [CrossRef] [PubMed]
13. Zhou, H.; Zhuang, Z.; Liu, Y.; Liu, Y.; Zhang, X. Defect classification of green plums based on deep learning. Sensors 2020, 20, 6993.
[CrossRef] [PubMed]
14. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7,
128837–128868. [CrossRef]
15. Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of
the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 1–8 June 2008.
16. Girshick, R.; Iandola, F.; Darrell, T.; Malik, J. Deformable part models are convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 437–446.
17. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
19. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
20. Seferbekov, S.; Iglovikov, V.; Buslaev, A.; Shvets, A. Feature pyramid network for multi-class land segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018;
pp. 272–275.
21. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Processes 2022, 10, 701 18 of 18

22. Xu, Y.; Zhang, K.; Wang, L. Metal Surface Defect Detection Using Modified YOLO. Algorithms 2021, 14, 257. [CrossRef]
23. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
24. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
25. Qianhui, Y.; Changlun, Z.; Qiang, H.; Hengyou, W. Research Progress of Loss Function in Object Detection. Comput. Sci. Appl.
2021, 11, 2836–2844.
26. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss
for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long
Beach, CA, USA, 15–20 June 2019; pp. 658–666.

You might also like