You are on page 1of 5

An improved helmet detection method for YOLOv3

on an unbalanced dataset
Rui Geng Yixuan Ma Wanhong Huang
School of Software Technology School of Economics School of Software Technology
Dalian University of Technology Beijing International Studies University Dalian University of Technology
Dalian, China Beijing, China Dalian, China
gengruigr@qq.com yixuanma9@outlook.com mannho dut@yahoo.co.jp
arXiv:2011.04214v2 [cs.CV] 1 Dec 2020

Abstract—The YOLOv3 target detection algorithm is widely data sets. The full source code of the project is available at
used in industry due to its high speed and high accuracy, but https://github.com/ridin/YOLOv3-on-an-unbalanced-dataset.
it has some limitations, such as the accuracy degradation of
unbalanced datasets. The YOLOv3 target detection algorithm is II. R ELEVANT RESEARCH
based on a Gaussian fuzzy data augmentation approach to pre-
process the data set and improve the YOLOv3 target detection A. Region CNN and YOLO
algorithm. Through the efficient pre-processing, the confidence Silva et al. [11] used the Circle Hough Transform (CHT)
level of YOLOv3 is generally improved by 0.01-0.02 without and Histogram of Oriented Gradient (HOG) descriptors to
changing the recognition speed of YOLOv3, and the processed
images also perform better in image localization due to effective extract image features, and a multilayer perceptron machine
feature fusion, which is more in line with the requirement of (Multi-layer Perceptron, MLP) to classify the target. This
recognition speed and accuracy in production. method works well for single-worn detection, but it is poorly
Index Terms—YOLOv3, Unbalanced dataset, MXNet, Gaus- effective for multi-worn detection and cannot be applied to
sian Blur multi-person images. In recent years, deep learning-based
target detection techniques have developed rapidly, which are
I. I NTRODUCTION
mainly divided into two types, namely the two-stage method
As an effective protective tool, the safety helmet has been based on region suggestion and the one-stage detection method
applied to all kinds of production sites, but due to the reason without region suggestion.
that the supervision and audit are not in place, accidents caused The two-stage target detection algorithm is mainly RCNN
by not wearing the safety helmet occur from time to time. (Regions with Convolutional Neural Network) columns, such
Therefore, the realization of helmet wearing detection is of as RCNN, Fast RCNN (Fast Regions with Convolutional Neu-
great significance for the protection of the life safety of the ral Network feature), and Faster-RCNN, with Faster-RCNN
staff at the production site. having the best detection performance.
In helmet detection, the main use is the target detection One-stage target detection algorithm Among them, the SSD
algorithm [1], [2]. In recent years, the target detection algo- (Single Shot MultiBox Detector) and YOLO algorithms are
rithm has become an important research hotspot in the field representative, such as YOLO, YOLOv2, and YOLOv3, with
of computer vision research. There are two types of target YOLOv3 being the most effective. Based on YOLOv3, an
detection methods based on deep learning: one is R-CNN image pyramid pattern is used to acquire feature maps at
target detection algorithm based on candidate regions [3]– different scales, and multi-scale training is used to increase
[6], which needs to generate candidate regions first, and then the adaptability of the model and improve the efficiency and
do classification and regression operations on the candidate accuracy of helmet wearing detection. Both traditional and
regions; the other is YOLO [13]–[16], [19] (You Only Look deep learning-based methods have achieved relatively good
Once) and SSD algorithm [2], which uses only one CNN results in helmet wearing detection, but there are problems of
network to directly predict the category and location of differ- reduced accuracy and difficulty in improving the upper limit
ent targets. Compared with the R-CNN algorithm, YOLO can of accuracy for unbalanced datasets [8], [12], [18].
achieve real-time detection speed, but the accuracy is lower.
In order to improve the accuracy of YOLO, Redmon et al. B. YOLOv3
[21] proposed YOLOv2 [20] and YOLOv3 [14], [21], which The YOLOv3 network uses Darknet53 as the feature ex-
improve the prediction accuracy while maintaining the speed traction network, in YOLOv3, there is only a convolutional
advantage, especially for the identification of small objects. layer, most of them are 3*3 convolution, and compared to
However, it is found that the accuracy of YOLOv3 decreases YOLOv2, the pooling layer is canceled, and the size of the
when the data set is unbalanced and irregular, and the work in output feature map can be controlled by adjusting the step
this paper is to make improvements on the basis of YOLOv3 size of the convolutional layer. YOLOv3 borrows the idea of
so that it can perform better on irregular and unbalanced a ”pyramid feature map” (FPN) and uses small feature maps
to detect large objects, while large feature maps are used
to detect large objects. The reason for this approach is that
the lower-level network features have more precise location
characteristics, while the higher-level network features have
richer semantic information.
The feature map used for prediction by YOLOv3 has both
accurate location information of low-level network feature map
and rich semantic information of high-level network feature
map, which enables the prediction layer of YOLOv3 to both
locate the object accurately and identify the category of the
object correctly. The output dimension of the feature map is
N ∗ N ∗ [3 ∗ (4 + 1 + M )], where N ∗ N is the number
Fig. 1. Image in the dataset
of output grid, and each grid has 3 Anchor boxes, each box
has a 4-dimensional prediction value box tx , ty , tw , th , and
1-dimensional object prediction box confidence level and M- image recognition, we use image enhancement processing
dimensional object category number. In category prediction, methods to process the relevant images.
YOLOv3 uses multiple logical classifiers instead of softmax to
classify each box and uses a binary cross-entropy loss function
to predict the categories during training.
III. O UR APPROACH
We first perform feature extraction to determine the distri-
bution and mathematical characteristics of the dataset; then we
build YOLOv3 on Mxnet for training on PC; then we use a
test set of 689 images with a size of 95.4 MB provided by
the competition, and finally, we export the test results to a txt
file.
A. Feature extraction statistics (unbalanced dataset)
Based on the features of the dataset, we can obtain relevant Fig. 2. Ratio of hat to person
information that will provide better support in building neural
network training. For feature extraction, the steps are as B. Pre-processing and training
follows.
In the training, we processed the process as shown in the
• Calculate the proportion of each target in the original
figure below, and now we will explain our processing in steps.
image.
• MXNet gets the data and reads it into the VOC dataset.
• Calculate the average length of the target.
• Test on validation dataset, set nms threshold and topk
• Calculate the average width.
constraint.
• Calculate the average area.
• Training pipeline: Download the data, decompress the
• Calculate the average percentage of the target.
data, and divide the data into different directories for
After the above process, we get the following description: training according to tags.
the dataset consists mainly of people wearing helmets and a • Cross entropy as a loss function.
wide variety of backgrounds. The size of the dataset is 1.1 GB • Set target matrix and center matrix.
and there are 7581 images, examples of which are as follows. • Circuit training.
After processing, we found that there is an imbalance in
the dataset. An unbalanced dataset refers to a dataset with C. Read the data set using the method provided by MXNet for
extremely unbalanced sample sizes for each category. Take the reading VOC datasets
binary classification problem as an example, e.g., the number Many deep learning platforms have computer vision APIs
of samples in positive categories is much larger than the that provide common vision models and their pre-training
number of samples in negative categories. The treatment of weights, which greatly facilitate our learning and experimen-
unbalanced data sets is divided into two aspects: tation. . We used the GluonCV framework. For GluonCV, we
• From the perspective of data, the main method is sam- just need to prepare the general format (.rec) dataset and feel
pling, which is divided into under-sampling and oversam- free to experiment with the visual model in model zoo. So a
pling and some corresponding improvement methods. first task is to convert our own dataset to the MXNet preferred
• From the perspective of algorithms, the algorithms are .rec binary.
optimized by considering the differences in the cost The result after reading the common image is to form
of different misclassification situations. In the field of a directory structure is to separate the image and txt form
of the label file into two folders, named images, and labels TABLE II
respectively. in the root directory to generate a RecDataSet G AUSSIAN VALUE
folder, which stores yolo.lst, yolo.idx, yolo.rec three files; then a1 *r1 a2 *r2 a3 *r3
converted to VOC Our directory structure is to put all the a4 *r4 a5 *r5 a6 *r6
images and labeled xml files with the same name in the same a7 *r7 a8 *r8 a9 *r9
folder, then run the data read-in code of the project; finally,
generate a folder of RecDataSet in the root directory, where
the voc.lst, voc.idx, and voc.rec files are stored. weight value to rn (n=1,2,...9), then the actual processed effect
is as follows:
D. Image processing Then the Gaussian value of the center point is 9 values
For the unbalanced dataset, we use Gaussian blurring means summed, which in turn fills the image with new grayscale
for image enhancement. The process is rough as follows: build values to get the blurred image. Using the blurred image,
an image loader and apply image transformers to the loader we train to better circumvent the effects of imbalance in the
for image enhancement. This is the way to deal with the data dataset.
imbalance problem. Gaussian blur is a processing technique E. Testing on the validation dataset, setting nms threshold and
used to eliminate noise, reduce the level of detail, and blur topk constraint
the image. Gaussian blur is also a template for the weighted This step is to ensure the consistency and accuracy of
average method and uses a positive distribution as a template the subsequent training images. Set nms thresh=0.45, nms -
for pixel mapping work. The formula is as follows: topk=400 to validate; after validation, crop the image size,
N dimensional space: then complete the operation of split ground truths, and finally
1 2
use eval metric.update() to evaluate the cut image and update
/(2σ 2 )
G(r) = √ N
e−r (1) parameters.
2πσ 2
F. Training pipeline: the data is divided into different direc-
Two-dimensional space: tories according to labels for training.
1 −(u2 +v2 )/(2σ2 ) Loading MXNet’s Yolo model for training, with lr set to
G(u, v) = e (2)
2πσ 2 0.001, epoch set to 10, the optimization function selected
Since the closer the image is to the object being processed, SGD, and the activation function selected sigmoid. Loading of
the greater its effect on that pixel, we use a weighted average the MXNet model for predictions, with the predictions stored
when Gaussian blurring. Blurring can be understood as each in txt (format: ”Label Confidence Left Top Right Button”).
pixel taking the average value of the surrounding pixels. The Use arg parse to provide command-line parameter settings, as
middle point takes the average of the surrounding points, and follows.
it becomes the average. Numerically, this is a smoothing. On • Create the ArgumentParser() object.

a graphic, it is equivalent to creating a blurring effect, where • Call the add argument() method to add an argument.

the midpoints lose detail. In actual image processing, a nine- • Use parse args() to parse the added parameters.

pixel grid is formed around each center point. And the center The constitutive neural network is shown
position is taken as (0,0), then the other positions are shown
as follows:

TABLE I
P OSITION

(-1,1) (0,1) (1,1)


(-1,0) (0,0) (1,0)
(-1,-1) (0,-1) (1,-1)

Based on this position information, calculate the weight


matrix needs to set the weight value α by itself, get the weight
matrix, calculate the weighted average of these 9 points, let
their weight sum is equal to 1. For the actual image of the
grayscale value, the image as each point multiplied by its
weight value, add these 9 values, is the value of the Gaussian Fig. 3. Neural networks
blur of the central point. Repeat this process for all points to
get the Gaussian blurred image. If the original image is a color Using cross-entropy as a category classification, define the
image, you can do a Gaussian blur for each of the three RGB SigmoidBinaryCrossEntropyLoss() method, using the follow-
channels. Set the grayscale value of each grid to an and the ing equation
max(x, 0) - x * z + log(1 + exp(-abs(x)))) TABLE III
YOLOV 3
G. Loss function definition
YOLOv3
In YOLOv3, Loss is divided into three sections. category Label Confidence Left Top Right Bottom
hat 0.981 598.40 5.74 718.82 143.52
• The error introduced by the xywh part, i.e., the loss
hat 0.965 106.98 74.34 188.41 177.79
introduced by the bb ox, is calculated using the following hat 0.968 314.11 60.32 395.96 158.78
mathematical method.
|A ∩ B | TABLE IV
IoU = (3) I MPROVED YOLOV 3
|A ∪ B |
Using Glou to calculate lb ox: Improved YOLOv3
category Label Confidence Left Top Right Bottom
hat 0.999 598.08 5.79 718.31 143.43
|Ac − U | hat 0.995 106.91 74.47 188.49 177.79
GIoU = IoU − (4)
|Ac | hat 0.977 314.10 60.35 395.93 158.73

where Ac represents the area of the smallest closed area


of the two boxes, that is, the area of the smallest box that
contains both the predicted and real boxes. To validate the effectiveness of the improved method pro-
• The error brought by the confidence level, i.e., the loss posed in this paper, it is further validated on 689 test data.
brought by obj is divided into Center and Scale. lo bj This section realizes the whole test results on YOLOv3 and
represents the probability of whether the bounding box improved YOLOv3 at the same time. The detection results of
contains objects or not. In the implementation, obj loss different improved models on the experimental data are shown
can be specified by arc, and we use default mode: using in the table.
BCEWithLogitsLoss, obj loss is calculated separately For the 689 test set data, we used YOLOv3 and the
from cls loss. improved YOLOv3 for the helmet wearing detection. It can
• The error caused by the class, that is, the loss caused be observed that the improved YOLOv3 can detect helmet
by the class. here called lc ls, using default mode, using wearers well compared to YOLOv3, and can detect more
BCEWithLogitsLoss to calculate the class loss. targets with accurate locations and complete bounding boxes.
The final loop training gets the model.
TABLE V
T EST RESULTS OVERALL TEST RESULTS FOR BOTH MODELS
To detect the improvement of the Gaussian fuzzy optimiza- Model Label Confidence
tion method on the recognition accuracy of the unbalanced YOLOv3 0.966
dataset, we introduce the comparison of the recognition accu- Improved YOLOv3 0.982
racy of the same image before and after the Gaussian fuzzy
optimization. The results are presented as a text document for It can be seen that the confidence level, which is used as
each image, with one detection per column in the document, a metric to evaluate the accuracy, has a general improvement
e.g., ”Label Confidence, Left, Top, Right, Bottom”. If a picture of 0.01-0.02 for helmet identification after the Gaussian fuzzy
has more than one detection, the use of more than one line optimization mentioned in this paper, which is important to
out. Here is a picture as an example, in the local PC test, to improve the ultimate accuracy of YOLOv3 in this state; at the
show the test results, use the following picture. same time, after the Gaussian fuzzy processing, due to the
effective feature fusion, the low-level feature map contains
semantic information and the high-level feature map contains
more detailed information, which improves the recall rate of
small-sized helmets and the accuracy of large-sized helmet
localization.

R ELATED WORK
For Gaussian blur, Jan Flusser et al. [9] introduced the
notion of a native image as the normative form of all Gaussian
blur-equivalent images. The native image is defined in the
spectral domain by projection operators, and they proved that
the moments of the native image are the normative form of all
Fig. 4. Image in the test set Gaussian blur-equivalent images. They prove that the moments
of the image is invariant to Gaussian blur and derive recursive
The following results were obtained. formulas for the direct computation of the image, without
actually constructing the image itself [9]. In this paper, we R EFERENCES
draw on the idea of image and use recursion in Gaussian [1] Girshick Ross,Donahue Jeff,Darrell Trevor,Malik Jitendra. Region-
blurring to ensure that the pixels of the original image are Based Convolutional Networks for Accurate Object Detection and
invariant to the information after the operation of the Gaussian Segmentation.[J]. IEEE transactions on pattern analysis and machine
intelligence,2016,38(1).
matrix. [2] Haodong Pan,Jue Jiang,Guangfeng Chen. TDFSSD: Top-Down Feature
Meanwhile, Niranjan D. Narvekar et al. [10] propose a Fusion Single Shot MultiBox Detector[J]. Signal Processing: Image
reference-free image blurring metric based on human studies Communication,2020,89.
[3] Hao Ye,Geoffrey Ye Li,Biing-Hwang Juang. Power of Deep Learning for
of the fuzzy perception of different contrast values. The Channel Estimation and Signal Detection in OFDM Systems.[J]. IEEE
metric uses a probabilistic model to estimate the probability Wireless Commun. Letters,2018,7(1).
of detecting blur for each edge in an image and then pools the [4] Ionut C. Duta,Bogdan Ionescu,Kiyoharu Aizawa,Nicu Sebe. Simple, Ef-
ficient and Effective Encodings of Local Deep Features for Video Action
information by calculating the cumulative probability of blur Recognition[P]. International Conference on Multimedia Retrieval,2017.
detection (CPBD) [10]. This approach inspired us to adopt an [5] Hao Zhai. Research on Image Recognition Based on Deep Learning
image fuzziness metric to deal with imbalances in the dataset. Technology[A]. Proceedings of 2016 4th International Conference on
Advanced Materials and Information Technology Processing(AMITP
We introduce a fuzzy metric for processing raw images, an 2016)[C]. Computer Science and Electronic Technology International
approach that ensures that the neural network’s perception Society ,2016:5.
of the image is not lost when balancing the classification [6] Wangpeng He,Zhe Huang,Zhifei Wei,Cheng Li,Baolong Guo. TF-
YOLO: An Improved Incremental Network for Real-Time Object De-
problem and still retains enough information that can be used tection[J]. Applied Sciences,2019,9(16).
for classification. [7] Yang Yang,Hongmin Deng. GC-YOLOv3: You Only Look Once with
For YOLOv3, Fan Wu et al. [15], [22] propose a YOLOv3- Global Context Block[J]. Electronics,2020,9(8).
[8] Chang-Yu Cao,Jia-Chun Zheng,Yi-Qi Huang,Jing Liu,Cheng-Fu Yang.
based full regression deep neural network architecture that Investigation of a Promoted You Only Look Once Algorithm and Its
takes advantage of Densenet’s advantages in model parameters Application in Traffic Flow Monitoring[J]. Applied Sciences,2019,9(17).
and technical cost to replace the backbone of the YOLOv3 net- [9] Flusser J, Farokhi S, Höschl C, et al. Recognition of images degraded by
Gaussian blur[J]. IEEE transactions on Image Processing, 2015, 25(2):
work for feature extraction [7], [14], [17], [21], [22], resulting 790-806.
in the so-called YOLO-Densebackbone convolutional neural [10] Narvekar N D, Karam L J. A no-reference image blur metric based on
network. In this paper, we draw on the ideas of Fan Wu et the cumulative probability of blur detection (CPBD)[J]. IEEE Transac-
tions on Image Processing, 2011, 20(9): 2678-2683.
al. [15], [22] to optimize the construction of the convolutional [11] Chen F, Ma J. An empirical identification method of Gaussian blur pa-
layer, the cross-entropy, and the selection of the loss function. rameter for image deblurring[J]. IEEE Transactions on signal processing,
2009, 57(7): 2467-2478.
I MPROVEMENT M ETHODS [12] Gedraite E S, Hadad M. Investigation on the effect of a Gaussian Blur in
image filtering and segmentation[C]//Proceedings ELMAR-2011. IEEE,
For the helmet detection problem, there are several tech- 2011: 393-396.
niques we can use to continue to improve accuracy. For [13] Peng Q, Luo W, Hong G, et al. Pedestrian detection for transformer
substation based on gaussian mixture model and YOLO[C]//2016 8th
example, for problems where there is not enough light in the international conference on intelligent human-machine systems and
production situation and the target occupies a small frame, we cybernetics (IHMSC). IEEE, 2016, 2: 562-565.
can improve algorithms such as YOLOv3 or Faster RCNN [14] Choi J, Chun D, Kim H, et al. Gaussian yolov3: An accurate and
fast object detector using localization uncertainty for autonomous driv-
for processing such as feature fusion, multiscale detection, ing[C]//Proceedings of the IEEE International Conference on Computer
and increasing the number of anchor points. At the same Vision. 2019: 502-511.
time, we can also introduce more deep learning techniques in [15] Wu Y, Meng Z, Palaiahnakote S, et al. Compressing YOLO Network by
Compressive Sensing[C]//2017 4th IAPR Asian Conference on Pattern
algorithms such as YOLOv3 or Faster RCNN, such as adding Recognition (ACPR). IEEE, 2017: 19-24.
multilayer convolutional feature fusion techniques in building [16] Molchanov V V, Vishnyakov B V, Vizilter Y V, et al. Pedestrian
neural networks, and introducing online difficult sample min- detection in video surveillance using fully convolutional YOLO neu-
ral network[C]//Automated Visual Inspection and Machine Vision II.
ing methods in target detection to enhance the robustness of International Society for Optics and Photonics, 2017, 10334: 103340Q.
small and occluded targets in different environments, which [17] Tian Y, Yang G, Wang Z, et al. Apple detection during different growth
are the directions of our future in-depth research. stages in orchards using the improved YOLO-V3 model[J]. Computers
and electronics in agriculture, 2019, 157: 417-426.
[18] Luo J, Wu X, Luo Y, et al. Real-world image datasets for federated
C ONCLUSION learning[J]. arXiv preprint arXiv:1910.11089, 2019.
The direct use of YOLOv3 can no longer adapt to the [19] Morera Á, Sánchez Á, Moreno A B, et al. SSD vs. YOLO for detection
of outdoor urban advertising panels under multiple variabilities[J].
real production situation for target detection requirements. Sensors, 2020, 20(16): 4587.
For the unbalanced helmet dataset, we use data augmentation [20] Laroca R, Zanlorensi L A, Gonçalves G R, et al. An efficient and layout-
to process the data and then train the model with YOLOv3 independent automatic license plate recognition system based on the
YOLO detector[J]. arXiv preprint arXiv:1909.01754, 2019.
algorithm, without affecting the recognition speed, and finally [21] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv
get a 0.01-0.02 increase in the confidence level. The image preprint arXiv:1804.02767, 2018.
detection and classification problems similar to helmet detec- [22] Wu F, Jin G, Gao M, et al. Helmet detection based on improved
yolo v3 deep model[C]//2019 IEEE 16th International Conference on
tion can draw on the results in this thesis to complete the Networking, Sensing and Control (ICNSC). IEEE, 2019: 363-368.
further optimization of the recognition effect, which is of great
significance for the rapid image detection and recognition of
actual production sites.

You might also like