You are on page 1of 5

IEEE ITOEC(ISSN:2693-289X)

TS-YOLO:An efficient YOLO Network for


Multi-scale Object Detection

Wang Yang1 , Ding BO 1 ,Li Su Tong1


1. College of Information Engineering, Yangzhou University, Yangzhou, China
2716973680@qq.com, dingbo@yzu.edu.cn, tongflower@126.com
Corresponding Author: Ding Bo Email: dingbo@yzu.edu.cn
2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC) | 978-1-6654-3185-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ITOEC53115.2022.9734458

Abstract—To solve the problem that the You Only Look between detection speed and accuracy. In 2016, Joseph
Once(YOLO) v4 still has missing detection in multi-scale Redmon et al. proposed YOLO [7] with the core idea of
object detection, we proposed a novel deep convolutional transforming the target detection problem into a
network structure TS-YOLO with three spatial pyramid regression problem, which can predict the location and
pooling(SPP) modules in YOLOv4. SPP plays an important category of the object directly. However, the detection
role in multi-scale object detection, it can extract more results are poor when multiple objects of the same class
semantic information in complex scenes. In this paper, we are close together. Joseph Redmon proposed YOLOv2 [8]
add two more SPP modules and redesign the pooling core in 2017 on the basis of YOLOV1, which uses Darknet19
sizes in SPP on the basis of YOLOv4. Our training was on
as a feature extraction network to calculate the size of the
the Pascal VOC data set and the experimental results show
that our TS-YOLO not only detects more objects but also
anchor by K-means algorithm to improve the recall rate,
has 2.21% higher accuracy compared with original while connecting shallow and deep features to improve
YOLOv4, demonstrating the excellent performance of our the accuracy. In 2018, Joseph Redmon et al. proposed
model in multi-scale object detection. YOLOv3 [9], using Darknet53 instead of Darknet19 as a
feature extraction network, and using a feature pyramid
Keywords—deep learning; missing detection; multi-scale network to achieve multi-scale detection, which maintains
object detection; TS-YOLO; SPP; YOLOv4 real-time performance and improves detection accuracy.
The newer YOLOv4 [10] network was proposed by
I. INTRODUCTION Alexey et al. in 2020, which tests some training tricks and
chooses the best combination of training tricks. As a result,
With the rapid development of deep learning in recent YOLOv4 has great improvement in detection speed and
years, object detection has become an indispensable task accuracy. However, when encountering scenes with multi-
in many popular fields such as automatic driving, medical scale objects, YOLOv4 is not accurate in the detection of
diagnosis, robot technology and so on. In many some objects. For example, missing detection will occur if
application scenarios, object detection algorithms based there are many objects in the input image.
on convolutional neural network (CNN) [1~2] have great
advantages in some aspects such as detection accuracy In order to get an efficient network for multi-scale
and speed, which are not available in traditional object object detection, originated from YOLOv4 [10], TS-
detection. Currently, the CNN-based object detection YOLO is proposed in this paper. Motivated by that SPP
algorithms mainly include two-stage algorithms and can extract more semantic information in multi-scale
single-stage algorithms. Among them, the two-stage object detection, we make some improvements on
algorithms are mainly represented by the RCNN [3], YOLOv4. Firstly, we apply two new SPP [11] modules
FAST R-CNN [4], and FASTER R-CNN [5]. As the name between PANet and YOLO Head in YOLOv4. Secondly,
implies, the process of the two-stage algorithms has two we redesign the pooling core sizes to achieve better
steps. Firstly, the Region Proposed Network(RPN) can detection results. This paper is divided into five sections,
extract the information of objects. Secondly, the locations the main contents of each section are organized as follows:
and categories of objects can be predicted by the detection Section I presents the differences between the two-stage
layers. Compared with the two-stage algorithms, the algorithms and the single-stage algorithms, the history of
single-stage algorithms such as SSD [6], YOLO [7], YOLO series and the reason for choosing YOLOv4 as
YOLOv2 [8], YOLOv3 [9] and YOLOv4 [10] can get baseline. Section II mainly introduces the overview of
predictive results of objects more directly without using YOLOv4, CSPDarknet53 and Feature Pyramid Network.
RPN. Therefore, the detection speed of the single-stage Section III presents the designation of TS-YOLO and
algorithms is better than that of the two-stage algorithms. improved SPP modules. Section IV includes comparative
experiments and experiment results. At last, Section V
As a newer algorithm in the YOLO series of draws the conclusion and discusses the future work.
algorithms, YOLOv4 [10] has achieved the best balance

978-1-6654-3185-9/22/$31.00 ©2022 IEEE 656

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:11:34 UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK

A. Overview of YOLOv4
YOLOv4 is the newer model in the YOLO series. It
firstly extracts the features of the input image through the
backbone network Darknet53, outputs the features of
three scales of 19*19, 38*38, 76*76, and sends them to
the detection network. The detection network regresses
these three features, uses the non-maximum suppression
algorithm to delete the prediction frame with lower
confidence, keeps the prediction frame with higher
confidence and uses it as the target detection frame,
thereby obtaining the target category and location. The
Fig.1 . The structure of FPN
final object detection result is the output of the obtained
category and location. On the basis of the YOLOv3,
YOLOv4 makes some improvements on data processing, YOLOv4 uses FPN [13] module and PAN [14]
backbone network, network training, activation function module at the same time. Different from the FPN module
and so on, so that it achieves the best balance between of YOLOv3, YOLOv4 also adds a bottom-up feature
detection speed and accuracy. pyramid after FPN module, which contains two PAN
structures. This improved structure replaces the single
FPN structure in YOLOv3. The FPN layer conveys
B. The backbone network of YOLOv4 strong semantic features from top to bottom. Unlike the
CSPDarknet53 is the backbone network of YOLOv4, feature pyramid, it conveys strong positioning features
which is an improved version of DarkNet53. from bottom to top. Thus the improved structure can be
CSPDarknet53 has five more CSP modules than used for parameter aggregation of different levels of
Darknet53.The CSP [12] module divides the feature detectors. It further improves the feature extraction
mapping of the base layer into two parts, and then merges capability of the detector, which enhances the capability
through the cross-stage hierarchy to reduce the of detecting the object.
calculation and ensure the accuracy. YOLOv4 only uses
the Mish activation function in the backbone network.
III. METHOD
Compared with the traditional Relu function, the Mish
function makes the network have better generalization A. Designation of TS-YOLO
and accuracy. The remaining part uses the Leaky_Relu
activation function, and uses the Drop_block The network structure of YOLOv4 mainly has three
regularization method to randomly delete the number of parts: backbone network, neck, and head. On the basis of
neurons. This solves the problem of gradient the original YOLOv4 network, this paper adds two new
disappearance and enhances the learning ability of the SPP modules after PANet and redesigns the pooling core
model. And the network parameters and the cost of sizes, which improves the feature extraction ability of the
model training are reduced. picture and has better results for multi-scale object
detection. Fig.2 shows the structure of TS-YOLO.
C. Feature Pyramid Network
B. Improved SPP modules
Early target detection algorithms predict the detection
frame on a feature layer, such as SSD [6], YOLO [7], and Early CNN models, such as Le Net [15] and Alex Net
YOLOv2 [8]. Although this is simple, it is easy to lose [16], require fixed-size image input. Training images or
the information extracted by other layers and the scale of test images of different sizes generally need to be
a feature layer is also difficult to cover all the objects. Lin trimmed or warped before they can be input to the CNN
et al. [13] proposed the Feature Pyramid Network (FPN), model. However, the processed image may have two
which uses the multi-scale feature layer in the deep problems. Firstly, the clipping area does not cover the
convolutional network to build an architecture that allows complete target. Secondly, the distortion causes the
high-level semantic information to be transferred from object to lose position information. These two problems
high-level to low-level. It can be used on feature layers of can affect the recognition accuracy of the network and
all scales to integrate high-level semantic information, reduce the robustness of the model.
which improves the expressive ability of the network.
Later the FPN structure is applied to a variety of general He et al. [11] proposed a SPP structure to remove the
feature extraction networks and a significant dependence of the CNN model on fixed-size input. The
improvement in recognition accuracy has been achieved. SPP structure has three main features as shown below.
Fig.1 shows the structure of FPN.

657

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:11:34 UTC from IEEE Xplore. Restrictions apply.
Inputs(416,416,3)

DarknetConv2D_BN_Mish(416,416,32)
7

Resblock_body(208,208,64)*1 11 Concat+Conv Yolo Head

15
Resblock_body(104,104,128)*2
SPP2
Conv
Resblock_body(52,52,256)*8 Concat+Conv*5
3
Conv+Upsanpling Downsampling
Conv
Resblock_body(26,26,512)*8 Concat+Conv*5 Concat+Conv*5 7 Concat+Conv Yolo Head

Conv+Upsanpling Downsampling
11
Resblock_body(13,13,1024)*4

SPP1
Conv*3

5 9 13

Concat+Conv*3 Concat+Conv*5 Yolo Head

SPP

Fig.2 . The structure of TS-YOLO

1) The SPP structure can generate a specified size candidate zone and the input characteristic, without
output for any size input; repeating the forward calculation of the CNN model. The
2) The SPP structure can pool the input image at detection speed of the network has increased by 24 to 120
multiple scales and has good robustness to the distortion times compared to R-CNN, and higher detection
accuracy is achieved.
of objects;
3) The SPP structure can pool feature maps extracted
at any scale. On the basis of an existing SPP module in YOLOv4,
two new SPP modules in this paper are assembled
The SPP structure is widely used in the field of object between the PANet and the YOLO Head. In addition, we
detection. Taking the classic two-stage object detection redesign the pooling core sizes to achieve better detection
algorithm R-CNN as an example, the network generates results. While the pooling layer performs multi-scale
a large number of candidate zones during the regional pooling and information fusion on the input feature layer,
proposal phase. The candidate zone is generated to it also greatly enhances the receptive field of the network.
generate a fixed size input image by tailoring or And the low-level detail information and high-level
distorting. Then the network uses the CNN model to semantic information of the input image are extracted
process the input image to obtain a feature vector of the more thoroughly. The three different SPP modules used
fixed size, which can enter a full-connection layer for in this paper are SPP, SPP1, and SPP2, shown in the
classification and regression. All candidate zones Fig.3 in order.
generated must be sent to the CNN model to generate a
feature vector of the fixed size, while the large number of Conv Conv Conv
candidate zones will inevitably lead to the low efficiency
of the algorithm. After introducing the SPP structure to Max Max Max Max Max Max Max Max Max
optimize the R-CNN algorithm, the network directly 5*5 9*9 13*13 3*3 7*7 11*11 7*7 11*11 15*15

sends the input image to the CNN model to obtain an


Concat Concat Concat
overall input feature. Then it establishes the mapping
relationship between the candidate zone and the input
characteristics, which can get the feature vectors
corresponding to all candidate zones. At last, it makes Fig.3 . The structures of three SPP modules
similar classifications and regression of these feature
vectors. With the application of SPP structures, it is
necessary to establish a mapping relationship between the

658

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:11:34 UTC from IEEE Xplore. Restrictions apply.
B. Evaluation index
IV. EXPERIMENT
In this paper, mAP [17] and Log average miss rate
A. Experiment Preparation (Lamr) are used to evaluate and compare target detection
networks. Average Precision(AP) represents the average
The platform used in this experiment is Windows 10, precision of various target detections, and further, mAP is
CPU: Intel(R) Core(TM) i7-10750, RAM 16 GB, GPU: the average value of the AP which can evaluate the
NVIDIA GeForce RTX 2060, video memory: 8 GB. The precision of the model. For the missing detection of the
experiment is carried out under the Pytorch framework, model in the test, we use the Lamr to evaluate. The
and the program is implemented in Python. The data set number of missing objects is proportional to the Lamr.
used in the experiment is PASCAL VOC data set. We
trained on PASCAL VOC 2007 and tested on PASCAL C. Experiment results
VOC 2012. We have done comparative experiments
between the YOLOv4 without any training tricks and TS- Through experiments on the PASCL VOC data set,
YOLO. we can compare the performance of YOLOv4 and TS-
YOLO. Comparison results between the two network are
shown in Fig.4.

Fig.4 . The Lamr and mAP of two modules. The two pictures in the first row are the results ofYOLOv4 and the two
pictures in the second row are the results of TS-YOLO.

Fig.5 . Visual results of two models. The top three pictures are the results of YOLOv4, and the bottom three pictures are
the results of TS-YOLO.

659

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:11:34 UTC from IEEE Xplore. Restrictions apply.
As shown in Fig.4, the mAP of TS-YOLO has [11] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep
increased by 2.21% and the AP of many categories has convolutional networks for visual recognition[J]. IEEE
transactions on pattern analysis and machine intelligence, 2015,
also been improved. In addition, the comparison of the 37(9): 1904-1916.
missed detection rates between the two models clearly [12] Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: A new
shows that TS-YOLO has reduced the Lamr for most backbone that can enhance learning capability of
categories. In order to see the difference between the two CNN[C]//Proceedings of the IEEE/CVF conference on computer
models more vividly, we selected several pictures of vision and pattern recognition workshops. 2020: 390-391.
multi-object scenes in the verification set for testing. The [13] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks
result is shown in Fig.5. Through the comparison of the for object detection[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017: 2117-2125.
three sets of pictures, we can clearly draw conclusions
[14] Liu S, Qi L, Qin H, et al. Path aggregation network for instance
that after training and testing on the PASCAL VOC data segmentation[C]//Proceedings of the IEEE conference on
set, the TS-YOLO model has a certain improvement in computer vision and pattern recognition. 2018: 8759-8768.
precision and detects more objects in multi-scale scenes. [15] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning
applied to document recognition[J]. Proceedings of the IEEE,
1998, 86(11): 2278-2324.
V. CONCLUSION [16] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification
with deep convolutional neural networks[J]. Advances in neural
Based on the YOLOv4, TS-YOLO is proposed for information processing systems, 2012, 25: 1097-1105.
multi-scale objection detection. We add two new SPP [17] K. Li, Z. Huang, Y. Cheng and C. Lee, "A maximal figure-of-
modules between PANet and YOLO Head in YOLOv4 merit learning approach to maximizing mean average precision
with deep neural network based classifiers," 2014 IEEE
and redesign the pooling core sizes in SPP. We have done International Conference on Acoustics, Speech and Signal
comparative experiments between YOLOv4 and TS- Processing (ICASSP), 2014, pp. 4503-4507, doi:
YOLO on the PASCAL VOC data set. The experimental 10.1109/ICASSP.2014.6854454.
results show that TS-YOLO has 2.21% higher detection
accuracy than YOLOv4. In addition, compared with
YOLOv4, TS-YOLO can detect more objects in multi-
scale scenes. In future work, TS-YOLO is expected to be
applied to the task of multi-scale object detection, such as
pedestrian detection, vehicle flow detection, etc.

REFERENCES
[1] Goodfellow, I., Bengio, Y., Courville, A.. Deep learning[M].
Cambridge: MIT press, 2016: 326-366.
[2] M. Lee, J. Lee, J. Kim, B. Kim and J. Kim, "The Sparsity and
Activation Analysis of Compressed CNN Networks in a HW
CNN Accelerator Model," 2019 International SoC Design
Conference (ISOCC), 2019, pp. 255-256.
[3] Girshick R, Donahue J, Darrell T, et al. Region-based
convolutional networks for accurate object detection and
segmentation[J]. IEEE transactions on pattern analysis and
machine intelligence, 2015, 38(1): 142-158.
[4] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE
international conference on computer vision. 2015: 1440-1448.
[5] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time
object detection with region proposal networks[J]. arXiv preprint
arXiv:1506.01497, 2015.
[6] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox
detector[C]//European conference on computer vision. Springer,
Cham, 2016: 21-37.
[7] Redmon J, Divvala S, Girshick R, et al. You only look once:
Unified, real-time object detection[C]//Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016:
779-788.
[8] Redmon J, Farhadi A. YOLO9000: better, faster,
stronger[C]//Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017: 7263-7271.
[9] Redmon J, Farhadi A. Yolov3: An incremental improvement[J].
arXiv preprint arXiv:1804.02767, 2018.
[10] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed
and accuracy of object detection[J]. arXiv preprint
arXiv:2004.10934, 2020.

660

Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:11:34 UTC from IEEE Xplore. Restrictions apply.

You might also like