10 Real-Time - Object - Detection - in - UAV - Vision - Based - On - Neural - Processing - Units

IEEE ITOEC(ISSN:2693-289X)
Real-Time Object Detection in UAV Vision

based on Neural Processing Units
Ming Liu1 , Linbo Tang1* , Zongya Li1
1. Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing 100081, China;
249676161@qq.com, tanglinbo@bit.edu.cn, 842782996@qq.com
Corresponding Author: Linbo Tang Email: tanglinbo@bit.edu.cn
Abstract—With the reduction of Unmanned Aerial storage spaces for storing intermediate results. Due to the
2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC) | 978-1-6654-3185-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ITOEC53115.2022.9734340
Vehicle (UAV) hardware cost and the development of deep limited load capacity of UAV, for example, the maximum
learning algorithm, the real-time object detection algorithm load weight of Inspire 2 professional UAV produced by
applied in UAV vision has great advantages in many fields. Dajiang company is 810 g. The UAV platform for car
However, due to the limited energy consumption and aerial images object detection can only use the embedded
computing power of embedded devices used in the drones system driven by battery and limited computing resources.
and the variable object scales and complex backgrounds in Therefore, the limitation of hardware cost and
the UAV vision restrict the applications in object detection
transmission delay is still the primary problem for object
based on the drones. In this paper, we optimized the
detection tasks in UAV vision.
generation of anchor boxes, introduced a new module to
increase the receptive field to improve the detection of small To sum up, the difficulties of deploying deep neural
targets, and used adaptively spatial feature fusion in the network algorithm to achieve effective object detection
feature pyramid to increase feature fusion of multi-scale based on UAV focus on two aspects [6]:
features. At last we pruned the model to make it lighter and
faster, and got the Average Precision (AP) of 89.7% for (1) UAV vision has special mission scenarios and
UAV car aerial images and the speed of 35.7 FPS by object characteristics. Different from the mainstream
running on Neural Processing Units (NPUs), which proves objection detection, most of these task scenes adopt the
the feasibility of the intelligent object detection algorithm’s conventional horizontal angle of view, and the object
efficient processing in hardware resource limited scale is larger and the characteristics are more obvious.
environment. The content of the drone shots is complex, the size of the
target to be inspected is small, and the features are
Keywords—UAV aerial images; car detection; embedded ambiguous, which are not the task type targeted by the
hardware; real-time processing mainstream algorithm network design.
I. INTRODUCTION (2) The computing power resources of UAV platform
In recent years, UAV has played an irreplaceable role are limited, which is difficult to meet the need of real-time
in many fields including military and civilian fields detection.
relying on the virtue of their lightweight, Maneuverability,
II. PROPOSED OBJECT DETECTION ALGORITHM
flexible movement and low energy consumption. Real-
Time Object Detection is an important task in applications A. The framework of the YOLOv3 algorithm
of UAV, and has been well-explored. With the rapid We use YOLOv3 as a reference model due to its high
development of computer vision and artificial intelligence precision and easy to deploy to embedded platforms such
technology, among many target detection algorithms, the as NPU.
method based on deep learning has been widely used
because it does not need the Feature Engineering, and has YOLOv3 is an algorithm proposed in recent years [1].
higher precision. YOLOv3 uses a more powerful feature extraction network
Darknet-53. As shown in Figure 1, due to the introduction
The object detection algorithms based on deep of a residual structure, Darknet-53 deepens the network to
learning are mainly divided into two stage methods and 53 layers, and the feature extraction capability is further
one stage methods. The typical algorithms of the former improved. In addition, YOLOv3 draws on the anchor
are R-CNN, as well as the improved f Fast R-CNN and mechanism in Faster R-CNN, and introduces the feature
Faster R-CNN. The detection speed of these algorithms is pyramid network (FPN) structure [2], which performs
too slow to meet the requirements of real-time. One stage detection on three feature scales of 13×13, 26×26, and
algorithm does not need to extract candidate regions first, 52×52. Greatly improved the detection effect of YOLO
and the speed is greatly improved. Typical algorithms series algorithms on small targets.
include Yolo series and SSD.
However, the usual object detection methods based on
deep learning need a large amount of calculation and large
978-1-6654-3185-9/22/$31.00 ©2022 IEEE 1951
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 05:15:40 UTC from IEEE Xplore. Restrictions apply.
TABLE I. THE COMPARISON OF DIFFERENT ANCHOR GENERATION
METHODS
Method Clusters Avg IoU
K-means 9 0.73
Improved methods 9 0.75
The test results show that the improved method for
optimizing anchor frame parameters is better on the aerial
data set used in this article. Table 1 is the comparison of
Fig. 1. YOLOv3 network structure Avg IOU of the traditional K-means clustering algorithm
on the data set of this article.
However, because the network structure of YOLOv3
is still too complicated, the detection speed of YOLOv3 is C. Context augmentation module
further reduced, and it cannot run in real time on edge The object detection requires context information,
devices with limited computing power such as drones. especially in terms of small objects, which are the main
components of the aerial data sets. Therefore, we propose
B. Optimize the parameters of the anchor
an improved receptive field module (IRFM), as shown in
Figure 2 is the sample distribution obtained based on Figure 4, which merges multi-scale dilated convolution
the sample width and height of the general data set COCO [3], and set the trainable parameter W to perform
and VOC data set. Figure 3 is the sample distribution weighted fusion with the shortcut layer outputs.
obtained based on the sample width and height of the
UAV aerial image data set used in this paper. We can see
clearly that the sample size of the aerial data set changes
greatly, and most of the targets are with smaller scales, so
it is necessary to re-select the appropriate anchor
parameters to make better predictions on the aerial data
set.
Fig. 2. Bounding and anchor distribution of general data set (left) and
UAV aerial image data set (right)
The usual method to re-select the appropriate anchor

parameters is clustering the training set by K-means
method, which is executed by a separate program. In our
method, we have embed this function into the training Fig. 3. The structure of IRFM
progress, and adaptively calculate the best anchor box
value in different training sets during each training. The IRFM is a module that increases the receptive
field, and improves the detection accuracy of small targets
In the task of object detection, the purpose of almost without increasing computational consumption.
clustering is to make the IOU value of the anchor box and
the ground truth as large as possible, and Avg IOU is D. Adaptively Multi-scale Spatial Feature Fusion
calculated by the following formula. YOLOv3 uses the feature pyramid (FPN) for
gtbox
contextual information fusion, which merges adjacent
Avg IOU = IOU centroid (1) feature maps with different grains from top to bottom,
which greatly improved the multi-scale expression ability
Among them, box is the target box of the sample label, of the network. However the concatenation-based fusion
and centroid is the cluster center. method between different feature maps cannot make full
Table 1 is the comparison of Avg IOU of the use of the features of different scales. So we try to use the
traditional K-means clustering algorithm on the data sets adaptively spatial feature fusion (ASFF) method to
of this paper. enhance the multi-scale expression ability of the network
[4], which is crucial to target detection tasks in UAV
vision.
1952
method to perform channel pruning [5] and layer pruning
simultaneously to compress the width and depth of the
model respectively. The model compression process and
specific steps are as follows in Figure 6.
Fig. 4. Overall network structure
As shown in Figure 1 and Figure 5, {C3, C4, C5}

represent different levels after input image being down- Fig. 5. Model compression process
sample by {8, 16, 32} times. {L1, L2, L3} are denoted the
feature levels generated by IRFM and FPN, {P1, P2, P3}  Perform sparse training on the network model to
are the features levels generated by adaptively spatial determine the importance of the clipped channel.
feature fusion module (ASFFM). X i is defined as the  Determine the pruned channel according to the
( n ,m ) calculated judgment standard, and start pruning.
input of the mth（m  {1,2,3}）layer of ASFFM. X i, j
is defined as
the result of resizing from the  Evaluate the previous CBL of each shortcut layer,
sort the highest gamma value of each layer, and
nth（n  {1,2,3}）layer to the mth layer at the position (x, take the smallest layer for pruning.
y). Yi ,mj is defined as the output of the  Calculate the global threshold, and use this
mth（m  {1,2,3}）layer of ASFFM at the position (x, y). threshold to control the pruning process.
So, the outputs of the ASFFM are:  Introduce a local safety threshold to determine
whether pruning is excessive, and set its value in
Yi ,mj  aim, j  X i(,1j,m )  bim, j  X i(,2j,m )  cim, j  X i(,3j,m ) (2) the algorithm to the percentile of all values in a
specific pruning layer. When the size of the
In the above formula, the adaptive parameters trainable scale factor is less than the threshold, stop
a m b m and c m are respectively obtained by 1×1 pruning, prevent excessive iterative pruning from
convolution of the resized {L1, L2, L3} feature maps. causing permanent damage to the model.
And the parameters a, b and c are passed through concat  Fine-tune the parameters of the pruning model to
through softmax so that their range is within [0,1] and the ensure complete pruning.
sum is 1, as follows:
[ a m , b m , c m ]  Soft max( F ) (3)
III. EXPERIMENT AND ANALYSIS
Take a m as an example, a m is calculated by the
A. Introduction of NPU hardware platform and
following formula, b m c m follow the same calculation algorithm embedded implementation
principle. UAV platforms face the problem of limited hardware
maij resources. Among the existing multiple smart computing
m e front-end application solutions: Embedded GPU projects
a i, j  maij bmij mcij
(4) have a short implementation cycle, but the chip power
e e e consumption is high, and at the same time it is difficult to
In this way, the features of all layers of FPN are fully control independently; FPGA can be customized, but the
fused together under the guidance of adaptive weights, development cost is high; high-performance NPU has the
and {P1, P2, P3} is used as the final output of the entire benefits of low cost and low power consumption.
network. In this paper, we decide to use T710 development
E. Model compression board to build an embedded airborne target recognition
system. T710 was released in 2019 by ZiGuangZhanRui
The CNN network is still complicated and each and has an AI computing power of 3.2 TOPS and a
iteration calculation will generate tens of millions of comprehensive computing power of 4.2 TOPS. Compared
calculation weights. In the meantime, not all with the traditional GPU as a large-scale chip solution of
convolutional layer channels will be put into the training the AI computing unit, it has an energy efficiency of more
process, so there are many redundant convolutional layer than 2.5 TOPS/W. T710 is also equipped with an eight-
channels in the network that have a very small impact on core high-performance processor which adopts a four-core
the prediction results, resulting in a decrease in cortex-A75 and four small-core cortex-A55 architecture,
computational efficiency. In this paper we choose a and with a dominant frequency of 2.0 GHz and super
1953
computing performance. We choose this board for its low
power consumption and accelerated computational
efficiency of convolutional networks and be in support of
the localization development of chip. The figure 7 is the
overall idea of the target detection system.
Fig. 7. UAV aerial images of our data set
The width and height distribution of the samples in the

Fig. 6. The integral system design data set are as follows:
Here are the specific steps:

 Train and test at a high-performance computer to
verify the algorithm and get a high-performance
model based on the data set.
 Replace the operator that is not supported by NPU.
In this case, the nearest neighbor interpolation
upsampling operator is replaced by deconvolution Fig. 8. The distribution of objects
operator, and the coordinate decoding process of
the network is removed too which is not supported From the Figure 9, we can get information about the
by NPU. objects in the data set. Most of the targets have a width of
50 pixels and a length of 20 ~ 30 pixels, while the aspect
 Use the model quantization and transformation ratio is concentrated between 0.4 ~ 0.6, indicating that
tool. In this case, uds710 imaging AI NDK is most of the targets are small and have an overall long bar
offered by ZiGuangZhanRui company which is shape, because the height is smaller than the width.
installed in Ubuntu 16.04 environment, the model
is quantized into 8-bit width and transformed into C. Evaluation measurement
an executable file that can be recognized by NPU. The effect of target detection is determined by the
classification accuracy and positioning accuracy of the
 The processor of T710 is equipped with Ubuntu prediction frame. Therefore, the target detection problem
system, in which we call its API to accelerate the is both a classification problem and a regression problem.
network computing process, and uses C++ The comprehensive evaluation index of the target
programming to supplement the coordinate detection algorithm is usually the average accuracy rate.
decoding process and Non-Maximum Suppression
(NMS) process, so as to complete the end-to-end Average Precision (AP) AP is defined as the area
computing output. under the PR curve, which is used to measure the average
classification accuracy of a class in the data set. The
B. Data set calculation formula is as follows:
This article focuses on the car targets in aerial images. 1
In order to verify the performance of the research results AP   P( R)dt (5)
0
in this article on aerial perspective images, we constructs
an aerial vehicle data set that includes multiple aerial D. Experimental result
perspectives, multiple vehicle types, and multiple ground This article designs three sets of experiments to train
object backgrounds, including an aerial vehicle data set of and test the three improvement points, and finally
12840 images. Among them, 3456 photos were taken by combines the three improvement points to design a
myself, and the rest were a mixture of unlabeled public comprehensive experiment and conduct training and
data sets and Visdrone 2019. This paper randomly selects testing:
70% and 30% from the entire data set as the training set
and the test set, respectively. All models in this paper are Experiment 1. Train and test the optimized anchor
trained based on this training set and evaluated on the test boxes YOLOv3. There are mostly small targets in the data
set. set. This paper performs clustering and adaptive anchor
box calculations for the data set. Three anchor boxes are
1954
still allocated for each scale. The calculation results are as Table above shows that our proposed object detection
follows: algorithm based on YOLOv3 in this paper improves AP
by 7.4% compared with YOLOv3 on this data set, and the
TABLE II. ANCHOR BOXES ON DIFFERENT FEATURE MAP AP after compression is still about 6.2% higher than
Feature Anchor boxes Anchor boxes
YOLOv3.
map (K-means) (improved method) Comprehensive experiment Compare several groups
16x16 (64,91) (102,59) (153,128) (64,86) (105,58) (151,127) of algorithms in the same data set, table 5 shows the
32x32 (38,34) (40,58) (66,36) (43,22) (41,49) (68,35) comparison results of YOLOv3-ours, YOLOv3,
YOLOv3-spp and yolov4 in accuracy, model size and
64x64 (15,36) (40,21) (23,44) (11,9) (21,16) (26,30) inference time. We can see that the size of the model is
reduced by 86.5% compared to YOLOv3. And the
Yolov3-ours model has great advantages in detection
TABLE III. AP CALCULATED BY DIFFERENT METHODS
accuracy and model size.
Anchor generation method AP@0.5 (car)
K-means 0.835 TABLE VI. AP CALCULATED BY DIFFERENT SETS OF EXPERIENCES
Ours 0.846 methods Input AP@0.5 Time/ms Size/MB

Size (car) (NPUs)
Experiment 2. Train and test the YOLOv3 which YOLOv3 512 0.835 150 246.3
introduces IRFM. The function of the IRFM structure is to YOLOv3-SPP 512 0.841 153 250.6
increase the receptive field of the feature map. Generally
YOLOv4 512 0.877 not supported 256.7
speaking, the receptive field of shallow features is smaller, by NPU
and the receptive field of deep features is larger. YOLOv3 YOLOv3-ours 512 0.897 28 36.9
fuses deep features with shallow features through
upsampling. This article first inputs shallow features into
IRFM, and then merges them with deep features. The IV. CONCLUSIONS
following table analyzes the effect of IRFM on feature In this paper, we propose a new object detection
maps of different scales. algorithm in UAV vision based NPU, which realizes the
real-time object detection in resource constrained
TABLE IV. AP CALCULATED BY DIFFERENT LOCATIONS OF IRFM embedded environment. Based on YOLOv3, the
IRFM Location AP@0.5 (car) algorithm introduces a new IRFM to enhanced receptive
field to obtain richer context information, we also used
After C3 0.847 adaptively spatial feature fusion in the feature pyramid to
After C4 0.863 increase feature fusion of multi-scale features. In addition,
After C5 0.865 we optimized the generation of anchor boxes, which
proves to achieve better performance in the UAV car
After C4 and C5 0.871
aerial image sets. At last, we pruned the model to make it
lighter and faster. Final experiments on our UAV data set
It can be found from the table above that compared show that the new algorithm can run successfully on the
with YOLOv3, when IRFM is inserted after C3, AP domestic NPU hardware platform, and gains the decent
almost does not increase, and when IRFM is inserted after AP at the same time, the detection speed fully meets the
C4 and C5, AP increases 2.5%. this experiment needs of real-time detection.
determines the appropriate location of IRFM and
effectiveness of IRFM.
Experiment 3. Train and test the improved multi- REFERENCES
scale feature fusion YOLOv3 on the data set. Combing [1] Redmon J, Farhadi A, “YOLOv3: An Incremental Improvement,”
experiment1 and experiment 2 to get our best model result. arXiv e-prints, 2018.
The test results are shown in the table below: [2] Lin T Y, Dollar P, Girshick R, et al, “Feature Pyramid Networks
for Object Detection,” IEEE Computer Society, 2017.
TABLE V. AP CALCULATED BY DIFFERENT SETS OF EXPERIENCES
[3] Liu S, Di H, Wang Y, “Receptive Field Block Net for Accurate
methods AP@0.5 (car) and Fast Object Detection,” 2017.
Yolov3 0.835 [4] Liu S, Huang D, Wang Y, “Learning Spatial Fusion for Single-
Shot Object Detection,” 2019.
YOLOv3 with optimized anchors 0.846 [5] Zhuang L, Li J, Shen Z, et al, “Learning Efficient Convolutional
YOLOv3 with IRFM after C4 and C5 0.871 Networks through Network Slimming,” IEEE, 2017.
[6] Zhang Wei, Zhang Xingtao, Wang Xueli, Chen Yunfang, Li
YOLOv3 with ASSM 0.884
Yanchao, “DS-YOLO: A real-time small target detection
YOLOv3-ours (with all improvements) 0.909 algorithm deployed on UAV terminal,” Journal of Nanjing
University of Posts and Telecommunications (natural science
YOLOv3-ours (after compression) 0.897 edition), 2021,41(01): 86-98. DOI: 10.14132/j.cnki.1673-
5439.2021.01.011.
1955

10 Real-Time - Object - Detection - in - UAV - Vision - Based - On - Neural - Processing - Units

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Real-Time - Object - Detection - in - UAV - Vision - Based - On - Neural - Processing - Units

Uploaded by

Copyright:

Available Formats

IEEE ITOEC(ISSN:2693-289X)

Real-Time Object Detection in UAV Vision

978-1-6654-3185-9/22/$31.00 ©2022 IEEE 1951

Method Clusters Avg IoU

The usual method to re-select the appropriate anchor

Fig. 4. Overall network structure

As shown in Figure 1 and Figure 5, {C3, C4, C5}

Fig. 7. UAV aerial images of our data set

The width and height distribution of the samples in the

Here are the specific steps:

Ours 0.846 methods Input AP@0.5 Time/ms Size/MB

You might also like