You are on page 1of 6

A Complete Multi-CPU/FPGA-based Design and Prototyping

Methodology for Autonomous Vehicles: Multiple Object Detection and


Recognition Case Study
Q.Cabanes, B.Senouci A.Ramdane-Cherif
Graduate Engineering School University of Versailles Saint-Quentin en Yvelines
ECE-Paris, France LISV Laboratory
37 Quai de Grenelle, 75015 Paris Velizy, France
quentin.cabanes@ece.fr rca@lisv.uvsq.fr

Abstract— Embedded smart systems are Hardware/Software the continuation of this work. Our approach is implemented
(HW/SW) architectures integrated in new autonomous vehicles on top of a real embedded architecture based on a Multi-
in order to increase their smartness. A key example of such ap- CPU/FPGA platform.
plications are camera-based automatic parking systems. In this
paper we introduce a fast prototyping perspective within a com- This paper is organized as follow: section II is dedicated
plete design methodology for these embedded smart systems. to the state of the art and related works for object detection
One of our main objective being to reduce development and and recognition systems and co-design methodology. Section
prototyping time, compared to usual simulation approaches. III describes the design methodology. Section IV gives
Based on our previous work [1], a supervised machine learning information about the implementation of the application.
approach, we propose a HW/SW algorithm implementation for
objects detection and recognition around autonomous vehicles. Section V is about our experimentation, section VI presents
We validate our real-time approach via a quick prototype on the results, section VII is for discussion, while the section
the top of a Multi-CPU/FPGA platform (ZYNQ). The main VIII concludes the paper.
contribution of this current work is the definition of a complete
design methodology for smart embedded vehicle applications II. A PPROACHES AND R ELATED W ORKS
which defines four main parts: specification & native software,
hardware acceleration, machine learning software, and the real In related works and to our knowledge, the general state-
embedded system prototype. Toward a full automation of our of-the-arts methods developed recently for object detection
methodology, several steps are already automated and presented and recognition in smart cars usually are about 3D sensors
in this work. Our hardware acceleration of point cloud-based because of the large possibilities implied by a 3D mapping
data processing tasks is 300 times faster than a pure software of the environment around an autonomous car. We want to
implementation.
develop a way to ease the deployment of object detection and
recognition systems without forgetting real-time constraints
I. I NTRODUCTION
and power management, so it can easily be embedded in
During the last decade, the technological advances in complex systems such as smart cars. Thus, the objective of
electronics enabled automobile constructors to integrate new our approach is the combination of object classification tech-
circuit systems into their vehicles in order to improve the niques with a design making use of the hardware acceleration
smartness of cars. Embedded intelligence for vehicles can technology.
be a solution for topics such as road safety and self driving. First, we were interested in the primal robotic approach of
In automated vehicles, the more data are processed, the more obstacle detection, such as the Stanley robot [2], in order to
decisions are accurate, so the outcome of situations like make a first processing of the environment. Such a system
traffic regulation or avoiding an accident can be managed may be interesting for navigation management but it lacks
more efficiently. Smart software for object detection and the power of classification. Kidono et al. [3] made the
recognition is a typical example of these new embedded classification of pedestrians from 3D sensors. We consider
systems integrated in present automobiles. this as a first step to our detection and recognition system,
Basically, behind these smart embedded applications, a but the absence of time study was a motivation for more
HW/SW architecture solution is built in order to perform exploration. Navarro et al. [4] realized a prototype of a fully
a real-time object detection. The autonomous car processes autonomous vehicle system integrating a machine learning
data coming from several sensors and classifies them to algorithm for pedestrian detection. This approach is close to
detect the objects around. In this work, our HW/SW co- our goal, but we want to propose an FPGA-based HW/SW
design application uses a set of data coming from a 3D processing system with a generic methodology for a more
sensor as the input. These data are converted to a Point Cloud embedded system.
which will be processed using a HW/SW algorithm. Once the Finally, and in order to design a HW/SW co-design pro-
Point Cloud is processed, a learning approach is performed, totype using hardware acceleration to quicken our software
and although it is not the focus of this paper, it will be in [1], we look for existing methodologies [5][6] and improve

978-1-5386-7822-0/19/$31.00 ©2019 IEEE 158 ICAIIC 2019


the approach with generic and automated steps to implement B. Hardware Acceleration
an FPGA application with a fast prototyping method. The Hardware Acceleration part is about giving a boost
III. D ESIGN M ETHODOLOGY to the data processing software. The native software source
code is to be converted to register-transfer level abstraction
Our approach for smart embedded system design and (RTL) with a High-Level Synthesis (HLS) software. Then to
validation for automobile is oriented toward a platform-based be implemented in any hardware description language (HDL)
design using a reconfigurable Multi-CPU/FPGA board. The project. If simulation results meet the requirements, the final
design flow (Figure 1) is about a co-design machine learning bitstream file that came from the HDL project is deployed
system where preliminary data processing tasks are hardware on the prototype.
accelerated and the learning system is pure software.
Some steps of this design flow have been automated to C. Machine Learning Software
accelerate the development, deployment and validation of the
HW/SW co-design application, such as the configuration of The Machine Learning Software part is about the develop-
the prototype platform and the deployment of the hardware ment of the learning system which represents the smartness
co-design application [9]. of the system. This part does not differ from traditional
embedded machine learning development. Once finished and
A. Specification & Native Software compiled, the software is deployed on the prototype.
The Specification & Native Software part is about the
development of the data processing software on a native D. Prototype
platform in order to validate it. Once the software profiled The Prototype part is about the platform on which the
and its results meeting requirements, the application source hardware acceleration bitstream and the machine learning
code is considered the input of the next step, the Hardware software are deployed and tested. The prototype configura-
Acceleration. tion and deployment are fully automated [9].

Specification & Native Software Machine Learning Software

Algorithm Compilation
Debug Development
Specification & Linking

Profiling Compilation & Linking


Validation

Debug

Hardware Acceleration Profiling

HLS Software RTL Synthesis HDL Integration Validation

Bitstream generation Training


Validation

Prototype

Ramdisk image U-Boot FSBL

Merge boot Kernel compilation


Generic DeviceTree
and kernel

Fig. 1: Design flow diagram for autonomous vehicle application

159
IV. M ULTIPLE O BJECT D ETECTION AND R ECOGNITION Algorithm 1: The occupancy grid algorithm from the
C ASE S TUDY hardware segmentation task
In this section, the previous methodology (see Figure 1) is threshold ← threshold configuration;
applied to this case study. First, the Specification & Native width ← point cloud map width;
Software and the Machine Learning Software parts were height ← point cloud map height;
already done in our previous work [1]. This section will repeat
mostly detail the Hardware Acceleration since the Prototype point ← receive point from CPU;
parts are fully automated.
After the validation of the native software, the data pro- x ← trunc[( point.x
width + 0.5) ∗ grid width];
point.y
cessing tasks (Segmentation and Box Slicing [1]) are now y ← trunc[( height + 0.5) ∗ grid height];
converted to hardware accelerators thanks to a HLS Software. if point.z < cell(x, y).zmin then
A new constraint appeared when working with embedded cell(x, y).zmin ← point.z;
FPGA architecture: the data access management. If data end
could be accessed anytime and anywhere from the RAM
cell(x, y).zsum ← cell(x, y).zsum + point.z;
(Random Access Memory) when making data processing
cell(x, y).zcount ← cell(x, y).zcount + 1;
tasks as software, it changed as being First In First Out
until no more points received;
(FIFO) data access because, in our Multi-CPU/FPGA plat-
form, data are streamed from the RAM to the FPGA within for y ← 0 to grid height do
the AMBA (Advanced Microcontroller Bus Architecture) us- for x ← 0 to grid width do
cell(x,y).zsum
ing AXI (Advanced eXtensible Interface) protocol. Because mean ← cell(x,y).z count
− cell(x, y).zmin ;
of such constraint, the initial software algorithms needed
if mean > threshold then
modifications.
Send cell(x, y) active to next IP;
A. Segmentation else
If the first Segmentation algorithm is refined as an hard- Send cell(x, y) inactive to next IP;
ware component, it tasks need to be run as two steps in a end
row and not as concurrent tasks because of the FIFO data end
access. Those two tasks are defined as: the occupancy grid end
and the point filtering . The occupancy grid is about creating
binary cells based on the Z-axis mean to detect if those points
are relevant. The point filtering is about discarding all points Algorithm 2: The points filtering algorithm from the
which are in inactive cells. hardware segmentation task
For the Occupancy Grid (Algorithm 1), since data are width ← point cloud map width;
streamed, when a point is received, the cell it belongs is height ← point cloud map height;
identified and the Z-axis mean of the cell is updated. Once occupancy grid ← receive occupancy grid from
all points are received, the mean of each cell is compared to previous IP;
the threshold to compute the occupancy grid. In this work, repeat
the threshold, width and height data, as well as the implicit point ← receive point from CPU;
number of points, are all sent from the CPU. The Point Cloud
is then streamed point per point to the occupancy grid IP x ← trunc[( point.x
width + 0.5) ∗ grid width];
(Intellectual Property). For each points received, the X and y ← trunc[( point.y
height + 0.5) ∗ grid height];
Y axis coordinates are normalized between 0 and the grid if cell(x, y) is active then
width/height. The 0.5 offset in the algorithm is due to the Send point to CPU
X and Y axis range of original points, {X ∈ IR| − width 2 ≤ end
width height height
X ≤ 2 } and {Y ∈ IR| − 2 ≤ Y ≤ 2 }. Once until no more points received;
normalized, each X and Y are truncated and associated to a
cell. For the corresponding cell, Z-axis points are summed,
the point counter is incremented and the minimum Z value
of the cell is stored for later. When all points have been is compared to the occupancy grid. If the cell is active the
received from the CPU, each cell is computed. The mean point is kept, otherwise it is discarded. As in the previous
with an offset is calculated in order to find the mean size algorithm, the width and height data, as well as the implicit
of the object from the zero point origin. The mean is then number of points, are all sent from CPU. For each point
compared to the threshold and the result of each cell is called received, it is scaled to the same range as in the occupancy
the occupancy grid, which is send to the next IP: the point grid part and mapped to the correct cell. If the cell is active
filtering. the point is sent to the CPU so it can be used for the next
The second part of the segmentation task is the Point task, the box slicing.
Filtering (Algorithm 2). Each point from the Point Cloud With those two tasks as part of the segmentation, points

160
from the Point Cloud are sent two times within the AMBA. Algorithm 4: The dimension merging algorithm from
the hardware box slicing task
Algorithm 3: The part of the box slicing algorithm for width ← point cloud map width;
one dimension from the hardware box slicing task stepwidth ← step width;
width ← point cloud map width; repeat
boxwidth ← box width; point ← receive point;
stepwidth ← step width; xID ← receive X dimension ID of point;
repeat yID ← receive Y dimension ID of point;
point ← receive point from CPU; boxID ← xID + yID ∗ step width
;
width
x ← point.x + width2 ; Send point and boxID to CPU;
xoverlap ← 0; until no more points and ids received;
xID ← stepxwidth ;
if stepwidth > boxwidth
2 then Box width Overlap 2
if x − xID ∗ stepwidth < boxwidth − stepwidth
then Step width Overlap 3 No overlap
xoverlap ← 1;
end
else
boxwidth
if xID < step width
then
xoverlap ← xID ;
else if xID ≥ width−box width
stepwidth then
width
xoverlap ← stepwidth − XID ; Fig. 3: Box multiple overlapping hierarchical problem. The
else blue zone represents no box overlapping, the red zone
boxwidth
xoverlap ← step ; represents two boxes overlapping. The green zone represents
width
three boxes overlapping
end
end to transfer only once the point cloud. For the explanation of
this part, only the X dimension will be observed and box will
for id ← xID − xoverlap to xID do be reduced as 2D, but it is notable that everything written
Transmit point and id; here applies to every other space dimensions. The constraint
end for this task is about the overlapping of boxes depending on
until no more points received; their width and step size. For this work, the step width was
defined as {stepwidth ∈ IR|0 < stepwidth ≤ boxwidth }.
Thus the hierarchical problem: the overlapping depends on
Box width the step width. Two kind of overlapping are defined in this
work. We defined the ”simple overlapping” as the overlap
Step width Overlap No overlap case when stepwidth > boxwidth
2 (see Figure 2).
X
boxID = trunc( ) (1)
stepwidth

1, if X − boxID ∗ stepwidth

Fig. 2: Box simple overlapping hierarchical problem. The Overlap(X) = < boxwidth − stepwidth (2)

blue zone represents no box overlapping, the red zone 0, otherwise

represents two boxes overlapping. 
boxwidth
boxID , if XID < step

width


max ID − boxID , if XID ≥ boxmax ID
box
B. Box slicing
Overlap(X) = boxwidth
For the Box Slicing task, data streaming revealed a hier- 
 − step width

archical problem. In the software algorithm from previous b boxwidth c,

otherwise
stepwidth
work, boxes were sliding from step to step with a double (3)
for-loop. With data coming as FIFO, to exactly reproduce In the case of the ”simple overlapping”, points are first
this behavior, the whole point cloud need to be sent for each mapped to the corresponding box identifier (ID) (Equation
box, which will lead to the congestion of the AMBA. So 1). Once the box is mapped, the position of the point in this
the initial task needs to be reversed as assigning a box to box is processed in order to find if the point is in the overlap-
a point whenever the point is received by the IP in order ping zone (Equation 2). There is exception of the overlapping

161
zone for the first and last box of the row, since there is no called Xillybus [11] is used, it is an FPGA IP core for easy
overlapping zone as defined in equation 2, so if boxID is 0 or DMA over AXI. This tool is meant to lower the development
boxmax ID , there is no box overlapping possible. The points time because the DMA management was already done. We
are always matched to the most advanced boxID and if the then put the focus of our work on the hardware acceleration.
position of the point in the stepwidth grid is considered in the For the software part, an OS called Xillinux is used. It is
overlapping zone, the point is also matched to the previous part of the Xillybus solution. This OS is based upon Ubuntu
box ID which is boxID − 1. LTS 16.04 for ARM. The OS is running over all cores of the
Then there is the ”multiple overlapping” (Figure 3), which CPU thus the Multi-CPU application. The machine learning
we defined as the overlap case when stepwidth ≤ boxwidth 2 . software is made with OpenCV [8] for features calculation
This problem is a bit more complex because boxes always and LIBSVM [7] for the machine learning application.
overlap, there is no ”no-overlapping” zone between overlap- The software is cross-compiled using GCC for ARM with
ping as in the ”simple overlapping” problem. optimization level 2 (-O2).
In the case of the ”multiple overlapping”, points are also When proceeding to the experimentation, the configuration
matched to a specific boxID with the same equation 1. The used for the hardware acceleration IPs are the following:
main difference is about the handling of the overlapping
zone. The boxID is always the most advanced box, so it threshold (m) width (m) height (m)
is now important to calculate the maximum overlap of the 0.5 40 40
zone in order to match the correct previous boxes (see number of points gridwidth gridheight
205300 120 120
Equation 3). Once the point is matched with equation 1, all
boxwidth (m) boxheight (m) stepwidth (m)
boxes in which is the point are boxes between boxID and 1 1 1
boxID − Overlap(X).
This problem apply for each dimension needed, so dimen- TABLE I: Value used for the experimentation
sions X and Y in this work. Algorithm 3 is the implemen-
tation of this problem for one dimension.
But processing each dimension individually lack the di- VI. R ESULTS
mension intersection. So once each dimension is processed
for the point, the boxID need to be finally adjusted to the In this section, we present the different results of this
two dimensions X and Y (see Algorithm 4) to correctly paper. Only one implementation is done on one specific real
be mapped as boxes in the Point Cloud. Finally, when embedded architecture following the design methodology
all points and boxes are sent to the CPU, it is needed explained beforehand. All simulations are ran on a real
to regroup all points to their respective box to completely embedded architecture which is a Digilent ZedBoard Zynq-
terminate the main HW/SW co-design application. After 7000 ARM/FPGA SoC development board.
this, all processing can be made by any supervised machine Table II shows the execution time for the data processing
learning classification system to identify the objects in the tasks when running as software (see [1]) and when hardware
boxes. Object position in space is also known since any box accelerated. As shown here, hardware acceleration have
position can be decrypted from the boxID . a huge impact on the execution time. The software time
execution on the Zedboard was 103,460 milliseconds and
V. E XPERIMENTATION it comes to 344 ms when it is hardware accelerated. This
We validate our multiple object detection and recognition represents a 300 times acceleration.
application using a hardware Multi-CPU/FPGA platform as
a real embedded architecture. The machine learning software SW Occupancy Grid + Point Filtering Box Slicing Total
Time 90,230 ms 13,230 ms 103,460 ms
was already validated in our previous work [1] so this HW Occupancy Grid Point Filtering Box Slicing Total
experimentation is mainly about the hardware acceleration Time 98 ms 132 ms 114 ms 344 ms
on the real embedded architecture, which for this work is a
Zedboard Multi-CPU/FPGA development platform [12][13]. TABLE II: Comparison of SW and HW Applications Exe-
Three hardware accelerator IPs are made: the occupancy cution Time
Grid, the Point Filtering and the Box Slicing. The de-
velopment is made with Vivado HLS [10] as the HLS In Figure 4, we compare the execution time between
Software, and the project bitstream is built as a block design the software profiling and the hardware profiling for the
project with Vivado [10]. When developing the hardware Segmentation (Occupancy Grid and Point Filtering) and
accelerators with Vivado HLS, the only directive is the ”clock the Box Slicing tasks. As shown in this figure, the tasks
period” which is set to 10. All other directives are the that benefited the hardware acceleration the most were the
Vivado HLS default ones. In order to get results near the Occupancy Grid and Point Filtering (OG+PF) tasks, with a
ones of the software, we are using floating-point numbers 392 times acceleration.
in the hardware accelerators instead of fixed-point numbers, Table III shows the resources utilization of our system: A
even if it may means a slower execution time. For the PS/PL system using Xillybus as communication interface and
communication between the CPU and the FPGA, a tool three IPs (Occupancy Grid, Point Filtering and Box Slicing).

162
·105 VIII. C ONCLUSIONS
1 90,230 Software Hardware
In this paper, we have presented our experience in de-
signing and prototyping a multiple object detection and
ms

0.5 recognition application for autonomous vehicles using Point


Cloud-based data. We proposed and validated a system which
13,230
230 114 detects a pedestrian from 3D sensors. We proposed in the
0 next step a methodology allowing the fast development and
OG+PF Box Slicing deployment of an HW/SW application with a generic real
embedded architecture prototype. This phase was a key
Fig. 4: Comparison between software and hardware applica- step towards the validation of the specification. Finally we
tion execution times hardware accelerated data processing tasks for a supervised
machine learning classification application. Future works
Resource Utilization Available Utilization (%) will focus on multiplying hardware accelerations in order
LUT 14960 53200 28.12 to decrease the execution time, and allow the processing
LUTRAM 592 17400 03.40
FF 15270 106400 14.35 of more heterogeneous data. Also, we will migrate all the
BRAM 63 140 45.00 software machine learning to a hardware deep learning with
DSP 28 220 12.73 an hardware implementation as a Neural Processing Unit
IO 85 200 42.50
BUFG 5 32 15.63 (NPU).
MMCM 1 4 25.00
PLL 1 4 25.00 R EFERENCES
[1] Cabanes, Quentin, and Benaoumeur Senouci. ”Objects detection and
TABLE III: FPGA resources utilization recognition in smart vehicle applications: Point cloud based approach.”
Ubiquitous and Future Networks (ICUFN), 2017 Ninth International
Conference on. IEEE, 2017.
[2] Thrun, Sebastian, et al. ”Stanley: The robot that won the darpa
VII. D ISCUSSION grand challenge.” The 2005 DARPA Grand Challenge. Springer Berlin
Heidelberg, 2007. 1-43.
The use of a Multi-CPU/FPGA hardware platform for [3] Kidono, Kiyosumi, et al. ”Pedestrian recognition using high-definition
Point Cloud-based object detection and recognition allows LIDAR.” IEEE Intelligent Vehicles Symposium (IV), 2011.
the development of a fast computation system on a real [4] Navarro, Pedro J., et al. ”A machine learning approach to pedestrian
detection for autonomous vehicles using high-definition 3D range
embedded environment (embedded CPUs, embedded OS, data.” Sensors 17.1 (2016)
etc...). One of the greatest benefit of this methodology [5] Nuzzo, Pierluigi, et al. ”A platform-based design methodology with
Multi-CPU/FPGA platform-based design resides in the quick contracts and related tools for the design of cyber-physical systems.”
Proceedings of the IEEE 103.11 (2015): 2104-2132.
development and deployment of the system thanks to the [6] B. Senouci, I. Charfi, B. Heyrman, J.Dubois, J.Miteran ”Fast proto-
automation of the prototype step [9]. The implementation typing of a SoC-based smartcamera: a real-time fall detection case
and prototype development in these kind of applications study” Journal of Real-Time Image Processing, 2015.
[7] Chang, Chih-Chung, and Chih-Jen Lin. ”LIBSVM: a library for
are often considered the most time-consuming tasks. It support vector machines.” ACM transactions on intelligent systems
required an interdisciplinary knowledge, and mixed efforts and technology (TIST) 2.3 (2011): 27.
between HW experts, embedded SW experts, etc... Thus [8] Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of
Software Tools.
methodologies and automated tools are necessary in order to [9] https://github.com/tigralt/zynq-boot
accelerate these design steps, and prototype rapidly a large [10] Vivado Design Suite - HLx Editions
panel of applications. Our goal is to lower specific knowl- https://www.xilinx.com/products/design-tools/vivado.html
[11] An FPGA IP core for easy DMA over PCIe with Windows and Linux:
edge requirements with such automation tools. The design http://xillybus.com/
methodology implemented in this work was made in order [12] Zynq-7000 All Programmable SoC Product:
to decrease the development and prototyping time of any https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html
[13] ZedBoard Zynq-7000 Development Board
HW/SW autonomous vehicle application (on Zynq platform). https://reference.digilentinc.com/reference/programmable-
Development time is always a constraint in any type of logic/zedboard/start
project, thus based on this approach, we propose a generic
methodology that wishes to limit this constraint for many
HW/SW autonomous vehicle applications. This methodology
was not optimized to minimize tasks, but was designed to
be easily automated and thus reduce human interaction with
the low level system, which is time consuming and requires
advanced knowledge. So in this work, by automating the
prototyping steps, we could decrease deployment time linked
to the embedded architecture configuration. But also, with
such automation, we want to develop any application using
AI on any embedded HW/SW architecture for autonomous
vehicles, with minimum knowledge required.

163

You might also like