Professional Documents
Culture Documents
Abstract— Embedded smart systems are Hardware/Software the continuation of this work. Our approach is implemented
(HW/SW) architectures integrated in new autonomous vehicles on top of a real embedded architecture based on a Multi-
in order to increase their smartness. A key example of such ap- CPU/FPGA platform.
plications are camera-based automatic parking systems. In this
paper we introduce a fast prototyping perspective within a com- This paper is organized as follow: section II is dedicated
plete design methodology for these embedded smart systems. to the state of the art and related works for object detection
One of our main objective being to reduce development and and recognition systems and co-design methodology. Section
prototyping time, compared to usual simulation approaches. III describes the design methodology. Section IV gives
Based on our previous work [1], a supervised machine learning information about the implementation of the application.
approach, we propose a HW/SW algorithm implementation for
objects detection and recognition around autonomous vehicles. Section V is about our experimentation, section VI presents
We validate our real-time approach via a quick prototype on the results, section VII is for discussion, while the section
the top of a Multi-CPU/FPGA platform (ZYNQ). The main VIII concludes the paper.
contribution of this current work is the definition of a complete
design methodology for smart embedded vehicle applications II. A PPROACHES AND R ELATED W ORKS
which defines four main parts: specification & native software,
hardware acceleration, machine learning software, and the real In related works and to our knowledge, the general state-
embedded system prototype. Toward a full automation of our of-the-arts methods developed recently for object detection
methodology, several steps are already automated and presented and recognition in smart cars usually are about 3D sensors
in this work. Our hardware acceleration of point cloud-based because of the large possibilities implied by a 3D mapping
data processing tasks is 300 times faster than a pure software of the environment around an autonomous car. We want to
implementation.
develop a way to ease the deployment of object detection and
recognition systems without forgetting real-time constraints
I. I NTRODUCTION
and power management, so it can easily be embedded in
During the last decade, the technological advances in complex systems such as smart cars. Thus, the objective of
electronics enabled automobile constructors to integrate new our approach is the combination of object classification tech-
circuit systems into their vehicles in order to improve the niques with a design making use of the hardware acceleration
smartness of cars. Embedded intelligence for vehicles can technology.
be a solution for topics such as road safety and self driving. First, we were interested in the primal robotic approach of
In automated vehicles, the more data are processed, the more obstacle detection, such as the Stanley robot [2], in order to
decisions are accurate, so the outcome of situations like make a first processing of the environment. Such a system
traffic regulation or avoiding an accident can be managed may be interesting for navigation management but it lacks
more efficiently. Smart software for object detection and the power of classification. Kidono et al. [3] made the
recognition is a typical example of these new embedded classification of pedestrians from 3D sensors. We consider
systems integrated in present automobiles. this as a first step to our detection and recognition system,
Basically, behind these smart embedded applications, a but the absence of time study was a motivation for more
HW/SW architecture solution is built in order to perform exploration. Navarro et al. [4] realized a prototype of a fully
a real-time object detection. The autonomous car processes autonomous vehicle system integrating a machine learning
data coming from several sensors and classifies them to algorithm for pedestrian detection. This approach is close to
detect the objects around. In this work, our HW/SW co- our goal, but we want to propose an FPGA-based HW/SW
design application uses a set of data coming from a 3D processing system with a generic methodology for a more
sensor as the input. These data are converted to a Point Cloud embedded system.
which will be processed using a HW/SW algorithm. Once the Finally, and in order to design a HW/SW co-design pro-
Point Cloud is processed, a learning approach is performed, totype using hardware acceleration to quicken our software
and although it is not the focus of this paper, it will be in [1], we look for existing methodologies [5][6] and improve
Algorithm Compilation
Debug Development
Specification & Linking
Debug
Prototype
159
IV. M ULTIPLE O BJECT D ETECTION AND R ECOGNITION Algorithm 1: The occupancy grid algorithm from the
C ASE S TUDY hardware segmentation task
In this section, the previous methodology (see Figure 1) is threshold ← threshold configuration;
applied to this case study. First, the Specification & Native width ← point cloud map width;
Software and the Machine Learning Software parts were height ← point cloud map height;
already done in our previous work [1]. This section will repeat
mostly detail the Hardware Acceleration since the Prototype point ← receive point from CPU;
parts are fully automated.
After the validation of the native software, the data pro- x ← trunc[( point.x
width + 0.5) ∗ grid width];
point.y
cessing tasks (Segmentation and Box Slicing [1]) are now y ← trunc[( height + 0.5) ∗ grid height];
converted to hardware accelerators thanks to a HLS Software. if point.z < cell(x, y).zmin then
A new constraint appeared when working with embedded cell(x, y).zmin ← point.z;
FPGA architecture: the data access management. If data end
could be accessed anytime and anywhere from the RAM
cell(x, y).zsum ← cell(x, y).zsum + point.z;
(Random Access Memory) when making data processing
cell(x, y).zcount ← cell(x, y).zcount + 1;
tasks as software, it changed as being First In First Out
until no more points received;
(FIFO) data access because, in our Multi-CPU/FPGA plat-
form, data are streamed from the RAM to the FPGA within for y ← 0 to grid height do
the AMBA (Advanced Microcontroller Bus Architecture) us- for x ← 0 to grid width do
cell(x,y).zsum
ing AXI (Advanced eXtensible Interface) protocol. Because mean ← cell(x,y).z count
− cell(x, y).zmin ;
of such constraint, the initial software algorithms needed
if mean > threshold then
modifications.
Send cell(x, y) active to next IP;
A. Segmentation else
If the first Segmentation algorithm is refined as an hard- Send cell(x, y) inactive to next IP;
ware component, it tasks need to be run as two steps in a end
row and not as concurrent tasks because of the FIFO data end
access. Those two tasks are defined as: the occupancy grid end
and the point filtering . The occupancy grid is about creating
binary cells based on the Z-axis mean to detect if those points
are relevant. The point filtering is about discarding all points Algorithm 2: The points filtering algorithm from the
which are in inactive cells. hardware segmentation task
For the Occupancy Grid (Algorithm 1), since data are width ← point cloud map width;
streamed, when a point is received, the cell it belongs is height ← point cloud map height;
identified and the Z-axis mean of the cell is updated. Once occupancy grid ← receive occupancy grid from
all points are received, the mean of each cell is compared to previous IP;
the threshold to compute the occupancy grid. In this work, repeat
the threshold, width and height data, as well as the implicit point ← receive point from CPU;
number of points, are all sent from the CPU. The Point Cloud
is then streamed point per point to the occupancy grid IP x ← trunc[( point.x
width + 0.5) ∗ grid width];
(Intellectual Property). For each points received, the X and y ← trunc[( point.y
height + 0.5) ∗ grid height];
Y axis coordinates are normalized between 0 and the grid if cell(x, y) is active then
width/height. The 0.5 offset in the algorithm is due to the Send point to CPU
X and Y axis range of original points, {X ∈ IR| − width 2 ≤ end
width height height
X ≤ 2 } and {Y ∈ IR| − 2 ≤ Y ≤ 2 }. Once until no more points received;
normalized, each X and Y are truncated and associated to a
cell. For the corresponding cell, Z-axis points are summed,
the point counter is incremented and the minimum Z value
of the cell is stored for later. When all points have been is compared to the occupancy grid. If the cell is active the
received from the CPU, each cell is computed. The mean point is kept, otherwise it is discarded. As in the previous
with an offset is calculated in order to find the mean size algorithm, the width and height data, as well as the implicit
of the object from the zero point origin. The mean is then number of points, are all sent from CPU. For each point
compared to the threshold and the result of each cell is called received, it is scaled to the same range as in the occupancy
the occupancy grid, which is send to the next IP: the point grid part and mapped to the correct cell. If the cell is active
filtering. the point is sent to the CPU so it can be used for the next
The second part of the segmentation task is the Point task, the box slicing.
Filtering (Algorithm 2). Each point from the Point Cloud With those two tasks as part of the segmentation, points
160
from the Point Cloud are sent two times within the AMBA. Algorithm 4: The dimension merging algorithm from
the hardware box slicing task
Algorithm 3: The part of the box slicing algorithm for width ← point cloud map width;
one dimension from the hardware box slicing task stepwidth ← step width;
width ← point cloud map width; repeat
boxwidth ← box width; point ← receive point;
stepwidth ← step width; xID ← receive X dimension ID of point;
repeat yID ← receive Y dimension ID of point;
point ← receive point from CPU; boxID ← xID + yID ∗ step width
;
width
x ← point.x + width2 ; Send point and boxID to CPU;
xoverlap ← 0; until no more points and ids received;
xID ← stepxwidth ;
if stepwidth > boxwidth
2 then Box width Overlap 2
if x − xID ∗ stepwidth < boxwidth − stepwidth
then Step width Overlap 3 No overlap
xoverlap ← 1;
end
else
boxwidth
if xID < step width
then
xoverlap ← xID ;
else if xID ≥ width−box width
stepwidth then
width
xoverlap ← stepwidth − XID ; Fig. 3: Box multiple overlapping hierarchical problem. The
else blue zone represents no box overlapping, the red zone
boxwidth
xoverlap ← step ; represents two boxes overlapping. The green zone represents
width
three boxes overlapping
end
end to transfer only once the point cloud. For the explanation of
this part, only the X dimension will be observed and box will
for id ← xID − xoverlap to xID do be reduced as 2D, but it is notable that everything written
Transmit point and id; here applies to every other space dimensions. The constraint
end for this task is about the overlapping of boxes depending on
until no more points received; their width and step size. For this work, the step width was
defined as {stepwidth ∈ IR|0 < stepwidth ≤ boxwidth }.
Thus the hierarchical problem: the overlapping depends on
Box width the step width. Two kind of overlapping are defined in this
work. We defined the ”simple overlapping” as the overlap
Step width Overlap No overlap case when stepwidth > boxwidth
2 (see Figure 2).
X
boxID = trunc( ) (1)
stepwidth
1, if X − boxID ∗ stepwidth
Fig. 2: Box simple overlapping hierarchical problem. The Overlap(X) = < boxwidth − stepwidth (2)
blue zone represents no box overlapping, the red zone 0, otherwise
represents two boxes overlapping.
boxwidth
boxID , if XID < step
width
max ID − boxID , if XID ≥ boxmax ID
box
B. Box slicing
Overlap(X) = boxwidth
For the Box Slicing task, data streaming revealed a hier-
− step width
archical problem. In the software algorithm from previous b boxwidth c,
otherwise
stepwidth
work, boxes were sliding from step to step with a double (3)
for-loop. With data coming as FIFO, to exactly reproduce In the case of the ”simple overlapping”, points are first
this behavior, the whole point cloud need to be sent for each mapped to the corresponding box identifier (ID) (Equation
box, which will lead to the congestion of the AMBA. So 1). Once the box is mapped, the position of the point in this
the initial task needs to be reversed as assigning a box to box is processed in order to find if the point is in the overlap-
a point whenever the point is received by the IP in order ping zone (Equation 2). There is exception of the overlapping
161
zone for the first and last box of the row, since there is no called Xillybus [11] is used, it is an FPGA IP core for easy
overlapping zone as defined in equation 2, so if boxID is 0 or DMA over AXI. This tool is meant to lower the development
boxmax ID , there is no box overlapping possible. The points time because the DMA management was already done. We
are always matched to the most advanced boxID and if the then put the focus of our work on the hardware acceleration.
position of the point in the stepwidth grid is considered in the For the software part, an OS called Xillinux is used. It is
overlapping zone, the point is also matched to the previous part of the Xillybus solution. This OS is based upon Ubuntu
box ID which is boxID − 1. LTS 16.04 for ARM. The OS is running over all cores of the
Then there is the ”multiple overlapping” (Figure 3), which CPU thus the Multi-CPU application. The machine learning
we defined as the overlap case when stepwidth ≤ boxwidth 2 . software is made with OpenCV [8] for features calculation
This problem is a bit more complex because boxes always and LIBSVM [7] for the machine learning application.
overlap, there is no ”no-overlapping” zone between overlap- The software is cross-compiled using GCC for ARM with
ping as in the ”simple overlapping” problem. optimization level 2 (-O2).
In the case of the ”multiple overlapping”, points are also When proceeding to the experimentation, the configuration
matched to a specific boxID with the same equation 1. The used for the hardware acceleration IPs are the following:
main difference is about the handling of the overlapping
zone. The boxID is always the most advanced box, so it threshold (m) width (m) height (m)
is now important to calculate the maximum overlap of the 0.5 40 40
zone in order to match the correct previous boxes (see number of points gridwidth gridheight
205300 120 120
Equation 3). Once the point is matched with equation 1, all
boxwidth (m) boxheight (m) stepwidth (m)
boxes in which is the point are boxes between boxID and 1 1 1
boxID − Overlap(X).
This problem apply for each dimension needed, so dimen- TABLE I: Value used for the experimentation
sions X and Y in this work. Algorithm 3 is the implemen-
tation of this problem for one dimension.
But processing each dimension individually lack the di- VI. R ESULTS
mension intersection. So once each dimension is processed
for the point, the boxID need to be finally adjusted to the In this section, we present the different results of this
two dimensions X and Y (see Algorithm 4) to correctly paper. Only one implementation is done on one specific real
be mapped as boxes in the Point Cloud. Finally, when embedded architecture following the design methodology
all points and boxes are sent to the CPU, it is needed explained beforehand. All simulations are ran on a real
to regroup all points to their respective box to completely embedded architecture which is a Digilent ZedBoard Zynq-
terminate the main HW/SW co-design application. After 7000 ARM/FPGA SoC development board.
this, all processing can be made by any supervised machine Table II shows the execution time for the data processing
learning classification system to identify the objects in the tasks when running as software (see [1]) and when hardware
boxes. Object position in space is also known since any box accelerated. As shown here, hardware acceleration have
position can be decrypted from the boxID . a huge impact on the execution time. The software time
execution on the Zedboard was 103,460 milliseconds and
V. E XPERIMENTATION it comes to 344 ms when it is hardware accelerated. This
We validate our multiple object detection and recognition represents a 300 times acceleration.
application using a hardware Multi-CPU/FPGA platform as
a real embedded architecture. The machine learning software SW Occupancy Grid + Point Filtering Box Slicing Total
Time 90,230 ms 13,230 ms 103,460 ms
was already validated in our previous work [1] so this HW Occupancy Grid Point Filtering Box Slicing Total
experimentation is mainly about the hardware acceleration Time 98 ms 132 ms 114 ms 344 ms
on the real embedded architecture, which for this work is a
Zedboard Multi-CPU/FPGA development platform [12][13]. TABLE II: Comparison of SW and HW Applications Exe-
Three hardware accelerator IPs are made: the occupancy cution Time
Grid, the Point Filtering and the Box Slicing. The de-
velopment is made with Vivado HLS [10] as the HLS In Figure 4, we compare the execution time between
Software, and the project bitstream is built as a block design the software profiling and the hardware profiling for the
project with Vivado [10]. When developing the hardware Segmentation (Occupancy Grid and Point Filtering) and
accelerators with Vivado HLS, the only directive is the ”clock the Box Slicing tasks. As shown in this figure, the tasks
period” which is set to 10. All other directives are the that benefited the hardware acceleration the most were the
Vivado HLS default ones. In order to get results near the Occupancy Grid and Point Filtering (OG+PF) tasks, with a
ones of the software, we are using floating-point numbers 392 times acceleration.
in the hardware accelerators instead of fixed-point numbers, Table III shows the resources utilization of our system: A
even if it may means a slower execution time. For the PS/PL system using Xillybus as communication interface and
communication between the CPU and the FPGA, a tool three IPs (Occupancy Grid, Point Filtering and Box Slicing).
162
·105 VIII. C ONCLUSIONS
1 90,230 Software Hardware
In this paper, we have presented our experience in de-
signing and prototyping a multiple object detection and
ms
163