Professional Documents
Culture Documents
Abstract—This paper presents a new tool flow to realize AXI Work-group Scheduler
Control Interface
algorithms in floating-point precision on FPGAs. A customizable
multicore soft GPU architecture is used on the hardware side.
Two solutions to perform floating-point arithmetic in IEEE-
754 single precision are investigated: using standard function
Compute Units
P P P P P P P P
calls to GPU-friendly software implementations, or by hardware E E E E E E E E
Work-item
Scheduler
5 2 1
upgrades to the Processing Elements (PEs). An OpenCL compiler 7 6 4 3 0
Enable
Reg. Files
Link
Stack Stack Scrtachpads
RTM
perform a task, its work-items (or threads) are bundled into In addition, handling subroutines is realized similarly using
work-groups and scheduled on different CUs. A work-group the proposed stacks: the return address is pushed on the link
gets executed inside a CU as a set of wavefronts, each includes stack on calling and popped out on return; the top entry of the
a fixed number of work-items. To save hardware resources, enable stack is duplicated on calling and popped out on return
FGPU performs resource allocation and operation scheduling as well. Fig. 3.a illustrates an example CFG (Control Flow
not for individual work-items but for complete wavefronts. For Diagram) where different paths are taken by a full wavefront.
example, a single PC (Program Counter) is used for the whole The numbers next to the arrows represent how many work-
wavefront. When executing a conditional branch, the original items took the corresponding path. Fig. 3.b shows how the
architecture does not allow the work-items of a wavefront to CFG is executed on FGPU. When all work-items select the
take different paths. For instance, they all have to execute the same path at a branch, the stacks remain unaffected (see path
same number of iterations of a for-loop. C in Fig. 3). Even if two paths merge later before a return
instruction, the shared code will be executed twice (T2 shared
To support full thread divergence, the CU scheduler use the by paths C and D).
Link and Enable stacks (Fig. 2). When branching into two
different directions, the execution continues along the path
taken by the minority of work-items. The address of the first Following the minority-path at first reduces the required
instruction of the other path is pushed on the link stack. In maximum size of link and enable stacks: for a full wavefront of
addition, the top entry on the enable stack is removed and two 64 work-items, only log2 (64) entries are needed to manage the
bit masks are pushed instead. The new entries define which worst case scenario, even when each single work-item takes
work-items are enabled for each path. When hitting a return a different path. In addition, since a CU can host up to 8
instruction, the top entries of both stacks are popped out. Then, wavefronts, only 1 BRAM36 memory block is needed per
the execution continues along the path that was stored at top. stack for physical realization.
64
Begin Work-items
Link Enable
Path A
Path B 64 Executed Path D Path C Path B Path A Stack Stack
Task
jsub
44 20
T0
unmodied entry
br
44 20 removed entry
T3
ret
Execution Order
added entry
11 T0
T1
br
Path C T3 T1 33 11
Path D 33 br return
ret
T2 11
11 ret br branch
T2
T2 jump to
jsub
ret subroutine
0
Return 64
135
B. Scratchpad Memories Host Code Device Code
Bitstream
Previously, FGPU had neither hardware- nor compiler-
support to implement call stacks or register spilling and filling.
Web Browser
Only 32 private registers were available for each work-item
to finish all required computations. The improved design uses Web Application
local scratchpad memories in each CU to solve these problems
without overloading the global memory connection (Fig. 2). FGPU
Operating System
Call stack operations are realized using different assembly Compiler
instructions to normal loads and stores. They are executed Processing Programmable
Hardware
directly on scratchpads without scheduling logic. System Logic
Each work-item is mapped to a fixed region in a scratch-
pad memory block. The allocated size can be specified at Fig. 4. The suggested tool flow based on PYNQ
synthesis time according to the application needs. With a
single BRAM36 per PE, the call stack of each work-item III. T OOL F LOW
running on FGPU can be extended by 16 entries. Operations A. Compiler Support
on scratchpads are aligned to 32bits.
FGPU has already an OpenCL compiler based on the LLVM
framework. It uses clang as a frontend with a special developed
C. Hard Floating-Point Support backend. Besides the integration of FP support, we enabled
function calls and register spilling. Two different calling
All FP operations required by the targeted benchmarks are
conventions are used at the same time. For kernel functions,
implemented in 32bit single-precision based on the floating-
which are the entry points when launching tasks on FGPU,
point IP core from Xilinx [10]. The generated cores are fully
the parameters are accessed via special assembly instructions.
pipelined, i.e. a new result per clock cycle can be calculated.
For all other functions that can be called from any software
When an instruction is implemented, a hardware instance is
running on FGPU, we used similar calling conventions to the
created per PE. However, the FP logic is not realized within
ones of the MIPS processor: the first four parameters as well
the ALU pipeline due to its high latency: it takes 28 clock
as the returned value are passed on registers, more parameters
cycles to compute a division or a square root. Moreover, all FP
can be passed on the stack. This is expected to be faster and
operations are implemented within a separate module inside a
reduces the maximum required stack size.
CU (see Fig. 2).
If the required FP operation is supported in hardware,
To simplify instruction scheduling, it is assumed that all FP
the compiler uses an assembly instruction to implement it.
operations have the same latency. The unified value is set at
Otherwise, the corresponding standard function is called, e.g.
synthesis time to the maximum latency of all individual FP
mulsf3 for multiplication. The implementation of these
operations. Since integer and FP instructions have different
functions is described later in Section III-B. FP comparison
delays and they are issued concurrently from multiple wave-
illustrates a special case: only the set if less than (fslt) variant
fronts and executed on the same compute vector, simultaneous
is supported in hardware. The compiler realizes all other com-
writing of ALU- and FP-results into the register files has to be
parison types using fslt and some extra logical instructions.
avoided. In addition, writing data loaded form scratchpads and
global memory would intensify the bottleneck at the write port B. Soft Floating-Point
of the register files. To solve this problem, we used 4 physical Compiler-rt is a part of the LLVM project. It includes a
block memories to hold the register files with time division runtime library for emulated FP using integer instruction that
multiplexing. has been used as a basis of this work. However, software
implementations of FP operations use conditional code inten-
D. Improved Scalability sively. Therefore, we modified to the existing implementations
for faster execution on FGPU but without affecting their
Because of their high area demands, realizing FP operations functionalities. For instance, the execution of if-else code
may be associated with sever degradations in the operation blocks is delayed to later points if possible, or both if- and
frequency. The reference architecture uses a doubled clock else-parts are executed if the estimated penalty is less that of
frequency for the ALUs and their register files [8]. After place a divergence point in the corresponding CFG.
and route, we realized often that the critical paths are the
ones which connect the two clock domains. To mitigate the C. Application Programming Interface
frequency degradation, we eliminated the faster clock domain PYNQ is a recent open-source project from Xilinx to
and redesigned the ALU pipeline. The improved design has program Zynq SoCs with IPython (Interactive Python) [11].
the same throughput but it can be synthesized at frequencies It is based on a Jupyter Notebook, which is a browser based
over 200MHz even when 99.9% of the available FPGA slices interactive computing environment. Currently, only the PYNQ-
where utilized. Z1 board is officially supported. We modified the source files
136
TABLE I 1000x
O UR BENCHMARKS WITH REQUIRED FP OPERATIONS
Geomean
333
Benchmark Description FP operations 206
116
137
20x
18,2 Control
18x fadd 4%
5%
fmul
Speedup over ARM and NEON
16x
14,4 2%
14x PEs
22%
12x
Floating-
10x Other
Point
fdiv Logic
7,8 7,5 Logic
8x 7 20%
5,7 5,3 48%
6x 52%
4,6
4x 3,5 3,5
2,5
1,6 2,0
2x 1,3 1,3
fslt Mem. Cntrl
0x 1% 23%
ul ul rr l er l
p) ni
c
m xco pos _ad
e d T on be ax w di
v dy al
m ta ito FF cti so m o -b
o s h
ec_ 12 b a t_ m ec le _p N ar fsqrt
v ( o e 12% Schedulers
R m
ec
v ls ac d -W
FI
7%
-d lle oy uitofp
LU a ra Fl 3%
p
Fig. 6. Average wall clock speedup over ARM and NEON when varying the Fig. 8. Relative resource utilization of the major FGPU parts when 4CUs are
problem size from 1K to 256K implemented with all considered floating-point operations
18x 80x
100%
94%
90% 16x compute density 70x
energy efficiency
80% 14x
60x
70% 12x
Energy Saving
60% 50x
60% 56%
53% 10x
50% 40x
8x
39%
40%
30x
6x
30% 25%
22% 20x
20%
20% 18% 4x
16%
11%
10% 8% 6% 8% 2x 10x
0% 0x 0x
ul ul r er l
p) ni
c
or se d T io
n ax di
v al n
ul ) ic ul rr n el er v y ll
m ta po ad FF ct w sh ea e d T x
di od ha ean
c_ 2 bi
to t_
m xc c_ le
m
po ar m m tap iton _m xco pos _ad FF ctio ob ma ow
e (1 a om ve se _ eo c_ 12 b t m ec le s p -b rs m
v
R m
ec l ac - W g ve ( a o e _ N Wa eo
FI -d lle oy
d
R m
ec
v ls ac d- g
LU ra Fl FI -d lle oy
pa LU ra Fl
pa
Fig. 7. Relative throughput of processed data using soft floating-point Fig. 9. Improvement in compute density and power saving over the MicroB-
implementations with respect to the hardened ones laze when using FGPU with hard floating-point support
highest utilization ratio. About 124-180K LUTs were required a single processor. Hence, we may conclude that on average,
to implement any of the FGPU cores we used in this work using FGPUs for general purpose FP computation has a better
(corresponds to 57-82% of the available LUTs). In comparison throughput per area as well as lower energy consumption than
to the reference MicroBlaze (about 7K LUTs), the FGPU cores any MicroBlaze-based MPSoC.
are approximately 18-26x bigger.
To decide whether it is worthy using FGPU, we used the D. Power and Energy Consumption
compute density (CD) metric defined in [12]: Using on-board power measurements, the utilized FGPUs
T hroughput (kBytes/us) consumed 4.9-7.5Watt over all benchmarks for soft- or hard-
Compute Density = Area (1K LU T s) FP computation. On the other side, the MicroBlaze consumed
We calculated the CD values for FGPU and MicroBlaze for a maximum of 1.24Watt and about 5.4x less power on average.
all benchmarks and problem sizes between 1-256K. Then, we However, FGPU with hard FP-support needs on average 11.2x
averaged on problem size and plotted the ratios of FGPU- less energy than the MicroBlaze (see Fig. 9). In the worst case,
values to the ones of the MicroBlaze in Fig. 9. On average, 2.5x less energy was consumed by FGPU to compute any task.
FGPU with hard FP support has 2.9x better compute density.
For FFT, up to 15.4x better throughput per area is recorded. E. Comparison to HLS
To estimate the area overhead of FGPU, we implemented only We synthesized 4 of our benchmarks using Vivado HLS
the required FP operations for each benchmark. (v2016.2). We used the code of the ARM implementations
Since the throughput, power consumption and area overhead as a basis for our experiments with HLS. The synthesized
of a homogeneous MPSoC scale linearly in the best case with modules own the necessary logic to read and write the global
the number of processors, the CD and the energy consumption memory without using external DMAs. Table. II lists the
of an MPSoC can not be better that the corresponding ones of needed number of LUTs for multiple implementations:
138
TABLE II ones like memory stalls at runtime using special hardware
M ULTIPLE HLS IMPLEMENTATIONS OF SELECTED BENCHMARKS WITH
THE EXECUTION TIME AND REQUIRED AREA
controllers.
Fig. 10 depicts the improvement in compute density and
Benchmark
Solution 1 Solution 2 Solution 3 energy saving when using FGPU over HLS. Because all task
Time Area Time Area Time Area parameters are fixed at synthesis time, Vivado HLS can syn-
msec #LUTs msec #LUTs msec #LUTs
FIR (12 Taps) 7.2 2.5 K 5.6 12 K 0.23 5K
thesize FIR-Solution 3 very effectively. Otherwise, FGPU can
N-body 15301 6.2 K 1910 34 K achieve better compute densities and energy efficiency in most
Bitonic Sort 113 2.4 K 44 28 K cases. Nevertheless, efficient coding for HLS should reflect
Floyd-Warshall 638 2.4 K 130 9K a good hardware structure which may be not known to the
developer when complex algorithms have to be implemented,
3,0x
FIR(12tap) – Solution 2 e.g. sorting. On the other side, simple and compact software
implementations are often enough to get the best performance
2,5x Bitonic – Solution 2
N-body – Solution 1
out of an FGPU.
Improvement in Compute Density
2,0x V. C ONCLUSION
N-body – Solution 2 Soft GPUs offer a very flexible and efficient tool flow to
1,5x Floyd-Warshall – Solution 1
implement FP arithmetic on FPGAs. The can deliver better
Floyd-Warshall – Solution 2
processing throughput per area and needs less energy than a
1,0x
FIR(12tap) – Solution 1 single soft processor or homogeneous MPSoCs. In addition,
0,5x
faster execution can be achieved in comparison to hard vector
Bitonic – Solution 1
coprocessors. In comparison to HLS, they can be programmed
FIR(12tap) – Solution 3
0,0x
much easier. Simple software implementations of complex
0x 2x 4x 6x 8x 10x 12x 14x algorithms on a soft GPU can give much better throughput
Improvement in Energy Consumption
than HLS ones, if the later do not reflect efficient hardware
Fig. 10. Improvement in compute density and power efficiency using FGPU structures. Moreover, the compute density and energy effi-
over Vivado HLS ciency with HLS may degrade if some parameters are not
fixed at synthesis time while the ones of a soft GPU remain
“Solution 1” refers to the case where we applied only
• unaffected.
some simple HLS optimizations like loop pipelining.
Because the problem size was not specified and hence R EFERENCES
the loop bounds were unknown, HLS could only pipeline [1] Nvidia Corp., “NVIDIA Tesla P100,” White Paper (WP-08019-001
the inner loop [13]. v01.1), 2016.
[2] D. Capalija and T. S. Abdelrahman, “A High-performance Overlay
• In “Solution 2”, we assumed that the problem size is Architecture for Pipelined Execution of Data Flow Graphs,” in 2013
a multiple of 64. In other words, we transformed the 23rd International Conference on Field programmable Logic and Ap-
inner loop into two ones, where the second loop has 64 plications, Sept 2013, pp. 1–8.
[3] A. K. Jain, D. L. Maskell, and S. A. Fahmy, “Are Coarse-Grained Over-
iterations. In this case, HLS can unroll the most inner loop lays Ready for General Purpose Application Acceleration on FPGAs?”
with the fixed bound and pipeline the upper one next to in IEEE DASC/PiCom/DataCom/CyberSciTech, Aug 2016, pp. 586–593.
it. Moreover, we used local buffering of read data with [4] “Top500 Supercomputers,” http://www.top500.org [Online; accessed 05-
Jan-2017].
burst-shaped read and write using the memcpy function. [5] D. Chen and D. Singh, “Fractal video compression in OpenCL: An
• “Solution 3” was applied only on the FIR filter where all evaluation of CPUs, GPUs, and FPGAs as acceleration platforms,” in
parameters were fixed at synthesis time, i.e. problem size 2013 18th Asia and South Pacific Design Automation Conference (ASP-
DAC), Jan 2013, pp. 297–304.
and number of filter taps. [6] Nvidia Corp., “GPU-Based Deep Learning Inference: A Performance
We could place and route all implementations at frequencies and Power Analysis,” White Paper, 2015.
[7] Xilinx, Inc., “UltraScale Architecture and Product Overview (v2.10),
from 187-250MHz. The measured execution times for a prob- DS890,” 2016.
lem size of 16K (for Floyd-Warshall) or 8K (for all others) are [8] M. Al Kadi, B. Janssen, and M. Huebner, “FGPU: An SIMT-
reported in Table II. We found that the implementations from Architecture for FPGAs,” ser. FPGA ’16. New York, NY, USA: ACM,
2016, pp. 254–263.
the group “Solution 1” and “Solution 2” are 28-81x and 6.4- [9] Khronos Group, “OpenCL 1.2 Specification,” 2012.
35x slower than the FGPU ones, respectively. FIR-Solution 3 [10] Xilinx, Inc., “Floating-Point Operator v7.1, LogiCORE IP Product Guide
is slightly slower than FGPU by a factor of 1.4x. (PG060),” 2015.
[11] “PYNQ Project,” http://www.pynq.io [Online; accessed 15-Jan-2017].
Hence, when flexibility is demanded, e.g. sorting arrays [12] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, produc-
with variable lengths, simple hardware implementations of tivity and scalability of the TILT overlay processor to OpenCL HLS,”
complex algorithms with HLS are not expected to perform in FPT’14, Dec 2014, pp. 20–27.
[13] Xilinx, Inc., “Vivado Design Suite, User Guide (UG902 v2016.2),” 2016.
as good as the FGPU ones. The main difference between the
two approaches is the scheduling part: while HLS allocates
the hardware and schedules the required operations statically
at synthesis time, FGPU solves this problem jointly with other
139