Floating-Point Arithmetic Using GPGPU On FPGAs PDF

2017 IEEE Computer Society Annual Symposium on VLSI
Floating-Point Arithmetic using GPGPU on FPGAs

Muhammed Al Kadi, Benedikt Janssen, Michael Huebner
Chair for Embedded Systems for Information Technology
Ruhr University of Bochum
{muhammed.alkadi, benedikt.janssen, michael.huebner}@rub.de
Abstract—This paper presents a new tool flow to realize AXI Work-group Scheduler
Control Interface
algorithms in floating-point precision on FPGAs. A customizable
multicore soft GPU architecture is used on the hardware side.
Two solutions to perform floating-point arithmetic in IEEE-
754 single precision are investigated: using standard function
Compute Units
P P P P P P P P
calls to GPU-friendly software implementations, or by hardware E E E E E E E E
Work-item
Scheduler
5 2 1
upgrades to the Processing Elements (PEs). An OpenCL compiler 7 6 4 3 0
that supports both approaches has been developed. An IPython

API is used to evaluate 15 benchmarks. The suggested tool CU Memory Controller
flow has been compared to many other solutions including
High Level Synthesis (HLS): on average, our architecture has
2.9x better compute density and 11.2x better energy efficiency
AXI Global Memory Controller
than a single MicroBlaze processor, or any homogeneous Multi- Data Interfaces
Processor System On Chip (MPSoC) based on it. In addition, Cache Memory
speedups up to 22x can be achieved over an ARM Cortex-A9
supported by a NEON vector coprocessor. For the most complex Fig. 1. Block Diagram of the reference architecture
benchmarks, where software-similar implementations are used
with HLS, the suggested approach performed always better.
When some task parameters were not fixed at synthesis time, expected. Therefore, if some task parameters are not known a
our architecture could provide better throughput per area and priori at synthesis time, soft GPUs may be a better solution
it consumed less energy than HLS in most cases. in many cases.
In this work, we offer floating-point arithmetic on FPGAs
I. I NTRODUCTION
but in a GPU-similar tool flow using soft GPGPU (General
The current performance gap between GPUs and FPGAs Purpose Computing on Graphics Processing Units). We use
estimated in floating-point operations per second (FLOPS) an existing soft GPU architecture with an LLVM-based com-
exceeds 100x [1] [2] [3]. GPUs have become an essential part piler. Section II describes our improvements to the original
in many super computers [4], but FPGAs still have a limited platform. A selective implementation of individual floating-
usage in the High Performance Computing (HPC) domain. point operations in hardware or in software is described.
Although the FPGA tool flow improved significantly with In addition, the modified architecture supports thread-level
HLS and adopted GPU-programming languages like OpenCL, divergence, scratchpad memories and it has a better scalability.
software development for GPUs still has shorter cycles. The whole system can be programmed and controlled through
In the embedded systems domain, FPGAs have dominated a PYNQ-similar interface described in Section III. Editing,
more application areas: they are proved to be faster and compiling and executing the host or the soft GPU code as
more energy-efficient than GPUs for fixed-point arithmetic [5]. well as programming bitstreams can be done through a web
However, power-efficient embedded GPUs can be preferred interface with Python scripts. Our experimental results are
for applications that demand FP (Floating-Point) precision like presented in Section IV. The paper is concluded in Section V.
deep learning [6]. Nevertheless, some modern FPGAs integrate
GPUs next to their conventional logic [7]. II. H ARDWARE A RCHITECTURE
When implemented on FPGAs, the high power consumption Our work is based on an existing soft GPU architecture
of GPUs can be mitigated by application-specific optimiza- for general purpose computing called FGPU: an autonomous
tions, e.g. selective hardening of floating-point operations or many-core SIMT (Single-Instruction Multiple-Thread) proces-
customizing the size of local memories. What distinguishes sor programmable with OpenCL [8]. It accommodates multiple
GPUs from all other solutions is how they solve the scheduling CUs (Compute Units), each holds 8 PEs as shown in Fig. 1. In
problem: the required operations are assigned to available the following sections we will describe only our improvements
hardware units at runtime by hardware controllers without any to the original FGPU hardware.
programming effort. The accompanied penalty in chip area and
power consumption is traded for programming flexibility and A. Thread-Level Divergence and Handling Subroutines
better hardware utilization, especially when memory stalls are The SIMT compute model of OpenCL is based on con-
currently executing the same code in many threads [9]. To
2159-3477/17 $31.00 © 2017 IEEE 134

DOI 10.1109/ISVLSI.2017.32
CU Scheduler Compute Vector CU Memory Controller
Wavefront 7 PE 7
Wavefront 0 PC Instruction
PC PE 0
Recongurable
ALU enable ALU Float Ops.
pop pop push
Enable
Reg. Files
Link
Stack Stack Scrtachpads
BRAM BRAM BRAM BRAM

push push
Branch Info To Global
Mem. Controller
Divergence
Evaluation
RTM
Fig. 2. Functional block diagram of the improved CU design
perform a task, its work-items (or threads) are bundled into In addition, handling subroutines is realized similarly using
work-groups and scheduled on different CUs. A work-group the proposed stacks: the return address is pushed on the link
gets executed inside a CU as a set of wavefronts, each includes stack on calling and popped out on return; the top entry of the
a fixed number of work-items. To save hardware resources, enable stack is duplicated on calling and popped out on return
FGPU performs resource allocation and operation scheduling as well. Fig. 3.a illustrates an example CFG (Control Flow
not for individual work-items but for complete wavefronts. For Diagram) where different paths are taken by a full wavefront.
example, a single PC (Program Counter) is used for the whole The numbers next to the arrows represent how many work-
wavefront. When executing a conditional branch, the original items took the corresponding path. Fig. 3.b shows how the
architecture does not allow the work-items of a wavefront to CFG is executed on FGPU. When all work-items select the
take different paths. For instance, they all have to execute the same path at a branch, the stacks remain unaffected (see path
same number of iterations of a for-loop. C in Fig. 3). Even if two paths merge later before a return
instruction, the shared code will be executed twice (T2 shared
To support full thread divergence, the CU scheduler use the by paths C and D).
Link and Enable stacks (Fig. 2). When branching into two
different directions, the execution continues along the path
taken by the minority of work-items. The address of the first Following the minority-path at first reduces the required
instruction of the other path is pushed on the link stack. In maximum size of link and enable stacks: for a full wavefront of
addition, the top entry on the enable stack is removed and two 64 work-items, only log2 (64) entries are needed to manage the
bit masks are pushed instead. The new entries define which worst case scenario, even when each single work-item takes
work-items are enabled for each path. When hitting a return a different path. In addition, since a CU can host up to 8
instruction, the top entries of both stacks are popped out. Then, wavefronts, only 1 BRAM36 memory block is needed per
the execution continues along the path that was stored at top. stack for physical realization.
64
Begin Work-items
Link Enable
Path A
Path B 64 Executed Path D Path C Path B Path A Stack Stack
Task
jsub
44 20
T0
unmodied entry
br
44 20 removed entry
T3
ret
Execution Order
added entry
11 T0
T1
br
Path C T3 T1 33 11
Path D 33 br return
ret
T2 11
11 ret br branch
T2
T2 jump to
jsub
ret subroutine
0
Return 64
(a) Control ow diagram (b) Execution order
Fig. 3. Handling branch instructions and subroutine calls
135
B. Scratchpad Memories Host Code Device Code
Bitstream
Previously, FGPU had neither hardware- nor compiler-
support to implement call stacks or register spilling and filling.
Web Browser
Only 32 private registers were available for each work-item
to finish all required computations. The improved design uses Web Application
local scratchpad memories in each CU to solve these problems
without overloading the global memory connection (Fig. 2). FGPU
Operating System
Call stack operations are realized using different assembly Compiler
instructions to normal loads and stores. They are executed Processing Programmable
Hardware
directly on scratchpads without scheduling logic. System Logic
Each work-item is mapped to a fixed region in a scratch-
pad memory block. The allocated size can be specified at Fig. 4. The suggested tool flow based on PYNQ
synthesis time according to the application needs. With a
single BRAM36 per PE, the call stack of each work-item III. T OOL F LOW
running on FGPU can be extended by 16 entries. Operations A. Compiler Support
on scratchpads are aligned to 32bits.
FGPU has already an OpenCL compiler based on the LLVM
framework. It uses clang as a frontend with a special developed
C. Hard Floating-Point Support backend. Besides the integration of FP support, we enabled
function calls and register spilling. Two different calling
All FP operations required by the targeted benchmarks are
conventions are used at the same time. For kernel functions,
implemented in 32bit single-precision based on the floating-
which are the entry points when launching tasks on FGPU,
point IP core from Xilinx [10]. The generated cores are fully
the parameters are accessed via special assembly instructions.
pipelined, i.e. a new result per clock cycle can be calculated.
For all other functions that can be called from any software
When an instruction is implemented, a hardware instance is
running on FGPU, we used similar calling conventions to the
created per PE. However, the FP logic is not realized within
ones of the MIPS processor: the first four parameters as well
the ALU pipeline due to its high latency: it takes 28 clock
as the returned value are passed on registers, more parameters
cycles to compute a division or a square root. Moreover, all FP
can be passed on the stack. This is expected to be faster and
operations are implemented within a separate module inside a
reduces the maximum required stack size.
CU (see Fig. 2).
If the required FP operation is supported in hardware,
To simplify instruction scheduling, it is assumed that all FP
the compiler uses an assembly instruction to implement it.
operations have the same latency. The unified value is set at
Otherwise, the corresponding standard function is called, e.g.
synthesis time to the maximum latency of all individual FP
mulsf3 for multiplication. The implementation of these
operations. Since integer and FP instructions have different
functions is described later in Section III-B. FP comparison
delays and they are issued concurrently from multiple wave-
illustrates a special case: only the set if less than (fslt) variant
fronts and executed on the same compute vector, simultaneous
is supported in hardware. The compiler realizes all other com-
writing of ALU- and FP-results into the register files has to be
parison types using fslt and some extra logical instructions.
avoided. In addition, writing data loaded form scratchpads and
global memory would intensify the bottleneck at the write port B. Soft Floating-Point
of the register files. To solve this problem, we used 4 physical Compiler-rt is a part of the LLVM project. It includes a
block memories to hold the register files with time division runtime library for emulated FP using integer instruction that
multiplexing. has been used as a basis of this work. However, software
implementations of FP operations use conditional code inten-
D. Improved Scalability sively. Therefore, we modified to the existing implementations
for faster execution on FGPU but without affecting their
Because of their high area demands, realizing FP operations functionalities. For instance, the execution of if-else code
may be associated with sever degradations in the operation blocks is delayed to later points if possible, or both if- and
frequency. The reference architecture uses a doubled clock else-parts are executed if the estimated penalty is less that of
frequency for the ALUs and their register files [8]. After place a divergence point in the corresponding CFG.
and route, we realized often that the critical paths are the
ones which connect the two clock domains. To mitigate the C. Application Programming Interface
frequency degradation, we eliminated the faster clock domain PYNQ is a recent open-source project from Xilinx to
and redesigned the ALU pipeline. The improved design has program Zynq SoCs with IPython (Interactive Python) [11].
the same throughput but it can be synthesized at frequencies It is based on a Jupyter Notebook, which is a browser based
over 200MHz even when 99.9% of the available FPGA slices interactive computing environment. Currently, only the PYNQ-
where utilized. Z1 board is officially supported. We modified the source files
136
TABLE I 1000x
O UR BENCHMARKS WITH REQUIRED FP OPERATIONS
Geomean
333
Benchmark Description FP operations 206
116
Speedup over MicroBlaze

vec mul vector multiplication of two arrays * 100x 102
79 69 77 72 69
FIR (12 tap) FIR filter with 12 taps +,* 59 52
bitonic bitonic sorting compare 28
20
mat mul matrix multiplication +, * 14
12
xcorr sliding dot product of two arrays +, * 10x
LU-decompose LU factorization of a square matrix -,*, /
vec add vector addition of two arrays +
FFT radix-2 Fast Fourier Transform +,-,*
parallel selection sorting by reading the whole array then compare 1x
doing a single write per element ul p) c ul r se d T n l
ax we
r v y ll
√ m ta ni or ad FF tio be di od ha
c_ 12 ito t_
m xc po c_ ec so m
po -b s
sobel Sobel filter for image edge detection +,*, ,uitofp e b a om
l _ N ar
v ( m ve ls
e
ac -W
(Amplitude only) R ec d
FI -d al
le oy
LU r Fl
max finds the maximum number in an array compare pa
ac power subtracts a value from an array and -, +,*
accumulates the squares of the results Fig. 5. Average wall clock speedup over MB and its variation when changing
div divides the elements of an array by a / the problem size from 1K to 256K
given value
√
N-body 3D motion simulation of a group of +, -, *, /,
objects under mutual gravity forces A. Hard Floating-Point Performance
Floyd-Warshall finds the shortest paths between all +, compare Fig. 5 illustrates the average recorded speedups for all
pairs of vertices in a weighted graph
benchmarks over the MicroBlaze. Overall, an average value
of 60x is recorded. Speedups up to 480x for FFT and 280x
for N-body are observed. However, the measured acceleration
to run on the ZC706 board which holds a bigger FPGA that depends strongly on the amount of processed data. To effec-
we need to implement the largest and most efficient FGPU tively use the deep pipelines of FGPU, a minimum problem
cores. Fig. 4 illustrates the suggested tool flow. size is required to achieve the best efficiency. A comparison to
Through a web interface, the developer can program a host the ARM and NEON combination is shown in Fig. 6. Overall,
application, write the OpenCL code for FGPU and program an averaged speedup of 4.2x is recorded over all benchmarks
bitstreams. All these operation are offered through a Python and problem sizes from 1K to 256K. The maximum value was
class which offers similar functions to the ones of the standard 22.6x for the N-body motion simulation recorded at N=8192.
OpenCL API. In addition, we installed the FGPU compiler on
the underlying operating system and could compile OpenCL
code on-board from the web interface. The proposed solution B. Soft Floating-Point Performance
does not affect the execution time on FGPU and it can easily Executing the required FP operations in software works
visualize processed data like images. We believe that the irrespectively of the underlying hardware support. Therefore,
flexibility of this tool flow is comparable to the conventional significant savings in FPGA resources can be achieved: the FP
one of GPUs. logic may consume up to 48% of the required area for FGPU
(see Fig. 8). However, remarkable penalties in execution time
IV. R ESULTS have to be expected. Fig. 7 shows the performance degradation
of the soft approach in comparison to hard one. If only simple
The ZC706 development board with the Zynq XC7Z045 FP operations are required, e.g. compare, the degradation may
SoC was used for evaluation. Our benchmarks are listed in be as small as 6% like for bitonic. When more complex
Table I. We used both a single MicroBlaze processor and operations are needed, 17x less throughput may be measured
the ARM Cortex-A9 available on the Zynq to evaluate our like for xcorr.
results. The MicroBlaze is optimized for best performance Actually, on a conventional processor architecture, the per-
with hard FP support and clocked at 180MHz. The ARM is formance gap between soft and hard FP operations may be
clocked at 667MHz and supported by the 128bit SIMD NEON two orders of magnitude. On average, this gap is only about
engine capable of vector single-precision FP computation. 4.5x on FGPU.
The MicroBlaze and the ARM were programmed in C++
by equivalent implementations to the OpenCL benchmarks C. Area Efficiency
running on FGPU. All implementations on all processors were Efficient FGPU cores require a lot of FPGA resources to be
compiled for best performance. The NEON engine was utilized realized. Moreover, FP operations are more area-consuming:
through the automatic instruction vectorization done by the a single floating-point divider (793 LUTs) is approximately as
compiler. We used multiple FGPU implementations that have big as a complete integer PE (878 LUTs). Fig. 8 illustrates the
either 4CUs (32PEs) or 8CUs (64PEs) to generate the results relative size of the main modules of an FGPU with 32PEs and
of this work. The FGPU cores were clocked at frequencies full hard FP support. Only the number of consumed LUTs was
between 215-250MHz. considered for area estimation because it corresponds to the
137
20x
18,2 Control
18x fadd 4%
5%
fmul
Speedup over ARM and NEON
16x
14,4 2%
14x PEs
22%
12x
Floating-
10x Other
Point
fdiv Logic
7,8 7,5 Logic
8x 7 20%
5,7 5,3 48%
6x 52%
4,6
4x 3,5 3,5
2,5
1,6 2,0
2x 1,3 1,3
fslt Mem. Cntrl
0x 1% 23%
ul ul rr l er l
p) ni
c
m xco pos _ad
e d T on be ax w di
v dy al
m ta ito FF cti so m o -b
o s h
ec_ 12 b a t_ m ec le _p N ar fsqrt
v ( o e 12% Schedulers
R m
ec
v ls ac d -W
FI
7%
-d lle oy uitofp
LU a ra Fl 3%
p
Fig. 6. Average wall clock speedup over ARM and NEON when varying the Fig. 8. Relative resource utilization of the major FGPU parts when 4CUs are
problem size from 1K to 256K implemented with all considered floating-point operations
18x 80x
100%
94%
90% 16x compute density 70x
energy efficiency
Improvement in Compute Denstity

Relative throughput to hard floating-point
80% 14x
60x
70% 12x
Energy Saving
60% 50x
60% 56%
53% 10x
50% 40x
8x
39%
40%
30x
6x
30% 25%
22% 20x
20%
20% 18% 4x
16%
11%
10% 8% 6% 8% 2x 10x
0% 0x 0x
ul ul r er l
p) ni
c
or se d T io
n ax di
v al n
ul ) ic ul rr n el er v y ll
m ta po ad FF ct w sh ea e d T x
di od ha ean
c_ 2 bi
to t_
m xc c_ le
m
po ar m m tap iton _m xco pos _ad FF ctio ob ma ow
e (1 a om ve se _ eo c_ 12 b t m ec le s p -b rs m
v
R m
ec l ac - W g ve ( a o e _ N Wa eo
FI -d lle oy
d
R m
ec
v ls ac d- g
LU ra Fl FI -d lle oy
pa LU ra Fl
pa
Fig. 7. Relative throughput of processed data using soft floating-point Fig. 9. Improvement in compute density and power saving over the MicroB-
implementations with respect to the hardened ones laze when using FGPU with hard floating-point support
highest utilization ratio. About 124-180K LUTs were required a single processor. Hence, we may conclude that on average,
to implement any of the FGPU cores we used in this work using FGPUs for general purpose FP computation has a better
(corresponds to 57-82% of the available LUTs). In comparison throughput per area as well as lower energy consumption than
to the reference MicroBlaze (about 7K LUTs), the FGPU cores any MicroBlaze-based MPSoC.
are approximately 18-26x bigger.
To decide whether it is worthy using FGPU, we used the D. Power and Energy Consumption
compute density (CD) metric defined in [12]: Using on-board power measurements, the utilized FGPUs
T hroughput (kBytes/us) consumed 4.9-7.5Watt over all benchmarks for soft- or hard-
Compute Density = Area (1K LU T s) FP computation. On the other side, the MicroBlaze consumed
We calculated the CD values for FGPU and MicroBlaze for a maximum of 1.24Watt and about 5.4x less power on average.
all benchmarks and problem sizes between 1-256K. Then, we However, FGPU with hard FP-support needs on average 11.2x
averaged on problem size and plotted the ratios of FGPU- less energy than the MicroBlaze (see Fig. 9). In the worst case,
values to the ones of the MicroBlaze in Fig. 9. On average, 2.5x less energy was consumed by FGPU to compute any task.
FGPU with hard FP support has 2.9x better compute density.
For FFT, up to 15.4x better throughput per area is recorded. E. Comparison to HLS
To estimate the area overhead of FGPU, we implemented only We synthesized 4 of our benchmarks using Vivado HLS
the required FP operations for each benchmark. (v2016.2). We used the code of the ARM implementations
Since the throughput, power consumption and area overhead as a basis for our experiments with HLS. The synthesized
of a homogeneous MPSoC scale linearly in the best case with modules own the necessary logic to read and write the global
the number of processors, the CD and the energy consumption memory without using external DMAs. Table. II lists the
of an MPSoC can not be better that the corresponding ones of needed number of LUTs for multiple implementations:
138
TABLE II ones like memory stalls at runtime using special hardware
M ULTIPLE HLS IMPLEMENTATIONS OF SELECTED BENCHMARKS WITH
THE EXECUTION TIME AND REQUIRED AREA
controllers.
Fig. 10 depicts the improvement in compute density and
Benchmark
Solution 1 Solution 2 Solution 3 energy saving when using FGPU over HLS. Because all task
Time Area Time Area Time Area parameters are fixed at synthesis time, Vivado HLS can syn-
msec #LUTs msec #LUTs msec #LUTs
FIR (12 Taps) 7.2 2.5 K 5.6 12 K 0.23 5K
thesize FIR-Solution 3 very effectively. Otherwise, FGPU can
N-body 15301 6.2 K 1910 34 K achieve better compute densities and energy efficiency in most
Bitonic Sort 113 2.4 K 44 28 K cases. Nevertheless, efficient coding for HLS should reflect
Floyd-Warshall 638 2.4 K 130 9K a good hardware structure which may be not known to the
developer when complex algorithms have to be implemented,
3,0x
FIR(12tap) – Solution 2 e.g. sorting. On the other side, simple and compact software
implementations are often enough to get the best performance
2,5x Bitonic – Solution 2
N-body – Solution 1
out of an FGPU.
Improvement in Compute Density
2,0x V. C ONCLUSION
N-body – Solution 2 Soft GPUs offer a very flexible and efficient tool flow to
1,5x Floyd-Warshall – Solution 1
implement FP arithmetic on FPGAs. The can deliver better
Floyd-Warshall – Solution 2
processing throughput per area and needs less energy than a
1,0x
FIR(12tap) – Solution 1 single soft processor or homogeneous MPSoCs. In addition,
0,5x
faster execution can be achieved in comparison to hard vector
Bitonic – Solution 1
coprocessors. In comparison to HLS, they can be programmed
FIR(12tap) – Solution 3
0,0x
much easier. Simple software implementations of complex
0x 2x 4x 6x 8x 10x 12x 14x algorithms on a soft GPU can give much better throughput
Improvement in Energy Consumption
than HLS ones, if the later do not reflect efficient hardware
Fig. 10. Improvement in compute density and power efficiency using FGPU structures. Moreover, the compute density and energy effi-
over Vivado HLS ciency with HLS may degrade if some parameters are not
fixed at synthesis time while the ones of a soft GPU remain
“Solution 1” refers to the case where we applied only
• unaffected.
some simple HLS optimizations like loop pipelining.
Because the problem size was not specified and hence R EFERENCES
the loop bounds were unknown, HLS could only pipeline [1] Nvidia Corp., “NVIDIA Tesla P100,” White Paper (WP-08019-001
the inner loop [13]. v01.1), 2016.
[2] D. Capalija and T. S. Abdelrahman, “A High-performance Overlay
• In “Solution 2”, we assumed that the problem size is Architecture for Pipelined Execution of Data Flow Graphs,” in 2013
a multiple of 64. In other words, we transformed the 23rd International Conference on Field programmable Logic and Ap-
inner loop into two ones, where the second loop has 64 plications, Sept 2013, pp. 1–8.
[3] A. K. Jain, D. L. Maskell, and S. A. Fahmy, “Are Coarse-Grained Over-
iterations. In this case, HLS can unroll the most inner loop lays Ready for General Purpose Application Acceleration on FPGAs?”
with the fixed bound and pipeline the upper one next to in IEEE DASC/PiCom/DataCom/CyberSciTech, Aug 2016, pp. 586–593.
it. Moreover, we used local buffering of read data with [4] “Top500 Supercomputers,” http://www.top500.org [Online; accessed 05-
Jan-2017].
burst-shaped read and write using the memcpy function. [5] D. Chen and D. Singh, “Fractal video compression in OpenCL: An
• “Solution 3” was applied only on the FIR filter where all evaluation of CPUs, GPUs, and FPGAs as acceleration platforms,” in
parameters were fixed at synthesis time, i.e. problem size 2013 18th Asia and South Pacific Design Automation Conference (ASP-
DAC), Jan 2013, pp. 297–304.
and number of filter taps. [6] Nvidia Corp., “GPU-Based Deep Learning Inference: A Performance
We could place and route all implementations at frequencies and Power Analysis,” White Paper, 2015.
[7] Xilinx, Inc., “UltraScale Architecture and Product Overview (v2.10),
from 187-250MHz. The measured execution times for a prob- DS890,” 2016.
lem size of 16K (for Floyd-Warshall) or 8K (for all others) are [8] M. Al Kadi, B. Janssen, and M. Huebner, “FGPU: An SIMT-
reported in Table II. We found that the implementations from Architecture for FPGAs,” ser. FPGA ’16. New York, NY, USA: ACM,
2016, pp. 254–263.
the group “Solution 1” and “Solution 2” are 28-81x and 6.4- [9] Khronos Group, “OpenCL 1.2 Specification,” 2012.
35x slower than the FGPU ones, respectively. FIR-Solution 3 [10] Xilinx, Inc., “Floating-Point Operator v7.1, LogiCORE IP Product Guide
is slightly slower than FGPU by a factor of 1.4x. (PG060),” 2015.
[11] “PYNQ Project,” http://www.pynq.io [Online; accessed 15-Jan-2017].
Hence, when flexibility is demanded, e.g. sorting arrays [12] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, produc-
with variable lengths, simple hardware implementations of tivity and scalability of the TILT overlay processor to OpenCL HLS,”
complex algorithms with HLS are not expected to perform in FPT’14, Dec 2014, pp. 20–27.
[13] Xilinx, Inc., “Vivado Design Suite, User Guide (UG902 v2016.2),” 2016.
as good as the FGPU ones. The main difference between the
two approaches is the scheduling part: while HLS allocates
the hardware and schedules the required operations statically
at synthesis time, FGPU solves this problem jointly with other
139

Floating-Point Arithmetic Using GPGPU On FPGAs PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating-Point Arithmetic Using GPGPU On FPGAs PDF

Uploaded by

Copyright:

Available Formats

2017 IEEE Computer Society Annual Symposium on VLSI

Floating-Point Arithmetic using GPGPU on FPGAs

that supports both approaches has been developed. An IPython

2159-3477/17 $31.00 © 2017 IEEE 134

BRAM BRAM BRAM BRAM

Fig. 2. Functional block diagram of the improved CU design

(a) Control ow diagram (b) Execution order

Fig. 3. Handling branch instructions and subroutine calls

Speedup over MicroBlaze

Improvement in Compute Denstity

You might also like