HARP2 An X Scale Reconfigurable Accelerator Rich Platform For Massively Parallel Signal Processing Algorithms

J Sign Process Syst (2016) 85:341–353
DOI 10.1007/s11265-015-1054-9
HARP2: An X-Scale Reconfigurable Accelerator-Rich

Platform for Massively-Parallel Signal Processing
Algorithms
Waqar Hussain1 · Roberto Airoldi1 · Henry Hoffmann2 · Tapani Ahonen1 ·
Jari Nurmi1
Received: 12 December 2014 / Revised: 23 June 2015 / Accepted: 21 September 2015 / Published online: 17 October 2015
© Springer Science+Business Media New York 2015
Abstract This paper presents design, development and metrics. The other version (HARP-hom) is instantiated for
evaluation of an eXtra-large Scale, Homogeneous and a Het- a relatively realistic design problem, i.e., satisfying the
erogeneous Accelerator-Rich Platform (HARP2 ) for mas- execution-time constraints imposed on Fast Fourier Trans-
sively parallel signal processing algorithms. HARP is an form processing in IEEE-802.11n demodulators. Both of the
integrated platform of multiple Coarse-Grained Reconfig- versions of HARP are treated for comparative analysis using
urable Arrays (CGRAs) over a Network-on-Chip (NoC) different performance metrics against some of the existing
where each CGRA is scaled and tailored for a specific state-of-the-art platforms. The HARP versions are designed
application. The architecture of the NoC consists of nine to illustrate large-scale homogeneous/heterogeneous multi-
nodes in a topology of 3-rows×3-columns and acts as back- core architectures while presenting the advantages of max-
bone of communication between different CGRAs. In this imizing the number of reconfigurable processing resources
experimental work, the HARP template is used to instan- on a single chip.
tiate a homogeneous (HARP-hom) and a heterogeneous
(HARP-het) platform. The HARP-het is generated for a Keywords Reconfigurable · CGRA · Network-on-Chip ·
proof-of-concept test to verify the design and functional- Heterogeneous · Homogeneous · Accelerator-Rich ·
ity of HARP. It also provides insight to many features of Multicore
the design and evaluation in terms of different performance
1 Introduction
Waqar Hussain
waqar.hussain@tut.fi Today’s challenges for computing are of much larger scale
than in the past and, on the other hand, the demand in per-
Roberto Airoldi
roberto.airoldi@tut.fi
formance is constrained by many factors. In single core
platforms, the performance increase was achieved using
Henry Hoffmann
hankhoffmann@cs.uchicago.edu technology scaling over several years. This trend eventu-
ally ended due to increasing power density; a solution was
Tapani Ahonen
tapani.ahonen@tut.fi
found in multicore systems and multicore scaling proved
to yield reasonable performance advantages. The scaling
Jari Nurmi
in multicore systems could not sustain longer and suffered
jari.nurmi@tut.fi
limitations by some constraints. In reference [1], one of
1 Department of Electronics and Communications Engineering, the constraints is mentioned as ’utilization-wall’. It states
Tampere University of Technology, P.O. Box 527, FI-33101 that under a fixed power budget of a homogeneous multi-
Tampere, Finland core platform, all the cores on a chip cannot be operated
2
at their maximum frequency due to the limiting factor of
Department of Computer Science, The University
of Chicago, Ryerson Hall 250, 1100 E. 58th Street,
power density. This limiting factor imposing the utilization-
Chicago, IL 60637, USA wall gave rise to Dark Silicon, i.e., the silicon on the chip
342 J Sign Process Syst (2016) 85:341–353
that is dark (switched off) or dim (operated at very low fre- CGRAs and the RISC processor. The disadvantage can be in
quency). One way of dealing with the issue of Dark Silicon case of multicore systems, where multiples CGRAs cannot
is to replace the dark and dim part of the chip with special- access each other’s resources. The loose coupling eventu-
purpose accelerators that can operate at very low frequency ally results in a lower bandwidth for communication but
but can yield very high performance levels. In this context, allows multiple CGRAs to access each other computational
critical-priority applications can be selected for the design and memory resources. An example of such communica-
of application-specific accelerators and then replacing the tion infrastructure is an NoC. The drawback of bandwidth
under-utilized part of the chip. can be overcome if the data is transfered to all the nodes
In this paper, the focus is on the design of Hetero- simultaneously.
geneous Accelerator-Rich Platforms (HARP) to deal with This paper presents the advantages of using the loosely
the issue of dark silicon. A VHDL template of HARP coupled architecture of HARP and motivates to maximize
is designed and two instances HARP-hom and HARP-het the number of rPEs on a system-on-chip. The HARP-het
are generated from the template that are the homogeneous is designed by placing seven heterogeneous accelerators
and heterogeneous versions of the platform, respectively. around a RISC core which is connected to the central node
The version HARP-het is a 408 reconfigurable Processing while the HARP-hom is made by using four homogeneous
Elements (rPEs) instance generated for a proof-of-concept accelerators integrated around the same RISC core. The key
test-case. The test-case is only to verify the functionality benefits of using HARP are as follows.
of the design and evaluate it for different performance met- 1. The number of rPEs are maximized to enhance the
rics. The other instance HARP-hom is a relatively realistic digital signal processing capacity of the platform. Fur-
design made to satisfy the execution-time constraints of thermore, the scalable CGRAs composed of rPEs are
IEEE-802.11n standard for Fast Fourier Transform (FFT) integrated over an NoC such that they are loosely
[16] processing. The HARP-hom version is designed by coupled and capable to share each other’s resources.
instantiating a total of 256 rPEs. The communication infras- 2. The RISC can establish centralized and allow dis-
tructure used in the design of HARP2 is a Network-on-Chip tributed supervisory control between all the processing
(NoC) of nine nodes arranged in the geometrical order of cores for independent and simultaneous execution of
three rows and three columns. The central node of the NoC multiple kernels over the platform. The control mech-
is the master node which contains a Reduced Instruction- anism is based on a message passing scheme over the
Set Computing (RISC) processor while the rest of the NoC.
nodes are the slave nodes containing an application-specific
Coarse-Grained Reconfigurable Array (CGRA) and a Direct This paper is organized as follows. Section 2 presents
Memory Access (DMA) device. The CGRAs used for inte- state-of-the-art platforms related to HARP2 . Section 3
explains the architecture of CGRAs used for integration in
gration with NoC are of application-specific dimensions and
HARP2 , i.e., CREMA, AVATAR and SCREMA. The next
act as an accelerating engine on the slave node.
section presents the architecture of HARP2 that includes
CGRAs are a powerful solution both in industrial and
the internal structure of master/slave nodes of the NoC
academic research. They have a proven track-record of pro-
and synchronization between nodes as well as the overall
cessing massively parallel signal processing applications.
functionality. Section 5 presents measurements and estima-
Some of the example of general-purpose CGRAs are BUT-
tions of different performance metrics related to HARP2
TER [2], MorphoSys [3], ADRES [4] and PACT-XPP [5].
when prototyped for a Field Programmable Gate Array
On the other side, CREMA [6], AVATAR [7] and SCREMA
(FPGA) device. Evaluation and Comparisons are presented
[8] are template-based CGRAs which can be crafted to
in Section 6 based on the measured and estimated perfor-
application specifications. HARP is made by integrating
mance metrics. Finally, in the last section, conclusions are
template-based CGRAs over its nodes. In the recent past,
presented.
the accelerator architectures were made tightly coupled typ-
ically with a general-purpose RISC processor, therefore the
resources of the accelerator were dedicated only to the cou- 2 Related Work
pled processor. In loose coupling, the resources among the
multiple cores can be shared for the purpose of computa- One of the motivations of HARP2 is to provide a platform
tion and communication. The tight coupling was achieved which can be finely tailored in hardware to the computa-
using a Direct Memory Access (DMA) device or by directly tional needs of the application. Similar features can also be
integrating the CGRA in the datapath of the RISC proces- found in many other state-of-the-art platforms. For exam-
sor. Examples of tightly coupled CGRA architectures are ple, a system presented in [15] is a heterogeneous platform
well presented in [9, 10] and [11]. The advantage of tight designed to match hardware granularity with application’s
coupling is a faster communication channel between the granularity.
J Sign Process Syst (2016) 85:341–353 343
NineSilica is a homogeneous Multi-Processor System-

on-Chip (MPSoC) platform built over an NoC of nine nodes
arranged in a geometrical grid of three rows and three OPERAND 1 OPERAND 2
REGISTER REGISTER
columns. Each node of NoC is integrated with a RISC
processor that acts a computational engine. Many signal
OPERAND 2
processing related algorithms have been mapped on Nine- DECODER
MUX
OUT 2
Silica, e.g., FFT and Correlations. In one of the cases, To FEs
NineSilica is reported to process a 64-point of FFT within
10.3 μs [22] on a 40 nm FPGA device.
In reference [24], several general-purpose microproces-
sors are integrated in a Multiple-Instruction Multiple-Data LUT + X SHIFTER
(MIMD) fashion. Based on user specification, additional
accelerators can also be integrated with the microproces- FP LOGIC
sor. The exchange of data between the microprocessors IMMEDIATE

and the accelerators is supported by two separate networks. REGISTER
The overall system shows a performance of 19.2 Giga OUTPUT MUX
Operations per Second (GOPS) on a 90 nm FPGA device.
The platform P2012 is a homogeneous MPSoC made up
of four clusters where each cluster contains sixteen general-
purpose microprocessors. The overall platform communi- Figure 1 The internal architecture of a rPE.
cates over an NoC. The processors within the cluster are
locally synchronous but globally asynchronous with proces- computation in all of the three CGRAs is a rPE, depicted in
sors in other clusters. The platform successfully performs Fig. 1. It contains a 32-bit Arithmetic and Logic Unit (ALU)
image processing related algorithms and delivers a perfor- capable of performing both integer and floating-point oper-
mance of 80 GOPS on a 28 nm technology [26]. ations in IEEE-754 format. Each rPE has two inputs (a,
The DREAM is an example of a mid-grain (granular- b) and two outputs (A, B). The operands to be processed
ity ≤ 8 bits) DSP processor which contains a dynamically by the ALU inside are received at the two inputs of rPE
reconfigurable datapath unit. It is capable to deliver a and stored in the internal operand registers. The output A
performance of 0.2 GOPS/mW using a 90 nm CMOS tech- receives the result of the operation performed by the ALU
nology. It was later integrated into MORPHEUS platform while the output B receives the operand from the operand
[27] as it is suitable to be integrated in a heterogeneous register of input b. In this way, the operand at input b can
platform. Besides DREAM, MORPHEUS has a fine-grain be supplied to the neighboring rPEs for any required opera-
platform named FlexEOS and a CGRA called XPP-III. The tion of the target application. Each rPE constituting a CGRA
overall MORPHEUS chip has an average dynamic power has local and global interconnections. The local intercon-
dissipation of 700 mW while yielding a performance of nections are for exchanging data with the neighboring PEs
0.02 GOPS/mW. while the global interconnections are used for exchanging
The platforms presented in [22, 24] and [26] are exam- data with the rPEs that are relatively distant in space. The
ples of loosely-coupled architectures. The design of HARP2 CGRA is equipped with two local data memories and data
contains more number of processing resources in compar- to be processed can be stored in any of them. The number
ison to the above mentioned platforms, therefore yielding of banks in the local memory are fixed and has to be equal
high performance levels. In Section 6, both of the versions to the total number of rPE inputs available in any row of
of HARP are treated for a detailed comparative analy- the CGRA. In between the local memories and the CGRA,
sis against these state-of-the-art platforms using different there are I/O buffers available to provide maximum inter-
performance metrics. leaving to the data for exchange between the CGRA and the
local memories. During the system start-up time, the con-
figuration stream and data to be processed are loaded in the
3 Coarse-Grained Reconfigurable Arrays CGRA. The data to be processed is loaded by the DMA [19]
device in the local data memory of the CGRA. Each rPE has
As discussed earlier in Section 1, the CGRA templates used its own configuration memory so the configuration stream is
to instantiate accelerators to be placed around the supervi- injected by the DMA device using the pipelined infrastruc-
sor node of NoC are CREMA, AVATAR and SCREMA. All ture [20]. Once the loading of the data and the distribution of
of these three CGRAs belong to the same family and there- configuration stream is complete, the CGRA can start pro-
fore contain similar architectural characteristics. The unit of cessing by enabling different contexts. During the process
First
Local Memory
Banks
I/O Buffer
Local Global
Interconnections Interconnections
rPE rPE rPE rPE rPE rPE
I/O Buffer
Second
Local Memory
Banks
Figure 2 A combined view of CREMA, AVATAR and SCREMA. The figure shows Local and Global Interconnections, I/O Buffers and the
Local Data Memories. The rPEs in black color are the reference rPEs to depict the Interconnections.
of context enabling, the RISC core also sends cycle-accurate 4 Homogeneous/Heterogeneous Accelerator-Rich
information to the CGRA control unit i.e., read cycles, write Platform
cycles, stall cycles and local memory offsets. A context can
be defined as a pattern of interconnections between dif- As described earlier, HARP2 is a template of nine nodes
ferent processing elements and the set of operations to be arranged in a geometrical order of three rows and three
performed by the processing elements in a CGRA. The con- columns. The central node acts as a supervisor node and rest
texts are made using a custom graphical user-interface tool of the nodes act as slaves. Since it is a template-based archi-
at the system design time. The design of contexts are based tecture, it depends on the user to integrate a CGRA with a
on the execution-time requirements of a certain application particular node or leave it only as a data routing resource.
and performance capabilities of the CGRA. The CGRA are The user can instantiate HARP2 as heterogeneous platform
run-time reconfigurable as a new configuration stream can by integrating application-specific CGRA generated accel-
be loaded during the processing of an application using the erators for different applications or can also instantiate it
DMA device. as a homogeneous platform using accelerators designed for
CREMA is a 4-rows×8-columns rPE template-based similar applications. In this paper, HARP2 is instantiated
CGRA. The number of rows can be increased to fit the both as a homogeneous and heterogeneous architecture for a
requirements of a certain application. For example, in [21], detailed analysis. As a heterogeneous architecture, HARP2
CREMA template was used to generate a 9×8 rPE radix-4 was instantiated with seven different CGRAs for a proof-
FFT accelerator. AVATAR is a 4×16 rPE CGRA template of-concept test to verify the functionality and present the
(twice the size of CREMA) and was used to generate a design features in detail. This instance presents 408 rPEs in
mixed-radix(2, 4) FFT accelerator to process 64 and 128- total. HARP2 is also instantiated as homogeneous platform
point of FFT algorithms. SCREMA is a scalable version of with four CGRA of the same type and size with a maxi-
CREMA and it could flex to sizes of 4×4, 4×8, 4×16 and mum of 256 rPEs. The homogeneous platform is motivated
4×32 rPEs as presented in [8]. Figure 2 shows the combined by satisfying the demanding FFT execution-time require-
view of all of these CGRA templates. ments of IEEE-802.11x standard. Figure 3 shows HARP2
4x8 PE
DMA
9x8 PE DMA DMA 4x16 PE
N0 N1 N2
4x4 PE DMA N3 N4 N5 DMA 4x8 PE

RISC
4x16 PE 4x32 PE
N6 N7 N8
DMA DMA
(a) HARP-het: 408-rPE Heterogeneous Accelerator-Rich Platform.
4x16 PE 4x16 PE
DMA DMA
N0 N1 N2
N3 N4 N5
RISC
4x16 PE 4x16 PE
N6 N7 N8
DMA DMA
(b) HARP-hom: 256-rPE Homogeneous Accelerator-Rich Platform.

Figure 3 Homogeneous/Heterogeneous Accelerator-Rich Platform Template. In the figure, the black colored N4 is the supervisor node, while
rest of the nodes are slave.
architecture for both homogeneous and heterogeneous In this case, the data transfer between two nodes is one
instances. In both of the figures, it can be observed that word per clock cycle after a fixed latency of one clock cycle
the nodes which are not integrated with any CGRA are per node encountered on the routing path. During the sys-
not eliminated from the design in order to provide max- tem startup time, the supervisor node of the NoC containing
imum available routing paths for data and configuration the RISC processor can send the configuration stream and
transfer. the data to be processed to the slave node’s data memory.
The data and configuration are transfered in form of pack-
4.1 Processing Nodes and Synchronization ets, where the most significant part of the packet contains
the routing information while the rest of the packet contains
The central node of the HARP2 establishes the supervisory the data/configuration word. The packet passes through a
control over the slave nodes. Each slave node contains a switch before being loaded onto the network where the
data memory, a DMA device and a CGRA (optional). The transport route is selected in east, west, north or south direc-
data transfer between the supervisor node’s data memory tion. The switch provides the supervisor node N4 and rest
and slave node’s data memory are inherently synchronized of the slave nodes with variable routing possibilities. After
as they are operated by the same clock as well as the NoC. this process, the DMA device in the slave nodes can load the
RISC each other unless the user implements specific data depen-
-------------------- From NoC dencies in the program flow. In this way, the CGRA nodes
DMA Master
(slave) can exchange data with each other as well as with
the central supervisor node. The computational resources
Master Interface Initiator of the CGRA are then allocated by the supervisor node to
process on the transferred data. Section 4.2 further explains
the mechanism of sharing computational resources. The
user can apply different execution and data transfer pro-
Request Switch Arbiter Response Switch
gramming schemes based on processing requirements of the
application.
The node of NoC has one master and two slave inter-
faces. The master interface can write to the network and
Slave Interface -1 Slave Interface -2 Target
can also perform data transfer within the node. In case of
the supervisor node (N4), the two slave devices are the data
Instruction Memory Data memory and instruction memory. The RISC is integrated to
------------------------- Memory To NoC the master interface of the node. In the rest of the nodes,
DMA Slave
the slave interfaces of the nodes are integrated to the local
Figure 4 An overview of the Supervisor and Slave nodes of NoC. data memory and the slave part of the DMA device. The
In this figure, the RISC and Instruction Memory are shown for the master interface of the slave node is connected to the mas-
Supervisor node and the DMA for the Slave nodes. ter side of the DMA device. A node is contacted through
its initiator module and then the node’s arbiter sequences
between the request and response switch to establish con-
configuration stream and the data in the respective local nection between different modules. The data is written
memories of the CGRA. In this context, the DMA is acti- to the network through the target module. The combined
vated by a control word sent by the RISC core in the view of the NoC’s master and the slave node is shown in
supervisor node. The loading and distribution of config- Fig. 4.
uration stream and the data by the DMA needs several
cycles and during this time period, the RISC should not 4.2 Application Mapping
send another control word to the DMA unless the previ-
ous data transfer task has been completed. Furthermore, As mentioned earlier, the instantiation and application
the RISC can utilize this time to carry out operations in mapping on HARP-het version is only for the proof-of-
other nodes as well. To ensure this, synchronization needs concept and to demonstrate the technical capabilities of
to be established between the supervisor and slave nodes. the HARP2 template. The HARP-hom version is designed
The supervisor node establishes synchronization by allocat- for the processing of multiple FFT streams while consid-
ing a shared memory space in its local data memory. The ering execution time constraints of IEEE-802.11x standard.
size of the shared memory space is of eight words as there Table 1 shows the type of the CGRA used to generate an
are eight slave nodes. Before activating the DMA transfer application-specific accelerator and the kernels mapped on
of a particular node, the RISC processor writes ’1’ in the them for both of the versions of HARP.
corresponding shared memory location. As the DMA com- One of the motivations behind the design of HARP2 is
pletes the internal data transfer, it writes a ’0’ over the NoC to allow all the cores to share each other’s computational
destined towards its corresponding shared memory location and memory resources. It allows for multiple algorithms to
in the data memory of supervisor node. This process syn- run concurrently and the cores can exchange the results with
chronizes the data transfers as the RISC waits and keeps a each other. Furthermore if a core lacks computational and
check on this particular memory location and does not send memory resources to process an application, it can send part
another activation control word to the same node’s DMA in of the data to other cores and then can receive back the
order to keep the sequence of data transfers. results as the processing completes. This is demonstrated
When the configuration stream and the data to be pro- in the HARP-het version as the 9×8 rPE CGRA integrated
cessed are received by the slave nodes and the slave node’s with node N0 computes a 64-point of FFT in a radix-4
internal data transfer is also completed by the DMA device, scheme on the vector received from node N4. As the com-
the RISC can start sending the control words to the CGRAs putation completes, the DMA device of N0 transfers the
in all the slave nodes for cycle accurate processing. The user results to the data memory of node N1. The node N1 then
can program the instance of HARP2 in different schemes. performs a 64-point complex Matrix-Vector Multiplication
The CGRAs can work simultaneously and independent of (MVM) on the vector received from node N0. As soon as
Table 1 Sizes in terms of the number of rPEs in order of rows×columns for different CGRAs connected to nodes of HARP-het and -hom
instances, and the kernels accelerated on them, respectively.
Node rPEs CGRA Type Kernel
HARP-het-N0 8×9 CREMA radix-4

64-point FFT
HARP-het-N1 4×8 CREMA complex MVM
64th -order
HARP-het-N2 4×16 AVATAR radix-(2, 4)
128-point FFT
HARP-het-N3 4×4
HARP-het-N5 4×8 SCREMA real MVM
HARP-het-N6 4×16 32nd -order
HARP-het-N8 4×32
HARP-hom- 4×16 AVATAR 128-point FFT
(N0,N2,N6,N8)
All the kernels are processed in 12-bit integer format.
the node N1’s CGRA completes the processing, the results rPE radix-(2, 4) FFT accelerator attached to its nodes N0,
are transferred to the data memory of node N4. N2, N5 and N7 to process several streams for FFT process-
Another feature of HARP2 is that, it allows all the cores ing. As a test case, only four 128-point streams have been
to process data simultaneously and independent of each processed with a workload distribution of only one stream
other. It is demonstrated as the node N2 of HARP-het ver- per accelerator. The design and implementation details
sion processes a 128-point of FFT using a radix-(2, 4) of these CGRA based accelerators are comprehensively
accelerator generated from AVATAR CGRA template with- described in [7, 21], and [23].
out interacting with other cores. The node N3, N5, N6 and The overall system is driven by a single clock source. The
N8 are integrated with a 4×4, 4×8, 4×16 and 4×32 rPE number of clock cycles required for data transfer is shown
CGRAs, respectively. These nodes process 32nd -order inte- in Table 2. The table presents clock cycles required for three
ger MVM simultaneously and independent of each other. types of data transfers for both of the versions of HARP. 1-
The node N7 is not instantiated with any CGRA to show Between data memories of different nodes. 2- Data transfer
the template nature of HARP, allowing the user to tailor within a slave node using the DMA device i.e., transfer from
the HARP template as required in the overall application node’s data memory to CGRA’s local memory. 3- The data
scenario. The homogeneous version HARP-hom has a 4×16 transfer from a CGRA’s local memory to the NoC.
Table 2 Clock cycles required for data transfer and execution at different stages.
Node-to D. Mem to D. Mem Trans. Exe.

-Node D. Mem to CGRA Total Total
het(N4-N0) 1017 659 1676 420

het(N0-N1) 1036 446* 1482 457
het(N1-N4) - 448* - -
het(N4-N2) 2042 1033 3075 571
het(N4-N3) 728
het(N4-N5) 609
het(N4-N6) 75622 - 75622 340
het(N4-N8) 211
hom(N4- 4× 4× 4×
(N0,N2,N6,N8)) 2042 1033 3075 587
D. Mem, Trans. and Exe. stand for Data Memory, Transfer and Execution, respectively. * sign shows data transfer from CGRA to Node’s data
memory.
Table 3 Resource Utilization Summary for Stratix-V FPGA Device (5SGXEA4H1F35C1).
HARP ALMs 78,845 / 158,500 50%

het Registers 64,436 / 634,000 10%
Memory Bits 21,000,928 / 38,912,000 54%
DSP 236 / 256 92%
HARP ALMs 120,823 / 158,500 76%

hom Registers 67,475 / 634,000 10.6%
Memory Bits 13,679,616 / 38,912,000 35%
DSP 236 / 256 92%
5 Measurements and Estimations this algorithm is relatively much higher than any other algo-
rithm considered in this experimental work and therefore,
The homogeneous and heterogeneous versions of HARP, the rPEs in HARP-hom are more densely generated than the
hom and het are synthesized on a Stratix-V FPGA device HARP-het version. It is for this reason, the ALM consump-
for prototyping purposes. The resource utilization summary tion in HARP-hom version is 76 % in comparison to the
of both of the versions is shown in Table 3 in terms of 54 % ALMs consumed by HARP-het version on the FPGA
the number of ALMs, Registers, Memory Bits and Digital device. The ever increasing density of FPGA devices offers
Signal Processing (DSP) elements. ALMs are the Adaptive a large number of resources. Stratix-V FPGA device offers
Logic Modules of the FPGA device, mostly used to imple- a very high number of registers. Since the HARP design is
ment the combinational part of the logic. The consumption aggressively registered, it efficiently utilizes the FPGA reg-
of ALMs in HARP-het version are less than the ALM’s isters using the automatic placement and routing utility of
consumed by HARP-hom version although there are more the synthesis tool.
rPEs instantiated in the heterogeneous version than in the The HARP-het and hom versions are consuming the
homogeneous. A comparison based on this aspect can not same number of DSP resources as the total number of mul-
be established as the resource utilization is based on the tipliers instantiated in both of the versions are the same.
number of computational resources generated inside a rPE. Each multiplier instantiated is of 32 bits, so it needs two
A CGRA in HARP-hom version is performing FFT algo- 18-bit DSP resources on the FPGA. The breakdown of
rithm with a mixed-radix(2, 4) scheme. The complexity of the multipliers instantiated for each node’s CGRA and the
Table 4 Node-by-node breakdown of DSP resources for both of the versions of HARP.
Accelerator 32-bit
Node Type Multipliers DSPs
HARP Node-0 Radix-4 FFT 12 24

het Node-1 Complex MVM 12 24
Node-2 Radix-(2, 4) FFT 28 56
Node-3 Real MVM 4 8
Node-4 RISC Core 6 12
Node-5 Real MVM 8 16
Node-7 - - -
Total - 118 236
HARP Node-0 Radix-(2, 4) FFT 28 56
hom Node-2 Radix-(2, 4) FFT 28 56
Node-4 RISC Core 6 12
Total - 118 236
corresponding DSP resource consumption in both of the and 186.22 MHz at 85 ◦ C. The fast time model (900 mV)
HARP versions is shown in Table 4. The number of multi- shows 274.65 MHz at 0 ◦ C and 251.0 MHz at 85 ◦ C. The
pliers instantiated in a particular CGRA template is based HARP-hom version shows 208.42 MHz at 0 ◦ C and 196.89
on the requirements of an application. The design of CGRA MHz at 85 ◦ C for slow timing model also at 900 mV. The
based accelerators containing a specific number of multipli- operating frequency for fast timing model are found to be
ers are explained in [7, 12, 21] and [23]. 299.31 MHz at 0 ◦ C and 272.63 MHz at 85 ◦ C.
The overall design of HARP2 works on a single clock Considering frequency of 186.22 MHz and the Clock
source. The slow timing model (900 mV) of HARP-het Cycles (CC) presented in Table 2, the HARP-hom processes
version shows the clock frequency of 214.64 MHz at 0 ◦ C four streams of 128-point FFT within 587 CC × 1/186.22
Table 5 Dynamic Power/Energy Estimation of each CGRA Node in both of the versions of HARP.
Node Algorithm Dynamic Active Dynamic

CGRA Size (Type) Power Time Energy
Other HW (mW) (μs) (μJ)
HARP-het N0 Radix-4 FFT

8×9 rPE (64-point) 129.68 4.20 0.54
N1 Complex MVM
4×8 rPE (64-Order) 116.75 4.57 0.53
N2 Radix-(2, 4) FFT
4×16 rPE (128-point) 213.12 5.71 1.21
N3 Real MVM
4×4 rPE (32nd Order) 61.55 7.28 0.44
N4 General Flow
RISC Control 59.73 - -
N5 Real MVM
4×8 rPE (32nd Order) 98.97 6.09 0.60
N6 Real MVM
4×16 rPE (32nd Order) 175.80 3.40 0.59
N7 - 0.37 - -
N8 Real MVM
4×32 rPE (32nd Order) 327.36 2.11 0.69
NoC - 7.12 - -
Integration
Logic - 302.07 - -
Total - 1492.52 - 4.6
HARP-hom N0 Radix-(2, 4) FFT

4×16 rPE (128-point) 214.68
N2 Radix-(2, 4) FFT
4×16 rPE (128-point) 214.35
N4 General Flow
RISC Control 59.75
N6 Radix-(2, 4) FFT
4×16 rPE (128-point) 212.97 5.87 7.11
N8 Radix-(2, 4) FFT
4×16 rPE (128-point) 213.89
NoC - 0.67
Integration
Logic - 295.64
Total - 1211.95
MHz = 3.15 μs. It shows the capability of HARP-het to 6 Evaluation and Comparisons
satisfy FFT execution time constraints of IEEE-802.11n
standard. The HARP homogeneous and heterogeneous instances are
The power dissipation is estimated for both of the synthesized for a 28 nm FPGA device. The process size
versions of HARP at an ambient temperature of 25 ◦ C scaling in Stratix FPGA devices leads to an average operat-
and at an operating frequency of 100.0 MHz. The esti- ing frequency increase of 20 % [28]. Xilinx claims to pro-
mates are achieved by simulating the gate-level netlist of vide 50 % performance increase while scaling the process
both of the HARP versions and then generating the Value size from 40nm to 28nm, however the performance comes
Change Dump (VCD) data for the applications mentioned at the cost of increasing the device density by 2X, therefore
in Section 4.2. The VCD data was then provided to the esti- expecting the user to exploit the additional 2X density for a
mation tool Quartus II PowerPlay Power Analyzer which highly pipelined and parallel design [29]. In this context, an
indicated a high confidence metric in the supplied data. average increase of 20 % in operating frequency can also be
The tool estimated 1146.64 mW and 1492.54 mW as static estimated for Xilinx FPGA devices, considering in general
and dynamic power dissipation, respectively for the HARP- the effects of scaling the process size on operating frequency
het version. For HARP-hom version, the tool estimated the provided in Altera FPGA devices. In particular, the pro-
static and dynamic component of power dissipation equal cess size scaling from 40 nm to 28 nm and from 90 nm to
to 1097.99 mW and 1211.95 mW, respectively. The I/O 28 nm will lead to 20 % and 40 % performance increase,
power dissipation for both of the versions is the same, i.e., respectively [28]. An important case study shows that a 90
22.66 mW. nm ASIC implementation is 3.4X to 4.6X (average 4.0X)
Table 5 shows the breakdown of dynamic power and faster and requires 14 times less dynamic power on average
energy consumption of the CGRAs integrated in both of the than an equivalent 90 nm FPGA implementation [30]. Based
versions of HARP. It can be observed that as the scale of the on this discussion, Table 6 shows a comparative analysis
CGRA increases, the dynamic power dissipation increases. between HARP’s heterogeneous implementation on 28 nm
The size of the CGRA is application-specific. Furthermore, FPGA and the scaled performance metrics of the platforms
the table shows that a CGRA offering more parallelism mentioned in Section 2.
(number of rPE columns) dissipates more dynamic power The HARP-het version is performing 64-point FFT on
but offers lower execution time. As the energy consumption the node N0 in 420 CC × 1/100.0 MHz = 4.2 μs. NineSilica
is the product of power dissipation and execution time, in is doing the same task using all of its computational nodes in
the specific case of node N3, N5, N6 and N8, the energy 10.3 μs on a 40 nm FPGA device. After applying a scaling
consumption does not appear to deviate with a considerable factor to estimate the processing time on a 28 nm FPGA,
amount while executing the same kernel, i.e., 32nd -order NineSilica appears to compute 64-point FFT in 8.24μs, thus
integer MVM. HARP-het has a 1.96X gain in comparison.
Table 6 Comparisons based on different performance metrics with HARP implementation at 100.0 MHz on 28 nm FPGA.
Platform / Performance Platform’s Scaled PV HARP-het

Technology Metric Value (PV) Value Gain
NineSilica [22] / FFT Execution Exe. Time - Exe. Time×20% SU fTfs

FPGA 40 nm Time (μs) 10.3 = 10.3-10.3×0.2=8.24 4.2 1.96X
[24] / PV×40% SU fTfs
FPGA 90 nm GOPS 19.2 = 19.2×1.4=26.88 40.8 1.51X
P2012 / [26] PV×SD atfs×Ps
CMOS 28 nm GOPS/mW 0.04 = 0.04×1/4×1/2= 0.005 0.0153 3.0X
MORPHEUS / [27] (PV×SD aTfs×Ps)×60% SU fTfs
CMOS 90 nm GOPS/mW 0.02 = (0.02×1/4×1/2)×1.6=0.004 0.0153 3.8X
In the table, SU fTfs, SD aTfs and Ps stand for Speed-Up for FPGA to FPGA scaling, Speed-Down for ASIC to FPGA scaling and Power scaling,
respectively.
A homogeneous MPSoC designed in a MIMD approach an MPSoC designed with a Multiple-Instruction Multiple-
is synthesized on a 90 nm FPGA which yields a perfor- Data approach. The comparisons extends to another homo-
mance of 19.2 GOPS [24]. The multiplication by a scaling geneous platform P2012 which is an integration of four
factor of 40 % to estimate the performance from 90 nm clusters where each cluster contains four general purpose
FPGA to 28 nm FPGA shows a scaled performance increase processors, and finally with a heterogeneous platform called
to 26.88 GOPS. In comparison, HARP-het version performs MORPHEUS which is made up of three different reconfig-
at 40.8 GOPS, therefore achieving a gain of 1.51X. urable fabrics. The HARP2 ’s heterogeneous version when
The platform P2012 [26] is synthesized using 28 nm compared against the scaled performance levels of P2012
CMOS technology. As discussed previously, an analysis for and MORPHEUS in terms of GOPS/mW, shows a perfor-
the performance gap between a 90 nm ASIC and a 90 nm mance gain of 3.0X and 3.8X, respectively.
FPGA implementation is presented [30]. Similar gap ratios
are expected between a 28 nm ASIC and a 28 nm FPGA Acknowledgment This research work is jointly conducted by the
implementation. The gap shows a performance degradation Department of Electronics and Communications Engineering, Tam-
pere University of Technology, Finland and the Department of Com-
by a factor of 4 and dynamic power dissipation increase by puter Science, University of Chicago, Illinois, USA. It was partially
a factor of 14 for an FPGA implementation in comparison funded by the Academy of Finland under contract # 258506 (DEFT:
to the same ASIC implementation. Considering a worst case Design of a Highly-parallel Heterogeneous MP-SoC Architecture for
scenario, where static power is equal to dynamic power, the Future Wireless Technologies) and Tampere Doctoral Programme in
Information Science and Engineering, Finland. The Department of
factor of 14 reduces to almost 2 for the overall power dissi- Computer Science, University of Chicago, Illinois, USA also provided
pation. The scaled performance of P2012 on a 28 nm FPGA the financial and on-site resources for its implementation.
is calculated to be 0.005 GOPS/mW as shown in Table 6. In The authors sincerely acknowledge Bob Bartlett, Director of the
comparison HARP shows a performance gain of 3.0X. Technical Staff at the Department of Computer Science, University
of Chicago, IL, USA for his consistent training, support and making
Similarly, MORPHEUS’s [27] performance of 0.02 available the required computing infrastructure in a short time for the
GOPS/mW at 90 nm CMOS can be scaled to a 90 nm FPGA implementation of this research work.
as shown in Table 6, and then a multiplication with a 60
% speed-up to estimate performance for a 28 nm FPGA
device. The HARP’s performance on 28 nm FPGA shows References
a gain of 3.8X in comparison to the scaled performance of
MORPHEUS platform. 1. Venkatesh, G., Sampson, J., Goulding, N., Gracia, S., Bryksin, V.,
Martinez, J.L., Swanson, S., & Taylor, M.B. (2010). Conservation
cores: reducing the energy of mature computations, ASPLOS 10.
2. Brunelli, C., Garzia, F., & Nurmi, J. (2008). A coarse-grain
7 Conclusion reconfigurable architecture for multimedia applications fea-
turing subword computation capabilities, Springer-Verlag.
Journal of Real-Time Image Processing, 3(1–2), 21–32.
In this paper, a template-based Homogeneous, Heteroge- doi:10.1007/s11554-008-0071-3.
neous Accelerator-Rich Platform (HARP2 ) is presented. 3. Singh, H., Lee, M.-H., Lu, G., Kurdahi, F.J., Bagherzadeh, N., &
The HARP2 template is used to generate both homogeneous Filho, E.M.C. (2000). Morphosys: An integrated reconfigurable
(HARP-hom) and heterogeneous (HARP-het) versions to system for data-parallel and computation-intensive applications.
IEEE Trans. Computers, 49(5), 465–481.
compute massively-parallel signal processing related algo- 4. Mei, B., Vernalde, S., Verkest, D., Man, H.D., & Lauwere-
rithms. HARP-het version is generated for a proof-of- ins, R. (2003). ADRES: An architecture with tightly coupled
concept test-case to verify the functionality and design VLIW processor and coarse-grained reconfigurable matrix. Field-
features of the template. HARP-hom version is execut- Programmable Logic and Applications, 2778, 61–70. ISBN 978-
3-540-40822-2.
ing four streams of 128-points for Fast Fourier Transform 5. Baumgarte, V., Ehlers, G., May, F., Nuckel, A., Vorbach, M., &
(FFT) processing within 3.15 μs under the execution-time Weinhardt, M. (2003). PACT XPP-A self-reconfigurable data pro-
constraint of 4.0 μs of 802.11n standard. cessing architecture. The Journal of Supercomputing, 26(2), 167–
The HARP-het version has been compared against other 184.
6. Garzia, F., Hussain, W., & Nurmi, J. (2009). CREMA, a coarse-
state-of-the-art platforms. It performs a 64-point FFT within grain re-configurable array with mapping adaptiveness. In Proc.
4.2 μs while achieving a gain of 1.96X in comparison 19th international conference on Field Programmable Logic and
to the estimated performance of NineSilica which is a Applications (FPL 2009). Prague, Czech Republic: IEEE.
7. Hussain, W., Garzia, F., Ahonen, T., & Nurmi, J. (2012). Design-
homogeneous Multi-Processor System-on-Chip (MPSoC).
ing fast fourier transform accelerators for orthogonal frequency-
HARP-het also performs at 40.8 GOPS and showing a division multiplexing systems. Journal of Signal Processing
gain of 1.51X in comparison to the scaled performance of Systems, Springer, 69, 161–171.
8. Hussain, W., Ahonen, T., & Nurmi, J. (2012). Effects of Scal- systems. International Symposium on System on Chip (SoC) 2010,
ing a coarse-grain reconfigurable array on power and energy 26–30. doi:10.1109/ISSOC.2010.5625562.
consumption. In Proc. SoC 2012. Finland. 23. Hussain, W., Ahonen, T., & Nurmi, J. (2012). Effects of scal-
9. Vassiliadis, D., Kavvadias, N., Theodoridis, G., & Nikolaidis, ing a coarse-grain reconfigurable array on power and energy
S. (2005). A RISC architecture extended by an efficient tightly consumption. In Proc. SoC 2012. Tampere.
coupled reconfigurable unit. In Proc. ARC. 24. Bonnot, P., Lemonnier, F., Edelin, G., Gaillat, G., Ruch, O.,
10. Hussain, W., Chen, X., Ascheid, G., & Nurmi, J. (2013). A & Gauget, P. Definition and SIMD implementation of a multi-
reconfigurable application-specific instruction-set processor for processing architecture approach on FPGA. In Proc. of Design,
fast fourier transform processing. In IEEE 24th international Automation and Test in Europe (DATE ’08) (pp. 610–615). New
conference on Application-Specific Systems, Architectures and York: ACM.
Processors (ASAP) (pp. 339–345). Washington, DC. 25. Campi, F., Deledda, A., Pizzotti, M., Ciccarelli, L., Rolandi, P.,
11. Garzia, F., Ahonen, T., & Nurmi, J. (2009). A switched inter- Mucci, C., Lodi, A., Vitkovski, A., & Vanzolini, L. A dynami-
connection infrastructure to tightly-couple a RISC processor core cally adaptive DSP for heterogeneous reconfigurable platforms. In
with a coarse grain reconfigurable array. Research in Micro- Proc. of Design Automation and Test in Europe (DATE ’07) (pp. 9–
electronics and Electronics 2009. PRIME 2009. Ph.D., 16–19. 14). San Jose: EDA Consortium.
doi:10.1109/RME.2009.5201372. 26. Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T.,
12. Hussain, W., Garzia, F., & Nurmi, J. (2010). Evaluation of Radix- Haugou, G., Clermidy, F., & Dutoit, D. Platform 2012, a Many-
2 and Radix-4 FFT processing on a reconfigurable platform. In Core Computing Accelerator for Embedded SoCs: Performance
Proceedings of the 13th IEEE international symposium on Design Evaluation of Visual Analytics Applications. In Proc. 49th Annual
and Diagnostics of Electronic Circuits and Systems (DDECS’10) Design Automation Conference (DAC ’12) (pp. 1137–1142). New
(pp. 249–254). IEEE. ISBN 978-1-4244-6610-8. York: ACM.
13. Garzia, F., Hussain, W., Airoldi, R., & Nurmi, J. (2009). A recon- 27. Voros, N.S., Hubner, M., Becker, J., Khnle, M., Thomaitiv, F.,
figurable SoC tailored to software defined radio applications. In Grasset, A., Brelet, P., Bonnot, P., Campi, F., Schler, E., Sahlbach,
Proc of 27th Norchip Conference, Trondheim (NO). H., Whitty, S., Ernst, R., Billich, E., Tischendorf, C., Heinkel,
14. IEEE Standard for Information technology (2009). Local and U., Ieromnimon, F., Kritharidis, D., Schneider, A., Knaeblein,
metropolitan area networks– Specific requirements– Part 11: J., & Putzke-Rming, W. (2013). MORPHEUS: A heterogeneous
Wireless LAN Medium Access Control (MAC)and Physical dynamically reconfigurable platform for designing highly com-
Layer (PHY) Specifications Amendment 5: Enhancements for plex embedded systems. ACM Transactions on Embedded Com-
Higher Throughput,” in IEEE Std 802.11n-2009 (Amendment puting Systems, 12(Article 70, 3), 33.
to IEEE Std 802.11-2007 as amended by IEEE Std 802.11k- 28. Altera Product Catalog (2015). Release Date: July 2014, Version
2008, IEEE Std 802.11r-2008, IEEE Std 802.11y-2008, and 15.0, p. 2, www.altera.com.
IEEE Std 802.11w-2009) , vol., no., pp.1-565, Oct. 29 2009 29. Wu, X., & Gopalan, P. (2013). Xilinx Next Generation 28 nm
doi:10.1109/IEEESTD.2009.5307322. FPGA Technology Overview, White Paper: 28nm Technology,
15. Rauwerda, G.K., Heysters, P.M., & Smit, G.J.M. (2008). Towards July 23, 2013, Version 1.1.1, p. 5, www.xilinx.com.
software defined radios using coarse-grained reconfigurable hard- 30. Ian, K., & Rose, J. (2007). Measuring the gap between
ware. IEEE Transactions on Very Large Scale Integration (VLSI) FPGAs and ASICs. IEEE Transactions on Computer-Aided
Systems, 16(1), 313. Design of Integrated Circuits and Systems, 26(2), 203–215.
16. Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine doi:10.1109/TCAD.2006.884574.
calculation of complex Fourier series. Math. Comp., 19, 297–301.
17. Hussain, W., Garzia, F., & Nurmi, J. (2010). Exploiting control
management to accelerate radix-4 FFT on a reconfigurable plat-
form. In Proc. International Symposium on System-on-Chip 2010 Waqar Hussain is currently a
(pp. 154–157). Tampere: IEEE. ISBN: 978-1-4244-8276-4. Postdoctoral Researcher and lec-
18. Kylliainen, J., Ahonen, T., & Nurmi, J. (2007). General-purpose turing at the Department of Elec-
embedded processor cores - the COFFEE RISC example. In J. tronics and Communications
Nurmi (ed.), Processor Design: System-on-Chip Computing for Engineering, Tampere University
ASICs and FPGAs (ch. 5, pp. 83–100). Kluwer Academic Pub- of Technology (TUT), Finland.
lishers / Springer Publishers. ISBN-10: 1402055293, ISBN-13: He has been a Visiting Scientist
978-1-4020-5529-4. at the Department of Computer
19. Brunelli, C., Garzia, F., Giliberto, C., & Nurmi, J. (2008). A Science, University of Chicago,
Dedicated DMA Logic Addressing a Time Multiplexed Memory IL, USA and also at the
to Reduce the Effects of the System Buss Bottleneck. In Proc. Institute for Communication
18th International Conference on Field Programmable Logic and Technologies and Embedded Sys-
Applications, (FPL 2008) (pp. 487–490). Germany, Heidelberg. tems, Rheinisch-Westfaelische
20. Garzia, F., Brunelli, C., & Nurmi, J. (2008). A pipelined infras- Technische Hochschule
tructure for the distribution of the configuration bitstream in (RWTH), Aachen, Germany.
a coarse-grain reconfigurable array. In Proceedings of the 4th He has an M.Sc and a D.Sc.
International Workshop on Reconfigurable Communication- (Technology) degree from TUT, Finland. Dr. Hussain’s research
centric System-on-Chip (ReCoSoC’08) (pp. 188–191). Univ interests include design and development of homogeneous and het-
Montpellier II. ISBN:978-84-691-3603-4. erogeneous multicore and manycore systems that are specialized for
21. Hussain, W., Garzia, T.A.honen.F., & Nurmi, J. (2011). application specific, reconfigurable and general purpose processing.
Application-driven dimensioning of a coarse-grain reconfigurable He is working on accelerator-rich architectures and issues related to
array. In Proc. NASA/ESA Conference on Adaptive Hardware and communication infrastructures e.g., Network-on-Chip in an effort to
Systems (AHS-2011) (pp. 234–239). USA. exploit the under-utilized part of the chip known as Dark Silicon. Dr.
22. Airoldi, R., Garzia, F., Anjum, O., & Nurmi, J. (2010). Homoge- Hussain has been serving as technical program committee member,
neous MPSoC as baseband signal processing engine for OFDM organizer and reviewer of several high-level conferences and journals.
Roberto Airoldi received Tapani Ahonen is a Research

the M.Sc. and the D.Sc. Fellow at Tampere Univer-
(Tech) degrees respectively sity of Technology (TUT) in
from University of Bologna Tampere, Finland, where he
(Italy) in 2009 and Tampere has held various positions
University of Technology since 2000. He is a Scien-
(Finland) in 2013. Currently, tific Collaborator at Univer-
Dr. Airoldi works as senior sit Libre de Bruxelles (ULB)
research scientist in the in Bruxelles, Belgium, where
Department of Electronics he was a Visiting Researcher
and Communications Engi- between 2009 and 2010. His
neering at Tampere University current activities include sci-
of Technology, where he entific coordination of a pan-
has the technical lead of an European project involving 26
Academy of Finland project participant organizations from
focused on the design of ten different countries. His work is focused on proof-of-concept driven
flexible architectures for the implementation of cognitive radio computer systems design with emphasis on many-core processing
systems. His main research interests are: development of physical environments. Ahonen has an MSc (2001) in Electrical Engineering
layer algorithms and flexible architectures for wireless communica- and a PhD (2006) in Information Technology from TUT.
tions, low power design, multi-processor architectures and parallel
programming.
Jari Nurmi works as a Pro-

fessor at Tampere University
of Technology, Finland since
1999, currently at the Depart-
ment of Electronics and
Communications Engineering.
He is working on embed-
ded computing systems,
wireless localization, posi-
Henry Hoffmann has been tioning receiver prototyping,
an Assistant Professor in the and software-defined radio.
Department of Computer Sci- He held various research,
ence at the University of education and management
Chicago since January 2013 positions at TUT since 1987
where he leads the Self- (e.g. Acting Associate Pro-
aware computing group (or fessor 1991-1994) and was
SEEC project) and conducts the Vice President of the SME VLSI Solution Oy 1995-1998. Since
research on adaptive tech- 2013 he is also a partner and cofounder of Ekin Labs Oy, a research
niques for power, energy, and spin-off company. He has supervised 19 PhD and about 130 MSc
performance management in theses at TUT, and been the opponent or reviewer of 25 PhD theses for
computing systems. He has other universities worldwide. He is a senior member in IEEE Circuits
spent the last 13 years work- and Systems Society, Communications Society, Computer Society,
ing on multicore architectures Signal Processing Society, and Solid-State Circuits Society. He is also
and system software in both a member of the technical committee on VLSI Systems and Appli-
academia and industry. He cations at IEEE CAS, and board member of Tampere Convention
completed a PhD in Electrical Engineering and Computer Science at Bureau. In 2004, he was one of the recipients of Nokia Educational
MIT where his research on self-aware computing was named one of Award, and the recipient of Tampere Congress Award 2005. He was
the ten ”World Changing Ideas” by Scientific American in December awarded one of the Academy of Finland Research Fellow grants for
2011. He received his SM degree in Electrical Engineering and Com- 2007-2008. In 2011 he received IIDA Innovation Award, and in 2013
puter Science from MIT in 2003. As a Masters student he worked on the Scientific Congress Award and HiPEAC Technology Transfer
MIT’s Raw processor, one of the first multicores. Along with other Award. He is a steering committee member of four international con-
members of the Raw team, he spent several years at Tilera Corporation, ferences, and participates actively in organizing conferences, tutorials,
a startup which commercialized the Raw architecture and created one workshops, and special sessions, and in editing special issues in inter-
of the first manycores. His implementation of the BDTI Communica- national journals. He has edited 3 Springer books, and has published
tions Benchmark (OFDM) on Tilera’s 64-core TILE64 processor still about 300 international conference and journal articles and book chap-
has the highest certified performance of any programmable processor. ters. He was the head of the national graduate school TELESOC for
Prior to his graduate studies, he served as an Associate Staff member 5 years, and board member of the national GETA graduate school for
of MIT Lincoln Laboratory where his research produced the Parallel 7 years. He is leading the national DELTA doctoral training network
Vector Library (PVL), which forms the foundation of the VSIPL++ 2014- 2017, and he is the coordinator of Marie Curie ITN MULTI-
standard, an interface for parallel signal and image processing. Henry POS (Multi-technology positioning professionals) in 2012-2016. He
was appointed as a Lincoln Masters Scholar in 2001 and a Lincoln is responsible for the Electrical Engineering MSc program at TUT
Doctoral Scholar in 2004. In 1999, he received his BS in Mathemat- since 2014. He is also reviewing projects for EU and project proposals
ical Sciences with highest honors and highest distinction from UNC for national funding agencies in Belgium, Canada, The Netherlands,
Chapel Hill. Saudi-Arabia, Slovenia, Sweden, and Switzerland.

HARP2 An X Scale Reconfigurable Accelerator Rich Platform For Massively Parallel Signal Processing Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HARP2 An X Scale Reconfigurable Accelerator Rich Platform For Massively Parallel Signal Processing Algorithms

Uploaded by

Copyright:

Available Formats

J Sign Process Syst (2016) 85:341–353

HARP2: An X-Scale Reconfigurable Accelerator-Rich

NineSilica is a homogeneous Multi-Processor System-

sor. The exchange of data between the microprocessors IMMEDIATE

rPE rPE rPE rPE rPE rPE

rPE rPE rPE rPE rPE rPE

4x4 PE DMA N3 N4 N5 DMA 4x8 PE

(a) HARP-het: 408-rPE Heterogeneous Accelerator-Rich Platform.

(b) HARP-hom: 256-rPE Homogeneous Accelerator-Rich Platform.

Node rPEs CGRA Type Kernel

HARP-het-N0 8×9 CREMA radix-4

All the kernels are processed in 12-bit integer format.

Node-to D. Mem to D. Mem Trans. Exe.

het(N4-N0) 1017 659 1676 420

Table 3 Resource Utilization Summary for Stratix-V FPGA Device (5SGXEA4H1F35C1).

HARP ALMs 78,845 / 158,500 50%

HARP ALMs 120,823 / 158,500 76%

HARP Node-0 Radix-4 FFT 12 24

Node Algorithm Dynamic Active Dynamic

HARP-het N0 Radix-4 FFT

HARP-hom N0 Radix-(2, 4) FFT

Platform / Performance Platform’s Scaled PV HARP-het

NineSilica [22] / FFT Execution Exe. Time - Exe. Time×20% SU fTfs

Roberto Airoldi received Tapani Ahonen is a Research

Jari Nurmi works as a Pro-

You might also like