You are on page 1of 17

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 1

A Hardware/Software Stack for


Heterogeneous Systems
Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp† , Nils Asmussen, Uwe Aßmann,
Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas,
Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel\ , Sven Karol,
Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard,
Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza‡ , Michael Raitza, Jörg Stiller,
Annett Ungethüm, Axel Voigt, and Sascha Wunderlich
Center for Advancing Electronics Dresden — Technische Universität Dresden
† SnT-CritiX — University of Luxembourg, ‡ KRDB Research Centre — Free University of Bozen-Bolzano
\ Oracle Labs — Zürich — Switzerland

Abstract—Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is
unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will
make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it
ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts
well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow
programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among
the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on
a tiled heterogeneous multicore.

Index Terms—cfaed, orchestration, post-CMOS, heterogeneous systems, programming stack, hardware/software abstractions,
emerging technologies, cross layer design

1 I NTRODUCTION emerging from the aforementioned technologies and pro-


vide significant gains in application performance.
Over the past five decades, the ever-increasing computa- A couple of inflection points are notable when it comes
tional power enabled by Moore’s Law has nurtured many to providing continued improvements in application perfor-
innovations in our daily life, including the Internet, smart- mance. The first inflection point was marked by a shift from
phones, and autonomous cars. However, the continuing frequency increasing to parallel systems, i.e., multicores [5].
scaling of transistors, predicted by Moore in 1965, will The end of Dennard scaling made it impossible to operate
ultimately end, mainly due to miniaturization limits and all parts of a chip at full frequency over extended periods of
power density constraints [1]. time, an effect known as dark silicon [6]. This led to a sec-
There are today many coordinated research efforts to ond inflection point, marked by a shift from homogeneous
find alternative technologies that could prevent stagnation to heterogeneous architectures. For example, today, almost
of computational power [2], [3], [4]. A similar initiative every server, desktop, and high-end embedded system in-
was launched in 2012 at TU Dresden – the large-scale cludes some sort of heterogeneous architecture. Examples
research project Center for Advancing Electronics Dresden are GPGPU systems [7], heterogeneous systems with ho-
(cfaed)1 to explore new technologies along with research mogeneous instruction sets [8], [9] and systems with het-
in new hardware and software architectures. The project, erogeneous cores [10]. Similarly, accelerators have become
on the one hand, has various sub-projects focusing on de- common in general purpose [11], domain specific [12] and
veloping novel technologies like silicon nanowires (SiNW) supercomputing applications [13]. Additionally, as Borkar
and carbon nanotubes (CNT), among others. On the other predicted in 2005 [14], the costs for compensating failing
hand, the project also emphasizes the need to already start transistors will soon outweigh the benefits of technology
thinking of integrating these devices at the system level scaling if faults or partial functionality are not exposed at
and develop techniques to deal with the system-design and the architectural level, which can be regarded as another
reliability challenges. In particular, the goal of the Orchestra- form of heterogeneity.
tion sub-project of cfaed is to design hardware and software Due to the ultimate scaling limits of CMOS, we expect
architectures for the yet unknown components and devices the third inflection point to be the integration of heteroge-
neous components potentially based on different emerging
1. http://www.cfaed.org post-CMOS technologies in combination with various de-

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 2

sign alternatives of classical CMOS hardware. These com-


ponents can be specialized processing elements (e.g., accel-
erators), heterogeneous (non-volatile) memories, or novel
interconnects (e.g., photonic or wireless). Components can
also be partially reconfigurable circuits combining different
processing elements or providing a platform for application-
specific and even software-delivered circuits. Regardless of
which combination of processing, memory, and interconnect
technology is ultimately used in future systems, we envision
the heterogeneity to only increase, thus forming “wildly
heterogeneous” systems. This trend has been predicted by
other researchers [1], [2], [15], [16].
This paper describes the Orchestration stack, a hard-
ware/software architecture for heterogeneous systems and
a research platform towards future wildly heterogeneous
systems (see Figure 1). We believe that to handle the
upcoming complexity of wildly heterogeneous systems a
holistic approach is needed that touches upon several layers
Fig. 1: The Orchestration stack (focus of this paper in bold).
of a hardware/software stack. In this paper we focus on
the lower layers of the stack, which combine and extend
well-understood principles from different areas, including
future heterogeneous systems. As we make no additional
networks-on-chip, operating systems, parallel programming
assumptions on neither the internal operational behavior of
models, and model checking. To design software for not
components nor their healthiness, we developed hardware
yet available hardware, we rely on (i) CMOS hardware as a
features to effectively isolate components at the NoC level
proxy with interfaces that can potentially allow connecting
and control mechanisms for data and command exchange.
post-CMOS components, (ii) event-based simulation of pro-
cessors, memories, and interconnect, as well as (iii) formal These hardware mechanisms are then used by the operat-
verification and quantitative analysis using (probabilistic) ing system (OS) to isolate applications and establish commu-
model checking. nication channels to other tiles and remote memories. Since
The paper is organized as follows. Section 2 provides we need to support a heterogeneous underlying architecture
an overview of the Orchestration stack and its underlying with varying processing capabilities, it may not be feasible
principles. Section 3 introduces Tomahawk, a family of to run a full-fledged OS on each of them. We, therefore,
tiled multicore platforms that serves to demonstrate the use NoC-level isolation, where kernel instances are run on
principles in the lower layers of the stack. Section 4 describes dedicated tiles and control the applications, running on the
M3 , an OS that leverages hardware support in Tomahawk to remaining tiles (e.g., an accelerator or an FPGA) on bare
encapsulate components. In Section 5, we discuss a compi- metal, remotely.
lation flow and a runtime system for dataflow applications On top of the OS, application runtimes make decisions
to demonstrate variant generation and deployment onto based on information queried from the OS. For example,
heterogeneous systems. The new formal methods developed they consider global metrics such as resource utilization,
to deal with these kinds of heterogeneous designs are de- power, and the performance of applications and determine,
scribed in Section 6. Formal methods and the lower layers among others, the mapping of applications to the tiles. Dur-
of the stack are evaluated in Section 7. We then put this ing this mapping, the expected future resource-performance
paper in perspective with respect to future technologies in trade-offs are also evaluated. While these decisions are made
Section 8. This is followed by related work and conclusions by the runtimes, they are enabled by the compiler, which
in Section 9 and Section 10, respectively. identifies application demands specific to the hardware,
while exploiting heterogeneous computation, communica-
tion, and storage resources.
2 O RCHESTRATION S TACK OVERVIEW The upper layers of the stack in Figure 1, namely
This section gives an overview of our hardware/software adaptable application algorithms and programming abstractions,
stack that is detailed in Sections 3–6. A conceptual view are not the subject of this paper. These layers deal with
of the stack is shown in Figure 1. Each layer hides cer- describing, algorithmically and programmatically, applica-
tain aspects of heterogeneity, propagating just enough for tions semantics in a way that makes it easier to adapt to
higher-level layers to adjust and use the available resources and exploit heterogeneous hardware, striving for future-
efficiently. proof programming. To this end, we develop parallel skele-
At the hardware level all we require is that the components tons [18] and domain specific languages (DSLs) together
have a well-defined interface that allows embedding them with domain experts. Examples are parallel skeletons and
into a tile-based architecture and that allows exchanging semantic composition [19] for computational fluid dynamics
data and commands using some kind of network (e.g., (CFD) applications [20], DSLs for CFD and particle-based
a network-on-chip (NoC) [17]). New architectures and/or simulations [21], [22], and hardware acceleration for data
technologies providing such an interface can hence be em- bases [23].
bedded into this framework, enabling flexible design of Last but not least, we integrate formal methods in our

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 3

PLL
Scratchpad
Memory GPIO

(256 KB) 6x VDSP LPDDR2 Memory Interface CM


PE
DPM[1] BBPM
Core DPM[0] NoC

Manager PE
Router1
R[6]
Serial Link
DC212GP

Periphery
PE
Router0
Serial DPM[2]
ACM
DC212GP
Link
DPM[3]
DDR Ctrl
PE

LPDDR2 Memory Interface R[7]


LDPC-CC Decoder SDSP 0 SerDes I/O LPDDR2 Memory Interface
Deblocking Filter SerDes I/O

SDSP 1
(FPGA IF)

PLL PLL
Serial NoC link

(a) (b) (c) (d)

Fig. 2: Chip photos of Tomahawk generations. (a) Tomahawk1 [24] (UMC 130 nm). (b) Tomahawk2 [25] (TSMC 65 nm). (c)
Tomahawk3 [26] (GF 28 nm). (d) Tomahawk4 [27] (GF 28 nm).

design process to quantitatively analyze low-level resource NoC: four data processing modules (DPM), one baseband
management protocols for stochastically modeled classes processing module (BBPM), and one application control
of applications and systems [28]. The model-based formal module (ACM), as depicted in Figure 3. The DPMs comprise
quantitative analysis, carried out using probabilistic model a Tensilica Xtensa LX5 and an ARM Cortex-M4F proces-
checking techniques, is particularly useful for the compara- sor as well as 128 kB local memory, and a data transfer
tive evaluation of (existing and future) design alternatives, unit (DTU). The Xtensa is built upon an ASIP approach
to compare the performance of heuristic approaches for and accelerates basic database operators by instruction set
system management with theoretical optimal solutions or extensions (similar to [26]). Both cores share the same local
to determine optimal system parameter settings. memory while only one core is active at a time. The DTU is
responsible for data transfers between the tiles and provides
isolation capabilities (more details in Section 3.2). The BBPM
3 T HE TOMAHAWK A RCHITECTURE
is tailored to signal processing applications and integrates
As mentioned in the stack overview in Section 2, we con- an Xtensa LX5 as well as application-specific integrated
sider two key aspects at the hardware level to handle circuits (Sphere Detector (SD), Turbo Decoder (TD)) with
heterogeneity. First, we think tile-based systems are promis- separate memories. The Tensilica 570T CPU, included in the
ing for easy integration of heterogeneous components at ACM, is a general-purpose core with full MMU and 32 kB
the physical level (see further discussion in Section 8). cache which is used as host processor. The FPGA interface
This architectural paradigm has already proven effective allows chip-to-chip interconnections and the LPDDR2 inter-
for handling design complexity and scalability up to 1000 face provides access to an 128 MB SDRAM.
processing elements (PEs) [29], [30], [31] and for heterogene-
ity [32], [33]. Second, and more importantly, the hardware 3.2 Uniform Access and Control
should include efficient mechanisms to provide uniform
Heterogeneity allows, on the one hand, increasing energy
access to specialized units via the NoC. Hardware support
efficiency due to specialization. On the other hand, it creates
for message passing and capabilities is key for system per-
significant challenges for the hardware and software inte-
formance as will be discussed in Section 4. In this section we
gration. A novel hardware mechanism is needed that pro-
briefly describe actual silicon prototypes of tiled systems in
vides a unified access and protection of system resources,
Section 3.1, followed by a detailed description of the unified
but does not rely on specific features of the processing
access control mechanisms in Section 3.2.
elements (e.g., user/kernel mode). This is similar to what
a memory management unit (MMU) accomplishes in a tra-
3.1 Tomahawk Overview ditional system. In the presence of heterogeneous tiles, this
Tomahawk is a family of tiled heterogeneous multi- mechanism has to be provided at the tile-system interface.
processor platforms with a NoC-based interconnect [34]. In this way, one can ensure that future heterogeneous tiles
A total of four generations of the Tomahawk architecture can be embedded safely without tampering with the state
have been designed, which are displayed in Figure 2. of the system. The unified access must be simple enough to
A Tomahawk platform consists of multiple heterogeneous not incur performance penalties on running applications.
tiles, peripherals, an FPGA interface, and off-chip memory. In Tomahawk, unified access control and isolation is pro-
All tiles contain interfacing logic to the associated NoC vided by a Data Transfer Unit (DTU) [35] in every tile. This
router. Apart from the required NoC interface, tiles may peripheral represents a single access point to tile-external
include different kinds of processing elements (standard or resources, which yields a mechanism to provide isolation
specialized instruction-set architectures (ISAs) or hardwired at the NoC level. The DTU can be configured, e.g., by the
accelerators). OS as discussed in Section 4, so that a functionality running
Most of the tiles found in the first four Tomahawk on a tile can communicate via message passing to a clearly
generations contain one or more processing elements and defined set of tiles. Message passing is the mechanism used
a local scratchpad memory. In particular, Tomahawk4 [27] to implement remote system calls, which allows tiles not
includes in total six tiles which are connected by a hexagonal capable of running an OS to access system services.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 4

Tile2 (BBPM) Tile1 Tile2


Xtensa SD
MEM 1 App 2 5 Kernel 4
128kB MEM TD
DTU MEM Tile3 (DPM)
Xtensa ARM
Xtensa ARM
Periphery R2 R3
128kB MEM
DTU Mem DTU S Mem R DTU

Tile1 (DPM) Tile4 (ACM)


3
Xtensa ARM Xtensa 570T
R1 R6 R4
128kB MEM MEM
Cache
Fig. 4: Illustration of a system call via DTU.
DTU DTU

Tile0 (DPM)

Xtensa ARM
R0 R5 LPDDR2 IF give details of the software architecture in Section 4.2 and
128kB MEM
DTU Tile5 (DPM) close with a discussion of the limitations of the approach in
Xtensa ARM Section 4.3.
FPGA IF R7
128kB MEM
DTU

4.1 M3 Overview
Fig. 3: Block diagram of Tomahawk4 architecture.
As mentioned earlier, we use NoC-level isolation to provide
isolation even for tiles that are not able to run an OS kernel.
For memory access and message passing, the DTU pro- It is based on the DTU as the common interface of all tiles
vides a set of endpoints. Each of these endpoints can be and the differentiation between privileged and unprivileged
configured to: (i) access a contiguous, byte-granular, tile- tiles. The key for isolation is that only privileged tiles can
external range of memory, (ii) receive messages into a ring configure the DTU’s endpoints to, e.g., create communi-
buffer in tile-internal memory, or (iii) send messages to cation channels for message passing. At boot, all tiles are
a receiving endpoint. Once their DTU is configured, tiles privileged. Since the M3 kernel deprivileges the application
can communicate and access external memory. Given the tiles, only the kernel tiles can configure endpoints. Once
distributed memory paradigm of the hardware, memory an endpoint has been configured, applications can use it
coherency can be implemented through message passing directly for memory access or communication, without in-
as a layer on top of the DTU. Furthermore, for dataflow volving any OS kernel. Who is allowed to communicate
applications the DTU can be used to directly implement with whom is controlled via capabilities as discussed in
channels and by that lower the communication overhead Section 4.2.
on the CPU. The DTU implementation in the Tomahawk4 There are also other approaches that target heteroge-
occupies an area of 0.063 mm2 which is about 6% of a DPM neous systems such as Barrelfish [36], K2 [37], and Pop-
tile. corn Linux [38]. However, they assume that all cores can
execute a traditional kernel, requiring user-/kernel-mode
separation and MMUs. Approaches using software-based
4 T HE M3 OS isolation such as Helios [39], do not require an MMU, but
In a heterogeneous setting, with general-purpose and are restricted to units that execute software and limited to
application-specific tiles like the Tomahawk, we cannot a certain set of programming languages. Our DTU-based
expect all units to have the ISA and other architectural approach allows M3 to control any kind of tile, e.g., con-
properties required to run full-fledged OSs. Therefore, we taining a general purpose core, a hardware accelerator, or a
designed a novel OS architecture called M3 [35], where the reconfigurable circuit.
kernel runs on dedicated tiles and remotely controls the ap- We illustrate how the communication via the DTU works
plications running on the remaining tiles. The M3 kernel is using a system call as example (see Figure 4). On traditional
not entered via interrupt or exception as traditional kernels. systems, system calls are issued by switching from user
Instead, it interacts with the applications via messages. In mode into kernel mode. On M3 , system calls are requested
spite of this, we still call it “kernel” since, as with traditional by sending a message via the DTU to the corresponding
kernels, its main responsibility is to decide whether an kernel tile. First, the application prepares the message in the
operation is allowed or not. Message passing is supported local memory (marked in orange in the figure) of Tile 1.
and isolation is ensured by the DTU described in Section 3.2. The application then instructs the DTU to use the pre-
In this model, the task of the OS then becomes to setup the configured send endpoint S in Tile 1 to send the message
processing elements and their interaction structures using to the receive endpoint R at the kernel’s DTU in Tile 2. In
the endpoints of the DTUs. This design removes all require- the third step, the DTU streams the message over the NoC
ments on features of heterogeneous processing elements, to the receiving tile into a ring buffer of Tile 2, pointed by
with the sole exception that they can copy data to a certain the receive endpoint. The DTU wakes up the kernel, which
address to initiate a message or memory transfer to another in turn will check for new messages, and, in the final fifth
tile. We explain how M3 ensures isolation in Section 4.1, step, process the message in its local memory.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 5

4.2 Software Architecture state. All these extensions are however out of the scope of
3
M follows the microkernel-based approach [40], [41], [42], this paper.
where OS functionality is divided into a small kernel and
several services running on top of the kernel. The reason
5 DATAFLOW P ROGRAMMING AND R UNTIMES
for this design is not specifically enabling heterogeneity, but
its security and reliability advantages. In such microkernel- Having introduced the Tomahawk architecture and the M3
based systems, the kernel enforces separation of processes OS, we now focus on application programming and run-
(applications and services) and provides some forms of time systems. Given the increased complexity of heteroge-
communication between them. To decide which processes neous systems, we opt for an automatic approach instead
are allowed to interact with each other, the kernel uses of resource-aware programming, where the programmer
capabilities and enforces rules on how the capabilities can be is responsible for handling resource allocation and de-
created, passed on, and otherwise manipulated by service allocation. An automatic approach builds on (i) hardware
and application processes. A capability is thereby a pair cost models to enable heterogeneity-aware optimizations,
of a reference to a kernel object and permissions for the (ii) runtime systems that intelligently assign resources to
object. In classical microkernel OSs, for example L4-based applications depending on the system state, and (iii) parallel
systems [41], the enforcement of separation and capabili- programming models with clear execution semantics that
ties relies on memory management and user/kernel mode. allow to reason about parallel schedules at compile time.
Message passing between processes, for example between To this end we employ dataflow programming languages
an application and a file service, requires a call to the micro- which cover a broad range of applications while still being
kernel. In contrast, in DTU/M3 -based systems, separation mostly analyzable. The analysis and information collected
and capability enforcement relies on the DTUs. Once a at compile time is then exploited at run time to deploy
communication channel is set up, i.e., the involved send the application to the heterogeneous fabric in near-optimal
and receive capabilities have been mapped to endpoints way. Additionally, dataflow programming represents a good
in the sending and receiving DTUs, exchanging messages match to the message-passing non-coherent memory archi-
no longer needs kernel interaction. Thus, the DTU is re- tecture of the system.
sponsible for enforcing the configured constraints (e.g., a In this section we provide an overview of the dataflow
maximum message size). programming flow in Section 5.1, discuss hardware mod-
On top of capabilities, M3 adds POSIX-like abstractions els for automatic optimization in Section 5.2, explain the
for, e.g., files and pipes. They free the programmer from methodology to find multiple static mappings in Section 5.3,
directly dealing with capabilities and also provide support and describe how these mappings are deployed at run time
for legacy applications. However, if desired, applications by the the Canoe runtime system in Section 5.4.
can also use capabilities directly. One example is Canoe,
which allows to run dataflow applications on M3 , as will 5.1 Compilation Overview
be explained in more detail in Section 5.4. M3 ’s concepts
allow Canoe to run multiple, mutually distrusting dataflow As mentioned above, we use a dataflow programming
applications simultaneously in an isolated fashion. model to represent applications. In this model, an appli-
cation is represented as a graph of communicating actors
(similar to task graphs). This explicit parallel representation
4.3 Discussion offers a better analyzability than, for example, programming
The strict separation of application tiles and kernel tiles with threads (or abstractions thereof like OpenMP) [43]. The
has several benefits. Since resources are not shared (e.g., clear separation between communication, state, and compu-
registers, caches, or translation lookaside buffers), registers tation makes it easier to change the implementation of actors
do not need to be saved on system calls and the potential without affecting the rest of the application. This, in turn, en-
for cache misses after a system call is also reduced. With the ables deploying actors to adequate heterogeneous resources
right programming model, it is also easier to add tiles, po- in a safe and transparent way. From the perspective of the
tentially with appropriate specializations, to a running ap- overall stack (recall Figure 1), dataflow can be seen as one
plication (similar to invasive computing [32] or the mapping of multiple intermediate abstractions. Higher-level models
variants discussed in Section 5). Finally, the kernels will not or DSLs map to one of these abstractions, empowering the
compete with applications for specialized resources, as they compiler to reason about resource allocation in a similar
do not necessarily benefit from the same hardware features way as with dataflow (e.g., ordering data accesses for tensor
(e.g., reconfigurable and application-specific circuits). computations in CFD [22]).
For more general systems, the DTU/M3 concept requires An overview of the compilation flow is shown in Fig-
several extensions. For example, for multiple applications ure 5. Dataflow applications are written in the language “C
to share a tile, we have an extension to the M3 kernel that for Process Networks” (CPN) [44]. CPN allows to specify
allows multiplexing remote tiles. We have also designed an Kahn Process Networks [45], a dynamic dataflow program-
extension to support complex cores with fully integrated ming model with a higher expressivity than static models
caches and virtual memory (such as modern x86 cores) and like synchronous dataflow [46]. Apart from the CPN code,
to transparently add caches and virtual memory to arbitrary the compiler receives application constraints, application
units such as accelerators. Finally, to scale to systems with meta-information and an abstract model of the target archi-
larger amount of tiles, we have an ongoing project to allow tecture. Constraints specify, among other, set of resources
multiple instances of the M3 kernel, that synchronize their the application should run on or real time constraints

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 6

Fig. 5: Compiler tool flow.

(latency and throughput). The architecture model and the Communication is modeled with cost functions that
meta-information are further explained in Section 5.2. Due indicate how many cycles are required to transmit a given
to the dynamic application model, we use a profile-directed amount of bytes. These cost models account for the posi-
compilation, i.e., we collect execution information to feed tion of source and sink tiles as well as the selected path
the optimization process. The optimizer determines near- through the hardware topology [54]. A cost model can be
optimal mappings, i.e., an assignment of actors to processors obtained by measuring actual communication costs on the
and data to memories. This is done iteratively to find mul- target platform. For instance, the upper plot in Figure 6
tiple mappings with different trade-offs (e.g., performance shows a cost measurement for communicating data tokens
and resource utilization). Mapping variants differ from each on the Tomahawk via DRAM. Further, the plot illustrates
other in the amount and type of resources they use. Which the derivation of a cost function using piecewise linear
variant is actually deployed is decided at run time by the regression.
Canoe runtime system. Theses analytical cost models have two applications
The generated variants (right hand-side of Figure 5) use within the compiler. First, the optimizer bases its decisions
target-specific C code with calls to the Canoe application on the cost models in order to find a near-optimal mapping.
programming interfaces (APIs) for resource allocation, task Second, a simulation module uses the cost models to simu-
management and communication. With this programming late a complete execution of an application provided a given
flow we aim at relieving the programmer from dealing variant. This allows the compiler to provide estimations of
with resource management herself. We believe that with the application’s performance. For instance, the middle plot
higher heterogeneity and core counts, resource allocation, in Figure 6 compares the estimated and measured costs for
utilization, and resizing should be transparent and handled a simple test application2 . The plots show two variants, one
by the software stack. that implements channels by placing data in the scratchpad
There are many frameworks for analysis, mapping, and memories of producing PEs, and one that places data in the
code generation for dataflow applications [47], [48], [49], scratchpad of consuming PEs.
[50], [51]. We focus on generating variants that exploit re- To add better support for NoC-based platforms, we
source heterogeneity and on methods to allow these variants extended the analytical communication model by an anno-
to be transformed by the runtime system depending on tation of shared resources (e.g., a link). During simulation, a
resource availability (see Section 5.3). process first has to acquire all resources before it can initiate
a transfer. This allows for simulation of contention in the
network. The bottom plot in Figure 6 shows performance
5.2 Compiler: Taming Heterogeneity estimation results for the same application as in the middle
plot but in a congested network. The network congestion
To support heterogeneity, an abstract model of the hard- was achieved by choosing a worst case mapping that max-
ware platform is used. The model describes the topology imizes the interference of independent communications on
of the hardware, including programmable PEs, hardware the network links.
accelerators, interconnect, and memories. PEs are described
by the instruction set, the number of resources (e.g., func-
tional units, pipelines, and number of registers), and latency 5.3 Mapping Variants and Symmetries
tables [52]. Accelerators are annotated with a list of design When generating code, we create different execution vari-
parameters (e.g., data representation or number of points ants. These variants have different execution properties,
of a fast Fourier transform). Latency, area, and other cost different goals (e.g., energy-efficiency or execution time),
functions are defined as arbitrary functions of the param- and different resource utilization. Concretely, a variant cor-
eters [53]. This high-level modeling of hardware resources responds to a mapping of the application to a given set of
can be applied in the future to components implemented platform resources, including tasks to PEs, communication
with emerging technologies. to links, and data to memories. Based on the hardware
To be able to use accelerators, if available, meta- models, each variant is annotated with the estimated per-
information is added. This information indicates the com- formance, energy usage, and resource utilization. These
piler that a given actor is actually an instance of a known annotated variants are given to the Canoe runtime system.
algorithmic kernel (e.g., filter or a decoder) [53]. Together It can then select and adapt the variant that best suits the
with the abstract platform model, the compiler can then
automatically generate code and use accelerators from a 2. The test application consists in an 8-stage pipeline transferring a
functional specification of the application. total of 6,000 tokens.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 7

1,500 produce to produce a variant during the compile-time design-space


consume
exploration (see examples in Section 7.2).
Time in cycles

linear regression

1,000
5.3.2 Deploying Variants with Symmetries
500
Having produced several annotated mappings at compile
Cost Measurement
time, they need to be deployed at run time by the system.
0 1,024 2,048 3,072 4,096 For this, we use symmetries again. However, instead of
6
Token size in bytes removing equivalent mappings, we generate them. The
·10
mapping represented by a variant can be transformed to
Execution time in cycles

2.5 Measured (Producer)


Simulated (Producer)
an equivalent mapping. This transformation is performance-
Measured (Consumer) preserving, i.e., the mappings before and after the transfor-
2
Simulated (Consumer) mation have the same expected performance [58]. Again,
this is non-trivial for heterogeneous systems.
By efficiently storing these transformations, the runtime
1.5 Cost Prediction without congestion (1, 000 iterations) system can find any equivalent mappings for a given variant
0 1,024 2,048 3,072 4,096 with negligible overhead [58]. When a new application is
Token size in bytes executed, the system only needs to know the resources it
6
·10
can allocate, and a variant to be used. Given a variant, it
Execution time in cycles

Measured (Producer)
uses the symmetries to find an equivalent mapping that fits
Simulated (Producer)
2.5 Measured (Consumer) the current, dynamic resource constraints of the system.
Simulated (Consumer)

2
5.4 The Canoe Runtime System
1.5
Canoe is a runtime environment for process networks. It
Cost Prediction with congestion (1, 000 iterations)

0 1,024 2,048 3,072 4,096


supports KPNs and an extension thereof that combines
Token size in bytes KPNs with task graphs, i.e., it is possible to dynamically
spawn short-lived tasks in Canoe. From the point of view of
Fig. 6: Abstract models of Tomahawk for compiler optimiza- the stack, Canoe is an application on top of M3 , as illustrated
tions. in Figure 8.
The Canoe application model is compatible to that of
the dataflow programming model discussed in Section 5.1.
system’s load and current objectives. Generating variants In Canoe, a dataflow application is a network composed
and modifying them at run time is non-trivial for hetero- of three basic building blocks, namely tasks, channels and
geneous systems. To accomplish this, we worked on an blocks. A task specifies a computational process with its
abstract characterization of symmetries in heterogeneous communication defined by a set of input and output ports.
systems that serves at exploring the space of variants at Each port has a direction (inbound and outbound) on which
compile time and transforming variants at run time. These the data is read or written in a sequential fashion. Each
two aspects are further discussed in the following. task has an associated set of binaries, one for each PE type.
A channel is a FIFO buffer connecting an outbound port
5.3.1 Obtaining Static Variants
of one task to an inbound port of another task. Finally, a
To obtain each variant, we fix constraints on resources and block is an immutable data object that can also be connected
set the objectives of the variant. We then use compile-time to task ports. Blocks are intended to store intermediate
heuristics based on profiling information and the architec- results between the execution of two tasks. To honor the
ture model [55], [56] to obtain a static mapping and its immutability attribute, a block can be connected to an
predicted performance, which corresponds to the desired outbound port exactly once. After that it can be used on
variant. a arbitrary number of inbound ports. In both cases the block
Intelligently exploring subsets of heterogeneous re- will act as a data sink or source that accepts or provides a
sources for constraints is non-trivial3 . For this purpose we number of bytes equal to the block size. Canoe ensures that
exploit symmetries in the target platform to reduce the all inbound blocks are fully available at task start and all
search space [57]. This is achieved by identifying that several outbound blocks are filled at task termination. This results
mappings have the exact same properties, by virtue of in a dependency network that influences the execution order
symmetries of the architecture and application. For exam- of tasks connected via blocks.
ple, consider the mapping shown as the lower variant in At startup, Canoe runs its manager process on one PE
Figure 7, using three tiles. To the right, four mappings allocated by M3 . From there the manager creates a working
that are equivalent to it are illustrated: they all schedule node pool by requesting additional PEs from M3 , forking
the same processes in equivalent tiles, and the communi- to them, and possibly later releasing them back to M3 .
cation patterns between said tiles are also identical. Out of Depending on load and policy, the manager process can
these equivalent mappings, only one has to be evaluated grow or shrink its node pool at any time as far as M3
provides additional PEs. The manager starts a Canoe worker
3. It is trivial in a homogeneous setup. It simply consists in itera-
tively adding one more resource. A naive approach for heterogeneous instance on each PE. It will be connected to the manager
systems would explore the power set of the set of system resources. with two channels allowing both sides to send requests to

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 8

R R R R
R R

2 R R R R

Makespan (s)
R R
FFT

1.5
Disp

R R R R
Eq1 Eq2 Eq3 1 R R

1 2 3 4 5 6 7 8
IFFT1 IFFT2 IFFT3 Number of PEs R R
R R R R

User 1 User 2 User 3

A baseband application Configuration variants Equivalent mappings

Fig. 7: Illustration of variants for a baseband processing application.

Dataflow FIFO channel is implemented by opening a communication


Applications POSIX using the DTU mechanism as described above. Since a DTU
Applications communication has a fixed message size but a FIFO does
Canoe
not, Canoe implements an intermediate layer that slices the
FIFO data to DTU message size blocks and reassembles
M3
them at the receiver side.
Apart from the actual functionality, the compiler gener-
HAL
ates code to start an M3 application. This code includes (i)
a connection to the Canoe runtime to add itself to the node
Fig. 8: The software stack. Applications may run directly pool, (ii) commands to Canoe to create channels, blocks, and
on top of the M3 operating system or, in case of dataflow tasks, and (iii) logic to fill blocks with its own local data to
applications, on top of the Canoe runtime. pass it to the process network and request to collect output
data from sink blocks. The latter allows the application to
communicate with the process network. Note that the M3
the other. These control channels are used to define tasks, application is itself in the Canoe node pool, which means it
channels, and blocks in the manager as well as instantiate can hold channels and block instances.
these in the worker nodes. To create a channel reaching from
one worker to another, the PE’s DTU must be configured
using an M3 capability of an communication endpoint on 6 F ORMAL M ETHODS
the remote PE’s DTU. In order to establish a connection, As mentioned in Section 2, we address cross cutting con-
the manager requests a communication preparation from cerns with formal methods (see Figure 1). Formal methods,
both workers, which results in a capability each. It will e.g., for verification, have been used for a long time in
then issue the kernel to transfer each capability to the other hardware design to ensure correct-by-construction systems.
worker. Only then it will request both workers to initialize We argue that formal methods should play a fundamental
the connection. Since M3 prohibits the configuration of the role in analyzing and verifying not only the hardware, but
local DTU by the PE itself, the worker will request the also the interplay with the software put in place to leverage
kernel to program the DTU using the received capability the system’s heterogeneity. Given global metrics such as
finally creating the connection. Once established, neither resource utilization, energy consumption, performance of
the Canoe manager nor the M3 kernel are involved on the applications, and trade-offs thereof, formal methods can,
communication anymore. e.g., help finding and improving existing heuristics for
When creating an application for Canoe, the compiler the mapping of application tasks to heterogeneous tiles.
produces a description of a network from the CPN speci- Such formal analysis (and synthesis) demands for new for-
fication (see Section 5.1) by using the three basic building mal methods for modeling and quantitative analysis. This
blocks, including the FIFO sizes. The compiler also specifies includes automated abstraction techniques and symbolic
one or more mapping variants that assign each task to one methods that enable more efficient treatment of larger-scale
PE. Depending on the current system status, Canoe selects system models.
the best suited mapping by checking the compiler provided In Section 6.1 we will first provide a brief introduction to
mappings for resource conflicts. Once a conflict-free map- probabilistic model checking (PMC), which is our method
ping is found the deploy process is started. The system of choice for a quantitative analysis of heterogeneous com-
status includes the PE pool and the already active mapping puting platforms. In Section 6.2 we give an overview on
of other networks that may block some PEs. Furthermore, the essential concepts for the modeling and quantitative
it has to be ensured that for each task a binary exists analysis and in particular the new methods for a PMC-based
that matches the PE type the task is supposed to run on. trade-off analysis. The section closes with a short remark on
Parameter sets in the input specification can be expressed the complexity of using PMC for (large-scale) heterogeneous
as blocks that are connected to tasks as additional input. A computing platforms.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 9

6.1 Probabilistic Model Checking can be used, e.g., to model stochastic assumptions on the
Formal methods include techniques for the specification, environment, i.e., the application load, or to abstract some
development, analysis and verification of hardware/soft- complex system behavior such as caching into stochastic
ware systems, which complement the classical approach of distributions. Another use for probabilism is, e.g., to incor-
simulation-based (performance) analysis and testing. They porate hardware failure models.
provide the foundations of a methodical system design and In Markov chains, the classical PMC-based analysis al-
contribute to the reliability, robustness, and performance of lows to compute single objectives such as probabilities of
systems (see, e.g., [59], [60], [61], [62]). temporal properties. For instance, in a multicore we can
In this work, we focus on formal verification and (quan- compute the probability to acquire a shared resource with-
titative) analysis using (probabilistic) model checking tech- out waiting for competing processes, or the likelihood of fin-
niques. Model checking is a formal algorithmic approach ishing a batch of tasks within given time bounds. Moreover,
that uses graph algorithms to exhaustively explore the state by providing annotations for costs or rewards to the models,
space of a given model to verify/falsify a temporal logic the classical PMC analysis allows for computing expected
property. The major benefit of model-based techniques such values of the respective random variables. Examples are
as model checking is the ability to predict and analyze the expected waiting time to acquire the resource or the
systems that are still in their design phase, which is of great expected amount of energy required to finish a batch of
importance in the context of future post-CMOS technology, tasks. The nondeterministic choice among different actions
especially when there are no hardware realizations avail- in the states of an MDP is used to model important decisions
able yet. Model checking allows for proving the functional to be made at run time. For example, for the Tomahawk, this
correctness of hardware/software systems with respect to a could be assigning a certain task to a specific PE or deciding
given temporal specification. Probabilistic model checking on the powering or scaling of PEs. The classical PMC anal-
(PMC) is a variant for the verification and quantitative ysis allows to compute minimal and maximal probabilities
analysis of stochastic systems. over all resolutions of these nondeterministic choices. From
Our goal is to use PMC techniques to support the design the extremal resolutions we can derive (nearly) optimal
of the Orchestration stack in various ways: (i) modeling policies which can provide valuable feedback for all layers
and formal analysis of orchestration concepts developed at of the Orchestration stack, e.g., for resource management
each layer in isolation (e.g., automatic resource allocation with respect to performance or energy consumption. Here,
used in M3 and Canoe as in Section 4 and Section 5.4, the optimality refers to the single objective measure, i.e.,
or the viability of the abstractions in Section 3.2), (ii) the probability or expectation.
interplay of concepts on two or more layers (e.g., advice for Beyond classical PMC methods, we have developed new
the compiler layer (see Section 5) with respect to OS and methods enabling the computation of more complex multi-
hardware specifics), as well as (iii) the effects on the whole objective criteria that combine multiple measures for costs
system when combining orchestration methods across all and/or utility in different ways. Among others, we have
layers of the stack. This demands for modeling approaches developed new PMC algorithms for computing conditional
and methods enabling the PMC-based quantitative analysis probabilities and expectations [65], [66], [67] as well as cost-
of heterogeneous computing platforms. The next section utility ratios and p-quantiles [68], [69], [70] in MDPs. The
provides an overview on the fundamental concepts. latter can, e.g., stand for the minimal time bound t such
that there is policy to schedule tasks in such a way that the
execution of the tasks of a given list is completed within t
6.2 PMC for Heterogeneous Computing Platforms time units with probability at least p. Another example for a
PMC-based analysis requires abstract operational models p-quantile is the maximal utility value u such that there is a
that are compositional and capture the operational behav- policy to schedule tasks such that the achieved utility value
ior of the hardware and/or software components together is at least u when having completed all tasks with probabil-
with their interplay. The compositional approach allows ity at least p. Such methods are essential for the evaluation
to easily derive model variants, refinements, and abstrac- of the resource management policies as provided on the
tions of a system model by exchanging specific components different layers of the Orchestration stack, as they enable
only. Component variants exposing the same operational multi-objective and trade-off analysis [71], [72]. With these
behavior, but with different characteristics, e.g., in terms methods at hand one can now optimize for the trade-off
of performance, reliability, and energy consumption can between contradicting objective measures, e.g., energy and
often be derived by simply exchanging the characteristic performance. For instance, one can compute a policy that
parameters within the model. For this part, we rely on an minimizes the energy consumption while guaranteeing a
extension of our modeling framework for tiled architectures certain throughput with sufficient probability. In Section 7.1
as introduced in [63] (used to analyze the Tomahawk4 in we report on the results of our analysis of the Tomahawk4
Section 7.1). The model is based on a feature-based model- platform where we apply classical PMC methods as well as
ing of probabilistic systems as presented in [64] that enables our new algorithms.
a family-based analysis, i.e., the simultaneous analysis of a
model family with members for multiple different design
variants. The underlying semantics are Markovian models 6.3 Remark on Complexity
such as Markov chains that feature probabilistic branching It is well-known that model checking techniques suffer
only, and Markov decision processes (MDPs) with addi- from the state-space explosion problem, as the size of
tional nondeterministic choice. The nature of probabilism the underlying model grows exponentially in the number

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 10

TABLE 1: System parameters for PMC the compiler and the Canoe runtime system on how to use
(a) Power consumption of PEs the T4 most efficiently and (ii) provide guidelines for the
design of subsequent versions of the Tomahawk platform
load @100 MHz @400 MHz and wildly heterogeneous tiled architectures of the future in
idle 1.3 mW 8.8 mW general.
full 3.3 mW 25.5 mW In our modeling framework, a concrete instance is cre-
ated by providing a list of task types (e.g., general computa-
(b) Speedup for task types of PEs tion task, FFT task, etc.) and/or code variants (e.g., parallel
PE DPM-accelerated BBPM-accelerated vs. sequential), a list of tiles and their type (e.g., ARM, or
FFT accelerator) along with their operational modes (e.g.,
DPM 12 1
BBPM 1 8 frequency), performance characteristics (per task type and
mode), and energy profiles. This framework can easily be
used for creating models for variants of the Tomahawk
of components. The general problem is finding the right (e.g., increasing the number of BBPM PEs) and including
level of abstraction to capture just exactly the informa- tiles based on future hardware with properties substantially
tion needed for the respective analysis. Except from expert different from those existing today.
knowledge that is needed here, there are several sophisti-
cated (semi-)automated abstraction techniques adapted for 7.1.1 PMC-based Analysis of the Tomahawk4
probabilistic systems such as partial-order reduction [73],
For the analysis we consider the T4 platform with four
[74], [75] and symmetry reduction [76], [77]. Besides such
DPMs and one BBPM as detailed in Section 3 and in [27].
abstraction techniques, symbolic model representations and
The PEs can be powered on and off and operated at two
methods based on variants of binary decision diagrams such
different frequencies (see Table 1a). We use a discrete-time
as in [78], [79], [80] can be used in parallel and yield the
model and we fix one time step to represent 10.0 ms.
basis for compact representations and efficient treatment
On the application side we consider a batch process-
of stochastic system models. In [81] we present symbolic
ing scenario of two different task types (DPM- or BBPM-
algorithms for computing quantiles in MDPs as well as gen-
accelerated, see Table 1b) and a fixed number of tasks to be
eral improvements to the symbolic engine of the prominent
finished per task type. We assume that the execution time of
probabilistic model checker PRISM [79], [82] that potentially
each task follows a geometric distribution with an expected
help treating models of larger scale more efficiently. Beyond
mean of 150.0 ms. The actual execution time depends on
this, even when considering very abstract models of larger
the type of the PE and the speed-step, i.e., the speedup
systems, e.g., when reasoning about a reduced number of
provided by the respective PE and the selected frequency.
PEs or hiding many operational details of the underlying
For instance, if a DPM-accelerated task is scheduled on a
hardware, we argue that our methods are still applicable
fitting PE with a speedup of 12 that runs at 400 MHz, the
and can provide guidelines for the design, e.g., by revealing
actual expected execution time is about 20.0 ms. The used
trends and tendencies that are expected to occur in real
performance characteristics in the Tables 1a and 1b are based
hardware.
on the values measured on the T4.
The first goal (G1) of our analysis is to evaluate and
7 E VALUATION quantify the benefits of heterogeneity. For this we compare
We now study the techniques in Sections 3–6 on the Toma- the T4 configuration with a homogeneous variant consisting
hawk testbed. We start by demonstrating the applicability of five DPM and no BBPM. The second goal (G2) is to evalu-
of the formal methods to the Tomahawk hardware and ate the impact of using alternative code variants. We mimic
particular layers of the stack. With increased system’s het- this by varying the ratio between the numbers of tasks of
erogeneity, these methods are crucial to ensure correct-by- the different types. This corresponds to having variability
construction systems. We then analyze the behavior of the on the compiler layer, which can for ten tasks decide to, e.g.,
lower layers of the stack on a simulated tiled architecture generate only DPM tasks and no BBPM task or five tasks
inspired by the Tomahawk architecture. We show how the of each type. In the third setting addressing goal (G3), we
compiler-generated variants can be deployed on different, study optimal strategies for powering the PEs and choosing
heterogeneous resources by the Canoe runtime, and how their frequencies. Results obtained from the formal analysis
Canoe interacts with M3 to claim resources when multiple in these three dimensions support design decisions at the
applications run on the system. Using a simulator allows us hardware level (see Section 3), the compiler level (see Sec-
to scale the system beyond what we currently can afford to tion 5 and Section 5.1) and the runtime management level
put in a hardware prototype and, in future, add hardware as realized in Canoe (see Section 5.4), respectively. When
based on emerging, not yet existing technologies. comparing the variants in one of the above settings (G1–
G3), we leave the other choices nondeterministically open,
e.g., when considering (G1), we compare the heterogeneous
7.1 PMC for Tomahawk with the homogeneous hardware variant, on the basis of the
In this section we report on the main results gained in best possible resolution of the remaining nondeterministic
the quantitative analysis of the Tomahawk4 (T4) platform choices, in the respective variant. Hence, we do the com-
using probabilistic model checking (see Section 6.1). The parison in (G1), while considering a resource management
general goal of our analysis is to (i) provide feedback to policy that is optimal for the respective variant. For the

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 11

TABLE 2: Expected energy and time consumption for finish- homogeneous 140
ing a task quota for comparisons 6 heterogeneous

(a) homogeneous vs. heterogeneous

energy [mJ]

time [ms]
4 120

PEs minimal expected


#DPM #BBPM energy time 2 100

homogeneous 5 0 2.5 mJ 99.9 ms


heterogeneous 4 1 0.6 mJ 90.4 ms 0.9 0.92 0.94 0.96 0.98 1 0.9 0.92 0.94 0.96 0.98 1
probability probability

(b) code variants


(a) homogeneous vs. heterogeneous
accelerated tasks minimal expected
DPM BBPM energy time 10 DPM, 0 BBPM
8 DPM, 2 BBPM
120

5 DPM, 5 BBPM
10 0 53.3 mJ 90.4 ms 1.2
2 DPM, 8 BBPM
8 2 56.8 mJ 90.4 ms 0 DPM, 10 BBPM 110

energy [mJ]

time [ms]
5 5 62.0 mJ 90.4 ms 1
2 8 67.2 mJ 91.0 ms 100
0 10 70.7 mJ 94.3 ms 0.8

(c) PE modes 90

0.9 0.92 0.94 0.96 0.98 1 0.9 0.92 0.94 0.96 0.98 1
probability probability
strategy minimal expected
power speed energy time (b) code variants
optimal powersave 62.0 mJ 110.0 ms 200
optimal performance 281.0 mJ 90.4 ms always on, performance
always on powersave 106.3 mJ 110.0 ms 6
always on, powersave 180
optimal, performance
always on performance 599.8 mJ 90.4 ms optimal, powersave
160
energy [mJ]

time [ms]
4 140

evaluation of the variants we use the following quantita- 120


2
tive measures, each for time and energy. The first measure 100

(M1) is the minimal expected time/energy consumption for 80


finishing a certain number of tasks. Here, the minimum in 0.9 0.92 0.94 0.96
probability
0.98 1 0.9 0.92 0.94 0.96
probability
0.98 1

among all possible resolutions for the remaining nondeter-


(c) PE modes
ministic choices (e.g., on powering or scaling PEs) within
the underlying MDP structure, as we aim to minimize the Fig. 9: Energy and time budgets to guarantee finishing the
time/energy consumption. The second measure (M2) is a task quota with a certain best-case probability
p-quantile: the minimal time/energy budget to guarantee
the completion of all tasks with a maximal probability of
at least p. Here, we use the maximal probability that can
be achieved among all possible resolutions of the remaining probability of at least p. The heterogeneous configuration
nondeterministic choices (e.g., on assigning tasks to PEs), is both faster and more energy efficient. By considering al-
since we aim to maximize the probability of finishing a ternative heterogeneous configurations of PEs featuring dif-
given number of tasks. The reason for considering the best ferent accelerated functions, one could for example find an
resolution of the remaining nondeterminism in (M1) and advantageous ratio of components for a certain application
(M2) is based on the assumption that this resolution is still scenario. These insights can either be used at design time
under our control (e.g., by providing scheduling or dynamic determining the concrete numbers within the architecture or
voltage/frequency scaling strategies) and not in the hand of at run time deciding how many PEs of certain types should
an uncontrollable environment. be reserved for this application scenario.
In the second part of the analysis we focus on goal
7.1.2 Queries and Results (G2). For this we assume that the total number of tasks
The full set of the following results was obtained within a to be finished is ten, for each of which the compiler can
few days of computation time and with at most 32GB of decide to produce DPM- or BBPM-accelerated code. We are
memory. now interested in the best ratio between alternative code
In the first scenario we address goal (G1), and show that variants and consider again measure (M1) (see Table 2b) and
heterogeneity in hardware pays off, if the application can measure (M2) (see Figure 9b) for the comparison. Again,
be compiled to code that utilizes the respective strengths we measure for the best possible way of using the PEs,
of the different PE types. Table 2a compares the results for i.e., considering the minimal expected time/energy and
measure (M1), the minimal expected energy consumption maximum-probabilities among all possible resolutions of
and execution time to finish five tasks of each type for the nondeterministic choices. Surprisingly, to have only DPMs
heterogeneous T4 case and an hypothetical homogeneous is the best variant in terms of minimal expected energy and,
case with five DPMs and no BBPM. Figure 9a depicts the re- to a lesser extend, minimal expected time. This results from
sults for measure (M2), the p-quantiles, namely the minimal their greater acceleration on the corresponding PEs (see
energy and time budgets to finish ten tasks with a maximal Table 1b). This result will be very different in an even more

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 12

RAM ARM ARM ARM ARM


configuration (to add more heterogeneity). The platform
consists of 25 tiles in a five-by-five mesh topology, with
R R R R R
20 ARM tiles (12 big and 8 little). Apart from processors,
FFT arm arm arm arm the system contains four hardware accelerators for fast
R R R R R Fourier transformations (FFTs), typical in signal processing
applications, similar to the ones in the actual T4. Further,
FFT ARM ARM ARM ARM
there is one memory tile that connects to the off-chip RAM.
R R R R R
Each tile includes a model of the DTU as well as a network
FFT arm arm arm arm interface that binds to the SystemC model of the Tomahawk
R R R R R
NoC. The processing and accelerator tiles also include a
scratchpad local memory. On this simulator, it is possible
FFT ARM ARM ARM ARM to run full applications, the M3 OS, and the Canoe runtime.
R R R R R For the analysis in this paper, we use a dataflow im-
plementation of a stripped-down version of the Long-Term
Fig. 10: The virtual platform used for evaluation consisting Evolution (LTE) standard for mobile networks. The LTE-
of a NoC, a memory tile, 4 FFT accelerators, 12 ARM big like application benchmark (see left-hand side of Figure 7)
cores, and 8 ARM little cores. represents the baseband processing of an LTE frame at a
cellular base station. It consists of applying an FFT trans-
form on data reaching the radio frontend and distributing
heterogeneous setting with restrictions on the compatibility chunks of data to user-specific data processing pipelines. For
of task types and PEs. Hence, for a given hardware configu- illustration purposes, we include equalization and inverse
ration and application scenario the results yield the basis for FFTs for each user. Additionally, we include a filesystem
finding promising code variants, e.g., lookup tables could application that runs directly on top of M3 , without using
be used for helping the runtime system to decide for code Canoe.
variants that are optimal with respect to the measures (M1)
and (M2) assuming the hardware is used the most efficient 7.2.2 Single-application Perspective
way possible. We first evaluate the execution of the LTE-like application on
Finally, we focus on the combinations of heuristics for the virtual platform. In order to generate code, we require a
(1) turning PEs on and off (either always on or no heuristic) mapping of the processes in the application to hardware
and (2) selecting their frequency mode (either power save, resources, which in turn requires a model of the target
performance, or no heuristic). The case where no heuristic architecture. Concretely, this model is given as an XML
is applied means that all choices are left open. Additionally, description, and includes information about the PEs, in-
Table 2c confirms that it is the best strategy to turn PEs cluding clock frequencies and properties of the ISA, as well
off if not needed and to use the lowest frequency in order as information like the interconnect types, bandwidth and
to assure maximal power savings. Locking the frequency topology. Using a modified version of the commercial SLX
dominates the overall energy consumption, no matter if PEs tool suite [85] we can generate code for different mapping
can be turned off. On the other hand, minimal execution variants, as described in Section 5.3.
times can be expected if the highest frequency is chosen. Variant evaluation: Figure 11a shows four different
However, the relative difference in execution times between mapping variants to the virtual platform from Figure 10.
the strategies is quite small. These observations carry over The first three variants, big, little and mix use only big ARM
to the trade-off analysis as shown in Figure 9c. Interestingly, cores, only little ones, and both. The last mapping, accel,
the performance heuristic is less affected by the time bud- uses the FFT accelerators as well as big cores. The execution
gets than the powersave heuristic. times for the LTE-like application with different mappings
are also shown in Figure 11b. It is clear how leveraging
7.2 Running Software on Tomahawk heterogeneity is very beneficial for this application, while
In this section we analyze the behavior of the lower layers it obviously utilizes the limited resources of the system.
of the stack (recall Figure 8). We first introduce the experi- Finally, Figure 11c also shows an execution trace for the mix
mental setup to then discuss results and give insight in the mapping variant. A similar trace for the big variant will be
run time overheads incurred by Canoe and M3 . described in detail in the following. We see how, by using
this programming flow, code can be generated transparently
7.2.1 Experimental Setup from the same source program to target heterogeneous hard-
As mentioned above, we evaluate the proposed approach on ware, including variants with different resource utilization
a system simulator that mimics the Tomahawk architecture and performance properties.
and allows analyzing larger, heterogeneous systems. An Application startup — Canoe and M3 signaling: As
overview of the simulated platform is shown in Figure 10. mentioned in Section 5.4, both Canoe itself and the dataflow
The simulator is based on the gem5 simulation frame- application start as M3 applications. Canoe then creates a
work [83] extended by a SystemC model of the Tomahawk worker pool from resources made available by M3 . The LTE
NoC. To enable this co-simulation, we use the SystemC application then opens a channel to the Canoe manager and
binding from [84]. sets up the dataflow network. The communication between
The simulated platform includes gem5 models of the the LTE application, the Canoe Manager and its workers
ARM instruction set architecture (used in T4) in a bigLITTLE is done by M3 but does not require the interference of the

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 13

Fig. 11: Execution of different mapping variants of the LTE-like application on a heterogeneous platform.

kernel. An actual chart of the setup process is displayed TABLE 3: Execution times of distinct actions and kernels in
in Figure 12 in the bottom overlay. In the row labeled app the simulation.
the green bars identify the intervals where the application
action big LITTLE accelerator
registers actual data to the manager, also marked data in-
ject in the example. With its knowledge about the worker app setup 12,752 µs
comm setup 664 µs
pool and the received network description, Canoe selects manager syscall 7 µs
a mapping (marked with the label mapping in Figure 12). worker syscall 3 µs
In the row labeled manager, we can identify the setup of FIFO transition 15 µs
application channels as green bars at the bottom. As an FFT kernel 4629 µs 7744 µs 10 µs
EQ kernel 15,040 µs 25,048 µs
example, one such channel setup event is marked as comm Disp kernel ⇠1 µs ⇠2 µs
setup in the figure. The system calls to the M3 kernel needed
to setup the channels from the side of the workers and the
TABLE 4: Execution times of experimental setups with
manager are marked as thin red bars in the figure. The
different mappings in the system simulator. The first four
time spent in M3 syscalls for setting up a communication
correspond to Figure 11b.
channel on manager’s side is 7 µs, which represents 1%
of the total communication setup time. Table 3 lists this name iteration run time
time, alongside the times of several other different events
big 15,254 µs 54,662 µs
in the benchmark. Blue bars, like the ones connected with little 25,313 µs 88,254 µs
an arrow labeled block transfer in the bottom overlay of mix 15,254 µs 57,838 µs
Figure 12, represent the moment at which a transfer of a accel 15,276 µs 42,793 µs
data block becomes active. Finally, workers start computing, mix (noisy) 15,254 µs 57,830 µs
mix (equiv.) 15,254 µs 58,065 µs
and their execution is represented with gray bars, blocking
reads with teal bars and blocking writes with cyan bars.
The actual transfer of data through the channel corresponds
to the time elapsed from the end of a write request block system simulation with the LTE application, blocking some
(cyan) and the end of the corresponding read request block cores for execution to force the system to use a different but
(teal). The top overlay zooms in on a series of data transfers equivalent mapping.
between workers. The two red bars surround the interval To verify isolation and time-predictability, we also run
of the data being transferred, which takes 15 µs as stated in simulations with a separate application running simultane-
Table 3. This transition time, however, is less than 0.3% of ously to the LTE benchmark. To this end, we use a native
the computation time of a FFT or EQ kernel. The Dataflow M3 application that produces a high load on the filesystem.
of the LTE application is shown with blue arrows pointing At setup time, the filesystem intensive application starts as
from the origin of some data to its destination in the figure an M3 application. It uses the M3 API to connect to the
for three selected communication hotspots marked with filesystem and performs several operations in an endless
blue circles. loop.
In Table 4, we show the total run times of several
7.2.3 Multiple-application Perspective simulations, as well as the length of one iteration of the
In multi-application scenarios, it might not be possible to LTE receiver (i.e., the inverse of the network’s throughput).
use a mapping variant due to constraints on available Additionally to the four mappings from Figure 11, the mix
resources. As described in Section 5.3.2, we overcome this setup has been simulated in modified versions noisy and
problem by transforming a mapping into an equivalent one, equiv., which mean the addition of the filesystem intensive
which will keep desired properties in a predictable fashion. application and the usage of a different albeit equivalent
This is crucial in many application domains, e.g., real time mapping, respectively. Note that, as expected, neither the
systems. To evaluate this, we first show that performance introduction of an noisy environment nor the usage of an
is stable among equivalent mappings. For this, we set up a equivalent mapping changes the run time.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 14

Fig. 12: Trace of simulator running LTE app using the big mapping variant. Bottom overlay: setup phase concerning
application and manager core. Highlighted is the mapping decision point (mapping), an interval of the app passing data
to the manager (data inject), a setup interval for a inter worker communication channel (comm setup) and an actual data
transfer as a result of a data inject (block transfer). Top overlay: detailed view of an in-KPN data transfer. Blue arrows:
examples of in-KPN data transfers around circled communication hot spots.

The run time values listed in Table 4 represent the term, the main increase in power and efficiency of future
time between the connection initiation of the application to electronics will be driven mainly by two mutually stimu-
Canoe manager and the end of the second iteration of the lating technological directions: (i) processing: further paral-
pipeline’s last block. It is therefore dependent on the startup lelization and specialization, including application-specific,
phase of the network. The iteration column, however, de- reconfigurable, neuromorphic, and analog circuits, leading
scribes the iteration length in a settled network, where the to more heterogeneous hardware and (ii) data: non-volatile
throughput of a pipeline is always limited by its slowest and potentially heterogeneous memories, their tight (3D)
element. From Table 3 we learn that the most computation integration with logic, integrated sensing, and novel inter-
time is spent in the EQ kernel, and thus, the iteration length connects (e.g., photonics or wireless).
difference only depends on the placement of these kernels In this regard, partner projects within cfaed are devel-
on big or little cores. Unfortunately, this means that the oping reconfigurable transistors based on silicon nanowires
speed-up of 450 obtained when using the FFT accelerator (SiNW) [87], [88] and carbon nanotubes (CNT) [89], [90]
core cannot be fully exploited in this benchmark. as well as radio frequency electronics based on CNTs [91],
plasmonic waveguides for nanoscale optical communica-
tion [92], and sensors based on SiNWs [93].
8 D ISCUSSION Wildly heterogeneous systems resulting from the two
The previous sections described the cfaed Orchestration mentioned directions will pose both challenges and oppor-
hardware/software stack consisting of a tiled heterogeneous tunities to the hardware/software stack. We believe that the
hardware platform with network-on-chip, a capability- concepts discussed in this paper are key principles to handle
based operating system, as well as a runtime system and this kind of heterogeneity. Still, several open questions re-
a compiler for dataflow applications. As mentioned in the main. We discuss some of them in the following paragraphs.
introduction, we develop our stack as a research platform Hardware architecture: Given a tight integration of het-
to prepare to what we think will be the third inflection erogeneous PEs, memories, and sensors, potentially based
point in computing, i.e., a shift towards “wildly hetero- on emerging technologies (e.g., SiNW, CNT, and novel
geneous” systems composed of fundamentally new hard- memories), a tiled-based approach with a NoC and DTUs
ware technologies. Since it is currently still unclear which as uniform hardware interfaces makes it easy to combine
transistor technology could replace CMOS and sustain the and interface these units. However, the limits of the DTU
performance growth of the past [1], [2], [86], in the medium concept with respect to a potential disruption of the memory

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 15

hierarchy are currently unclear. as well as simulation to demonstrate their software stack.
Memory: Novel, non-volatile memories, such as spin In contrast to our automated approach, invasive computing
transfer torque magnetic RAM (STT-MRAM) and resistive is resource-aware, i.e., computing resources have to be allo-
RAM (RRAM) [2], will open a new space of optimizations cated manually. Another project working on heterogeneous
for many applications. We have seen how new memory tiled architectures is the Euretile EU project [33]. They
technologies can bring a positive impact both for perfor- developed a complete stack consisting of a programming
mance and energy efficiency [94]. Hybrid memory archi- paradigm based on process networks, a runtime support-
tectures are also being proposed that combine multiple ing fault management, a lightweight OS, and a hardware
different technologies [95], [96]. Given these systems, one platform with a custom torus network. While they apply a
could place data items, depending on their usage patterns, dataflow approach similar to ours, we additionally investi-
in the most efficient type of memory. Some of the novel gate run-time adaptivity.
memory technologies are more amenable for a tight inte- The above mentioned projects work with various het-
gration with logic, opening opportunities for near and in- erogeneous CMOS-based hardware, either off-the-shelf or
memory computing [97]. Given a high-level specification custom design. However, none of the projects consider
of the application, compilers together with runtimes and systems with post-CMOS hardware. The IEEE Rebooting
flexible memory controllers can automate the process of Computing Initiative [3], founded in 2012, proposes to re-
data mapping (and near memory operations). Additionally, think the concept of computing over the whole hardware/-
all these new memory architectures will certainly impact software stack and acts as a scientific platform for research
the system software (OSs, paging, security, and others). The on the CMOS scaling challenge and emerging post-CMOS
concrete impact is still unclear though. technologies. Besides such important stimulating initiatives,
Programming: It is questionable whether today’s stan- few projects are working on a concrete hardware/software
dards like OpenCL and OpenMP are able to address stack based on emerging technologies. The six US STAR-
wild heterogeneity and non-von-Neumann paradigms (e.g., net centers mainly focus on developing new materials,
neuromorphic, quantum, reconfigurable, or analog) be- technologies, and hardware architectures, but less on the
yond manual offloading. Alternative models that provide a software stack [4]. N3XT, for example, is an approach within
higher level of abstraction, like dataflow programming and STARnet towards a memory-centric architecture based on
domain-specific languages, could enable automatic utiliza- new technologies including CNTFET, RRAM, and STT-
tion of heterogeneous resources. Therefore, we are working MRAM [16], targeting big data applications. Another novel
on abstractions for computational fluid dynamics [20], [22] memory-centric architecture called “The Machine” is de-
and particle-based simulations [21]. veloped by HPE. It serves as a pre-commercial testbed for
Co-design: Given a huge design space, tools are needed large shared non-volatile memories accessed via photonic
that help co-designing from the specific application problem interconnects with a big data centric software stack [104].
down to the hardware architecture and the materials. For ex- In a joint division, CEA-Leti/List [105] develop hardware
ample, image processing implemented with coupled oscilla- based on emerging technologies as well as software for fu-
tors provides higher efficiency compared to an application- ture embedded systems, but without a common integrated
specific CMOS circuit [98]. As presented in Section 7, we hardware/software stack.
are using system-level simulation and probabilistic model
checking to evaluate design alternatives.
10 C ONCLUSION
In this paper we presented the design of a programming
9 R ELATED W ORK stack for heterogeneous systems and evaluated a prototype
This section focuses on large initiatives that cover the of it for a tiled CMOS heterogeneous multicore. We expect
hardware/software stack comparable to our holistic ap- the layer-wise mechanisms and the interfaces among the
proach. Works relevant to the individual layers are dis- layers to be applicable to future systems that integrate post-
cussed throughout Sections 3–6. CMOS components. More precisely, the paper described (i) a
The Aspire Lab at UC Berkeley emphasizes two areas: simple hardware-level interfacing via message passing that
hardware-specific optimizations hidden by pattern-specific will make it possible to integrate new exotic components; (ii)
languages [99] and agile hardware development based on an OS architecture that leverages hardware support for mes-
the hardware construction language Chisel [100]. Chisel has sage passing to implement isolation and makes OS services
been applied to create cycle-accurate simulators and to tape- available remotely to tiles that may not have a conventional
out various research processors. OmpSs [101], [102] is a ISA; (iii) an application runtime that reserves resources from
task-based programming model developed at the Barcelona the OS, and deploys and transforms a variant accordingly;
Supercomputing Center that extends and influences the and (iv) an automatic approach for dataflow applications
OpenMP standard. It has been applied to run applica- that uses abstract cost models of the architecture to pro-
tions on clusters, multicores, accelerators, and FPGAs. The duce near-optimal application mapping to resources. Fur-
Teraflux project [103] investigates OmpSs-based dataflow thermore, we developed novel formal methods to provide
programming on a simulated massively parallel multicore the kind of quantitative analysis required to reason about
architecture. The German collaborative research center Inva- trade-offs and system interfaces in such a heterogeneous
sIC [32] introduces invasive computing as a new program- setup. The evaluation served to (i) demonstrate the potential
ming paradigm for heterogeneous tiled architectures. They of the formal methods for a real multicore platform, (ii)
use an FPGA implementation of their MPSoC architecture measure the overhead of the software layers on a virtual

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 16

prototype, and (iii) show the validity of the claims (isola- [20] J. Mey et al., “Using semantics-aware composition and weaving
tion, performance-preserving transformation and run-time for multi-variant progressive parallelization,” Procedia Computer
Science, vol. 80, pp. 1554–1565, 2016.
adaptability). [21] S. Karol et al., “A domain-specific language and editor for parallel
In the future we will extend the simulator to integrate particle methods,” ACM T Math Software, 2017, to appear.
models of components built with post-CMOS technologies, [22] A. Susungi et al., “Towards compositional and generative tensor
allowing us to test the programming stack on wildly het- optimizations,” in Conf. Generative Programming: Concepts & Expe-
rience (GPCE’17), 2017, pp. 169–175.
erogeneous systems. We believe that the simulator and the [23] T. Karnagel et al., “Adaptive work placement for query process-
programming stack will allow technologists and system ing on heterogeneous computing resources,” in Conf. on Very
architects to explore system-level aspects of new devices. Large Data Bases (VLDB’17), 2017, pp. 733–744.
[24] T. Limberg et al., “A heterogeneous MPSoC with hardware sup-
This requires more modeling formalisms for both hardware ported dynamic task scheduling for software defined radio,” in
(i.e., emerging technologies) and application models (be- Design Automation Conference (DAC’09), 2009.
yond dataflow models), at different levels (e.g., for auto- [25] B. Noethen et al., “A 105GOPS 36mm2 heterogeneous SDR
matic compilation or formal analysis). This will eventually MPSoC with energy-aware dynamic scheduling and iterative
detection-decoding for 4G in 65nm CMOS,” in Solid-State Circuits
contribute to narrowing down and structuring the current Conf. (ISSCC), 2014, pp. 188–189.
landscape of alternative technologies, both within and out- [26] S. Haas et al., “An MPSoC for energy-efficient database query
side cfaed. We are convinced that more such early initiatives processing,” in Design Automation Conference (DAC’16), 2016.
[27] ——, “A heterogeneous SDR MPSoC in 28nm CMOS for low-
are needed to prepare systems that can cope, to some extent, latency wireless applications,” in DAC’17, 2017.
with yet unknown technologies. [28] C. Baier et al., “Probabilistic model checking for energy-utility
analysis,” in Horizons of the Mind. A Tribute to Prakash Panangaden,
2014, pp. 96–123.
ACKNOWLEDGEMENT [29] C. Ramey, “TILE-Gx100 manycore processor: Acceleration inter-
faces and architecture,” in Hot Chips Symp., 2011.
This work is supported by the German research foundation [30] B. D. de Dinechin et al., “A clustered manycore processor ar-
(DFG) and the Free State of Saxony through the cluster chitecture for embedded and accelerated applications,” in High
of excellence “Center for Advancing Electronics Dresden” Performance Extreme Computing Conf. (HPEC), 2013, pp. 1–6.
[31] B. Bohnenstiehl et al., “A 5.8 pJ/Op 115 billion ops/sec, to 1.78
(cfaed). trillion ops/sec 32nm 1000-processor array,” in Symp. on VLSI
Technology and Circuits, 2016.
[32] J. Teich, Ed., “Invasive computing,” (special issue) it - Information
R EFERENCES Technology, vol. 58, no. 6, 2014.
[1] T. N. Theis and H. S. P. Wong, “The end of Moore’s Law: A [33] P. S. Paolucci et al., “Dynamic many-process applications on
new beginning for information technology,” Computing in Science many-tile embedded systems and HPC clusters: The EURETILE
Engineering, vol. 19, no. 2, pp. 41–50, 2017. programming environment and execution platforms,” Journal of
[2] “More Moore,” White Paper, IEEE IRDS, 2016. Systems Architecture, vol. 69, pp. 29–53, 2015.
[3] T. M. Conte et al., “Rebooting computing: The road ahead,” [34] O. Arnold et al., “Tomahawk: Parallelism and heterogeneity
Computer, vol. 50, no. 1, pp. 20–29, 2017. in communications signal processing MPSoCs,” Trans Embedded
[4] “STARnet Research,” Semiconductor Research Corporation. Computer Systems, vol. 13, no. 3s, pp. 107:1–107:24, 2014.
[Online]. Available: https://www.src.org/program/starnet/ [35] N. Asmussen et al., “M3: A hardware/operating-system co-
[5] D. Geer, “Chip makers turn to multicore processors,” Computer, design to tame heterogeneous manycores,” in Conf. Architectural
vol. 38, no. 5, pp. 11–13, 2005. Support for Programming Languages and Operating Systems (ASP-
[6] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore LOS’16), 2016, pp. 189–203.
scaling,” in Symp. on Computer Architecture, 2011, pp. 365–376. [36] A. Baumann et al., “The multikernel: A new OS architecture
[7] J. D. Owens et al., “A survey of general-purpose computation on for scalable multicore systems,” in Symp. on Operating Systems
graphics hardware,” Comput Graph Forum, vol. 26, pp. 80–113, Principles (SOSP ’09), 2009, pp. 29–44.
2007. [37] F. X. Lin et al., “K2: A mobile operating system for heterogeneous
[8] R. Kumar et al., “Single-ISA heterogeneous multi-core architec- coherence domains,” in ASPLOS’14, 2014, pp. 285–300.
tures: The potential for processor power reduction,” in Symp. on [38] A. Barbalace et al., “Popcorn: Bridging the programmability gap
Microarchitecture (MICRO-36), 2003, pp. 81–92. in heterogeneous-ISA platforms,” in Conf. on Computer Systems
[9] P. Greenhalgh, “big.LITTLE processing with ARM Cortex-A15 (EuroSys ’15), 2015, pp. 29:1–29:16.
and Cortex-A7,” ARM, Tech. Rep., 2011. [39] E. B. Nightingale et al., “Helios: Heterogeneous multiprocessing
[10] E. Biscondi et al., “Maximizing multicore efficiency with naviga- with satellite kernels,” in Symp. on Operating Systems Principles
tor runtime,” White Paper, Texas Instruments, 2012. (SOSP ’09), 2009, pp. 221–234.
[11] M. Taylor, “A landscape of the new dark silicon design regime,” [40] P. B. Hansen, “The nucleus of a multiprogramming system,”
IEEE Micro, vol. 33, no. 5, pp. 8–19, 2013. Communications of the ACM, vol. 13, no. 4, pp. 238–241, 1970.
[12] M. Ferdman et al., “Toward dark silicon in servers,” IEEE Micro, [41] J. Liedtke, “On µ-kernel construction,” in Symp. on Operating
vol. 31, pp. 6–15, 2011. System Principles (SOSP), 1995.
[13] P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to [42] D. B. Golub et al., “Microkernel operating system architecture
the new normal for computer architecture,” Computing in Science and Mach,” in USENIX Workshop on Microkernels and Other Kernel
Engineering, vol. 15, no. 6, pp. 16–26, 2013. Architectures, 1992, pp. 11–30.
[14] S. Borkar, “Designing reliable systems from unreliable compo- [43] E. A. Lee, “The problem with threads,” Computer, vol. 39, no. 5,
nents: the challenges of transistor variability and degradation,” pp. 33–42, 2006.
IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005. [44] J. Castrillon, “Programming heterogeneous MPSoCs: tool flows
[15] L. Ceze et al., “Arch2030: A vision of computer architecture to close the software productivity gap,” Ph.D. dissertation,
research over the next 15 years,” Arch2030 Workshop, 2016. RWTH Aachen University, Apr. 2013.
[16] M. M. Sabry Aly et al., “Energy-efficient abundant-data comput- [45] G. Kahn, “The semantics of a simple language for parallel pro-
ing: The N3XT 1,000x,” Computer, vol. 48, no. 12, pp. 24–33, 2015. gramming,” in Information Processing ’74, 1974, pp. 471–475.
[17] A. Hemani et al., “Network on chip: An architecture for billion [46] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,”
transistor era,” in IEEE NorChip Conference, 2000. Proc. IEEE, vol. 75, no. 9, pp. 1235–1245, 1987.
[18] S. Gorlatch and M. Cole, “Parallel skeletons,” in Encyclopedia of [47] C. Ptolemaeus, Ed., System Design, Modeling, and Simulation using
Parallel Computing. Springer, 2011, pp. 1417–1422. Ptolemy II. Ptolemy.org, 2014.
[19] S. Karol, “Well-formed and scalable invasive software composi- [48] L. Thiele et al., “Mapping applications to tiled multiprocessor
tion,” Ph.D. dissertation, Technische Universität Dresden, 2015. embedded systems,” in Conf. Application of Concurrency to System
Design (ACSD’07), 2007, pp. 29–40.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. XX, NO. Y, 2017 17

[49] R. Jordans et al., “An automated flow to map throughput con- [77] A. F. Donaldson and A. Miller, “Symmetry reduction for proba-
strained applications to a MPSoC,” in Bringing Theory to Practice: bilistic model checking using generic representatives,” in Symp.
Predictability and Performance in Embedded Systems, 2011, pp. 47– on Automated Technology for Verification and Analysis (ATVA 2006),
58. 2006, pp. 9–23.
[50] J. Castrillon et al., “Trends in embedded software synthesis,” [78] C. Baier et al., “Symbolic model checking for probabilistic
in Conf. Embedded Computer Systems: Architecture, Modeling and processes,” in Colloq. on Automata, Languages and Programming
Simulation (SAMOS’11), 2011, pp. 347–354. (ICALP’97), 1997, pp. 430–440.
[51] R. Leupers et al., “MAPS: A software development environment [79] M. Z. Kwiatkowska et al., “Probabilistic symbolic model checking
for embedded multicore applications,” in Handbook of Hardware/- with PRISM: A hybrid approach,” Int J Softw Tools for Technol
Software Codesign, S. Ha and J. Teich, Eds., 2017. Transfer, vol. 6, pp. 128–142, 2004.
[52] J. F. Eusse et al., “CoEx: A novel profiling-based algorithm/ar- [80] H. Hermanns et al., “On the use of MTBDDs for performability
chitecture co-exploration for ASIP design,” ACM T Reconfigurable analysis and verification of stochastic systems,” J Logic and Alge-
Technol Syst, vol. 8, no. 3, 2014. braic Programming, vol. 56, pp. 23–67, 2003.
[53] J. Castrillon et al., “Component-based waveform development: [81] J. Klein et al., “Advances in probabilistic model checking with
the nucleus tool flow for efficient and portable software defined PRISM: variable reordering, quantiles and weak deterministic
radio,” Analog Integr Circ Sig Proc, vol. 69, no. 2, pp. 173–190, Büchi automata,” J Softw Tools for Technol Transfer, pp. 1–16, 2017.
2011. [82] M. Z. Kwiatkowska et al., “PRISM 4.0: Verification of probabilistic
[54] C. Menard et al., “High-level NoC model for MPSoC compilers,” real-time systems,” in Conf. Computer Aided Verification (CAV),
in Nordic Circuits and Systems Conf. (NORCAS’16), 2016, pp. 1–6. 2011, pp. 585–591.
[55] J. Castrillon et al., “Communication-aware mapping of KPN [83] N. Binkert et al., “The gem5 simulator,” ACM SIGARCH Computer
applications onto heterogeneous MPSoCs,” in DAC’12, 2012. Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
[56] ——, “MAPS: Mapping concurrent dataflow applications to het- [84] C. Menard et al., “System simulation with gem5 and SystemC:
erogeneous MPSoCs,” T Ind Inform, vol. 9, no. 1, pp. 527–545, The keystone for full interoperability,” in Conf. Embedded Com-
2013. puter Systems: Architectures, Modeling and Simulation (SAMOS’17),
[57] A. Goens et al., “Symmetry in software synthesis,” ACM T Archit 2017.
Code Op (TACO), vol. 14, no. 2, pp. 20:1–20:26, 2017. [85] “SLXMapper,” Silexica GmbH, 2016. [Online]. Available:
[58] ——, “Tetris: a multi-application run-time system for predictable http://www.silexica.com
execution of static mappings,” in Workshop on Software and Com- [86] J. M. Shalf and R. Leland, “Computing beyond Moore’s Law,”
pilers for Embedded Systems, 2017, pp. 11–20. Computer, vol. 48, no. 12, pp. 14–23, 2015.
[59] J. Sifakis, “System design automation: Challenges and limita- [87] W. M. Weber et al., “Reconfigurable nanowire electronics – a
tions,” Proc. IEEE, vol. 103, no. 11, pp. 2093–2103, 2015. review,” Solid-State Electronics, vol. 102, pp. 12–24, 2014.
[60] E. M. Clarke et al., “Model checking: Back and forth between [88] A. Heinzig et al., “Dually active silicon nanowire transistors and
hardware and software,” in Conf. on Verified Software: Theories, circuits with equal electron and hole transport,” Nano Letters,
Tools, Experiments (VSTTE), 2005, pp. 251–255. vol. 13, no. 9, pp. 4176–4181, 2013.
[61] E. M. Clarke and Q. Wang, “25 years of model checking,” in Conf. [89] M. Haferlach et al., “Electrical characterization of emerging tran-
on Perspectives of System Informatics (PSI 2014), 2014, pp. 26–40. sistor technologies: Issues and challenges,” IEEE Trans on Nan-
[62] O. Grumberg and H. Veith, Eds., 25 Years of Model Checking - otechnology, vol. 15, no. 4, pp. 619–626, 2016.
History, Achievements, Perspectives. Springer, 2008. [90] S. Mothes et al., “Toward linearity in schottky barrier CNTFETs,”
[63] C. Baier et al., “Towards automated variant selection for heteroge- IEEE Trans on Nanotechnology, vol. 14, no. 2, pp. 372–378, 2015.
neous tiled architectures,” in Models, Algorithms, Logics and Tools, [91] M. Schröter et al., “Carbon nanotube FET technology for radio-
2017, pp. 382–399. frequency electronics: State-of-the-art overview,” IEEE J Electron
[64] P. Chrszon et al., “ProFeat: Feature-oriented engineering for Devices Soc, vol. 1, no. 1, pp. 9–20, 2013.
family-based probabilistic model checking,” Formal Aspects of [92] F. N. Gür et al., “Toward self-assembled plasmonic devices:
Computing, pp. 1–31, 2017. High-yield arrangement of gold nanoparticles on DNA origami
[65] C. Baier et al., “Computing conditional probabilities in Marko- templates,” ACS Nano, vol. 10, no. 5, pp. 5374–5382, 2016.
vian models efficiently,” in Conf. Tools and Algorithms for the [93] D. Karnaushenko et al., “Light weight and flexible high-
Construction and Analysis of Systems (TACAS), 2014, pp. 515–530. performance diagnostic platform,” Advanced Healthcare Materials,
[66] ——, “Maximizing the conditional expected reward for reaching vol. 4, no. 10, pp. 1517–1525, 2015.
the goal,” in Conf. on Tools and Algorithms for the Construction and [94] F. Hameed et al., “Efficient STT-RAM last-level-cache architecture
Analysis of Systems (TACAS), 2017, pp. 269–285. to replace DRAM cache,” in Symp. Memory Systems (MemSys’17),
[67] S. Märcker et al., “Computing conditional probabilities: Imple- 2017, pp. 141–151.
mentation and evaluation,” in Conf. on Software Engineering and [95] J. C. Mogul et al., “Operating system support for NVM+DRAM
Formal Methods (SEFM 2017), 2017, pp. 349–366. hybrid main memory,” in Hot Topics in Operating Systems, 2009.
[68] M. Ummels and C. Baier, “Computing quantiles in Markov [96] S. Onsori et al., “An energy-efficient heterogeneous mem-
reward models,” in Conf. on Foundations of Software Science and ory architecture for future dark silicon embedded chip-
Computation Structures (FOSSACS). Springer, 2013, pp. 353–368. multiprocessors,” IEEE Trans Emerging Topics in Computing, 2016.
[69] C. Baier et al., “Energy-utility quantiles,” in NASA Formal Methods [97] S. Khoram et al., “Challenges and opportunities: From near-
Symp. (NFM), 2014, pp. 28–299. memory computing to in-memory computing,” in Symp. Physical
[70] D. Krähmann et al., “Ratio and weight quantiles,” in Symp. Design (ISPD’17), 2017, pp. 43–46.
Mathematical Foundations of Computer Science (MFCS), 2015, pp. [98] N. Shukla et al., “Pairwise coupled hybrid vanadium dioxide-
344–356. MOSFET (HVFET) oscillators for non-boolean associative com-
[71] C. Baier et al., “Probabilistic model checking and non-standard puting,” in IEEE Int. Electron Devices Meeting, 2014.
multi-objective reasoning,” in Conf. on Fundamental Approaches to [99] B. Catanzaro et al., “SEJITS: Getting productivity and perfor-
Software Engineering (FASE), 2014, pp. 1–16. mance with selective embedded JIT specialization,” University
[72] K. Etessami et al., “Multi-objective model checking of Markov of California, Berkeley, Tech. Rep. UCB/EECS-2010-23, 2010.
decision processes,” Log Meth Comput Sci, vol. 4, no. 4, 2008. [100] Y. Lee et al., “An agile approach to building RISC-V microproces-
[73] C. Baier et al., “Partial order reduction for probabilistic systems,” sors,” IEEE Micro, vol. 36, no. 2, pp. 8–20, 2016.
in Conf. on Quantitative Evaluation of SysTems (QEST), 2004, pp. [101] J. Bueno et al., “Productive programming of GPU clusters with
230–239. OmpSs,” in Parallel Distr. Processing Symp., 2012, pp. 557–568.
[74] P. R. D’Argenio and N. Peter, “Partial order reduction on concur- [102] A. Filgueras et al., “OmpSs@Zynq all-programmable SoC ecosys-
rent probabilistic programs,” in Conf. on Quantitative Evaluation of tem,” in Symp. Field-programmable Gate Arrays, 2014, pp. 137–146.
SysTems (QEST), 2004, pp. 240–249. [103] R. Giorgi et al., “Teraflux: Harnessing dataflow in next generation
[75] M. Größer et al., “On reduction criteria for probabilistic reward teradevices,” Microprocessors and Microsystems, vol. 38, no. 8, 2014.
models,” in Conf. on Foundations of Software Technology and Theo- [104] K. Keeton, “Memory-driven computing,” in Conf. on File and
retical Computer Science (FSTTCS), 2006, pp. 309–320. Storage Technologies (FAST’17), 2017.
[76] M. Kwiatkowska et al., “Symmetry reduction for probabilistic [105] M. Belleville et al., Eds., Architecture and IC Design, Embedded
model checking,” in Conf. on Computer Aided Verification (CAV Software Annual Research Report 2015. CEA Leti, 2016.
2006), 2006, pp. 234–248.

2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like