Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
Abstract—Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is
unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will
make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it
ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts
well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow
programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among
the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on
a tiled heterogeneous multicore.
Index Terms—cfaed, orchestration, post-CMOS, heterogeneous systems, programming stack, hardware/software abstractions,
emerging technologies, cross layer design
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
PLL
Scratchpad
Memory GPIO
Manager PE
Router1
R[6]
Serial Link
DC212GP
Periphery
PE
Router0
Serial DPM[2]
ACM
DC212GP
Link
DPM[3]
DDR Ctrl
PE
SDSP 1
(FPGA IF)
PLL PLL
Serial NoC link
Fig. 2: Chip photos of Tomahawk generations. (a) Tomahawk1 [24] (UMC 130 nm). (b) Tomahawk2 [25] (TSMC 65 nm). (c)
Tomahawk3 [26] (GF 28 nm). (d) Tomahawk4 [27] (GF 28 nm).
design process to quantitatively analyze low-level resource NoC: four data processing modules (DPM), one baseband
management protocols for stochastically modeled classes processing module (BBPM), and one application control
of applications and systems [28]. The model-based formal module (ACM), as depicted in Figure 3. The DPMs comprise
quantitative analysis, carried out using probabilistic model a Tensilica Xtensa LX5 and an ARM Cortex-M4F proces-
checking techniques, is particularly useful for the compara- sor as well as 128 kB local memory, and a data transfer
tive evaluation of (existing and future) design alternatives, unit (DTU). The Xtensa is built upon an ASIP approach
to compare the performance of heuristic approaches for and accelerates basic database operators by instruction set
system management with theoretical optimal solutions or extensions (similar to [26]). Both cores share the same local
to determine optimal system parameter settings. memory while only one core is active at a time. The DTU is
responsible for data transfers between the tiles and provides
isolation capabilities (more details in Section 3.2). The BBPM
3 T HE TOMAHAWK A RCHITECTURE
is tailored to signal processing applications and integrates
As mentioned in the stack overview in Section 2, we con- an Xtensa LX5 as well as application-specific integrated
sider two key aspects at the hardware level to handle circuits (Sphere Detector (SD), Turbo Decoder (TD)) with
heterogeneity. First, we think tile-based systems are promis- separate memories. The Tensilica 570T CPU, included in the
ing for easy integration of heterogeneous components at ACM, is a general-purpose core with full MMU and 32 kB
the physical level (see further discussion in Section 8). cache which is used as host processor. The FPGA interface
This architectural paradigm has already proven effective allows chip-to-chip interconnections and the LPDDR2 inter-
for handling design complexity and scalability up to 1000 face provides access to an 128 MB SDRAM.
processing elements (PEs) [29], [30], [31] and for heterogene-
ity [32], [33]. Second, and more importantly, the hardware 3.2 Uniform Access and Control
should include efficient mechanisms to provide uniform
Heterogeneity allows, on the one hand, increasing energy
access to specialized units via the NoC. Hardware support
efficiency due to specialization. On the other hand, it creates
for message passing and capabilities is key for system per-
significant challenges for the hardware and software inte-
formance as will be discussed in Section 4. In this section we
gration. A novel hardware mechanism is needed that pro-
briefly describe actual silicon prototypes of tiled systems in
vides a unified access and protection of system resources,
Section 3.1, followed by a detailed description of the unified
but does not rely on specific features of the processing
access control mechanisms in Section 3.2.
elements (e.g., user/kernel mode). This is similar to what
a memory management unit (MMU) accomplishes in a tra-
3.1 Tomahawk Overview ditional system. In the presence of heterogeneous tiles, this
Tomahawk is a family of tiled heterogeneous multi- mechanism has to be provided at the tile-system interface.
processor platforms with a NoC-based interconnect [34]. In this way, one can ensure that future heterogeneous tiles
A total of four generations of the Tomahawk architecture can be embedded safely without tampering with the state
have been designed, which are displayed in Figure 2. of the system. The unified access must be simple enough to
A Tomahawk platform consists of multiple heterogeneous not incur performance penalties on running applications.
tiles, peripherals, an FPGA interface, and off-chip memory. In Tomahawk, unified access control and isolation is pro-
All tiles contain interfacing logic to the associated NoC vided by a Data Transfer Unit (DTU) [35] in every tile. This
router. Apart from the required NoC interface, tiles may peripheral represents a single access point to tile-external
include different kinds of processing elements (standard or resources, which yields a mechanism to provide isolation
specialized instruction-set architectures (ISAs) or hardwired at the NoC level. The DTU can be configured, e.g., by the
accelerators). OS as discussed in Section 4, so that a functionality running
Most of the tiles found in the first four Tomahawk on a tile can communicate via message passing to a clearly
generations contain one or more processing elements and defined set of tiles. Message passing is the mechanism used
a local scratchpad memory. In particular, Tomahawk4 [27] to implement remote system calls, which allows tiles not
includes in total six tiles which are connected by a hexagonal capable of running an OS to access system services.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
Tile0 (DPM)
Xtensa ARM
R0 R5 LPDDR2 IF give details of the software architecture in Section 4.2 and
128kB MEM
DTU Tile5 (DPM) close with a discussion of the limitations of the approach in
Xtensa ARM Section 4.3.
FPGA IF R7
128kB MEM
DTU
4.1 M3 Overview
Fig. 3: Block diagram of Tomahawk4 architecture.
As mentioned earlier, we use NoC-level isolation to provide
isolation even for tiles that are not able to run an OS kernel.
For memory access and message passing, the DTU pro- It is based on the DTU as the common interface of all tiles
vides a set of endpoints. Each of these endpoints can be and the differentiation between privileged and unprivileged
configured to: (i) access a contiguous, byte-granular, tile- tiles. The key for isolation is that only privileged tiles can
external range of memory, (ii) receive messages into a ring configure the DTU’s endpoints to, e.g., create communi-
buffer in tile-internal memory, or (iii) send messages to cation channels for message passing. At boot, all tiles are
a receiving endpoint. Once their DTU is configured, tiles privileged. Since the M3 kernel deprivileges the application
can communicate and access external memory. Given the tiles, only the kernel tiles can configure endpoints. Once
distributed memory paradigm of the hardware, memory an endpoint has been configured, applications can use it
coherency can be implemented through message passing directly for memory access or communication, without in-
as a layer on top of the DTU. Furthermore, for dataflow volving any OS kernel. Who is allowed to communicate
applications the DTU can be used to directly implement with whom is controlled via capabilities as discussed in
channels and by that lower the communication overhead Section 4.2.
on the CPU. The DTU implementation in the Tomahawk4 There are also other approaches that target heteroge-
occupies an area of 0.063 mm2 which is about 6% of a DPM neous systems such as Barrelfish [36], K2 [37], and Pop-
tile. corn Linux [38]. However, they assume that all cores can
execute a traditional kernel, requiring user-/kernel-mode
separation and MMUs. Approaches using software-based
4 T HE M3 OS isolation such as Helios [39], do not require an MMU, but
In a heterogeneous setting, with general-purpose and are restricted to units that execute software and limited to
application-specific tiles like the Tomahawk, we cannot a certain set of programming languages. Our DTU-based
expect all units to have the ISA and other architectural approach allows M3 to control any kind of tile, e.g., con-
properties required to run full-fledged OSs. Therefore, we taining a general purpose core, a hardware accelerator, or a
designed a novel OS architecture called M3 [35], where the reconfigurable circuit.
kernel runs on dedicated tiles and remotely controls the ap- We illustrate how the communication via the DTU works
plications running on the remaining tiles. The M3 kernel is using a system call as example (see Figure 4). On traditional
not entered via interrupt or exception as traditional kernels. systems, system calls are issued by switching from user
Instead, it interacts with the applications via messages. In mode into kernel mode. On M3 , system calls are requested
spite of this, we still call it “kernel” since, as with traditional by sending a message via the DTU to the corresponding
kernels, its main responsibility is to decide whether an kernel tile. First, the application prepares the message in the
operation is allowed or not. Message passing is supported local memory (marked in orange in the figure) of Tile 1.
and isolation is ensured by the DTU described in Section 3.2. The application then instructs the DTU to use the pre-
In this model, the task of the OS then becomes to setup the configured send endpoint S in Tile 1 to send the message
processing elements and their interaction structures using to the receive endpoint R at the kernel’s DTU in Tile 2. In
the endpoints of the DTUs. This design removes all require- the third step, the DTU streams the message over the NoC
ments on features of heterogeneous processing elements, to the receiving tile into a ring buffer of Tile 2, pointed by
with the sole exception that they can copy data to a certain the receive endpoint. The DTU wakes up the kernel, which
address to initiate a message or memory transfer to another in turn will check for new messages, and, in the final fifth
tile. We explain how M3 ensures isolation in Section 4.1, step, process the message in its local memory.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
4.2 Software Architecture state. All these extensions are however out of the scope of
3
M follows the microkernel-based approach [40], [41], [42], this paper.
where OS functionality is divided into a small kernel and
several services running on top of the kernel. The reason
5 DATAFLOW P ROGRAMMING AND R UNTIMES
for this design is not specifically enabling heterogeneity, but
its security and reliability advantages. In such microkernel- Having introduced the Tomahawk architecture and the M3
based systems, the kernel enforces separation of processes OS, we now focus on application programming and run-
(applications and services) and provides some forms of time systems. Given the increased complexity of heteroge-
communication between them. To decide which processes neous systems, we opt for an automatic approach instead
are allowed to interact with each other, the kernel uses of resource-aware programming, where the programmer
capabilities and enforces rules on how the capabilities can be is responsible for handling resource allocation and de-
created, passed on, and otherwise manipulated by service allocation. An automatic approach builds on (i) hardware
and application processes. A capability is thereby a pair cost models to enable heterogeneity-aware optimizations,
of a reference to a kernel object and permissions for the (ii) runtime systems that intelligently assign resources to
object. In classical microkernel OSs, for example L4-based applications depending on the system state, and (iii) parallel
systems [41], the enforcement of separation and capabili- programming models with clear execution semantics that
ties relies on memory management and user/kernel mode. allow to reason about parallel schedules at compile time.
Message passing between processes, for example between To this end we employ dataflow programming languages
an application and a file service, requires a call to the micro- which cover a broad range of applications while still being
kernel. In contrast, in DTU/M3 -based systems, separation mostly analyzable. The analysis and information collected
and capability enforcement relies on the DTUs. Once a at compile time is then exploited at run time to deploy
communication channel is set up, i.e., the involved send the application to the heterogeneous fabric in near-optimal
and receive capabilities have been mapped to endpoints way. Additionally, dataflow programming represents a good
in the sending and receiving DTUs, exchanging messages match to the message-passing non-coherent memory archi-
no longer needs kernel interaction. Thus, the DTU is re- tecture of the system.
sponsible for enforcing the configured constraints (e.g., a In this section we provide an overview of the dataflow
maximum message size). programming flow in Section 5.1, discuss hardware mod-
On top of capabilities, M3 adds POSIX-like abstractions els for automatic optimization in Section 5.2, explain the
for, e.g., files and pipes. They free the programmer from methodology to find multiple static mappings in Section 5.3,
directly dealing with capabilities and also provide support and describe how these mappings are deployed at run time
for legacy applications. However, if desired, applications by the the Canoe runtime system in Section 5.4.
can also use capabilities directly. One example is Canoe,
which allows to run dataflow applications on M3 , as will 5.1 Compilation Overview
be explained in more detail in Section 5.4. M3 ’s concepts
allow Canoe to run multiple, mutually distrusting dataflow As mentioned above, we use a dataflow programming
applications simultaneously in an isolated fashion. model to represent applications. In this model, an appli-
cation is represented as a graph of communicating actors
(similar to task graphs). This explicit parallel representation
4.3 Discussion offers a better analyzability than, for example, programming
The strict separation of application tiles and kernel tiles with threads (or abstractions thereof like OpenMP) [43]. The
has several benefits. Since resources are not shared (e.g., clear separation between communication, state, and compu-
registers, caches, or translation lookaside buffers), registers tation makes it easier to change the implementation of actors
do not need to be saved on system calls and the potential without affecting the rest of the application. This, in turn, en-
for cache misses after a system call is also reduced. With the ables deploying actors to adequate heterogeneous resources
right programming model, it is also easier to add tiles, po- in a safe and transparent way. From the perspective of the
tentially with appropriate specializations, to a running ap- overall stack (recall Figure 1), dataflow can be seen as one
plication (similar to invasive computing [32] or the mapping of multiple intermediate abstractions. Higher-level models
variants discussed in Section 5). Finally, the kernels will not or DSLs map to one of these abstractions, empowering the
compete with applications for specialized resources, as they compiler to reason about resource allocation in a similar
do not necessarily benefit from the same hardware features way as with dataflow (e.g., ordering data accesses for tensor
(e.g., reconfigurable and application-specific circuits). computations in CFD [22]).
For more general systems, the DTU/M3 concept requires An overview of the compilation flow is shown in Fig-
several extensions. For example, for multiple applications ure 5. Dataflow applications are written in the language “C
to share a tile, we have an extension to the M3 kernel that for Process Networks” (CPN) [44]. CPN allows to specify
allows multiplexing remote tiles. We have also designed an Kahn Process Networks [45], a dynamic dataflow program-
extension to support complex cores with fully integrated ming model with a higher expressivity than static models
caches and virtual memory (such as modern x86 cores) and like synchronous dataflow [46]. Apart from the CPN code,
to transparently add caches and virtual memory to arbitrary the compiler receives application constraints, application
units such as accelerators. Finally, to scale to systems with meta-information and an abstract model of the target archi-
larger amount of tiles, we have an ongoing project to allow tecture. Constraints specify, among other, set of resources
multiple instances of the M3 kernel, that synchronize their the application should run on or real time constraints
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
(latency and throughput). The architecture model and the Communication is modeled with cost functions that
meta-information are further explained in Section 5.2. Due indicate how many cycles are required to transmit a given
to the dynamic application model, we use a profile-directed amount of bytes. These cost models account for the posi-
compilation, i.e., we collect execution information to feed tion of source and sink tiles as well as the selected path
the optimization process. The optimizer determines near- through the hardware topology [54]. A cost model can be
optimal mappings, i.e., an assignment of actors to processors obtained by measuring actual communication costs on the
and data to memories. This is done iteratively to find mul- target platform. For instance, the upper plot in Figure 6
tiple mappings with different trade-offs (e.g., performance shows a cost measurement for communicating data tokens
and resource utilization). Mapping variants differ from each on the Tomahawk via DRAM. Further, the plot illustrates
other in the amount and type of resources they use. Which the derivation of a cost function using piecewise linear
variant is actually deployed is decided at run time by the regression.
Canoe runtime system. Theses analytical cost models have two applications
The generated variants (right hand-side of Figure 5) use within the compiler. First, the optimizer bases its decisions
target-specific C code with calls to the Canoe application on the cost models in order to find a near-optimal mapping.
programming interfaces (APIs) for resource allocation, task Second, a simulation module uses the cost models to simu-
management and communication. With this programming late a complete execution of an application provided a given
flow we aim at relieving the programmer from dealing variant. This allows the compiler to provide estimations of
with resource management herself. We believe that with the application’s performance. For instance, the middle plot
higher heterogeneity and core counts, resource allocation, in Figure 6 compares the estimated and measured costs for
utilization, and resizing should be transparent and handled a simple test application2 . The plots show two variants, one
by the software stack. that implements channels by placing data in the scratchpad
There are many frameworks for analysis, mapping, and memories of producing PEs, and one that places data in the
code generation for dataflow applications [47], [48], [49], scratchpad of consuming PEs.
[50], [51]. We focus on generating variants that exploit re- To add better support for NoC-based platforms, we
source heterogeneity and on methods to allow these variants extended the analytical communication model by an anno-
to be transformed by the runtime system depending on tation of shared resources (e.g., a link). During simulation, a
resource availability (see Section 5.3). process first has to acquire all resources before it can initiate
a transfer. This allows for simulation of contention in the
network. The bottom plot in Figure 6 shows performance
5.2 Compiler: Taming Heterogeneity estimation results for the same application as in the middle
plot but in a congested network. The network congestion
To support heterogeneity, an abstract model of the hard- was achieved by choosing a worst case mapping that max-
ware platform is used. The model describes the topology imizes the interference of independent communications on
of the hardware, including programmable PEs, hardware the network links.
accelerators, interconnect, and memories. PEs are described
by the instruction set, the number of resources (e.g., func-
tional units, pipelines, and number of registers), and latency 5.3 Mapping Variants and Symmetries
tables [52]. Accelerators are annotated with a list of design When generating code, we create different execution vari-
parameters (e.g., data representation or number of points ants. These variants have different execution properties,
of a fast Fourier transform). Latency, area, and other cost different goals (e.g., energy-efficiency or execution time),
functions are defined as arbitrary functions of the param- and different resource utilization. Concretely, a variant cor-
eters [53]. This high-level modeling of hardware resources responds to a mapping of the application to a given set of
can be applied in the future to components implemented platform resources, including tasks to PEs, communication
with emerging technologies. to links, and data to memories. Based on the hardware
To be able to use accelerators, if available, meta- models, each variant is annotated with the estimated per-
information is added. This information indicates the com- formance, energy usage, and resource utilization. These
piler that a given actor is actually an instance of a known annotated variants are given to the Canoe runtime system.
algorithmic kernel (e.g., filter or a decoder) [53]. Together It can then select and adapt the variant that best suits the
with the abstract platform model, the compiler can then
automatically generate code and use accelerators from a 2. The test application consists in an 8-stage pipeline transferring a
functional specification of the application. total of 6,000 tokens.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
linear regression
1,000
5.3.2 Deploying Variants with Symmetries
500
Having produced several annotated mappings at compile
Cost Measurement
time, they need to be deployed at run time by the system.
0 1,024 2,048 3,072 4,096 For this, we use symmetries again. However, instead of
6
Token size in bytes removing equivalent mappings, we generate them. The
·10
mapping represented by a variant can be transformed to
Execution time in cycles
Measured (Producer)
uses the symmetries to find an equivalent mapping that fits
Simulated (Producer)
2.5 Measured (Consumer) the current, dynamic resource constraints of the system.
Simulated (Consumer)
2
5.4 The Canoe Runtime System
1.5
Canoe is a runtime environment for process networks. It
Cost Prediction with congestion (1, 000 iterations)
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
R R R R
R R
2 R R R R
Makespan (s)
R R
FFT
1.5
Disp
R R R R
Eq1 Eq2 Eq3 1 R R
1 2 3 4 5 6 7 8
IFFT1 IFFT2 IFFT3 Number of PEs R R
R R R R
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
6.1 Probabilistic Model Checking can be used, e.g., to model stochastic assumptions on the
Formal methods include techniques for the specification, environment, i.e., the application load, or to abstract some
development, analysis and verification of hardware/soft- complex system behavior such as caching into stochastic
ware systems, which complement the classical approach of distributions. Another use for probabilism is, e.g., to incor-
simulation-based (performance) analysis and testing. They porate hardware failure models.
provide the foundations of a methodical system design and In Markov chains, the classical PMC-based analysis al-
contribute to the reliability, robustness, and performance of lows to compute single objectives such as probabilities of
systems (see, e.g., [59], [60], [61], [62]). temporal properties. For instance, in a multicore we can
In this work, we focus on formal verification and (quan- compute the probability to acquire a shared resource with-
titative) analysis using (probabilistic) model checking tech- out waiting for competing processes, or the likelihood of fin-
niques. Model checking is a formal algorithmic approach ishing a batch of tasks within given time bounds. Moreover,
that uses graph algorithms to exhaustively explore the state by providing annotations for costs or rewards to the models,
space of a given model to verify/falsify a temporal logic the classical PMC analysis allows for computing expected
property. The major benefit of model-based techniques such values of the respective random variables. Examples are
as model checking is the ability to predict and analyze the expected waiting time to acquire the resource or the
systems that are still in their design phase, which is of great expected amount of energy required to finish a batch of
importance in the context of future post-CMOS technology, tasks. The nondeterministic choice among different actions
especially when there are no hardware realizations avail- in the states of an MDP is used to model important decisions
able yet. Model checking allows for proving the functional to be made at run time. For example, for the Tomahawk, this
correctness of hardware/software systems with respect to a could be assigning a certain task to a specific PE or deciding
given temporal specification. Probabilistic model checking on the powering or scaling of PEs. The classical PMC anal-
(PMC) is a variant for the verification and quantitative ysis allows to compute minimal and maximal probabilities
analysis of stochastic systems. over all resolutions of these nondeterministic choices. From
Our goal is to use PMC techniques to support the design the extremal resolutions we can derive (nearly) optimal
of the Orchestration stack in various ways: (i) modeling policies which can provide valuable feedback for all layers
and formal analysis of orchestration concepts developed at of the Orchestration stack, e.g., for resource management
each layer in isolation (e.g., automatic resource allocation with respect to performance or energy consumption. Here,
used in M3 and Canoe as in Section 4 and Section 5.4, the optimality refers to the single objective measure, i.e.,
or the viability of the abstractions in Section 3.2), (ii) the probability or expectation.
interplay of concepts on two or more layers (e.g., advice for Beyond classical PMC methods, we have developed new
the compiler layer (see Section 5) with respect to OS and methods enabling the computation of more complex multi-
hardware specifics), as well as (iii) the effects on the whole objective criteria that combine multiple measures for costs
system when combining orchestration methods across all and/or utility in different ways. Among others, we have
layers of the stack. This demands for modeling approaches developed new PMC algorithms for computing conditional
and methods enabling the PMC-based quantitative analysis probabilities and expectations [65], [66], [67] as well as cost-
of heterogeneous computing platforms. The next section utility ratios and p-quantiles [68], [69], [70] in MDPs. The
provides an overview on the fundamental concepts. latter can, e.g., stand for the minimal time bound t such
that there is policy to schedule tasks in such a way that the
execution of the tasks of a given list is completed within t
6.2 PMC for Heterogeneous Computing Platforms time units with probability at least p. Another example for a
PMC-based analysis requires abstract operational models p-quantile is the maximal utility value u such that there is a
that are compositional and capture the operational behav- policy to schedule tasks such that the achieved utility value
ior of the hardware and/or software components together is at least u when having completed all tasks with probabil-
with their interplay. The compositional approach allows ity at least p. Such methods are essential for the evaluation
to easily derive model variants, refinements, and abstrac- of the resource management policies as provided on the
tions of a system model by exchanging specific components different layers of the Orchestration stack, as they enable
only. Component variants exposing the same operational multi-objective and trade-off analysis [71], [72]. With these
behavior, but with different characteristics, e.g., in terms methods at hand one can now optimize for the trade-off
of performance, reliability, and energy consumption can between contradicting objective measures, e.g., energy and
often be derived by simply exchanging the characteristic performance. For instance, one can compute a policy that
parameters within the model. For this part, we rely on an minimizes the energy consumption while guaranteeing a
extension of our modeling framework for tiled architectures certain throughput with sufficient probability. In Section 7.1
as introduced in [63] (used to analyze the Tomahawk4 in we report on the results of our analysis of the Tomahawk4
Section 7.1). The model is based on a feature-based model- platform where we apply classical PMC methods as well as
ing of probabilistic systems as presented in [64] that enables our new algorithms.
a family-based analysis, i.e., the simultaneous analysis of a
model family with members for multiple different design
variants. The underlying semantics are Markovian models 6.3 Remark on Complexity
such as Markov chains that feature probabilistic branching It is well-known that model checking techniques suffer
only, and Markov decision processes (MDPs) with addi- from the state-space explosion problem, as the size of
tional nondeterministic choice. The nature of probabilism the underlying model grows exponentially in the number
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
TABLE 1: System parameters for PMC the compiler and the Canoe runtime system on how to use
(a) Power consumption of PEs the T4 most efficiently and (ii) provide guidelines for the
design of subsequent versions of the Tomahawk platform
load @100 MHz @400 MHz and wildly heterogeneous tiled architectures of the future in
idle 1.3 mW 8.8 mW general.
full 3.3 mW 25.5 mW In our modeling framework, a concrete instance is cre-
ated by providing a list of task types (e.g., general computa-
(b) Speedup for task types of PEs tion task, FFT task, etc.) and/or code variants (e.g., parallel
PE DPM-accelerated BBPM-accelerated vs. sequential), a list of tiles and their type (e.g., ARM, or
FFT accelerator) along with their operational modes (e.g.,
DPM 12 1
BBPM 1 8 frequency), performance characteristics (per task type and
mode), and energy profiles. This framework can easily be
used for creating models for variants of the Tomahawk
of components. The general problem is finding the right (e.g., increasing the number of BBPM PEs) and including
level of abstraction to capture just exactly the informa- tiles based on future hardware with properties substantially
tion needed for the respective analysis. Except from expert different from those existing today.
knowledge that is needed here, there are several sophisti-
cated (semi-)automated abstraction techniques adapted for 7.1.1 PMC-based Analysis of the Tomahawk4
probabilistic systems such as partial-order reduction [73],
For the analysis we consider the T4 platform with four
[74], [75] and symmetry reduction [76], [77]. Besides such
DPMs and one BBPM as detailed in Section 3 and in [27].
abstraction techniques, symbolic model representations and
The PEs can be powered on and off and operated at two
methods based on variants of binary decision diagrams such
different frequencies (see Table 1a). We use a discrete-time
as in [78], [79], [80] can be used in parallel and yield the
model and we fix one time step to represent 10.0 ms.
basis for compact representations and efficient treatment
On the application side we consider a batch process-
of stochastic system models. In [81] we present symbolic
ing scenario of two different task types (DPM- or BBPM-
algorithms for computing quantiles in MDPs as well as gen-
accelerated, see Table 1b) and a fixed number of tasks to be
eral improvements to the symbolic engine of the prominent
finished per task type. We assume that the execution time of
probabilistic model checker PRISM [79], [82] that potentially
each task follows a geometric distribution with an expected
help treating models of larger scale more efficiently. Beyond
mean of 150.0 ms. The actual execution time depends on
this, even when considering very abstract models of larger
the type of the PE and the speed-step, i.e., the speedup
systems, e.g., when reasoning about a reduced number of
provided by the respective PE and the selected frequency.
PEs or hiding many operational details of the underlying
For instance, if a DPM-accelerated task is scheduled on a
hardware, we argue that our methods are still applicable
fitting PE with a speedup of 12 that runs at 400 MHz, the
and can provide guidelines for the design, e.g., by revealing
actual expected execution time is about 20.0 ms. The used
trends and tendencies that are expected to occur in real
performance characteristics in the Tables 1a and 1b are based
hardware.
on the values measured on the T4.
The first goal (G1) of our analysis is to evaluate and
7 E VALUATION quantify the benefits of heterogeneity. For this we compare
We now study the techniques in Sections 3–6 on the Toma- the T4 configuration with a homogeneous variant consisting
hawk testbed. We start by demonstrating the applicability of five DPM and no BBPM. The second goal (G2) is to evalu-
of the formal methods to the Tomahawk hardware and ate the impact of using alternative code variants. We mimic
particular layers of the stack. With increased system’s het- this by varying the ratio between the numbers of tasks of
erogeneity, these methods are crucial to ensure correct-by- the different types. This corresponds to having variability
construction systems. We then analyze the behavior of the on the compiler layer, which can for ten tasks decide to, e.g.,
lower layers of the stack on a simulated tiled architecture generate only DPM tasks and no BBPM task or five tasks
inspired by the Tomahawk architecture. We show how the of each type. In the third setting addressing goal (G3), we
compiler-generated variants can be deployed on different, study optimal strategies for powering the PEs and choosing
heterogeneous resources by the Canoe runtime, and how their frequencies. Results obtained from the formal analysis
Canoe interacts with M3 to claim resources when multiple in these three dimensions support design decisions at the
applications run on the system. Using a simulator allows us hardware level (see Section 3), the compiler level (see Sec-
to scale the system beyond what we currently can afford to tion 5 and Section 5.1) and the runtime management level
put in a hardware prototype and, in future, add hardware as realized in Canoe (see Section 5.4), respectively. When
based on emerging, not yet existing technologies. comparing the variants in one of the above settings (G1–
G3), we leave the other choices nondeterministically open,
e.g., when considering (G1), we compare the heterogeneous
7.1 PMC for Tomahawk with the homogeneous hardware variant, on the basis of the
In this section we report on the main results gained in best possible resolution of the remaining nondeterministic
the quantitative analysis of the Tomahawk4 (T4) platform choices, in the respective variant. Hence, we do the com-
using probabilistic model checking (see Section 6.1). The parison in (G1), while considering a resource management
general goal of our analysis is to (i) provide feedback to policy that is optimal for the respective variant. For the
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
TABLE 2: Expected energy and time consumption for finish- homogeneous 140
ing a task quota for comparisons 6 heterogeneous
energy [mJ]
time [ms]
4 120
5 DPM, 5 BBPM
10 0 53.3 mJ 90.4 ms 1.2
2 DPM, 8 BBPM
8 2 56.8 mJ 90.4 ms 0 DPM, 10 BBPM 110
energy [mJ]
time [ms]
5 5 62.0 mJ 90.4 ms 1
2 8 67.2 mJ 91.0 ms 100
0 10 70.7 mJ 94.3 ms 0.8
(c) PE modes 90
0.9 0.92 0.94 0.96 0.98 1 0.9 0.92 0.94 0.96 0.98 1
probability probability
strategy minimal expected
power speed energy time (b) code variants
optimal powersave 62.0 mJ 110.0 ms 200
optimal performance 281.0 mJ 90.4 ms always on, performance
always on powersave 106.3 mJ 110.0 ms 6
always on, powersave 180
optimal, performance
always on performance 599.8 mJ 90.4 ms optimal, powersave
160
energy [mJ]
time [ms]
4 140
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
Fig. 11: Execution of different mapping variants of the LTE-like application on a heterogeneous platform.
kernel. An actual chart of the setup process is displayed TABLE 3: Execution times of distinct actions and kernels in
in Figure 12 in the bottom overlay. In the row labeled app the simulation.
the green bars identify the intervals where the application
action big LITTLE accelerator
registers actual data to the manager, also marked data in-
ject in the example. With its knowledge about the worker app setup 12,752 µs
comm setup 664 µs
pool and the received network description, Canoe selects manager syscall 7 µs
a mapping (marked with the label mapping in Figure 12). worker syscall 3 µs
In the row labeled manager, we can identify the setup of FIFO transition 15 µs
application channels as green bars at the bottom. As an FFT kernel 4629 µs 7744 µs 10 µs
EQ kernel 15,040 µs 25,048 µs
example, one such channel setup event is marked as comm Disp kernel ⇠1 µs ⇠2 µs
setup in the figure. The system calls to the M3 kernel needed
to setup the channels from the side of the workers and the
TABLE 4: Execution times of experimental setups with
manager are marked as thin red bars in the figure. The
different mappings in the system simulator. The first four
time spent in M3 syscalls for setting up a communication
correspond to Figure 11b.
channel on manager’s side is 7 µs, which represents 1%
of the total communication setup time. Table 3 lists this name iteration run time
time, alongside the times of several other different events
big 15,254 µs 54,662 µs
in the benchmark. Blue bars, like the ones connected with little 25,313 µs 88,254 µs
an arrow labeled block transfer in the bottom overlay of mix 15,254 µs 57,838 µs
Figure 12, represent the moment at which a transfer of a accel 15,276 µs 42,793 µs
data block becomes active. Finally, workers start computing, mix (noisy) 15,254 µs 57,830 µs
mix (equiv.) 15,254 µs 58,065 µs
and their execution is represented with gray bars, blocking
reads with teal bars and blocking writes with cyan bars.
The actual transfer of data through the channel corresponds
to the time elapsed from the end of a write request block system simulation with the LTE application, blocking some
(cyan) and the end of the corresponding read request block cores for execution to force the system to use a different but
(teal). The top overlay zooms in on a series of data transfers equivalent mapping.
between workers. The two red bars surround the interval To verify isolation and time-predictability, we also run
of the data being transferred, which takes 15 µs as stated in simulations with a separate application running simultane-
Table 3. This transition time, however, is less than 0.3% of ously to the LTE benchmark. To this end, we use a native
the computation time of a FFT or EQ kernel. The Dataflow M3 application that produces a high load on the filesystem.
of the LTE application is shown with blue arrows pointing At setup time, the filesystem intensive application starts as
from the origin of some data to its destination in the figure an M3 application. It uses the M3 API to connect to the
for three selected communication hotspots marked with filesystem and performs several operations in an endless
blue circles. loop.
In Table 4, we show the total run times of several
7.2.3 Multiple-application Perspective simulations, as well as the length of one iteration of the
In multi-application scenarios, it might not be possible to LTE receiver (i.e., the inverse of the network’s throughput).
use a mapping variant due to constraints on available Additionally to the four mappings from Figure 11, the mix
resources. As described in Section 5.3.2, we overcome this setup has been simulated in modified versions noisy and
problem by transforming a mapping into an equivalent one, equiv., which mean the addition of the filesystem intensive
which will keep desired properties in a predictable fashion. application and the usage of a different albeit equivalent
This is crucial in many application domains, e.g., real time mapping, respectively. Note that, as expected, neither the
systems. To evaluate this, we first show that performance introduction of an noisy environment nor the usage of an
is stable among equivalent mappings. For this, we set up a equivalent mapping changes the run time.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
Fig. 12: Trace of simulator running LTE app using the big mapping variant. Bottom overlay: setup phase concerning
application and manager core. Highlighted is the mapping decision point (mapping), an interval of the app passing data
to the manager (data inject), a setup interval for a inter worker communication channel (comm setup) and an actual data
transfer as a result of a data inject (block transfer). Top overlay: detailed view of an in-KPN data transfer. Blue arrows:
examples of in-KPN data transfers around circled communication hot spots.
The run time values listed in Table 4 represent the term, the main increase in power and efficiency of future
time between the connection initiation of the application to electronics will be driven mainly by two mutually stimu-
Canoe manager and the end of the second iteration of the lating technological directions: (i) processing: further paral-
pipeline’s last block. It is therefore dependent on the startup lelization and specialization, including application-specific,
phase of the network. The iteration column, however, de- reconfigurable, neuromorphic, and analog circuits, leading
scribes the iteration length in a settled network, where the to more heterogeneous hardware and (ii) data: non-volatile
throughput of a pipeline is always limited by its slowest and potentially heterogeneous memories, their tight (3D)
element. From Table 3 we learn that the most computation integration with logic, integrated sensing, and novel inter-
time is spent in the EQ kernel, and thus, the iteration length connects (e.g., photonics or wireless).
difference only depends on the placement of these kernels In this regard, partner projects within cfaed are devel-
on big or little cores. Unfortunately, this means that the oping reconfigurable transistors based on silicon nanowires
speed-up of 450 obtained when using the FFT accelerator (SiNW) [87], [88] and carbon nanotubes (CNT) [89], [90]
core cannot be fully exploited in this benchmark. as well as radio frequency electronics based on CNTs [91],
plasmonic waveguides for nanoscale optical communica-
tion [92], and sensors based on SiNWs [93].
8 D ISCUSSION Wildly heterogeneous systems resulting from the two
The previous sections described the cfaed Orchestration mentioned directions will pose both challenges and oppor-
hardware/software stack consisting of a tiled heterogeneous tunities to the hardware/software stack. We believe that the
hardware platform with network-on-chip, a capability- concepts discussed in this paper are key principles to handle
based operating system, as well as a runtime system and this kind of heterogeneity. Still, several open questions re-
a compiler for dataflow applications. As mentioned in the main. We discuss some of them in the following paragraphs.
introduction, we develop our stack as a research platform Hardware architecture: Given a tight integration of het-
to prepare to what we think will be the third inflection erogeneous PEs, memories, and sensors, potentially based
point in computing, i.e., a shift towards “wildly hetero- on emerging technologies (e.g., SiNW, CNT, and novel
geneous” systems composed of fundamentally new hard- memories), a tiled-based approach with a NoC and DTUs
ware technologies. Since it is currently still unclear which as uniform hardware interfaces makes it easy to combine
transistor technology could replace CMOS and sustain the and interface these units. However, the limits of the DTU
performance growth of the past [1], [2], [86], in the medium concept with respect to a potential disruption of the memory
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
hierarchy are currently unclear. as well as simulation to demonstrate their software stack.
Memory: Novel, non-volatile memories, such as spin In contrast to our automated approach, invasive computing
transfer torque magnetic RAM (STT-MRAM) and resistive is resource-aware, i.e., computing resources have to be allo-
RAM (RRAM) [2], will open a new space of optimizations cated manually. Another project working on heterogeneous
for many applications. We have seen how new memory tiled architectures is the Euretile EU project [33]. They
technologies can bring a positive impact both for perfor- developed a complete stack consisting of a programming
mance and energy efficiency [94]. Hybrid memory archi- paradigm based on process networks, a runtime support-
tectures are also being proposed that combine multiple ing fault management, a lightweight OS, and a hardware
different technologies [95], [96]. Given these systems, one platform with a custom torus network. While they apply a
could place data items, depending on their usage patterns, dataflow approach similar to ours, we additionally investi-
in the most efficient type of memory. Some of the novel gate run-time adaptivity.
memory technologies are more amenable for a tight inte- The above mentioned projects work with various het-
gration with logic, opening opportunities for near and in- erogeneous CMOS-based hardware, either off-the-shelf or
memory computing [97]. Given a high-level specification custom design. However, none of the projects consider
of the application, compilers together with runtimes and systems with post-CMOS hardware. The IEEE Rebooting
flexible memory controllers can automate the process of Computing Initiative [3], founded in 2012, proposes to re-
data mapping (and near memory operations). Additionally, think the concept of computing over the whole hardware/-
all these new memory architectures will certainly impact software stack and acts as a scientific platform for research
the system software (OSs, paging, security, and others). The on the CMOS scaling challenge and emerging post-CMOS
concrete impact is still unclear though. technologies. Besides such important stimulating initiatives,
Programming: It is questionable whether today’s stan- few projects are working on a concrete hardware/software
dards like OpenCL and OpenMP are able to address stack based on emerging technologies. The six US STAR-
wild heterogeneity and non-von-Neumann paradigms (e.g., net centers mainly focus on developing new materials,
neuromorphic, quantum, reconfigurable, or analog) be- technologies, and hardware architectures, but less on the
yond manual offloading. Alternative models that provide a software stack [4]. N3XT, for example, is an approach within
higher level of abstraction, like dataflow programming and STARnet towards a memory-centric architecture based on
domain-specific languages, could enable automatic utiliza- new technologies including CNTFET, RRAM, and STT-
tion of heterogeneous resources. Therefore, we are working MRAM [16], targeting big data applications. Another novel
on abstractions for computational fluid dynamics [20], [22] memory-centric architecture called “The Machine” is de-
and particle-based simulations [21]. veloped by HPE. It serves as a pre-commercial testbed for
Co-design: Given a huge design space, tools are needed large shared non-volatile memories accessed via photonic
that help co-designing from the specific application problem interconnects with a big data centric software stack [104].
down to the hardware architecture and the materials. For ex- In a joint division, CEA-Leti/List [105] develop hardware
ample, image processing implemented with coupled oscilla- based on emerging technologies as well as software for fu-
tors provides higher efficiency compared to an application- ture embedded systems, but without a common integrated
specific CMOS circuit [98]. As presented in Section 7, we hardware/software stack.
are using system-level simulation and probabilistic model
checking to evaluate design alternatives.
10 C ONCLUSION
In this paper we presented the design of a programming
9 R ELATED W ORK stack for heterogeneous systems and evaluated a prototype
This section focuses on large initiatives that cover the of it for a tiled CMOS heterogeneous multicore. We expect
hardware/software stack comparable to our holistic ap- the layer-wise mechanisms and the interfaces among the
proach. Works relevant to the individual layers are dis- layers to be applicable to future systems that integrate post-
cussed throughout Sections 3–6. CMOS components. More precisely, the paper described (i) a
The Aspire Lab at UC Berkeley emphasizes two areas: simple hardware-level interfacing via message passing that
hardware-specific optimizations hidden by pattern-specific will make it possible to integrate new exotic components; (ii)
languages [99] and agile hardware development based on an OS architecture that leverages hardware support for mes-
the hardware construction language Chisel [100]. Chisel has sage passing to implement isolation and makes OS services
been applied to create cycle-accurate simulators and to tape- available remotely to tiles that may not have a conventional
out various research processors. OmpSs [101], [102] is a ISA; (iii) an application runtime that reserves resources from
task-based programming model developed at the Barcelona the OS, and deploys and transforms a variant accordingly;
Supercomputing Center that extends and influences the and (iv) an automatic approach for dataflow applications
OpenMP standard. It has been applied to run applica- that uses abstract cost models of the architecture to pro-
tions on clusters, multicores, accelerators, and FPGAs. The duce near-optimal application mapping to resources. Fur-
Teraflux project [103] investigates OmpSs-based dataflow thermore, we developed novel formal methods to provide
programming on a simulated massively parallel multicore the kind of quantitative analysis required to reason about
architecture. The German collaborative research center Inva- trade-offs and system interfaces in such a heterogeneous
sIC [32] introduces invasive computing as a new program- setup. The evaluation served to (i) demonstrate the potential
ming paradigm for heterogeneous tiled architectures. They of the formal methods for a real multicore platform, (ii)
use an FPGA implementation of their MPSoC architecture measure the overhead of the software layers on a virtual
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
prototype, and (iii) show the validity of the claims (isola- [20] J. Mey et al., “Using semantics-aware composition and weaving
tion, performance-preserving transformation and run-time for multi-variant progressive parallelization,” Procedia Computer
Science, vol. 80, pp. 1554–1565, 2016.
adaptability). [21] S. Karol et al., “A domain-specific language and editor for parallel
In the future we will extend the simulator to integrate particle methods,” ACM T Math Software, 2017, to appear.
models of components built with post-CMOS technologies, [22] A. Susungi et al., “Towards compositional and generative tensor
allowing us to test the programming stack on wildly het- optimizations,” in Conf. Generative Programming: Concepts & Expe-
rience (GPCE’17), 2017, pp. 169–175.
erogeneous systems. We believe that the simulator and the [23] T. Karnagel et al., “Adaptive work placement for query process-
programming stack will allow technologists and system ing on heterogeneous computing resources,” in Conf. on Very
architects to explore system-level aspects of new devices. Large Data Bases (VLDB’17), 2017, pp. 733–744.
[24] T. Limberg et al., “A heterogeneous MPSoC with hardware sup-
This requires more modeling formalisms for both hardware ported dynamic task scheduling for software defined radio,” in
(i.e., emerging technologies) and application models (be- Design Automation Conference (DAC’09), 2009.
yond dataflow models), at different levels (e.g., for auto- [25] B. Noethen et al., “A 105GOPS 36mm2 heterogeneous SDR
matic compilation or formal analysis). This will eventually MPSoC with energy-aware dynamic scheduling and iterative
detection-decoding for 4G in 65nm CMOS,” in Solid-State Circuits
contribute to narrowing down and structuring the current Conf. (ISSCC), 2014, pp. 188–189.
landscape of alternative technologies, both within and out- [26] S. Haas et al., “An MPSoC for energy-efficient database query
side cfaed. We are convinced that more such early initiatives processing,” in Design Automation Conference (DAC’16), 2016.
[27] ——, “A heterogeneous SDR MPSoC in 28nm CMOS for low-
are needed to prepare systems that can cope, to some extent, latency wireless applications,” in DAC’17, 2017.
with yet unknown technologies. [28] C. Baier et al., “Probabilistic model checking for energy-utility
analysis,” in Horizons of the Mind. A Tribute to Prakash Panangaden,
2014, pp. 96–123.
ACKNOWLEDGEMENT [29] C. Ramey, “TILE-Gx100 manycore processor: Acceleration inter-
faces and architecture,” in Hot Chips Symp., 2011.
This work is supported by the German research foundation [30] B. D. de Dinechin et al., “A clustered manycore processor ar-
(DFG) and the Free State of Saxony through the cluster chitecture for embedded and accelerated applications,” in High
of excellence “Center for Advancing Electronics Dresden” Performance Extreme Computing Conf. (HPEC), 2013, pp. 1–6.
[31] B. Bohnenstiehl et al., “A 5.8 pJ/Op 115 billion ops/sec, to 1.78
(cfaed). trillion ops/sec 32nm 1000-processor array,” in Symp. on VLSI
Technology and Circuits, 2016.
[32] J. Teich, Ed., “Invasive computing,” (special issue) it - Information
R EFERENCES Technology, vol. 58, no. 6, 2014.
[1] T. N. Theis and H. S. P. Wong, “The end of Moore’s Law: A [33] P. S. Paolucci et al., “Dynamic many-process applications on
new beginning for information technology,” Computing in Science many-tile embedded systems and HPC clusters: The EURETILE
Engineering, vol. 19, no. 2, pp. 41–50, 2017. programming environment and execution platforms,” Journal of
[2] “More Moore,” White Paper, IEEE IRDS, 2016. Systems Architecture, vol. 69, pp. 29–53, 2015.
[3] T. M. Conte et al., “Rebooting computing: The road ahead,” [34] O. Arnold et al., “Tomahawk: Parallelism and heterogeneity
Computer, vol. 50, no. 1, pp. 20–29, 2017. in communications signal processing MPSoCs,” Trans Embedded
[4] “STARnet Research,” Semiconductor Research Corporation. Computer Systems, vol. 13, no. 3s, pp. 107:1–107:24, 2014.
[Online]. Available: https://www.src.org/program/starnet/ [35] N. Asmussen et al., “M3: A hardware/operating-system co-
[5] D. Geer, “Chip makers turn to multicore processors,” Computer, design to tame heterogeneous manycores,” in Conf. Architectural
vol. 38, no. 5, pp. 11–13, 2005. Support for Programming Languages and Operating Systems (ASP-
[6] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore LOS’16), 2016, pp. 189–203.
scaling,” in Symp. on Computer Architecture, 2011, pp. 365–376. [36] A. Baumann et al., “The multikernel: A new OS architecture
[7] J. D. Owens et al., “A survey of general-purpose computation on for scalable multicore systems,” in Symp. on Operating Systems
graphics hardware,” Comput Graph Forum, vol. 26, pp. 80–113, Principles (SOSP ’09), 2009, pp. 29–44.
2007. [37] F. X. Lin et al., “K2: A mobile operating system for heterogeneous
[8] R. Kumar et al., “Single-ISA heterogeneous multi-core architec- coherence domains,” in ASPLOS’14, 2014, pp. 285–300.
tures: The potential for processor power reduction,” in Symp. on [38] A. Barbalace et al., “Popcorn: Bridging the programmability gap
Microarchitecture (MICRO-36), 2003, pp. 81–92. in heterogeneous-ISA platforms,” in Conf. on Computer Systems
[9] P. Greenhalgh, “big.LITTLE processing with ARM Cortex-A15 (EuroSys ’15), 2015, pp. 29:1–29:16.
and Cortex-A7,” ARM, Tech. Rep., 2011. [39] E. B. Nightingale et al., “Helios: Heterogeneous multiprocessing
[10] E. Biscondi et al., “Maximizing multicore efficiency with naviga- with satellite kernels,” in Symp. on Operating Systems Principles
tor runtime,” White Paper, Texas Instruments, 2012. (SOSP ’09), 2009, pp. 221–234.
[11] M. Taylor, “A landscape of the new dark silicon design regime,” [40] P. B. Hansen, “The nucleus of a multiprogramming system,”
IEEE Micro, vol. 33, no. 5, pp. 8–19, 2013. Communications of the ACM, vol. 13, no. 4, pp. 238–241, 1970.
[12] M. Ferdman et al., “Toward dark silicon in servers,” IEEE Micro, [41] J. Liedtke, “On µ-kernel construction,” in Symp. on Operating
vol. 31, pp. 6–15, 2011. System Principles (SOSP), 1995.
[13] P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to [42] D. B. Golub et al., “Microkernel operating system architecture
the new normal for computer architecture,” Computing in Science and Mach,” in USENIX Workshop on Microkernels and Other Kernel
Engineering, vol. 15, no. 6, pp. 16–26, 2013. Architectures, 1992, pp. 11–30.
[14] S. Borkar, “Designing reliable systems from unreliable compo- [43] E. A. Lee, “The problem with threads,” Computer, vol. 39, no. 5,
nents: the challenges of transistor variability and degradation,” pp. 33–42, 2006.
IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005. [44] J. Castrillon, “Programming heterogeneous MPSoCs: tool flows
[15] L. Ceze et al., “Arch2030: A vision of computer architecture to close the software productivity gap,” Ph.D. dissertation,
research over the next 15 years,” Arch2030 Workshop, 2016. RWTH Aachen University, Apr. 2013.
[16] M. M. Sabry Aly et al., “Energy-efficient abundant-data comput- [45] G. Kahn, “The semantics of a simple language for parallel pro-
ing: The N3XT 1,000x,” Computer, vol. 48, no. 12, pp. 24–33, 2015. gramming,” in Information Processing ’74, 1974, pp. 471–475.
[17] A. Hemani et al., “Network on chip: An architecture for billion [46] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,”
transistor era,” in IEEE NorChip Conference, 2000. Proc. IEEE, vol. 75, no. 9, pp. 1235–1245, 1987.
[18] S. Gorlatch and M. Cole, “Parallel skeletons,” in Encyclopedia of [47] C. Ptolemaeus, Ed., System Design, Modeling, and Simulation using
Parallel Computing. Springer, 2011, pp. 1417–1422. Ptolemy II. Ptolemy.org, 2014.
[19] S. Karol, “Well-formed and scalable invasive software composi- [48] L. Thiele et al., “Mapping applications to tiled multiprocessor
tion,” Ph.D. dissertation, Technische Universität Dresden, 2015. embedded systems,” in Conf. Application of Concurrency to System
Design (ACSD’07), 2007, pp. 29–40.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2017.2771750, IEEE
Transactions on Multi-Scale Computing Systems
[49] R. Jordans et al., “An automated flow to map throughput con- [77] A. F. Donaldson and A. Miller, “Symmetry reduction for proba-
strained applications to a MPSoC,” in Bringing Theory to Practice: bilistic model checking using generic representatives,” in Symp.
Predictability and Performance in Embedded Systems, 2011, pp. 47– on Automated Technology for Verification and Analysis (ATVA 2006),
58. 2006, pp. 9–23.
[50] J. Castrillon et al., “Trends in embedded software synthesis,” [78] C. Baier et al., “Symbolic model checking for probabilistic
in Conf. Embedded Computer Systems: Architecture, Modeling and processes,” in Colloq. on Automata, Languages and Programming
Simulation (SAMOS’11), 2011, pp. 347–354. (ICALP’97), 1997, pp. 430–440.
[51] R. Leupers et al., “MAPS: A software development environment [79] M. Z. Kwiatkowska et al., “Probabilistic symbolic model checking
for embedded multicore applications,” in Handbook of Hardware/- with PRISM: A hybrid approach,” Int J Softw Tools for Technol
Software Codesign, S. Ha and J. Teich, Eds., 2017. Transfer, vol. 6, pp. 128–142, 2004.
[52] J. F. Eusse et al., “CoEx: A novel profiling-based algorithm/ar- [80] H. Hermanns et al., “On the use of MTBDDs for performability
chitecture co-exploration for ASIP design,” ACM T Reconfigurable analysis and verification of stochastic systems,” J Logic and Alge-
Technol Syst, vol. 8, no. 3, 2014. braic Programming, vol. 56, pp. 23–67, 2003.
[53] J. Castrillon et al., “Component-based waveform development: [81] J. Klein et al., “Advances in probabilistic model checking with
the nucleus tool flow for efficient and portable software defined PRISM: variable reordering, quantiles and weak deterministic
radio,” Analog Integr Circ Sig Proc, vol. 69, no. 2, pp. 173–190, Büchi automata,” J Softw Tools for Technol Transfer, pp. 1–16, 2017.
2011. [82] M. Z. Kwiatkowska et al., “PRISM 4.0: Verification of probabilistic
[54] C. Menard et al., “High-level NoC model for MPSoC compilers,” real-time systems,” in Conf. Computer Aided Verification (CAV),
in Nordic Circuits and Systems Conf. (NORCAS’16), 2016, pp. 1–6. 2011, pp. 585–591.
[55] J. Castrillon et al., “Communication-aware mapping of KPN [83] N. Binkert et al., “The gem5 simulator,” ACM SIGARCH Computer
applications onto heterogeneous MPSoCs,” in DAC’12, 2012. Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
[56] ——, “MAPS: Mapping concurrent dataflow applications to het- [84] C. Menard et al., “System simulation with gem5 and SystemC:
erogeneous MPSoCs,” T Ind Inform, vol. 9, no. 1, pp. 527–545, The keystone for full interoperability,” in Conf. Embedded Com-
2013. puter Systems: Architectures, Modeling and Simulation (SAMOS’17),
[57] A. Goens et al., “Symmetry in software synthesis,” ACM T Archit 2017.
Code Op (TACO), vol. 14, no. 2, pp. 20:1–20:26, 2017. [85] “SLXMapper,” Silexica GmbH, 2016. [Online]. Available:
[58] ——, “Tetris: a multi-application run-time system for predictable http://www.silexica.com
execution of static mappings,” in Workshop on Software and Com- [86] J. M. Shalf and R. Leland, “Computing beyond Moore’s Law,”
pilers for Embedded Systems, 2017, pp. 11–20. Computer, vol. 48, no. 12, pp. 14–23, 2015.
[59] J. Sifakis, “System design automation: Challenges and limita- [87] W. M. Weber et al., “Reconfigurable nanowire electronics – a
tions,” Proc. IEEE, vol. 103, no. 11, pp. 2093–2103, 2015. review,” Solid-State Electronics, vol. 102, pp. 12–24, 2014.
[60] E. M. Clarke et al., “Model checking: Back and forth between [88] A. Heinzig et al., “Dually active silicon nanowire transistors and
hardware and software,” in Conf. on Verified Software: Theories, circuits with equal electron and hole transport,” Nano Letters,
Tools, Experiments (VSTTE), 2005, pp. 251–255. vol. 13, no. 9, pp. 4176–4181, 2013.
[61] E. M. Clarke and Q. Wang, “25 years of model checking,” in Conf. [89] M. Haferlach et al., “Electrical characterization of emerging tran-
on Perspectives of System Informatics (PSI 2014), 2014, pp. 26–40. sistor technologies: Issues and challenges,” IEEE Trans on Nan-
[62] O. Grumberg and H. Veith, Eds., 25 Years of Model Checking - otechnology, vol. 15, no. 4, pp. 619–626, 2016.
History, Achievements, Perspectives. Springer, 2008. [90] S. Mothes et al., “Toward linearity in schottky barrier CNTFETs,”
[63] C. Baier et al., “Towards automated variant selection for heteroge- IEEE Trans on Nanotechnology, vol. 14, no. 2, pp. 372–378, 2015.
neous tiled architectures,” in Models, Algorithms, Logics and Tools, [91] M. Schröter et al., “Carbon nanotube FET technology for radio-
2017, pp. 382–399. frequency electronics: State-of-the-art overview,” IEEE J Electron
[64] P. Chrszon et al., “ProFeat: Feature-oriented engineering for Devices Soc, vol. 1, no. 1, pp. 9–20, 2013.
family-based probabilistic model checking,” Formal Aspects of [92] F. N. Gür et al., “Toward self-assembled plasmonic devices:
Computing, pp. 1–31, 2017. High-yield arrangement of gold nanoparticles on DNA origami
[65] C. Baier et al., “Computing conditional probabilities in Marko- templates,” ACS Nano, vol. 10, no. 5, pp. 5374–5382, 2016.
vian models efficiently,” in Conf. Tools and Algorithms for the [93] D. Karnaushenko et al., “Light weight and flexible high-
Construction and Analysis of Systems (TACAS), 2014, pp. 515–530. performance diagnostic platform,” Advanced Healthcare Materials,
[66] ——, “Maximizing the conditional expected reward for reaching vol. 4, no. 10, pp. 1517–1525, 2015.
the goal,” in Conf. on Tools and Algorithms for the Construction and [94] F. Hameed et al., “Efficient STT-RAM last-level-cache architecture
Analysis of Systems (TACAS), 2017, pp. 269–285. to replace DRAM cache,” in Symp. Memory Systems (MemSys’17),
[67] S. Märcker et al., “Computing conditional probabilities: Imple- 2017, pp. 141–151.
mentation and evaluation,” in Conf. on Software Engineering and [95] J. C. Mogul et al., “Operating system support for NVM+DRAM
Formal Methods (SEFM 2017), 2017, pp. 349–366. hybrid main memory,” in Hot Topics in Operating Systems, 2009.
[68] M. Ummels and C. Baier, “Computing quantiles in Markov [96] S. Onsori et al., “An energy-efficient heterogeneous mem-
reward models,” in Conf. on Foundations of Software Science and ory architecture for future dark silicon embedded chip-
Computation Structures (FOSSACS). Springer, 2013, pp. 353–368. multiprocessors,” IEEE Trans Emerging Topics in Computing, 2016.
[69] C. Baier et al., “Energy-utility quantiles,” in NASA Formal Methods [97] S. Khoram et al., “Challenges and opportunities: From near-
Symp. (NFM), 2014, pp. 28–299. memory computing to in-memory computing,” in Symp. Physical
[70] D. Krähmann et al., “Ratio and weight quantiles,” in Symp. Design (ISPD’17), 2017, pp. 43–46.
Mathematical Foundations of Computer Science (MFCS), 2015, pp. [98] N. Shukla et al., “Pairwise coupled hybrid vanadium dioxide-
344–356. MOSFET (HVFET) oscillators for non-boolean associative com-
[71] C. Baier et al., “Probabilistic model checking and non-standard puting,” in IEEE Int. Electron Devices Meeting, 2014.
multi-objective reasoning,” in Conf. on Fundamental Approaches to [99] B. Catanzaro et al., “SEJITS: Getting productivity and perfor-
Software Engineering (FASE), 2014, pp. 1–16. mance with selective embedded JIT specialization,” University
[72] K. Etessami et al., “Multi-objective model checking of Markov of California, Berkeley, Tech. Rep. UCB/EECS-2010-23, 2010.
decision processes,” Log Meth Comput Sci, vol. 4, no. 4, 2008. [100] Y. Lee et al., “An agile approach to building RISC-V microproces-
[73] C. Baier et al., “Partial order reduction for probabilistic systems,” sors,” IEEE Micro, vol. 36, no. 2, pp. 8–20, 2016.
in Conf. on Quantitative Evaluation of SysTems (QEST), 2004, pp. [101] J. Bueno et al., “Productive programming of GPU clusters with
230–239. OmpSs,” in Parallel Distr. Processing Symp., 2012, pp. 557–568.
[74] P. R. D’Argenio and N. Peter, “Partial order reduction on concur- [102] A. Filgueras et al., “OmpSs@Zynq all-programmable SoC ecosys-
rent probabilistic programs,” in Conf. on Quantitative Evaluation of tem,” in Symp. Field-programmable Gate Arrays, 2014, pp. 137–146.
SysTems (QEST), 2004, pp. 240–249. [103] R. Giorgi et al., “Teraflux: Harnessing dataflow in next generation
[75] M. Größer et al., “On reduction criteria for probabilistic reward teradevices,” Microprocessors and Microsystems, vol. 38, no. 8, 2014.
models,” in Conf. on Foundations of Software Technology and Theo- [104] K. Keeton, “Memory-driven computing,” in Conf. on File and
retical Computer Science (FSTTCS), 2006, pp. 309–320. Storage Technologies (FAST’17), 2017.
[76] M. Kwiatkowska et al., “Symmetry reduction for probabilistic [105] M. Belleville et al., Eds., Architecture and IC Design, Embedded
model checking,” in Conf. on Computer Aided Verification (CAV Software Annual Research Report 2015. CEA Leti, 2016.
2006), 2006, pp. 234–248.
2332-7766 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.