Professional Documents
Culture Documents
Architecture of Computing Systems ARCS 2016 29th International Conference Nuremberg Germany April 4 7 2016 Proceedings 1st Edition Frank Hannig
Architecture of Computing Systems ARCS 2016 29th International Conference Nuremberg Germany April 4 7 2016 Proceedings 1st Edition Frank Hannig
https://textbookfull.com/product/architecture-of-computing-
systems-arcs-2020-33rd-international-conference-aachen-germany-
may-25-28-2020-proceedings-andre-brinkmann/
https://textbookfull.com/product/architecture-of-computing-
systems-arcs-2014-27th-international-conference-lubeck-germany-
february-25-28-2014-proceedings-1st-edition-erik-maehle/
https://textbookfull.com/product/business-information-
systems-19th-international-conference-bis-2016-leipzig-germany-
july-6-8-2016-proceedings-1st-edition-witold-abramowicz/
https://textbookfull.com/product/ignition-systems-for-gasoline-
engines-3rd-international-conference-november-3-4-2016-berlin-
germany-1st-edition-michael-gunther/
https://textbookfull.com/product/microactuators-and-
micromechanisms-proceedings-of-mamm-2016-ilmenau-germany-
october-5-7-2016-1st-edition-lena-zentner/
https://textbookfull.com/product/computational-science-and-its-
applications-iccsa-2016-16th-international-conference-beijing-
china-july-4-7-2016-proceedings-part-iv-1st-edition-osvaldo-
gervasi/
https://textbookfull.com/product/internet-and-distributed-
computing-systems-9th-international-conference-idcs-2016-wuhan-
china-september-28-30-2016-proceedings-1st-edition-wenfeng-li/
Architecture of
LNCS 9637
Computing Systems –
ARCS 2016
29th International Conference
Nuremberg, Germany, April 4–7, 2016
Proceedings
123
Lecture Notes in Computer Science 9637
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7407
Frank Hannig João M.P. Cardoso
•
Architecture of
Computing Systems –
ARCS 2016
29th International Conference
Nuremberg, Germany, April 4–7, 2016
Proceedings
123
Editors
Frank Hannig Dietmar Fey
Friedrich-Alexander University Friedrich-Alexander University
Erlangen-Nürnberg Erlangen-Nürnberg
Erlangen Erlangen
Germany Germany
João M.P. Cardoso Wolfgang Schröder-Preikschat
Faculty of Engineering (FEUP) Friedrich-Alexander University
University of Porto Erlangen-Nürnberg
Porto Erlangen
Portugal Germany
Thilo Pionteck Jürgen Teich
Universität zu Lübeck Friedrich-Alexander University
Lübeck Erlangen-Nürnberg
Germany Erlangen
Germany
General Co-chairs
Dietmar Fey Friedrich-Alexander University Erlangen-Nürnberg,
Germany
Wolfgang Friedrich-Alexander University Erlangen-Nürnberg,
Schröder-Preikschat Germany
Jürgen Teich Friedrich-Alexander University Erlangen-Nürnberg,
Germany
Program Co-chairs
Frank Hannig Friedrich-Alexander University Erlangen-Nürnberg,
Germany
João M.P. Cardoso University of Porto, Portugal
Publication Chair
Thilo Pionteck Universität zu Lübeck, Germany
Program Committee
Michael Beigl Karlsruhe Institute of Technology, Germany
Mladen Berekovic Technische Universität Braunschweig, Germany
Simon Bliudze École Polytechnique Fédérale de Lausanne,
Switzerland
Florian Brandner École Nationale Supérieure de Techniques Avancées,
France
Jürgen Brehm Leibniz Universität Hannover, Germany
Uwe Brinkschulte Universität Frankfurt, Germany
João M.P. Cardoso University of Porto, Portugal
Luigi Carro Universidade Federal do Rio Grande do Sul, Brasil
Albert Cohen Inria, France
Nikitas Dimopoulos University of Victoria, Canada
Ahmed El-Mahdy Egypt-Japan University for Science and Technology,
Egypt
Fabrizio Ferrandi Politecnico di Milano, Italy
VIII Organization
Additional Reviewers
Abstract. Energy efficiency and the need for high performance has steered
computing platforms toward customization. General purpose computing how-
ever remains a challenge as on-chip resources continue to increase with a limited
performance improvement. In order to truly improve processor performance, a
major reconsideration at the microarchitectural level must be sought with
regards to the compiler, ISA, and general architecture without an explicit
dependence on transistor scaling and increased cache levels. In attempts to
assign the processor transistor budget towards engineering ingenuity, this paper
presents the concept of Configurable Computing Units (CCUs). CCUs are
designed to make reconfigurability in general purpose computing a reality by
introducing the concept of logical and physical compilation. This concept allows
for both the application and underlying architecture to be considered during the
compilation process. Experimental results demonstrate that a single CCU core
(consisting of double engines) achieves dual core performance, with half the
area and power consumption required of a conventional monolithic CPU.
1 Introduction
Traditional microprocessors have long benefited from the transistor density gains of
Moore’s law. Diminishing transistor speeds and practical energy limits however have
created new challenges in technology, where the exponential performance improve-
ments we have been accustomed to from previous computing generations continue to
slowly cease. A common response to addressing various issues in computing has
revolved around increasing core counts in multiprocessors and employing hetero-
geneity (i.e. including accelerated units such as GPUs, FPGAs etc.) for offloading and
executing dataflow-like phases of an application. This type of computing has been
made possible with parallel programming models and languages such as OpenCL,
OpenMP, CUDA, and OmpSs [2, 7]. Much research in Chip Multiprocessors
(CMP) has also revolved around energy efficiency and specialization, noting that
processor performance improvement is attributed to increased cache levels and tran-
sistor scaling [1, 3]. Although these concepts have increased performance to an extent,
there is still the fundamental problem of how a single core’s organization and design
may be improved and applied to the multiprocessor domain [4, 5, 8–10].
Conventional CPU architectures possess a boundary between the compiler and
underlying hardware due to a processor’s Instruction Set Architecture (ISA). Current
ISAs are limited by program counters, control-flow, and fine-grained instruction exe-
cution, lacking the richness needed to express a programmer’s intent. Specifically, the
programmer and software may provide a considerable amount of information about an
application prior to compilation, however ISAs are unable to express this information
during code generation and consequently the compiler is not able to exploit this
knowledge to the underlying CPU [10]. Majority of the processor’s hardware units
must then try to rediscover the program’s characteristics at the cost of power and area
overhead. Hence, there is much ingenuity to be discovered based on these limitations if
transistor budgets are put towards improving and sophisticating processing element
performance, versus adopting the mantra of integrating more simple cores and memory
on a single die [5].
Given these factors, this work presents a nuanced approach to general purpose
computing using the concepts of Configurable Computing Units (CCUs), and logical
and physical compilation. CCUs are configurable processors that execute tasks which
are formed using the OmpSs programming model (Sect. 3.1). Each CCU processor
consists of multiple variable sized engines, where each engine comprises of unique
functional units (FUs) that are connected through a registerSwitch (rS) interconnect as
shown in Fig. 1. The rS interconnect provides distributed storage and single-cycle
multi-hop data communication amongst its FUs. Thus the rS interconnect avoids the
need to constantly access centralized register files, bypass networks, and tile-based
hotspots by revising the datapath layout. The engines are able to configure to a general
purpose application’s communication patterns using the programmable rS interconnect
and information of the underlying architecture, transferring data amongst the functional
units only when necessary. Therefore the configuration data (generated by the physical
compiler) allows the engine to temporally configure itself on every clock cycle to
support various data transfers and storages as required. To maintain memory ordering
imposed by the logical compiler, a small external register file and load/store unit(s) are
also included in the CCU back-end. A banked memory setup for storing configuration
data is also used to reduce configuration times and allow for reconfigurability in general
purpose processors.
The focus of this work pertains to the CCU back-end which will be prototyped in
hardware, with the full CCU simulated in software and compared to a conventional
monolithic CPU and multiprocessor.
2 Overview
Fig. 1. Six FU (Single) engine CCU Fig. 2. Proposed CCU processor design flow
architecture
Assuming the engine is also ready, the task ID is then input to the engine’s lookup
table to determine the configuration memory addresses to be read (where the necessary
configuration bits are stored contiguously from that address). This address is then read
from memory which initiates the engine’s configuration and external read process
(where external data is sent to the Read Register Buffer (RRB)). Once configuration is
complete, task execution commences. Subsequent to execution, final values pending in
the Write Register Buffer (WRB) are written/stored back to the external register
file/cache, respectively.
In the first cycle of the example, the XOR and XNOR instructions are executed (as
buffered in their respective FU queue), each obtaining their source data from the RRB
(i.e. the external register reads and/or immediate values buffered to the RRB during
configuration). The result generated by the XNOR instruction is thereafter needed by
the three consumers ADD, SUB, and OR, all executing in the second cycle. The XOR
instruction however is needed in the next clock cycle by the ADD and SUB instruc-
tions, and in the third cycle by the AND instruction. The operand result (‘4’) therefore
requires both propagation and temporary storage for its consumers in the 2nd and 3rd
Towards Multicore Performance with Configurable Computing Units 7
clock cycle respectively. Therefore during the 3rd clock cycle when AND executes, the
operand is selected from the rS unit storage (marked ‘S’ in the 2nd row, 2nd column),
and propagated to its consumer as programmed during configuration.
This process continues for all instructions until completion, where each FU con-
tains a queue to buffer multiple task instructions. In terms of distributed storage, the rS
units may temporally store up to 3 values each (i.e. 3 input ports) where the torus
connection is used to transfer results from opposite edges of the rS interconnect (right
to left, and bottom to top) for a continuing stream of dataflow execution using the xy
protocol. Note that stores are also sent to the WRB during execution, which are then
forwarded to the store unit to handle writebacks. Once all instructions have finished
executing, the final register values present in the WRB are written to the external
register file as required.
3 Compilers
3.1 OmpSs Programming Model
OmpSs is a task-based programming model which exploits task-level parallelism on a
shared-memory architecture. OmpSs enables programmers to annotate standard
sequential-based applications with pragmas to help the compiler and runtime identify
dependent and independent tasks for parallel execution while maintaining program-
ming familiarity. Therefore the programmer does not need to synchronize or manage
task execution, but rather expose task side-effects by simply annotating each kernel’s
operands with input, output, or inout (bidirectional) [7]. Using the clauses specified by
the programmer, communication patterns amongst the tasks are obtained and analyzed
by the physical compiler to determine inter-dependencies.
dedicated unit (See Fig. 1). Loads may be divided into two categories: External (a load
which is needed to start task execution) or internal (load requested by an inter-task
instruction). External loads possess a separate configuration memory bank which is
used to read data values during configuration, occurring concurrently while the other
units are configured. These loaded values are then written to the RRB so that they may
propagate to their consumers as required during the execution stage. Similarly, internal
loads are configured in the WRB which must wait for their producer data to generate in
the engine. Once ready, the address is then sent to the load unit, where the data is
brought back to the RRB when successfully read. Store instructions are also buffered
by the WRB during task execution and sent to the store unit when ready. Likewise,
external ‘‘register’’ values to be read prior and/or written after task execution are
configured in the RRB and WRB memory respectively (in the same manner as loads
and stores). Note that the multiplier/divider unit does not need to be configured as the
RRB and WRB units are configured in the same manner as loads.
4 Hardware Design
maximum achievable frequency due to the simplistic design of the rS (Sect. 5.2.5,
Frequency of Operation).
»30/VIII 95.
Alexander.»
*****
*****
*****
*****
Ida Aalbergin taiteilijaluonne ei ollut älyllinen. Vapaaherra Uexküll-
Gyllenbandin intellektualismilla jota tarjottiin ylen runsaina
annoksina, oli omat vaaransa Ida Aalbergille. Neuvoja oli liian paljon,
jotta Ida Aalberg olisi voinut niitä kaikkia edullisesti hyödykseen
käyttää. Ne voivat muodostua raskaaksi painolastiksi, joka voi
hävittää hänen taiteensa suurimman vaikutusvoiman: välittömyyden
tunneilmauksissa. Hän tarvitsi Bergbomilta vain pienen vihjauksen
tai lyhykäisen lauseen ohjauksekseen, heti sanotaan hänen
joutuneen inspiratsionin valtaan, joka voimallaan ja hehkullaan ylitti
kaikki toivomukset, ja pystyneen tällöin luomaan sellaista, mitä
ohjaaja omassa mielikuvituksessaan ei ollut nähnyt, mutta johon hän
täysin yhtyi. Tuommoista hetkelliseen inspiratsioniin perustuvaa
taidetta voidaan syyttää pintapuolisuudesta, mutta älyn ylivalta vie
tehon taiteelta ja tekee esityksen kuolleeksi.