You are on page 1of 54

Invasive Computing for Mapping

Parallel Programs to Many Core


Architectures 1st Edition Andreas
Weichslgartner
Visit to download the full and correct content document:
https://textbookfull.com/product/invasive-computing-for-mapping-parallel-programs-to-
many-core-architectures-1st-edition-andreas-weichslgartner/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Programming Multicore and Many core Computing Systems


1st Edition Sabri Pllana

https://textbookfull.com/product/programming-multicore-and-many-
core-computing-systems-1st-edition-sabri-pllana/

Task Scheduling for Multi core and Parallel


Architectures Challenges Solutions and Perspectives 1st
Edition Quan Chen

https://textbookfull.com/product/task-scheduling-for-multi-core-
and-parallel-architectures-challenges-solutions-and-
perspectives-1st-edition-quan-chen/

From Variability Tolerance to Approximate Computing in


Parallel Integrated Architectures and Accelerators 1st
Edition Abbas Rahimi

https://textbookfull.com/product/from-variability-tolerance-to-
approximate-computing-in-parallel-integrated-architectures-and-
accelerators-1st-edition-abbas-rahimi/

Tools for High Performance Computing 2015 Proceedings


of the 9th International Workshop on Parallel Tools for
High Performance Computing September 2015 Dresden
Germany 1st Edition Andreas Knüpfer
https://textbookfull.com/product/tools-for-high-performance-
computing-2015-proceedings-of-the-9th-international-workshop-on-
parallel-tools-for-high-performance-computing-
Modeling and Simulation of Invasive Applications and
Architectures Sascha Roloff

https://textbookfull.com/product/modeling-and-simulation-of-
invasive-applications-and-architectures-sascha-roloff/

Parallel programming for modern high performance


computing systems Czarnul

https://textbookfull.com/product/parallel-programming-for-modern-
high-performance-computing-systems-czarnul/

Cable Driven Parallel Robots Proceedings of the Second


International Conference on Cable Driven Parallel
Robots 1st Edition Andreas Pott

https://textbookfull.com/product/cable-driven-parallel-robots-
proceedings-of-the-second-international-conference-on-cable-
driven-parallel-robots-1st-edition-andreas-pott/

How to Design Programs An Introduction to Programming


and Computing Matthias Felleisen

https://textbookfull.com/product/how-to-design-programs-an-
introduction-to-programming-and-computing-matthias-felleisen/

SQL & NoSQL Databases: Models, Languages, Consistency


Options and Architectures for Big Data Management 1st
Edition Andreas Meier

https://textbookfull.com/product/sql-nosql-databases-models-
languages-consistency-options-and-architectures-for-big-data-
management-1st-edition-andreas-meier/
Computer Architecture and Design Methodologies

Andreas Weichslgartner
Stefan Wildermann
Michael Glaß
Jürgen Teich

Invasive
Computing for
Mapping Parallel
Programs to Many-
Core Architectures
Computer Architecture and Design
Methodologies

Series editors
Anupam Chattopadhyay, Noida, India
Soumitra Kumar Nandy, Bangalore, India
Jürgen Teich, Erlangen, Germany
Debdeep Mukhopadhyay, Kharagpur, India
Twilight zone of Moore’s law is affecting computer architecture design like never
before. The strongest impact on computer architecture is perhaps the move from
unicore to multicore architectures, represented by commodity architectures like
general purpose graphics processing units (gpgpus). Besides that, deep impact of
application-specific constraints from emerging embedded applications is presenting
designers with new, energy-efficient architectures like heterogeneous multi-core,
accelerator-rich System-on-Chip (SoC). These effects together with the security,
reliability, thermal and manufacturability challenges of nanoscale technologies are
forcing computing platforms to move towards innovative solutions. Finally, the
emergence of technologies beyond conventional charge-based computing has led to
a series of radical new architectures and design methodologies.
The aim of this book series is to capture these diverse, emerging architectural
innovations as well as the corresponding design methodologies. The scope will
cover the following.
Heterogeneous multi-core SoC and their design methodology
Domain-specific Architectures and their design methodology
Novel Technology constraints, such as security, fault-tolerance and their impact
on architecture design
Novel technologies, such as resistive memory, and their impact on architecture
design
Extremely parallel architectures

More information about this series at http://www.springer.com/series/15213


Andreas Weichslgartner Stefan Wildermann

Michael Glaß Jürgen Teich


Invasive Computing
for Mapping Parallel
Programs to Many-Core
Architectures

123
Andreas Weichslgartner Michael Glaß
Department of Computer Science Embedded Systems/Real-Time Systems
Friedrich-Alexander-Universität Erlangen- University of Ulm
Nürnberg (FAU) Ulm, Baden-Württemberg
Erlangen, Bayern Germany
Germany
Jürgen Teich
Stefan Wildermann Department of Computer Science
Department of Computer Science Friedrich-Alexander-Universität Erlangen-
Friedrich-Alexander-Universität Erlangen- Nürnberg (FAU)
Nürnberg (FAU) Erlangen, Bayern
Erlangen, Bayern Germany
Germany

ISSN 2367-3478 ISSN 2367-3486 (electronic)


Computer Architecture and Design Methodologies
ISBN 978-981-10-7355-7 ISBN 978-981-10-7356-4 (eBook)
https://doi.org/10.1007/978-981-10-7356-4
Library of Congress Control Number: 2017958628

© Springer Nature Singapore Pte Ltd. 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Acknowledgements

This work originated from within the Transregional Collaborative Research Center
89 “Invasive Computing” (abbr. InvasIC) in which a novel paradigm for the design
and resource-aware programming of future parallel computing systems is investi-
gated. For systems with 1000 and more cores on a chip, resource-aware pro-
gramming is of utmost importance to obtain high utilization as well as high
computational and energy efficiency, but also in order to achieve predictable
qualities of execution of parallel programs. The basic principle of invasive com-
puting and innovation is to give a programmer explicit handles to specify and argue
about resource requirements desired or required in different phases of execution.
InvasIC is funded by the Deutsche Forschungsgemeinschaft (DFG), aggregating
researchers from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU),
Karlsruher Institut für Technologie (KIT), and Technische Universität München
(TUM). Its scientific team includes specialists in parallel algorithm design, hard-
ware architects for reconfigurable MPSoC development as well as language, tool,
application, and operating system designers.
At this point, we like to thank all participating scientists of InvasIC who enabled
and jointly contributed to the achievements of InvasIC in general and to the results
summarized in this book in particular. Our particular thanks go to the DFG for
funding InvasIC.

v
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 (A) Decentralized Application Mapping . . . . . . . . . . . . . . 3
1.1.2 (B) Hybrid Application Mapping . . . . . . . . . . . . . . . . . . . 4
1.1.3 (C) Nonfunctional Properties . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Outline of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Invasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Principles of Invasive Computing . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Invasive Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Invade, Infect, Retreat, and Claims . . . . . . . . . . . . . . . . . . 12
2.2.2 Communication-Aware Programming . . . . . . . . . . . . . . . . 13
2.2.3 Actor Model and Nonfunctional Properties . . . . . . . . . . . . 15
2.3 Overhead Analysis of Invasive Computing . . . . . . . . . . . . . . . . . . 19
2.3.1 Invasive Speedup and Efficiency Analysis . . . . . . . . . . . . . 21
2.4 Invasive Hardware Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Invasive Tightly Coupled Processor Arrays . . . . . . . . . . . . 25
2.4.2 The Invasive Core—i-Core . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Dynamic Many-Core i-let Controller—CiC . . . . . . . . . . . . 27
2.5 Invasive Network on Chip—i-NoC . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Invasive Network Adapter—i-NA . . . . . . . . . . . . . . . . . . . 31
2.5.3 Control Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Invasive Run-Time and Operating System . . . . . . . . . . . . . . . . . . 34
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii
viii Contents

3 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 -Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Self-embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Self-embedding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Incarnations of Embedding Algorithms . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Path Load and Best Neighbor . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Seed-Point Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Hardware-Based Acceleration for Self-embedding . . . . . . . . . . . . 68
4.4.1 Application Graph Preprocessing . . . . . . . . . . . . . . . . . . . 69
4.4.2 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.4 Random Walk with Weighted Probabilities . . . . . . . . . . . . 77
4.5.5 Hardware-Based Self-embedding . . . . . . . . . . . . . . . . . . . 79
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Hybrid Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 HAM Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Static Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.1 Composable Communication Scheduling . . . . . . . . . . . . . . 92
5.2.2 Composable Task Scheduling . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Generation of Feasible Application Mappings . . . . . . . . . . 98
5.3.2 Optimization Objectives and Evaluation . . . . . . . . . . . . . . 99
5.4 Run-Time Constraint Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Constraint Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Run-Time Mapping of Constraint Graphs . . . . . . . . . . . . . 102
5.4.3 Backtracking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.4 Run-Time Management and System Requirements . . . . . . . 106
Contents ix

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


5.5.1 Comparison Run-Time Management . . . . . . . . . . . . . . . . . 112
5.5.2 MMKP-Based Run-Time Heuristic . . . . . . . . . . . . . . . . . . 114
5.5.3 Considering Communication Constraints . . . . . . . . . . . . . . 117
5.5.4 Objectives Related to Embeddability and
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.5 Temporal Isolation Versus Spatial Isolation . . . . . . . . . . . . 121
5.5.6 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6.1 Techniques for Static, Dynamic, and Hybrid
Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6.2 Communication Models in Hybrid Application
Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6 Hybrid Mapping for Increased Security . . . . . . . . . . . . . . . . . . . . . . 137
6.1 Hybrid Mapping for Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1.1 Attacker Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 Shape-Based Design-Time Optimization . . . . . . . . . . . . . . . . . . . 142
6.3 Run-Time Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 First-Fit Mapping Heuristic . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.2 SAT-Based Run-Time Mapping . . . . . . . . . . . . . . . . . . . . 147
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5 Region-Based Run-Time Mapping in the i-NoC . . . . . . . . . . . . . . 151
6.6 Related Work . . . . . . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . 153
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . 154
References . . . . . . . . . . . . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . 154
7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Abbreviations

2D Two dimensional, 25, 28, 47, 137, 138, 141, 158


AG Address generator, 26
AHB Advanced high-performance bus, 31
AMBA Advanced microcontroller bus architecture, 31
API Application programming interface, 38
BCET Best-case execution time, 46, 47, 51, 52
BE Best effort, 28, 29, 30, 32
BN Best neighbor, 65, 67, 76
bps Bits per second, 17
BRAM Block random access memory, 79
CA Cluster agent, 80
CAP Communication-aware programming, 13, 14, 19, 57, 157
CDF Cumulative distribution function, 123
CiC Dynamic many-core i-let controller, 27, 34
CPU Computer processing unit, 20, 23, 35, 51
CSP Constraint satisfaction problem, 105
DAARM Design-time application analysis and run-time mapping, 4,
86, 88, 132, 158
DMA Direct memory access, 13, 31
DOP Degree of parallelism, 11, 20, 21, 22, 23
DSE Design space exploration, 4, 18, 35, 39, 68, 85
E3S Embedded system synthesis benchmarks suite, 75, 111,
114, 120, 148
EA Evolutionary algorithm, 96, 120, 127
FCFS First-come, first-served, 50
FF First free, 50
FIFO First in, first out, 31, 32, 72, 79
flit Flow control digit, 29, 31, 32, 49
FPGA Field-programmable gate array, 27, 35, 79
fps Frames per second, 16

xi
xii Abbreviations

FSM Final state machine, 17, 33


GA Global agent, 80
GC Global controller, 26
GPU Graphics processing unit, 35
GS Guaranteed service, 28, 29, 30, 31, 49
HAL Hardware abstraction layer, 34
HAM Hybrid application mapping, 39, 45, 49, 59, 85, 86, 87, 88,
127, 128, 129, 132, 141, 148
HPC High-performance computing, 1, 24
HW Hardware, 35
I/O Input/Output, 25, 26, 139
ID Identifier, 33, 70, 72
ILP Integer linear programming, 127
IM Invasion manager, 26
i-Core Invasive core, 24, 27, 39
i-NA Invasive network adapter, 13, 25, 26, 28, 29, 30, 31, 32,
33, 79, 81
i-NoC Invasive network on chip, 3, 5, 9, 19, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33 34, 35, 36, 37, 38, 39, 47, 59, 68, 70, 73
iRTSS Invasive run-time system, 18, 34
L1 Level 1, 24, 51
L2 Level 2, 24
LRU Least recently used, 51, 55
LUT Lookup table, 79, 154
MAC Minimal average channel load, 64, 80
MIT Massachusetts Institute of Technology, 52
MMC Minimum maximum channel load, 64
MMKP Multi-dimensional multiple choice knapsack problem, 107,
109, 110, 111, 114, 115, 117, 119
MPSoC Multiprocessor system-on-chip, 17, 21, 22, 23, 24, 52, 138,
154
MTTF Mean time to failure, 160
NA Network adapter, 47, 48, 49
NN Nearest neighbor, 80
NoC Network on chip, 3, 4, 5, 9, 14, 28, 33, 47, 48, 49, 57, 58,
59, 64
OP Operating point, 19, 89, 101, 112, 113, 114, 120, 158
OpenMP Open multi-processing, 11
OS Operating system, 9, 12, 26, 27, 28, 31, 34, 59, 95
OSAL Operating system abstraction layer, 34
P2P Point to point, 86, 88, 117, 129
PE Processing element, 26
PFH Probability of failure per hour, 16, 19
PGAS Partitioned global address space, 11, 17, 40
PiP Picture in picture, 14
Abbreviations xiii

PL Path load, 64, 65, 76, 81


QoS Quality-of-service, 28, 31, 48, 89, 92, 128, 131
QSDPCM Quad-tree structured differential pulse code modulation, 75
RAM Random access memory, 141
RANSAC Random sample consensus, 125
RISC Reduced instruction set computing, 14, 19, 27, 57, 58
RM Run-time management, 5, 32, 85, 86, 87, 90, 101, 102,
106, 112, 113, 114, 122, 127, 128, 133, 141, 145, 147,158
RR Round robin, 50
RTC Real-time calculus, 103
RW Random walk, 65, 77, 78
RWW Random walk weighted, 65, 66, 67, 77
SA Simulated annealing, 127
SAT Boolean satisfiability problem, 137, 138, 147, 148, 149,
150, 151, 154
SEM Self-embedding module, 72, 73, 74, 79, 80
SIFT Scale-invariant feature transform, 125, 126
SIL Safety integrity level, 216
SL Service level, 19, 28, 31
TCB Trusted computing base, 141, 154
TCPA Tightly coupled processor array, 25, 26, 125
TDM Time division multiplexing, 50, 52, 131
TDMA Time division multiple access, 55
TGFF Task graphs for free, 55, 92, 128, 129, 131, 132, 139
TLM Tile local memory, 13, 14, 24, 31, 50
VC Virtual channel, 28, 30, 31, 32, 101, 102, 103, 141
VLIW Very large instruction word, 25
WCET Worst-case execution time, 47, 51, 52, 94, 99, 111
WCRT Worst-case response time, 50
WRR Weighted round robin, 28, 52, 92, 129, 132
XML Extensible markup language, 36
Symbols

A Attacker, 17
A Variable assignment in CSP, 105
ai Underutilization factor, 21
avgnet Average network load, 75
B Communication channel, 102, 105, 141, 148
BE Best-case execution time, 46
b Map task, 55, 59, 60, 64, 65, 67, 91, 92, 95, 100, 102, 104,
107, 111, 113, 120, 142
bCG Map task cluster of constraint graph function, 103, 105
bDSE Map task function in DSE, 102
bw Minimum required message bandwidth, 46, 47, 49, 59, 64,
71, 98
C Task cluster, 101, 102, 103, 104, 105, 112, 141
c Cost function for self-embedding algorithm, 60, 62
cap Link capacity, 49, 59, 66, 98
CL Worst-case communication latency, 93
Conf Confidentiality, 17
D Domain in CSP, 106
d Deadline, 46, 91
Dc Number of cores to invade, 23
E Set of edges of an application graph, 46, 47, 49, 50
e Edge of an application graph, 46
ECPU Overall maximal processor energy consumption of a
mapping, 99
Einc Energy consumption of all mapped operating points by
incremental RM, 114
embAlg Embedding algorithm, 60
EMMKP Energy consumption of all mapped operating points by
MMKP RM, 114
ENOC Overall maximal NoC energy consumption of a mapping,
99

xv
xvi Symbols

Env Environment, 17
EOV Overall maximal energy consumption of a mapping, 90,
99, 107, 109
2 Conf 2-confidentiality, 17
equaltype Checks if the resource type of a tile matches a certain
resource type, 101
Erel Relative energy consumption of MMKP RM and incre-
mental RM, 114
ELbit Energy consumption of one bit in a NoC link, 99, 112
ESbit Energy consumption of one bit in a NoC router, 99, 112
ENoCbit Energy consumption of routing one bit over a NoC router,
99
gR NoC router delay, 93
f Frequency, 49, 69, 125
GAPP0 ðV; EÞ Example application graph, 46, 47, 50
GArch Short notation of architecture graph, 106
GArch0 ðU; LÞ Example architecture graph, 48
GArch ðU; LÞ Architecture graph, 47, 50, 91, 102, 141
GApp ðV; EÞ Application graph, 46, 49, 71, 91, 103, 141
GApp ðV ¼ T [ M; EÞ Application graph, 70, 71
GC Short notation of constraint graph, 106
GC ðVC ; EC Þ Constraint graph, 101, 102, 141
gettype Determines the resource type of a tile: U ! R, 47, 49, 50,
94, 96, 98, 99, 101, 144
hop Hop constraint in the constraint graph, 102, 103
Hþ Hop distance, 48, 49, 93, 102, 104
H Manhattan distance, 48, 49, 50, 66, 67, 93, 99, 102, 104
h Max hop distance in self-embedding algorithm, 60, 61, 62,
64, 65, 66, 76
Hqþ Hop distance of a route, 93, 99
Hq Manhattan distance of a route, 50, 93, 99
I Input space, 53
i Running variable, 18, 19, 20, 66
=E Invasive efficiency, 22, 23
INF Infinum, 53
INFL Best-case end-to-end latency, 125,
INFLComp Best-case tile latency, 125
INFLNoC Best-case NoC latency, 125
INFTrNoC Best-case NoC throughput, 126
=P Average number of processors utilized, 22, 23
=S Invasive speedup, 22, 23
isrouted Function evaluates whether a message is routed over the
NoC or utilizes local tile communication, 100
=T Invasive execution time, 22, 23
Symbols xvii

j Running variable, 114


k Number of bits of a head flit of a serialized task graph, 71
Kmax Maximum number of tasks for schedule, 94, 95, 96, 102,
104, 105, 120
L Worst-case end-to-end latency of a path, 92
L Worst-case end-to-end latency of an application, 90, 91,
98, 125, 126
L Set of NoC links, 48, 49, 50, 64, 75, 102, 105
l NoC link, 48, 49, 50, 59, 60, 64, 75, 98, 143
k Lagrangian multipliers, 109, 111
load Load induced of a task onto a tile, 59, 66, 70, 102, 104
LW Link width, 49, 72, 125
M Set of messages of an application graph, 46, 48, 60, 61, 63,
70, 71, 93, 97, 98, 100, 147
m Degree of parallelism, 20, 21
m Message of an application graph, 46, 47, 48, 60, 63, 64, 65,
70, 76,82, 92, 100, 122, 124, 147, 148
MC Communication channels, 102, 103, 106, 141, 148
n Number of cores, 201, 21, 22
n Number of applications, 106, 107, 108, 111, 118
nf Number of flits, 93
o Objective, 53, 54, 99, 100,138, 139, 143, 144, 149
obj Number of detected objective in Harris Corner algorithm,
125
Ot Invasive overhead function, 21
Ot þ Invasive overhead function for invade and infect, 22
Ot Invasive overhead function for retreat and infect, 22
Oinfect Invasive overhead function for infect, 23
Oinvade Invasive overhead function for invade, 23
Oretreat Invasive overhead function for retreat, 23
P Period of the application, 46, 49, 98, 102
p Program, 52, 53
path Path in mapping, 91
pathLoad Cost function for self-embedding which evaluates a NoC
path, 83
paths Set of paths, 91
power Power function returns the power consumption of a
processing core, 99
pr Priority of a task, 94, 95, 96, 101
pred Function which returns the predecessor vertex in a graph,
62
PredT Function which returns the predecessor tasks, 95, 106
hpr(t), 8t 2 Ci List of priorities, 102
P Si Set of incarnations of a shape of application i, 147
xviii Symbols

PS Set of incarnations of a shape, 144, 146, 148


PS Incarnation of a shape, 145, 147
PS SAT variable for shape, 147
P1S Incarnation of shape, 142
P2S Incarnation of shape, 142
W Mapping step in self-embedding algorithm, 62
Q State space, 53, 54
R Set of resource types, 47, 48, 101, 102, 108, 109, 110, 102,
125
r Resource type, 47, 48, 49, 50, 89, 90, 98, 99, 100, 102,
125, 142, 180
ratelink Link utilization, 64, 75
req Task memory requirements, 59, 70
res Tile memory resources, 57
q Routing of a message, 52, 59, 64, 91, 92, 93, 94, 95, 96,
97, 98, 99, 100, 101, 102, 106, 107, 108, 111, 143, 145,
150
qCG Routing of a message cluster of a constraint graph, 102,
104, 105
S Speedup, 20
S Set of shapes, 142, 144, 145, 147
S Shape, 142, 143, 144, 145, 146, 147, 148
SI Scheduling interval, 94, 95, 96, 98, 102, 120
SIos Scheduling interval os overhead, 94, 95, 96, 98, 102, 120
size Message size 46, 71, 99
SL Service level in the NoC, 49, 52, 70, 76, 84, 96, 110, 116,
120, 128, 129, 130, 131, 132, 140, 155, 180, 190
SLmax Number of scheduling intervals on the NoC, 52, 43, 44, 67,
68, 94, 97, 80, 115, 119, 120, 124, 125, 126, 128, 129,
130, 131, 154, 155, 175
Soft Software, 17
succ Successor function returns all successors of a vertex, 60,
71, 106
SUP Supremum, 53
SUPL Worst-case end-to-end latency, 114
SUPLComp Worst-case tile latency, 125
SUPLNoC Worst-case NoC latency, 125
SUPTrNoC Worst-case NoC throughput, 125
T Execution time, 20, 21, 23
T Set of tasks of an application graph, 46, 47, 50, 59, 60,62,
66, 71, 79, 91, 98, 99, 100, 101, 102, 143, 17
t Task of an application graph, 64, 65, 68, 69, 81, 82, 83, 84,
86, 87, 89, 98, 93, 94, 96, 97, 115, 116, 117, 118, 121,
122, 123, 124, 125, 126, 128, 129, 132, 154, 177, 179
Symbols xix

s Cycle length, 50, 94


TC Task clusters, 102, 103, 104, 105, 141
hr Instances of resource type available, 108, 109, 110
TL Worst-case computing latency, 91, 94, 95, 96
tr Throughput, 49
type Resource type function: T ! R, 47, 71, 91
typeCG Type constraint of a constraint graph, 102, 116
U Set of tiles of an architecture graph, 48, 49, 59, 64, 65, 66,
98, 103, 144, 145, 147
u SAT variable for the tile u, 146
u Tile, 46, 47, 48, 59, 60, 62, 65, 66, 67, 73, 75, 88, 89, 94,
97, 98, 101, 105, 107, 144, 145, 146, 149
V Set of application graph vertices, 45
v Number bits of a serialized task, 69, 67
random Function which returns a normal distributed random value
in the specified bounds, 66, 67
valid Functions which determines if a u is valid, i.e. has valid
coordinates, 66, 67
W Amount of work, 20, 21
WE Worst-case execution time, 47, 49, 50, 60, 95, 96, 97, 98,
99
WR Worst-case response time, 51, 52
v NoC width, 48, 49, 60, 64, 66, 76, 144
x X-coordinate, 47, 66, 144, 145
X Set operating point, 106, 110, 111
x Operating point, 106, 107
N Temporal interference, 50
Y NoC height, 48, 50, 59, 64, 66, 76, 144
y Y-coordinate, 48, 66, 144, 146
z Number of bits of a serialized task, 71
Abstract

In recent years, heterogeneous multi- and many-core systems have emerged as


architectures of choice to harness the performance potential of the ever increasing
transistor density of modern semiconductor chips. As traditional communication
infrastructures such as shared buses or crossbars do not scale for these architectures,
NoC have been proposed as novel communication paradigm. However, these
NoC-based many-core architectures require a different way of programming, OS,
compilers, etc. Invasive computing addresses these issues and combines research on
a holistic solution from software to hardware, for current and future many-core
systems. Using the three invasive primitives invade, infect, and retreat, the appli-
cation developer can exclusively claim resources, use them for parallel execution,
and make them available for other applications after the computation is finished.
In the realm of invasive computing, this book proposes methodologies to map
applications, i.e., invading computing and network resources for a static application
graph, to NoC-based many-core architectures.
The first method is called self-embedding and utilizes the inherent task-level
parallelism of applications, modeled by application graphs, by distributing the
mapping process to the different tasks. Each task is responsible for mapping its
direct succeeding task in the application graph and the communication towards
there. Exploring the status and resource availability of the mapping task’s local
neighborhood only instead of a global search makes this application mapping
highly scalable while offering competitive quality in terms of NoC utilization.
The second contribution of this book targets guarantees on non-functional
execution properties of applications. While self-embedding maps applications in a
distributed, scalable, and fast manner, it is strictly performed during run time and
does not conduct any analysis which is inevitable for a predictable execution. As a
remedy, we propose a novel HAM methodology which combines
compute-intensive analysis at design time with run-time decision making. The
design-time analysis is performed during a DSE which aims to find Pareto-optimal
mappings regarding the multi-objective optimization criteria such as timing, energy
consumption, or resource usage. With the concept of composability, applications
can be analyzed individually and then combined during run time to arbitrary

xxi
xxii Abstract

application mixes. Composability is enabled through spatial or temporal isolation


on computing and communication resources. As an intermediate representation to
hand over Pareto-optimal mappings to the RM, we propose constraint graphs which
abstract from the concrete mapping and give general rules to find mappings that
adhere to the analyzed non-functional properties. To map these constraint graphs to
the architecture, we propose a backtracking algorithm. In contrast to related work,
which neglects or simplifies NoC communication, the proposed approach performs
a true multi-objective DSE with a detailed model of a packet-switched NoC.
As a third contribution, the book introduces methodologies to incorporate
security and communication reliability into the introduced HAM flow. To prevent
side-channel attacks in packet-switched NoC, we propose the total spatial isolation
of applications. Therefore, we introduce so-called shapes, which are needed for
encapsulating the computation and communication of an application for isolated
execution. To prevent the fragmentation of the system, these shapes are optimized
during a DSE. The run-time mapping can then be performed by an extended
backtracking algorithm or region-based mapping algorithm. For the latter, the book
proposes fast heuristics and exact mechanisms based on SAT. Further, this book
investigates disjoint-path and adaptive routing algorithms and their implication on
HAM. If two communicating tasks are mapped to the same row or column of a 2D
mesh NoC, there is no minimal communication over two disjoint paths. In con-
sequence, we incorporate a location constraint into the DSE and extend the con-
straint graph. We also show that adaptive routing algorithms may profit from an
increased adaptivity by applying such a location constraint.
Chapter 1
Introduction

Abstract One of the most important trends in computer architecture in recent years
is the paradigm shift toward multi and many-core chips. This chapter outlines the
implications and challenges of future many-core architectures and gives an overview
of the book’s contributions.

One of the most important trends in computer architecture in recent years is the
paradigm shift toward multi- and many-core chips. Until the year 2005, the per-
formance gain of new processor generations mainly stemmed from advances in the
microarchitecture and an increased frequency (see Fig. 1.1).
Then, frequency scaling reached its limit and additional performance gains by
improving the core architecture would result in a huge increase of power consump-
tion [5]. As Moore’s law still holds, the number of transistors still increases expo-
nentially. These additional transistors contribute best to a higher performance when
used for increasing the core count. By exploiting parallelism, multiple “weaker”
cores can outperform a single core. To accelerate programs which cannot profit from
parallelism, specialized hardware blocks (e.g., for cryptography, signal processing),
or mixtures between powerful and weaker processors (e.g., ARM big.LITTLE) can
be used. This heterogeneity might also help to circumvent the problem of dark sili-
con [9]. The term dark silicon describes the fact that not all transistors on a chip can
be utilized concurrently because of power density limits, as elsewise the temperature
would exceed its limits. As a direct consequence, some parts of the chip do no com-
putation at all and stay “dark.” Overall, heterogeneous many-core architectures seem
to be the most promising solution to cope with the aforementioned problems. This
affects all markets and branches ranging from high-performance computing (HPC),
over gaming and mobile devices, to even processors in the automotive and embedded
sector. Targeting the HPC market, Intel’s latest generation of the many-core chip Intel
Xeon Phi, Knight’s Landing [12], offers a chip with 72 Atom cores. Also, the leading
supercomputer TaihuLight of the Top 500 list [13] incorporates clusters of many-
core systems. Altogether, the system consists of 40,960 nodes where each node is an
SW26010 processor with 260 cores [6]. TaihuLight not only outperforms all other
systems which rely on processors with fewer cores or use acceleration of graphics

© Springer Nature Singapore Pte Ltd. 2018 1


A. Weichslgartner et al., Invasive Computing for Mapping
Parallel Programs to Many-Core Architectures, Computer Architecture
and Design Methodologies, https://doi.org/10.1007/978-981-10-7356-4_1
2 1 Introduction

107 transistors [#*1000]


performance [SPECint]
power [W]
frequency [MHz]
105
Number cores [#]

103

101

10−1
1970 1980 1990 2000 2010
Year

Fig. 1.1 Development of processor architectures over the last decades. While the frequency and
the power have saturated, the number of transistors still increases exponentially and performance
gains result mainly of an increased core count (c.f. [1]; plotted with data from [11])

processing units (GPUs) but is also more energy-efficient than other supercomput-
ers. The Tile-Mx 100 from Mallanox (previously Tilera) incorporates 100 ARMv8
cores onto a single chip [7]. The company markets the chip for the networking and
data center area. Also, academic research aims at massive many-core chips. In [3],
the design and implementation of a kilo-core, a chip with 1,000 processing cores, is
presented.
It can be observed that the aforementioned chips do not use specially designed and
sophisticated processor cores but rather employ already developed energy-efficient
cores from the embedded domain. The design focus shifts to the so-called uncore.
Uncore describes everything on a chip which is not a processing core, e.g., last-level
cache, communication infrastructure, and memory controller. Obviously, conven-
tional single arbitrated buses or crossbars do not scale up to thousands of cores.
Network on chips (NoCs) with regular structures and simple building blocks have
therefore emerged as easily expendable communication infrastructure [2].
As these computing systems interfuse more and more with our daily life ranging
from industry automation over transportation to internet of things and smart devices,
requirements with respect to nonfunctional execution properties drastically increase.
A functional correct execution of a program mostly not suffices anymore. Nowa-
days, the energy consumption or a predictable execution time of a program plays an
important role already. Especially for mobile and embedded devices, a small power
footprint is crucial. For example, the uncore of the processor already consumes over
50% of the power budget of a processor [4]. But, also other nonfunctional execu-
tion properties gain more and more importance: In safety critical environments, e.g.,
automotive or aerospace, hard real-time requirements are a prerequisite. Addition-
ally, to meet certain safety standards, e.g., safety integrity level (SIL), programs and
communication may be conducted redundantly. Even in non-safety critical environ-
ments, nonfunctional execution properties gain importance. For example, the user
1 Introduction 3

wants to have a minimum video throughput and quality for a paid streaming service
and has high demands on privacy and security of his/her data and programs.
In summary, modern chip architectures comprise more and more heterogeneous
computing cores interconnected with a NoC. To efficiently exploit the computational
performance of these systems while considering nonfunctional execution properties
is one of the major challenges in nowadays computer science. To tackle these prob-
lems, Teich proposed invasive computing [14]. Invasive computing gives the appli-
cation programmer the possibility to invade resources, according to her/his specified
constraints, infect the returned claim of resources with program code, and retreat
from these resources after the computation is finished. Invasive computing com-
prises various aspects of computing, ranging from language development to invasive
hardware. One aspect is mapping applications onto many-core architectures. This is
a challenging task, especially when considering nonfunctional execution properties,
resource utilization, and short mapping times.

1.1 Contributions

The book at hand investigates and proposes novel application mapping methodolo-
gies. As detailed in Fig. 1.2, the application mapping can take a different amount of
time and can fulfill various non-functional execution requirements.
The main contributions of this book may be summarized as follows:
(A) An approach to decentrally map applications to NoC architectures with a focus
on communication [16] and the possibility of hardware acceleration inside NoC
routers [18].
(B) A hybrid application flow that enables to combine the strengths of static analysis
and dynamic adaptivity [15, 17, 19, 21, 22].
(C) Assuring nonfunctional properties such as timeliness [15, 17, 19] and secu-
rity [8, 20].

1.1.1 (A) Decentralized Application Mapping

Application graphs, besides other application characteristics, express the task level
parallelism of applications. This parallelism can be also exploited during the map-
ping process. The concept of self-embedding [16] describes a class of distributed
algorithms where each task is responsible for mapping its succeeding tasks and the
communication in between. Also, these algorithms are highly scalable as they do not
require global knowledge and make their mapping decision based on local informa-
tion. Dedicated hardware modules, attached to each network router inside an invasive
network on chip (i-NoC) [10], have direct access to the i-NoC link utilization and
can accelerate the self-embedding [18].
4 1 Introduction

Chapter 2: Chapter 2:
IntroducƟon to Overview of
invasive compuƟng invasive
and invasive architectures
programming and hardware

t0
Chapter 3:
Formal models for u0 u1 u2
applicaƟons and t1 t2
architectures and
fundamentals u4 u5 u6
t3
(Hybrid)
ApplicaƟon
Mapping
Chapter 4:
Chapter 6:
Fast mapping ut00 ut21 u2 t0
u0t1 t2
u 1t3 u2 Hybrid applicaƟon
heurisƟc (self-
mapping for
embedding) with
security-criƟcal
hardware support ut14 ut35 u6 u4 u5 u6 applicaƟon

t2
Chapter 5: ut0 0t1 u1 ut23
Hybrid applicaƟon
mapping
methodology u4 u5 u6

Fig. 1.2 Overview of the structure and the contributions of this book. Chapters 2 and 3 introduce
the required context and fundamentals while Chaps. 4–6 present the contributions in the area of
application mapping

1.1.2 (B) Hybrid Application Mapping

Dynamic application mapping algorithms have a limited time budget to find a suitable
mapping. Hence, they cannot perform extensive formal analyses to determine bounds
on nonfunctional properties to ensure a predictable program execution. In contrast,
static approaches are unable to react to run-time events or on changes in the compo-
sition of the executed applications (i.e., inter-application scenarios). As the number
of possible scenarios is exponential to the number of applications, scenario-based
approaches suffer from a bad scalability. In contrast to existing hybrid application
mapping approaches, this book proposes the design-time application analysis and
run-time mapping (DAARM) design flow, which is capable of exploring multiple
objectives rather than only timing and energy. Most existing approaches also sim-
plify the NoC communication in their analysis and the run-time mapping process.
For more realistic results, a detailed model of the invasive NoC [10] for latency,
throughput, link utilization, and energy is an integral part of the DAARM design
flow [17, 19]. During a design space exploration (DSE) at design time, infeasible
1.1 Contributions 5

mappings which overutilize NoC resources can already be discarded. Only feasible
mappings, alongside with the explored objectives, are handed over to the run-time
management (RM). As an intermediate representation, the book at hand proposes
the notion of a constraint graph. This graph encodes all constraints which need to
hold for the run-time mapping so that it adheres to the objectives evaluated at design
time. To perform the run-time mapping of this constraint graph, this book proposes
a backtracking algorithm.

1.1.3 (C) Nonfunctional Properties

As detailed before, the proposed hybrid application mapping approach enables to give
upper bounds for nonfunctional execution properties. These properties are derived
by a statical analysis but even hold true in the context of dynamic run-time mapping.
In this book, we consider the following objectives: (a) timing (best-case/worst-case
end-to-end latency) [15, 17, 19] (see Chap. 5), (b) energy consumption [17, 19] (see
Chap. 5), and (c) security (spatial isolation of communication and computation) [8,
20] (see Chap. 6). We present the needed analysis models and methodologies to
integrate these nonfunctional execution properties and investigate the implications
on the mapping process.

1.2 Outline of this Book

This book is organized as follows. Chapter 2 introduces the main principles of


invasive computing and gives an overview of an invasive programming language
(invadeX10) and invasive hardware (invasive core (i-Cores), invasive network on
chip (i-NoC), and tightly coupled processor arrays (TCPAs)). A special focus of this
work lies on the i-NoC which is the communication backbone of invasive architec-
tures, and its composable nature plays an integral part in the hybrid application map-
ping (HAM) methodology proposed by this book. In Chap. 3, the underlying models
for application mapping are introduced. Further, the concepts of composability and
predictability are detailed. Chapter 4 describes the concepts of self-embedding for
applications without strict deadlines and a possible hardware acceleration. The hybrid
application mapping methodology is the center of Chap. 5. Afterwards, Chap. 6 gives
details how spatially isolated mapping can be integrated into the HAM methodology
to consider the nonfunctional execution property security. Finally, Chap. 7 concludes
the book and outlines possible future work.
6 1 Introduction

References

1. Batten C (2014) Energy-efficient parallel computer architecture. www.csl.cornell.edu/cbatten/


pdfs/batten-xloops-afrl2014.pdf. Accessed 08 Aug 2016
2. Benini L, Micheli GD (2002) Networks on chips: a new SoC paradigm. IEEE Comput 35(1):70–
78. https://doi.org/10.1109/2.976921
3. Bohnenstiehl B, Stillmaker A, Pimentel J, Andreas T, Liu B, Tran A, Adeagbo E, Baas B (2016)
Kilocore: A 32 nm 1000-processor array. In: Proceedings of the IEEE HotChips Symposium
on High-Performance Chips (HotChips). IEEE. https://doi.org/10.1145/1999946.1999979, 10.
1109/HOTCHIPS.2016.7936218
4. Cheng H, Zhan J, Zhao J, Xie Y, Sampson J, Irwin MJ (2015) Core vs. uncore: The heart
of darkness. In: Proceedings of the Design Automation Conference (DAC). ACM, pp 121:1–
121:6. https://doi.org/10.1145/2744769.2747916
5. Danowitz A, Kelley K, Mao J, Stevenson JP, Horowitz M (2012) CPU DB: recording micro-
processor history. Commun ACM 55(4):55–63. https://doi.org/10.1145/2133806.2133822
6. Dongarra J (2016) Report on the Sunway TaihuLight system. Technical Report, University of
Tennessee
7. Doud B (2015) Accelerating the data plane with the TILE-Mx manycore processor. www.
tilera.com/files/drim__EZchip_LinleyDataCenterConference_Feb2015_7671.pdf. Accessed
26 April 2016
8. Drescher G, Erhardt C, Freiling F, Götzfried J, Lohmann D, Maene P, Müller T, Verbauwhede
I, Weichslgartner A, Wildermann S (2016) Providing security on demand using invasive com-
puting. it - Inf Technol 58(6):281–295. https://doi.org/10.1515/itit-2016-0032
9. Esmaeilzadeh H, Blem ER, Amant RS, Sankaralingam K, Burger D (2012) Dark silicon and
the end of multicore scaling. IEEE Micro 32(3):122–134. https://doi.org/10.1109/MM.2012.
17
10. Heisswolf J, Zaib A, Weichslgartner A, Karle M, Singh M, Wild T, Teich J, Herkersdorf A,
Becker J (2014) The invasive network on chip - a multi-objective many-core communication
infrastructure. In: Proceedings of the International Workshop on Multi-Objective Many-Core
Design (MOMAC). VDE, pp 1–8. http://ieeexplore.ieee.org/document/6775072/
11. Rupp K (2015) 40 years of microprocessor trend data. https://www.karlrupp.net/2015/06/40-
years-of-microprocessor-trend-data/
12. Sodani A, Gramunt R, Corbal J, Kim H, Vinod K, Chinthamani S, Hutsell S, Agarwal R, Liu Y
(2016) Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 36(2):34–46.
https://doi.org/10.1109/MM.2016.25
13. Strohmaier E, Dongarra J, Simon H (2016) Top 10 sites for June 2016. http://www.top500.org/
lists/2016/06/. Accessed 18 Jul 2016
14. Teich J (2008) Invasive algorithms and architectures. it - Inf Technol 50(5):300–310. https://
doi.org/10.1524/itit.2008.0499
15. Teich J, Glaß M, Roloff S, Schröder-Preikschat W, Snelting G, Weichslgartner A, Wildermann
S (2016) Language and compilation of parallel programs for *-predictable MPSoC execu-
tion using invasive computing. In: Proceedings of the International Symposium on Embedded
Multicore/Many-core Systems-on-Chip. IEEE, pp 313–320. https://doi.org/10.1109/MCSoC.
2016.30
16. Weichslgartner A, Wildermann S, Teich J (2011) Dynamic decentralized mapping of tree-
structured applications on NoC architectures. In: Proceedings of the International Symposium
on Networks-on-Chip (NOCS). ACM, pp 201–208. https://doi.org/10.1145/1999946.1999979,
http://ieeexplore.ieee.org/document/5948565/
17. Weichslgartner A, Gangadharan D, Wildermann S, Glaß M, Teich J (2014) DAARM: Design-
time application analysis and run-time mapping for predictable execution in many-core systems.
In: Proceedings of the Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS). ACM, pp 34:1–34:10. https://doi.org/10.1145/2656075.2656083
References 7

18. Weichslgartner A, Heisswolf J, Zaib A, Wild T, Herkersdorf A, Becker J, Teich J (2015) Position
paper: Towards hardware-assisted decentralized mapping of applications for heterogeneous
NoC architectures. In: Proceedings of the International Workshop on Multi-Objective Many-
Core Design (MOMAC). VDE, pp 1–4. http://ieeexplore.ieee.org/document/7107099/
19. Weichslgartner A, Wildermann S, Gangadharan D, Glaß M, Teich J (2017) A design-time/run-
time application mapping methodology for predictable execution time in MPSoCs. ArXiv
e-prints pp 1–30, arXiv: 1711.05932
20. Weichslgartner A, Wildermann S, Götzfried J, Freiling F, Glaß M, Teich J (2016) Design-
time/run-time mapping of security-critical applications in heterogeneous MPSoCs. In: Proceed-
ings of the Conference on Languages, Compilers and Tools for Embedded Systems (SCOPES).
ACM, pp 153–162. https://doi.org/10.1145/2906363.2906370
21. Wildermann S, Weichslgartner A, Teich J (2015) Design methodology and run-time manage-
ment for predictable many-core systems. In: Proceedings of the Workshop on Self-Organizing
Real-Time Systems (SORT). IEEE, pp 103–110. https://doi.org/10.1109/ISORCW.2015.48
22. Wildermann S, Bader M, Bauer L, Damschen M, Gabriel D, Gerndt M, Glaß M, Henkel J,
Paul J, Pöppl A, Roloff S, Schwarzer T, Snelting G, Stechele W, Teich J, Weichslgartner A,
Zwinkau A (2016) Invasive computing for timing-predictable stream processing on MPSoCs.
it - Inf Technol 58(6):267–280. https://doi.org/10.1515/itit-2016-0021
Chapter 2
Invasive Computing

Abstract As this book originates in the context of invasive computing, this chapter
gives an overview of the invasive computing paradigm and its realization in software
and hardware. It starts with its basic principles and then gives an overview how
the paradigm is expressed at the language level. Afterwards, a formal definition
and analysis of invasive speedup and efficiency according to Teich et al. is given.
For the formal analysis of individual application programs independent from each
other through composability presented in the later chapters of this book, it is a
prerequisite to consider an actual invasive hardware architecture. Therefore, a tiled
invasive architecture with its building blocks is detailed with a focus on the (i-NoC).
Finally, a brief description of the employed operating system is given before other
approaches which deal with heterogeneous many-core systems are reviewed.

Efficiently leveraging the performance of future many-core systems is one of


the key challenges of our days as motivated in Chap. 1. One approach to tackle this
challenge in a holistic manner is invasive computing [50, 52]. As this book originates
in the context of invasive computing, the following chapter gives a broad overview
of the whole paradigm and its realization in software and hardware. Section 2.1
starts with the principles of invasive computing and Sect. 2.2 gives an overview of
the expression of the invasive paradigm at the language level. Afterwards, Sect. 2.3
gives a formal definition and analysis of invasive speedup and efficiency according
to [53]. For the formal and composability exploiting analysis presented in Chap. 5, it
is a prerequisite to consider an actual invasive hardware architecture. Therefore, we
detail tiled invasive architectures and their building blocks in Sect. 2.4 with a focus
on the invasive network on chip (i-NoC) in Sect. 2.5. Finally, we briefly describe
the operating system (OS) in Sect. 2.6 and review other approaches which deal with
heterogeneous many-core systems in Sect. 2.7.

2.1 Principles of Invasive Computing

Future and even nowadays many-core systems come along with various challenges
and obstacles. Namely, programmability, adaptivity, scalability, physical constraints,

© Springer Nature Singapore Pte Ltd. 2018 9


A. Weichslgartner et al., Invasive Computing for Mapping
Parallel Programs to Many-Core Architectures, Computer Architecture
and Design Methodologies, https://doi.org/10.1007/978-981-10-7356-4_2
10 2 Invasive Computing

reliability, and fault-tolerance are mentioned in [52]. These issues motivate the new
computing paradigm invasive computing, first proposed by Teich in [50], which
introduces resource-aware programming. This gives the application programmer the
possibility to distribute the workload of the application based on the availability
and status of the underlying hardware resources. In [52], Teich et al. define invasive
computing as follows:

Definition 2.1 (invasive programming) “Invasive programming denotes the capa-


bility of a program running on a parallel computer to request and temporarily claim
processor, communication and memory resources in the neighborhood of its actual
computing environment, to then execute in parallel the given program using these
claimed resources, and to be capable to subsequently free these resources again.”

In contrast to statically mapped applications, resources are only claimed when they
are actually needed and are available for other applications after they are freed. This
increases the resource utilization drastically and hence the efficiency (for a formal
analysis of the invasive efficiency, see Sect. 2.3). Also, each application can adapt
itself to the amount and types of available resources. For example, if there are more
computing resources available, it can utilize a higher degree of parallelism. Or, if
there is a special accelerator module available, the programmer can use this resource
to execute an implementation variant of the algorithm which is tailored for exactly
this accelerator. Additionally, the application can retreat from resources which are
becoming too hot or unreliable [22]. All this is done in a decentralized manner and,
thus, highly scalable which is crucial for systems with 1,000 cores and more.
Invasive computing relies on three basic primitives invade, infect, and retreat.
The typical state transition of them is depicted by the chart in Fig. 2.1. First, an
initial claim is assembled by issuing an invade call. A claim can constitute itself of
computing resources such as processor cores, communication (e.g., NoC bandwidth),
and memory (e.g., caches, scratch pads). Subsequently, infect starts the application’s
code on the allocated cores of the claim. After the execution finishes, the claim size
can be increased by issuing another invade, also known as re-invade, or decreased by
a retreat, also known as a partial retreat. It is also possible to call infect, or so-called
reinfect, with another application on the same claim. After the program execution
terminates, the retreat primitive frees the claim and makes the resources available to
other applications.

start invade infect retreat exit

Fig. 2.1 State chart of an invasive program (c.f. [22])


2.1 Principles of Invasive Computing 11

With these invasive primitives, different kinds of applications are supported. In


the following, we present application classes which may profit from the invasive
computing paradigm:
• Applications with dynamic degree of parallelism (DOP): Depending on the phase
of the algorithm, the degree of parallelism (DOP) can vary and the application
programmer specifically requests the number and type of cores. These kind of
applications are the target of the analysis in Sect. 2.3.1.
• Malleable applications: The application can vary its DOP almost arbitrarily. Typ-
ically, this kind of applications is equipped with a speedup or hint curve which
specifies the performance gain with respect to the DOP. The system can then max-
imize among all malleable applications the average speedup. For example, the
multi-agent system DistRM performs this kind of optimization in a decentralized
way [35]. Further, Wildermann et al. showed, based on game theoretical analy-
sis that local and decentralized core allocation schemes for malleable application
converge to an optimum [57].
• Static applications graphs: Applications with strict real-time requirements need
to be statically analyzed. To do so, the data and control flow need to be known
at design time. Thus, for this kind of applications (see Sect. 3.1), the invasion
is performed on a static graph structure rather than on dynamically changing the
DOP. Static application graphs build the foundation of the mapping methodologies
presented in Chaps. 4–6.

2.2 Invasive Programming Language

In principle, invasive computing is a novel computing paradigm which can be uti-


lized by any programming language by implementing the three primitives invade,
infect, and retreat. Besides existing ports to C++ [36] or OpenMP [19], the major
language research was performed based on the programming language X10 [46].
X10 was developed by IBM within the productive, easy to use, reliable computing
system (PERCS) project and was founded, among Cray’s Chapel and Sun’s Fortress,
by the DARPA’s high productivity computing systems project. According to [29],
X10 “brings modern features to the field of scientific computing by addressing par-
allelization from the start of the application development.” X10 offers state-of-the-art
features such as system modularity, type safety, and generic programming. In con-
trast to, e.g., C++, it inherently supports concurrency and builds upon the partitioned
global address space (PGAS) model which is perfectly suited for tiled many-core
systems such as targeted by invasive computing. This model partitions the global
memory into so-called places. One place corresponds to a computing tile in invasive
architectures (see Sect. 2.4).
In addition, it does not rely on automatic mechanisms but involves directly the
program developer who is most familiar with the algorithm: “The language imple-
mentation is not expected to automatically discover more concurrency than was
Another random document with
no related content on Scribd:
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project Gutenberg™ electronic


works

Professor Michael S. Hart was the originator of the Project


Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back

You might also like