The REDEFINE Execution Model: Revision History

The REDEFINE Execution Model
Document Version 1.1-draft
C Madhava Krishna1 (MK), S K Nandy2 (SK), Ranjani Narayan1 (RN)

1
Morphing Machines Pvt. Ltd., India
2
CAD Lab, Indian Institute of Science, India
madhav@morphing.in, nandy@iisc.ac.in, ranjani@morphing.in
February 22, 2021
Revision History
Revision Date Author(s) Description
1.0 04.04.2020 MK first release
1.1-draft 29.09.2020 MK FALLOC, and END extended
1.2-draft 17.02.2021 MK FDELETE extended
Contents
1 Introduction and Background 1
1.1 The Programming Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 REDEFINE Execution Model 2

2.1 HyperOp: Representation and Operation Semantics . . . . . . . . . . . . . . . . . . . . . 3
2.2 Context Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Orchestrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Compute Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Distributed Runtime Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.7 Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 C with HyperOps Programming Abstraction 8

3.1 HyperOp as a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Inter-hyperOp communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Fork-Join Parallelism - Code Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Example Discussion - Recursive Fibonacci . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Pipeline Parallelism - Code Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Executing a Kernel on Multiple CRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Example Discussion - Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.8.1 Code Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 REDEFINE Compiler 22
Morphing Machines Pvt. Ltd.
REDEFINE - Reconfigurable silicon cores
1 Introduction and Background

REDEFINE many-core is a co-processor used for accelerating compute-intensive parts of an application.
It provides high performance by efficiently exploiting task parallelism at a finer granularity than conven-
tional multithreading. This document describes the REDEFINE execution model and its programming
framework.
REDEFINE many-core processor is programmed using the OpenCL programming framework. OpenCL
explicitly separates a program execution into host side execution and device side execution. The host
controls the device side execution through OpenCL runtime APIs. The code that runs on the device
is called the kernel. In the OpenCL framework, a kernel is written in OpenCL C using an SPMD
(single program, multiple data) programming style and compiled using the device-specific compiler. In
the REDEFINE programming environment, we do not use OpenCL C for implementing kernel code.
Instead, an explicit parallel C dialect is used to express kernels. This parallel C programming interface
called C with hyperOps enables expressing kernels in the REDEFINE execution model.
Figure 1 shows an abstract representation of the REDEFINE OpenCL programming framework. The
interface between host environment and device environment is realized through REDEFINE resource
manager (RRM). The host side RRM includes an implementation of the OpenCL runtime APIs and the
device driver for the REDEFINE many-core accelerator. The device side RRM is a hardware IP that
serves as a gateway between the host processor and the REDEFINE many-core accelerator.
Host code
Host with OpenCL runtime APIs
RRM
Device Kernel code
Figure 1: Host-device interface in REDEFINE programming framework.
REDEFINE many-core architecture is implemented as a network of nodes as shown in figure 2a. Figure
2b shows the architecture of a single node. Each node is organised as a cluster of 4 compute elements
(CEs). Further details of the architecture relevant to the execution model are discussed in Section
2.
Host I/F
off-chip memory I/F
Router
Orchestrator
DSM - bank
CE CE CE CE
L1$ L1$ L1$ L1$ CM$
L1$: Private L1-cache for global memory address space

CM$: Cache for context memory address space
RRM Compute Node Router
DSM-bank: Distributed Shared Memory bank, hosts a
Access-Router region of global memory and context memory
(a) Tiled architecture with toroidal mesh interconnect (b) Composition of a node
Figure 2: REDEFINE many-core architecture with 64 CEs
1.1 The Programming Flow

Figure 3 shows the REDEFINE OpenCL programming flow. An application program includes a host-
side program and a device-side program, as shown in Figure 3a. As per the OpenCL programming
Copyright ©2016-2020 REDEFINE - Morphing Machines Pvt. Ltd.. All Rights Reserved 1
Host source code (.c)

+
Kernel source code (.c)
(a)
Host code compilation Kernel code compilation
OpenCL Runtime Host code Kernel code

Library
REDEFINE
Standard Compiler
C compiler
Kernel binary
Host binary
(b) (c)
Host-Device execution environment
Host binary Kernel binary
Host Processor REDEFINE

many-core processor
(d)
Figure 3: REDEFINE OpenCL framework programming flow.
model [1], the host code uses the OpenCL runtime APIs for device initialization, kernel management,
buffer management, sending inputs to the device, and receiving outputs from the device. The device
(kernel) code is described using C with hyperOps programming interfaces. Section 3 lists the C with
hyperOps programming interfaces used for kernel description. As shown in Figure 3c, the REDEFINE
compiler compiles the kernel code to an executable. Figure 3d shows the host binary executing on the
host processor and the kernel binary executing on the REDEFINE many-core processor.
2 REDEFINE Execution Model

REDEFINE execution model is inspired by the macro-dataflow model [2]. In this model, an applica-
tion is described as a hierarchical dataflow graph, as shown in Figure 4. In this graph, the vertices are
called hyperOps, and the directed edges represent explicit data transfer or execution order requirement
between the connected hyperOps. The graph is called hyperOp dependency graph (HDG). A hyperOp
is a multiple-input and multiple-output macro operation. The graph is hierarchical because a hyperOp
(node) may contain another dataflow graph. A hyperOp is ready for execution as soon as all its operands
are available and all its execution order or synchronization dependencies are satisfied. Apart from the
arithmetic, control, and memory load and store instructions, the REDEFINE execution model includes
primitives for adding new nodes (hyperOp instances) and edges (dependencies) to the application exe-
cution graph. Thus the execution model supports dynamic (data-dependent) parallelism. The runtime
unit named Orchestrator schedules ready hyperOps onto compute resources. Compute resources comprise
several compute elements (CEs). Each hyperOp instance is executed on a CE.
According to the hybrid dataflow/von-Neumann execution model classification described in [3], the RE-
DEFINE execution model falls under Dataflow/Control Flow class. Fig. 5 shows the abstract machine
for the REDEFINE execution model. The remainder of this section provides a detailed description of
hyperOps and each component in Fig. 5.
H1 H1
H1 H1 H1 H1
H3
H0 H0
H2 H3
H2
(a) (b)
Figure 4: HyperOp dependency graph (HDG). An application described as a hierarchical dataflow graph
in which vertices represent hyperOps and edges represent explicit data transfer or execution order re-
quirements between connected hyperOps. (a) and (b) are 2 different representations of same HDG,
showing all the hyperOp types and their relations in the application. The application has four different
hyperOps labeled as H0 , H1 , H2 , and H3 . Note that H1 is a recursive hyperOp.
CE
CE
CE
CE
GM CM Orch
GM: Global Memory CE: Compute Element

CM: Context Memory Orch: Orchestrator
Figure 5: Block diagram of abstract machine
2.1 HyperOp: Representation and Operation Semantics

HyperOp is the unit of scheduling. The execution model enforces data-driven scheduling at the level of
hyperOps. And within a hyperOp, the execution follows the conventional control flow (program-counter
driven) scheduling of instructions. HyperOp launching follows strict function semantics, i.e., a hyperOp
is launched only after receiving all its operands and its synchronization dependencies are satisfied. Each
hyperOp is represented by its computation and its static-metadata. The hyperOp’s computation is
encoded as a sequence of instructions. The constituents of a hyperOp’s static-metadata are described
below..
Static-metadata ::= ( codePointer, arity, annotations )
annotations ::= ( Start, End, Join, ... )
codePointer is the reference (code-pointer) to the instruction sequence that represents the hyperOp
computation.
arity specifies the number of operands of the hyperOp.
annotations give attributes to the hyperOp. Table 2 describes few annotations. Additional anno-
tations will be revealed appropriately when required.
Apart from the usual arithmetic operations, control operations, and memory load and store operations,
a hyperOp includes special instructions to communicate, synchronize, and spawn other hyperOps. The
operation semantics of these special instructions are described in Section 2.5. At runtime, each instance of
a hyperOp is associated with a context frame. Each frame1 holds operands and instance metadata (refer
to Section 2.2) of its corresponding hyperOp instance. Each hyperOp can have at most 16 operands.
Outputs produced during the execution of a hyperOp may serve as inputs (operands) to other hyperOps.
1 For sake of brevity, we use the term frame when referring to context frame.
Start-hyperOp: start-hyperOp is the entry point of the application. It is the root node in
Start
the HDG. An application run should involve only one instance of Start-hyperOp.
End-hyperOp: Last hyperOp of the application. Application ends after execution of the
End
End HyperOp. An application run should involve only one instance of End hyperOp.
Join-hyperOp: Implements the join operation as in the fork-join model. A Join-hyperOp
Join can have only 15 operands. The last operand is reserved holding the synchronization
wait-count which is not accessible to the hyperOp.
Table 2: HyperOp annotations
A producer hyperOp directly writes operands to the consumer hyperOp’s frame. The consumer hyperOp
starts executing as soon as all its operands are available and its synchronization dependencies are satisfied.
Execution of a consumer hyperOp may overlap with the execution of its producer hyperOps.
2.2 Context Memory

Context memory is the storage for context frames. Each hyperOp instance is associated with a context
frame. Supporting dynamic parallelism with a large number of fine grain tasks involves frequent allocation
and deallocation of frames. In order to perform allocation and deallocation of frames with low runtime
overhead, context memory is partitioned into frames of uniform size and the frame management is
implemented in hardware. In this implementation, each frame can hold up to 16-operands. Restricting
maximum operands of a hyperOp to 16 does not limit the computational data-set size of hyperOps to
16-operands, as explained in the following section (refer to Section 2.3). Apart from operands, a frame
also holds metadata of the hyperOp instance called instance metadata.
Context-frame ::= ( operands, instance-metadata )
operands ::= ( Op0, ... , Op15 )
instance-metadata::= ( HyperOpId, WaitCount, ... )
HyperOpId is the reference (pointer) to the static metadata of the hyperOp that the frame holds.
WaitCount specifies the number of operands yet to be received. A hyperOp is ready to execute
when its WaitCount becomes 0.
A frame is associated with only one instance of a hyperOp at any instant. A producer hyperOp must know
the frame addresses of its consumer hyperOps in order to deliver the data to the consumer hyperOps.
Launching a ready hyperOp onto a CE involves reading the operands from the associated frame, and
loading them into the CE’s register-file. HyperOps can perform write to any frame but read only its
associated frame, through the register-file of the CE. The operand slots in a frame are of write-once or
single-assignment data structures and its data is destroyed (deleted) after the read access. Thus the
operands slots in a frame has only one writer and one reader. This property simplifies the hardware
implementation required for ensuring memory consistency for context memory.
2.3 Global Memory

Global memory is the storage for data and code. Note that the code segment includes hyperOp’s
instruction sequence and static metadata. A hyperOp can perform loads(reads) and stores(writes) on
global memory. Although the context frame size limits hyperOp operands to 16, the data-set size of
hyperOp computation is not restricted to 16 operands; hyperOps can exchange data through global
memory. A producer hyperOp can store data in the global memory and communicate the address of the
data to its consumer hyperOp through context frame. The consumer hyperOp then accesses the data
through the address received as an operand. Such global memory accesses would require synchronization
between reading and writing hyperOps which is enforced through context frame writes.
2.4 Orchestrator
Orchestrator manages context memory and scheduling of hyperOps. Orchestrator includes two storage
structures called free-list and ready-list. Free-list keeps track of unallocated frames. Ready-list contains
addresses of the frames with hyperOps that are ready to execute. The functionalities of Orchestrator
are as listed below.
Allocation and deallocation of context frames.

Keeping track of the status of active hyperOps. All context memory updates happen through the
Orchestrator.
Monitoring status of CEs; idle (i.e., ready to accept a hyperOp for execution) or busy.
Prioritizing ready hyperOps and launching them onto idle CEs.
It is important to note that the abstract machine model assumes that all communications and memory
operations are performed instantaneously. In the presence of network delays and memory hierarchy with
caches, Orchestrator also manages cache coherence and ensures memory consistency.
2.5 Compute Element

The execution model assumes CE as an instruction set processor, and computations of hyperOps are
specified as a sequence of instructions. Launching a hyperOp onto CE involves loading the program-
counter (PC) with codePointer of the hyperOp computation and loading register-file (RF) with operands
of the hyperOp. After executing the hyperOp, CE invalidates the contents of PC and RF, thereby
leaving itself in a “clean” state. Thus execution of a hyperOp has side-effects2 only in terms of writes to
global memory (with data produced by the hyperOp), and writes to context memory (data that serve as
operands to other hyperOp(s), and events that enforce execution order among hyperOps). In addition
to arithmetic and control instructions, Load, Store, FAlloc, FBind, FDelete, WriteCM, Sync, CreateInst,
and End form the basic instruction set of CE in REDEFINE execution model. The operation semantics
of these instructions are as specified below.
r := Load(a)
Load register r with the value from global memory address a.
Store(a, v)
Store value v at global memory address a.
r := FAlloc(n)
Allocates n contiguous frames, i.e., an array of n context frames - cf[0], .., cf[n-1], and r is loaded
with address of cf[0]. n must be less than or equal to 16. The allocated frames are in valid state,
i.e., the frame is reserved and not associated with any hyperOp instance.
FBind(cf, hid)
Create an instance of hid hyperOp and bind it with frame cf. cf is the base address of the frame.
Each frame is uniquely identified by its address in the context memory address space. Binding
(associating) a frame with a hyperOp instance changes the state of the frame from valid to active.
FDelete(cf)
Add frame cf to the free-list. Only frames in the valid state, i.e., frames that are not associated
with any hyperOp instance can be deleted. Frames that are allocated by FAlloc can be deallocated
using FDelete.
WriteCM(ca, v)
[ca] := v, write value v at the context memory address ca. The WaitCount (refer to Section 2.2)
associated with the frame that contains ca is decremented by 1. If the WaitCount becomes 0, add
frame to the ready-list.
Sync(ca, v)
[ca] := [ca]+v, update operand at address ca3 by adding value v to it. The sum ([ca]+v) represents
the updated synchronization wait-count of the frame (associated hyperOp instance) that contains
ca. Sync operates only on the hyperOp instances that are annotated as Join and only the 15th
operand is used to hold the synchronization wait-count. Thus ca is assumed to be referring the
15th operand of the frame. If this is the first Sync operating on the frame then the operand-15
is initialized with the value v. For a join-hyperOp to be ready, its synchronization wait-count and
WaitCount must become 0. The synchronization wait-count can never be negative. It is forbidden
to update a synchronization wait-count that is already 0, and the respective Sync instructions are
considered illegal.
2 Programmer observable state changes.
3 Note that [ca] represents the operand value at address ca.
r := CreateInst(hid)
Allocate a frame and bind it with an instance of hid hyperOp. The frame address is written
to register r. This instruction is equivalent to the instruction sequence - r := FAlloc(1); FBind(r,
hid). The frame allocated by CreateInst instruction gets deallocated (deleted) immediately after
the execution of its associated hyperOp instance, no need for explicit deallocation using FDelete.
End
HyperOp executes an End instruction to trasnfer the control of a CE to the Orchestrator. The last
instruction of a hyperOp is always an End instruction. CE self-invalidates its PC and RF, and
notifies Orchestrator that it is idle.
Enumeration and description of arithmetic and control instructions is avoided as they only modify CE’s
internal state and are not critical for understanding the execution model.
2.6 Distributed Runtime Support

CR0
CM0 CM1 CMn
Orchn
Orch0
Orch1
CE
CE
CE
CE
CE
CE
GM
GM: Global Memory CM: Context Memory Orch: Orchestrator

CE: Compute ELement CR: Compute Resource
Figure 6: Abstract machine with distributed Orchestrator
The Orchestrator is responsible for managing context memory and scheduling of hyperOps. With a large
number of CEs and light-weight tasks, a centralized Orchestrator, with centralized runtime support, will
limit the overall parallelism speedup. Figure 6 shows the abstract machine with a distributed Orches-
trator that scales with the number of CEs. To distribute Orchestrator’s functionality, context memory
and CEs are partitioned and grouped into clusters called compute resource (CR). The context memory
is partitioned into multiple banks, shown as CM0 , ... , CMn in Figure 6. Each instance of Orchestrator
manages CEs and context memory bank within a CR. Ready-list and free-list of an Orchestrator hold
frames that belong to the context memory bank of the same CR. Thus each Orchestrator can allocate
and deallocate frames that belongs to the context memory bank of the same CR. The effect is that a
FAlloc or CreateInst instruction executed in a cluster will allocate a frame that belongs to the same CR,
and Orchestrator schedules hyperOps onto CEs within the same CR. The abstraction of single address
space for context memory is preserved, and this allows a hyperOp executing in one CR to write to context
frames of another CR. Figure 7 shows the bit field representation of context memory address.
CR Id Addr. within Context Memory Bank
Figure 7: Fields in context memory address bits
Work Distribution
A limitation with this distributed Orchestrator is that hyperOps created (using CreateInst or FAlloc and
FBind) in one CR cannot be executed on another CR. This is an impediment to efficient work distribution.
To address this issue, the instruction set is extended with remote frame allocate instruction called RFAlloc
with operational semantics as described below.
r := RFAlloc(n, crid)
Allocate an array of n context frames, cf[0], ..., cf[n-1], in the CR indexed with crid and register r
is loaded with the address of cf[0]. n must be less than 17. The allocated frames are in valid state,
i.e., the frame is reserved and not associated with any hyperOp instance.
FBind and FDelete instructions are used to bind a hyperOp instance to a remote frame and to remove
a remote frame, respectively. With the number of CRs known, the work can be distributed across CRs
using RFAlloc and FBind instructions.
2.7 Memory Consistency Model

The REDEFINE execution model prescribes a weak ordering memory model [4] that guarantees sequen-
tial consistency for data-race-free programs. A weak ordering model relies on explicit synchronization
to avoid data races. A data race occurs when at least two unordered memory operations are accessing
the same memory location, and at least one of the operations is a write. In the REDEFINE execution
model, WriteCM and Sync instructions are used for synchronizing global memory accesses. The REDE-
FINE memory model guarantees that “all global memory accesses that comes before a synchronization
operation (WriteCM or Sync) in the producer hyperOp’s program order are observed by the (synchronizing)
consumer hyperOp.” Note that in conventional shared-memory programming, synchronization enforces
an order among shared variable accesses; in the REDEFINE execution model, synchronization enforces
an execution order among hyperOps, which intern imposes order among shared variable accesses. This
does not mean that any two ordered hyperOps enforces an order among their shared variable accesses,
as ordered hyperOp can still overlap their execution. So care must be taken that to enforce an order
among shared variable accesses, the producer hyperOp must finish the shared variable access and then
enable the consumer hyperOp. The situation is better illustrated with the example shown in Figure
8. The Sync instruction in s2 enforces a read after write dependency on variable ‘a’ in s1 and s4 , thus
guarantees that ‘c’ holds value ‘1’ after the execution of statement s4 . But accesses to variable ‘b’ in
s3 and s5 are not synchronized and involves a data race. Such unsynchronized shared variable accesses
causes undefined behavior and should be avoided at all costs.
Hp
s1 : a = 1 Hc
s2 : Sync(ca,-1)
s3 : b = 2 s4 : c = a
s5 : d = b
Figure 8: An execution order among two hyperOps does not guarantee an order among their shared
variable accesses. Hp and Hc are the producer and consumer hyperOps, respectively. Hp enables Hc by
decrementing its synchronization-wait-count using the Sync instruction in s2 . This creates an execution
order between Hp and Hc , such that Hp ≺ Hc . In Hp , the REDEFINE memory model guarantees that
s1 ≺ s2 , thus s1 ≺ s4 . But s3 and s5 are unordered and may create a data race.
The REDEFINE execution model assumes that the programs are properly synchronized or data-race-free.
With no data races, a program can be reasoned by executing hyperOps as per their producer-consumer
relationship and executing the instructions within each hyperOp in their program order.
2.8 Summary
In REDEFINE execution model, an applications is described as a hierarchical dataflow graph called
hyperOp dependence graph (HDG). The HDG statically defines all the hyperOp types and their relations
or control and data dependencies in the program. Independent hyperOps can execute in parallel. Each
hyperOp instance is associated with a context frame. Execution order is enforced between producer
and consumer hyperOps through context frame writes. The execution model has two address spaces,
global memory address space and context memory address space. Context memory is storage for context
frames. A hyperOp can perform reads and writes to global memory. Global memory accesses across
hyperOps are ordered/synchronized through context frame writes, and all global memory accesses are
assumed to be free from data races. A hyperOp can perform writes to any frame. A hyperOp can read
only its frame, i.e., a hyperOp can only read its own operands. The operands of a hyperOp are directly
loaded to the CE’s register-file. All communications among hyperOps are one-sided communications,
i.e., only producer hyperOp initiates and completes a communication. Thus with sufficient parallelism,
all communications can overlap with computations. The amount of parallelism that can be exposed by
an application is limited only by the size of context memory.
3 C with HyperOps Programming Abstraction

This section presents a programming abstraction in C called C with hyperOps that matches the RE-
DEFINE execution model. It can be used as an intermediate representation for compilers, and also
as a low-level language for efficient programmers. In this programming abstraction the REDEFINE
instructions - WriteCM, Sync, FAlloc, FBind, FDelete, CreateInst, and RFAlloc, are expressed as function
calls. A standard C compiler tool-chain is extended (refer to Section 4) to lower these function calls to
machine instructions. Table 3 shows the instructions and their corresponding function interfaces. Note
that there is no equivalent function call for End instruction, it is added implicitly by the compiler. FAlloc
instruction is extended to include a CE Id that executes the hyperOp instances that get bind to the
allocated frames.
CMAddr fAlloc( int n );

r := FAlloc(n)
0 < n ≤ 16
CMAddr fAllocWithCe( int n , int ceId);
r := FAlloc(n, ceId) 0 < n ≤ 16
0 ≤ ceId < 15
void fBind( CMAddr cf, const SMD *smd );
FBind(cf, hid) smd is the static metadata of the hyperOp.
Note that the hid is the pointer to the static metatdata.
FDelete(cf, selfId) void fDelete( CMAddr cf , CMAddr selfId);
void writeCM( CMAddr ca, int v );
void writeCM( CMAddr ca, float v);
WriteCM(ca, v) void writeCM( CMAddr ca, CMAddr v);
void writeCM( CMAddr ca, void* v);
Function overloading is used in implementing writeCM
Sync(ca, v) void sync( CMAddr ca, int v );
r := CreateInst(hid) CMAddr createInst( const SMD *smd );
CMAddr rFAlloc( int n, CrId crId );
r := RFAlloc(n, crid)
0 < n ≤ 16
Table 3: The REDEFINE execution model’s instructions and programming interfaces. CMAddr type
is of 32-bit size and holds the context memory address. SMD is a structure that holds hyperOp static
metadata refer Sect. 2.1.
3.1 HyperOp as a Function

Each hyperOp’s computation is expressed as a function. As mentioned earlier, data transfer between
hyperOps is realized through context memory writes or through global memory accesses using pointers
that are explicitly communicated. Thus hyperOp functions are of void return type. Function prototype
for a hyperOp with 2 operands is shown below.
#define hyperOp attribute ((hyperOp))
hyperOp void hyOpFoo1( CMAddr selfId, Op32 op0, Op32 op1);
The custom function attribute hyperOp is added to specify that it is a hyperOp function. op0 and op1
are operands of the hyperOp and selfId is address of the context frame associated with the hyperOp
instance. The argument selfId is not counted as an operand to the hyperOp. Function prototype of a
hyperOp with zero operands is as shown below.
hyperOp void hyOpFoo2( CMAddr selfId);
As the current implementation is a 32-bit machine, Op32 can hold any data type that fits in 32-bit
size. Op32 is defined as union data type as shown below.
typedef uint32 t CMAddr;
typedef union Op32{

int i32;
float i32;
CMAddr cmAddr;//Address of an operand slot in context memory
void* ptr; //global memory pointer

} Op32;
As mentioned earlier, each hyperOp code (hyperOp function) is associated with a static metadata (refer
Section 2.1) and is defined as type struct SMD as shown below.
typedef void (* HyOpFunc)( CMAddr );
typedef struct SMD{

HyOpFunc fptr; //function pointer
uint8 t ann; //hyperOp annotations
uint8 t arity;
} SMD;
#define SMD const SMD
In struct SMD, the ann field holds the hyperOp annotations (refer to Table2). The arity field holds
the hyperOp’s operand count. And fptr holds the hyperOp’s function pointer. The variables that hold
static metadata (struct SMD type) are constants, declared with const qualifier. The code below shows
declaration of a SMD variable smdEnd. Its hyperOp function is end, it requires 2 operands, and it is
annotated as ANN END i.e. End-hyperOp (refer to Table 2).
SMD smdEnd = {.ann = ANN END, .arity = 2, .fptr = ( HyOpFunc)end};
3.2 Inter-hyperOp communication
1 GM
3
2
producer consumer
1 Update array in GM
producer consumer
2 writeCM(opAddr, arrayAddr)
writeCM(opAddr, opValue) 3 Load array from GM
(a) Data transfer through context memory. (b) Data transfer through global memory.
Figure 9: Inter-hyperOp communication
As shown in Figure 9, a producer hyperOp can transfer data to its consumer hyperOps in two different
ways. Figure 1 shows the code snippet for the data transfer through context memory, and Figure 2 for
data transfer through global memory.
1 int sum = 0;
2 CMAddr consumerAddr; //consumer hyperOp’s context frame address
3
4 hyperOp void producer( CMAddr selfId, Op32 a$, Op32 b$){
5 int x = a$.i32 + b$.i32;
6 //providing scalar value as operand-0 to the consumer hyperOp
7 writeCM(consumerAddr, x);
8 }
9 SMD smdProducer = {.ann = ANN NONE, .arity = 2, .fptr = ( HyOpFunc)producer};
10
11 hyperOp void consumer( CMAddr selfId, Op32 v$){
12 sum = v$.i32;
13 }
14 SMD smdProducer = {.ann = ANN NONE, .arity = 1, .fptr = ( HyOpFunc)consumer};
Listing 1: Communicating a scalar value between two hyperOps.
In Listing 1, producer hyperOp (lines 4-9) transfers a scalar value to consumer hyperOp (lines 11-14). Note
that the producer hyperOp needs to know the context-frame address of its consumer hyperOp. The global
variable consumerAddr (line 2) holds the context-frame address of consumer hyperOp instance. consumer
hyperOp takes only one operand named v$ (line 11). And this operand gets the 0th operand slot in its
context frame. At line-7 the producer hyperOp assigns the 0th operand of the consumer hyperOp with
a scalar value x, using writeCM instruction.
1 int sum;
2 int data[64]; //array in global memory
3 CMAddr consumerAddr; //consumer hyperOp’s context frame address
4
5 hyperOp void producer( CMAddr selfId, Op32 init$, Op32 incrBy$){
6 //Updating array
7 for(int i=0; i<64; i++){
8 data[i] = init$.i32 + (i * incrBy$.i32);
9 }
10 //Providing address of the array as operand-0 to consumer hyperOp.
11 writeCM( consumerAddr, (void*)data);
12 }
13 SMD smdProducer = {.ann = ANN NONE, .arity = 2, .fptr = ( HyOpFunc)producer};
14
15 hyperOp void consumer( CMAddr selfId, Op32 inArray$){
16 int sum = 0;
17 int* inArray = inArray$.ptr;
18 for(int i=0; i<64; i++){
19 sum += *(inArray + i);
20 }
21 }
22 SMD smdProducer = {.ann = ANN NONE, .arity = 1, .fptr = ( HyOpFunc)consumer};
Listing 2: Communicating an array of values between two hyperOps.
In Listing 2, producer hyperOp (lines 5-13) transfers an array of values to consumer hyperOp (lines
15-22) through global variable data (line 2). The variable consumerAddr (line 3) holds the context-
frame address of consumer hyperOp instance. producer hyperOp first writes the data that needs to be
communicated to the consumer to the global memory (lines 7-9) and then communicates the address
of the data to consumer hyperOp (line 11). Note that the writeCM instruction (line 11), ensures that
producer hyperOp’s write operations (line 8) happens before the consumer hyperOp’s read operations
(line 19) to the shared variable data.
3.3 Fork-Join Parallelism - Code Discussion
worker0 workern
master
join
Figure 10: HyperOp dependency graph implementing fork-join parallelism.
1 hyperOp void worker( CMAddr selfId, Op32 workerId$, Op32 syncAddr$){

2
3 someParallelWork( workerId$.i32 );
4
5 //decrease the synchronization wait-count of ’join’ hyperOp by 1
6 sync(syncAddr$.cmAddr, -1);
7 }
8 SMD smdWorker = {.ann = ANN NONE, .arity = 2, .fptr = ( HyOpFunc)worker};
9
10 hyperOp void join( CMAddr selfId){
11 //some sequential work following the fork-join
12 ....
13 ....
14 }
15 /* join hyperOp annotated with ”ANN JOIN”.
16 * Note that the arity is set to ’1’ with zero operands as arguments
17 * to the function ”join”.
18 */
19 SMD smdJoin = {.ann = ANN JOIN, .arity = 1, .fptr = ( HyOpFunc)join};
20
21 //master hyperOp spawns a sub-graph to realize fork-join parallelism
22 hyperOp void master( CMAddr selfId, Op32 nWork$){

23
24 /* create a join hyperOp with synchronization waitcount
25 * initialized to the number of worker hyperOps.
26 */
27 CMAddr joinFrAddr = createInst(&smdJoin);
28 //sync. wait-count assigned to last operand
29 CMAddr syncAddr = opAddr(joinFrAddr, 15);
30 int nWork = nWork$.i32;
31 sync( syncAddr, nWork); //synchronization wait-count initailization
32
33 // spawning worker hyperOps
34 for(int i=0; i < nWork; i++){
35 // creating worker node
36 CMAddr workFrAddr = createInst(&smdWorker);
37 writeCM( workFrAddr, i); //assign operand-0
38 // creating an edge between worker node and join node.
39 writeCM( opAddr(workFrAddr, 1), syncAddr);
40 }
41 }
42 SMD smdMaster = {.ann = ANN NONE, .arity = 1, .fptr = ( HyOpFunc)master};
Listing 3: Code for implementing fork-join parallelism
Listing 3 shows code for realizing fork-join model [5] of parallelism. Figure 10 shows the hyperOp
dependency graph described in the code. The code contains three types of hyperOps. The functions
worker (lines 1-8), join (lines 10-19), and master (lines 22-42) are the hyperOp functions and smdWorker,
smdJoin, and smdMaster are their static metadata, respectively. Function opAddr(frId, opId) (in line 29
and 39) returns context memory address of the operand in the frame frId at index opId.
The parallel computation is distributed among many worker hyperOp instances. join performs the se-
quential computation that follows the parallel computation. join hyperOp waits for all worker hyperOp
instances to finish execution. This is realized by annotating join hyperOp as ANN JOIN (line-19) and
using sync instruction (line-6). Note that each worker hyperOp instance finishes execution by synchro-
nizing with join hyperOp at line-6. Even though join function doesn’t receive any operands as arguments
(line-10) its arity is set to ‘1’ (line-19). The synchronization wait-count for a Join-hyperOp (refer to Table
2) is one of the operands for the hyperOp but not available as its function argument.
master hyperOp spawns or builds a sub-graph that implements fork-join pattern of parallelism. New
nodes are added to the sub-graph at lines 27 and 36. Edges connecting each worker node to join node are
created at line-39. Here creating an edge means specifying a consumer hyperOp’s operand address to its
producer hyperOp. Operand-15 of join hyperOp hold the synchronization wait-count, line-31. Since all
worker hyperOps need to decrement this synchronization wait-count to enable join hyperOp, the address
of join hyperOp’s operand-15 is forwarded to all worker hyperOp instances, line-39.
3.4 Example Discussion - Recursive Fibonacci

1 int fibSeq(int n){
2 if(n<2){
3 return n;
4 } else {
5 return fibSeq(n-1) + fibSeq(n-2);
6 }
7 }
8
9 void kernelFibSeq(int N, int* fibN){
10 *fibN = fibSeq(N);
11 }
Listing 4: Fibonacci in C.
Listing 4 shows the C code for computing Fibonacci number using recursion. Listing 5 shows the C with
hyperOps version of the same Fibonacci kernel. From here on the C with hyperOps version is referred
to as parallel-fib. This parallel-fib code is used as a running example to illustrate certain details of the
C with hyperOps programming abstraction. The number of lines in the parallel-fib code is much more
than the sequential code in Listing 4. It is because apart from actual computation, parallel-fib includes
instructions (statements) to construct the application HDG.
1 #include "redefine.h"
2
3 /* end is a leaf node. The kernel execution ends once the execution
4 * returns from this hyperOp. There should be only one instance of this hyperOp in
5 * the entire execution of the kernel.
6 */
7 hyperOp void end( CMAddr selfId, Op32 v$, Op32 fibN$){
8 int* fibN = fibN$.ptr
9 *fibN = v$.i32;
10 }
11 SMD smdEnd = {.ann = ANN END, .arity = 2, .fptr = ( HyOpFunc)end};
12
13 //sum is a leaf node.
14 hyperOp void sum( CMAddr selfId, Op32 i1$, Op32 i2$, Op32 outAddr$){
15 int r = i1$.i32 + i2$.i32;
16 //put r on the output edge
17 writeCM( outAddr$.cmAddr, r);
18 }
19 SMD smdSum = {.ann = ANN NONE, .arity = 3, .fptr = ( HyOpFunc)sum };
20
21 //forward declaration of fib hyperOp function.
22 hyperOp void fib( CMAddr selfId, int, CMAddr);
23 SMD smdFib = {.ann = ANN NONE, .arity = 2, .fptr = ( HyOpFunc)fib };
24
25 //fib is a hierarchical node.
26 hyperOp void fib( CMAddr selfId, Op32 n$, Op32 outAddr$){
27
28 if(n<2){
29 writeCM(outAddr$.cmAddr, n$.i32);
30 } else {
31 //spawn two fib and one sum hyperOp instances
32 CMAddr fib1Fr = createInst(&smdFib);
33 CMAddr fib2Fr = createInst(&smdFib);
34 CMAddr sumFr = createInst(&smdSum);
35
36 /* creating edges between child hyperOp instances.
37 * opAddr(frId, i), returns address of an operand of index ’i’ in the
38 * contextframe ‘frId’
39 */
40 writeCM( opAddr(fib1Fr, 1), sumFr ); //fib1 output as sum’s operand0
41 writeCM( opAddr(fib2Fr, 1), opAddr(sumFr, 1) ); //fib2 output as sum’s operand1
42
43 //binding this fib output to child sum hyperOp output
44 writeCM( opAddr(sumFr, 2), outAddr$.cmAddr);
45
46 int n = n$.i32;
47 //providing inputs to fib
48 writeCM( fib1Fr, (n1) );
49 writeCM( fib2Fr, (n2) );
50 }
51 }
52
53 //kernel’s entry function. StarthyperOp calls this function.
54 kernel void parallelFib(int N, int* fibN){
55 //create two child hyperOps
56 CMAddr fibFr = createInst(&smdFib);
57 CMAddr endFr = createInst(&smdEnd);
58
59 //creating an edge between child hyperOp instances
60 //connecting fib’s output as end’s operand0
61 writeCM( opAddr(fibFr, 1), endFr );
62
63 //providing input to fib hyperOp
64 writeCM( fibFr, N);
65
66 //providing input to end hyperOp
67 writeCM( opAddr(endFr, 1), fibN);
68 }
Listing 5: Fibonacci in C with hyperOps
parallelFib function with the qualifier kernel, at line-54, is the entry function to the kernel. All input
and output buffers are allocated by the host code (as per OpenCL semantics) and provided as arguments
to this function. In this case N and fibN are input and output, respectively. The functions named fib,
sum, and end annotated as hyperOp (hyperOp functions) describe respective hyperOp’s computations.
The static metadata (SMD) variables associated with these hyperOps are smdFib, smdSum, and smdEnd.
Each hyperOp type represents a static computation. At run time the kernel may involve execution of
multiple instances of these hyperOp types, except the end-hyperOp (refer to Table 2). An end-hyperOp
is the one with SMD annotation ANN END. In parallel-fib, end is the end-hyperOp. Note that smdEnd’s
ann field is assigned with ANN END, at line-11. The start-hyperOp (refer to Table 2) is not part of
the kernel code. The runtime invokes the start-hyperOp, and the start-hyperOp calls4 the kernel’s entry
function, in this case, parallelFib function. From a kernel programmer’s perspective, parallelFib function
can be assumed as a hyperOp with no SMD associated with it.
fib fib fib fib fib

fib
start sum
sum start
end
end
(a)
(b)
Figure 11: Hierarchical dataflow graph representation of the code in Listing 5 in two different ways. start
and fib are hierarchical nodes, and sum and end are leaf nodes. In (b), the sum node output edge gets
bind to the output edge of its parent fib node. start hyperOp is part of the startup code (C runtime)
and calls the kernel entry function, parallelFib.
start
fib(3)
fib(2) fib(1)
start
fib(1) fib(0)
fib(3) end
sum
fib(2) fib(1) sum

sum
fib(1) fib(0) sum end
(a) Control dependence (parent-child) graph (b) Data dependence graph
Figure 12: Execution graph for the code in Listing 5 with input N = 3. Each node represents a hyperOp
instance of type specified inside the node. Node ‘fib(i)’ represents instance of fib hyperOp with input i.
4 function calls are allowed within a hyperOp.
Parallel-fib has dynamic task-parallelism. Thus its execution graph is input data dependent, Figure 12
shows the execution graph for input N=3. The programmer statically defines the hyperOp types and
their relations or control and data dependencies. Dynamic instances of hyperOps may get generated
at runtime, and the dependencies between the dynamic instances are the same as the statically defined
relations. Figure 11 shows all the hyperOp types and their relations in parallel-fib code as a hierarchical
dataflow graph in two different representations. In this figure, sum and end hyperOps are leaf nodes,
and start and fib are hierarchical nodes. A hierarchical node contains a child graph, and the child
graph itself can have hierarchical nodes and leaf nodes. Apart from instructions to spawn child graph
nodes, a hierarchical node may include instructions to create new edges connecting the child nodes and
instructions to bind its output edge(s) with one or more child node’s output edge(s). Here creating
edges or binding edges means specifying consumer hyperOp’s context memory address to the producer
hyperOp. In parallel-fib, parallelFib function creates a child graph with two nodes – lines 56-57 spawn
one instance of fib and one instance of end hyperOp, and at line-61 an edge is created connecting the
output of fib to one of the inputs of end. Similarly, fib hyperOp can conditionally create a child graph
with three nodes – lines 32-34 spawns two fib and one sum hyperOp instances, lines 40-41 creates edges
connecting the output of fibs to inputs of sum, and line-44 binds the output edge of the child sum with
the output edge of parent fib. Leaf nodes, end and sum, do not create any new nodes or edges.
In the entire code in Figure 5, only lines 17, 29, 48, 49, 64, and 67 are performing data transfers. The
remaining part of the code with REDEFINE specific instructions describes the kernel HDF. Figure 12
shows parallel-fib’s execution graph as a DAG. In this execution graph, the vertices represent hyperOp in-
stances ,and the directed edges represent dependencies between the connecting hyperOp instances. Since
parallel-fib has dynamic parallelism, the kernel’s control and data dependence DAG is input dependent.
Figure 12 shows the parallel-fib’s execution graph for computing the 3rd Fibonacci number. Figure 12a
shows the parent-child relationship or control dependence in the kernel. Note that only hierarchical
nodes have children. Figure 12b shows data-dependence or dataflow in the kernel.
3.5 Pipeline Parallelism - Code Discussion

buf1
buf2
out
Q
in
P R
Figure 13: Example pipeline with three stages.
Pipeline parallelism can be viewed as data flowing through several computational stages to accomplish
a task [5]. Figure 13 shows a pipeline with three stages. In this figure P, Q, and R are the computa-
tional stages and in, buf1, buf2, and out are the pipes (queues) that connect the computations. Each
computational stages reads data from its input pipe, transforms it, and write it to an output pipe.
Listing 6 shows a sequential C code that implements a pipeline computation similar to the one shown in
Figure 13. In this C implementation, the functions P, Q, and R are the stage computations, and these
functions communicate through buffers (arrays) instead of pipes. Figure 14 shows the dependencies
among the functions and buffers. The required RAW and WAR dependencies are implicitly enforced in
sequential execution.
1 int in[N][32];
2 int out[N][32];
3 int buf1[32], buf2[32];
4
5 for(int i=0; i<N; i++) {
6 P(in[i], buf1);
7 Q(buf1, buf2);
8 R(buf2, out[i]);
9 }
Listing 6: C code for pipeline shown in Figure 13.
Listing 7 shows a C with hyperOps code implementing the pipeline in Figure 14. The three hyperOps
stageP (lines 8-23), stageQ (lines 25-42), and stageR (lines 44-60) implement the three computational
stages P, Q, and R, respectively. And the required RAW and WAR dependencies are enforced through
explicit synchronizations. The synchronization wait-count for each hyperOp is set as per their number
of RAW and WAR dependences shown in Figure 14; for stageP (line 16) and stageR (line 52) it is 1, and
buf1
buf2
write read write read
RAW RAW
out
read write
Q
in
P WAR WAR
R
Figure 14: Dependencies among functions and buffers for the code in Listing 6. RAW: read-after-write
dependence; WAR: write-after-read dependence. Note that there are no cyclic dependencies between
function instances P and Q, and Q and R. The RAW dependence is between function instances of the
same iteration, and the WAR dependence is between current iteration and the next iteration.
for satgeQ (line 33) it is 2. The pipeline stage hyperOps are executed in a loop iterating N times. Note
that the context frames for these hyperOps are allocated once (lines 65, 70, 76) and reused N times .
The loop control is delegated individually to each hyperOp, thus each hyperOp creates its instance for
the next iteration (lines 14, 31, 50). Upon reaching the loop exist condition, each hyperOp deletes its
context frame explicitly using fDelete (lines 18, 37, 56). The context memory addresses required for
synchronization are communicated through global variables syncP, syncQ, and syncR. These addresses
remain constant for the entire loop execution.
1 int in[N][32];
2 int out[N][32];
3 int buf1[32], buf2[32];
4
5 CMAddr syncP, syncQ, syncR;
6 CMAddr syncLoopEnd;
7
8 hyperOp void stageP( CMAddr selfId, Op32 i$){
9 int i = i$.i32;
10 P(in[i], buf1);
11 i++;
12 if(i < N){
13 //Reinitialize frame for the next iteration
14 fBind(selfId, smdStageP);
15 writeCM(selfId, i)
16 sync( opAddr(selfId,15), 1); //sync. waitcount init.
17 } else {
18 fDelete(selfId);
19 }
20 //Enable stageQ, RAW dependence
21 sync( syncQ, -1);
22 }
23 SMD smdStageP = {.ann = ANN JOIN, .arity = 2, .fptr = ( HyOpFunc)stageP};
24
25 hyperOp void stageQ( CMAddr selfId, Op32 i$){
26 int i = i$.i32;
27 Q(buf1, buf2);
28 i++;
29 if(i < N){
31 fBind(selfId, smdStageQ);
32 writeCM(selfId, i);
33 sync( opAddr(selfId,15), 2); //sync. waitcount init.
34 //Enable stageP, WAR dependence
35 sync( syncP, -1);
36 } else {
37 fDelete(selfId);
38 }
39 //Enable stageR, RAW dependence
40 sync( syncR, -1);
41 }
42 SMD smdStageQ = {.ann = ANN JOIN, .arity = 2, .fptr = ( HyOpFunc)stageQ};
43
44 hyperOp void stageR( CMAddr selfId, Op32 i$){
45 int i = i$.i32;
46 R(buf2, out[i]);
47 i++;
48 if(i<N){
50 fBind(selfId, smdStageR);
51 writeCM( selfId, i);
52 Sync( opAddr(selfId,15), 1); //sync. waitcount init.
53 //Enable stageQ, WAR dependence
54 sync( syncQ, -1);
55 } else {
56 fDelete(selfId);
57 sync(syncLoopEnd, -1);
58 }
59 }
60 SMD smdStageR = {.ann = ANN JOIN, .arity = 2, .fptr = ( HyOpFunc)stageR};
61
62 hyperOp void loopBody( CMAddr selfId, Op32 syncAddr$){
63 syncLoopEnd = syncAddr$.cmAddr;
64
65 CMAddr pFrId = FAlloc(1);
66 fBind( pFrId, &smdStageP);
67 writeCM(pFrId, 0);
68 syncP = opAddr(pFrId, 15);
69
70 CMAddr qFrId = FAlloc(1);
71 fBind( qFrId, &smdStageQ);
72 writeCM( qFrId, 0);
73 syncQ = opAddr(qFrId, 15);
74 sync( syncQ, 1); //sync waitcount init.
75
76 CMAddr rFrId = FAlloc(1);
77 fBind( rFrId, &smdStageR);
78 writeCM( rFrId, 0);
79 syncR = opAddr(rFrId,15);
80 sync( syncR, 1); //sync waitcount init.
81
82 //Enable stageP
83 sync( opAddr(pFrId,15), 0); //sync waitcount init.
84 }
85 SMD smdLoopBody = {.ann = ANN NONE, .arity = 1, .fptr = ( HyOpFunc)loopBody};
86
87 hyperOp void loopEnd( CMAddr selfId){
88 //some sequential work following the loop.
89 ....
90 }
91 SMD smdLoopEnd = {.ann = ANN JOIN, .arity = 1, .fptr = ( HyOpFunc)loopEnd};
92
93 hyperOp void loopStart( CMAddr selfId){
94 CMAddr loopEndCF = createInst(&smdLoopEnd);
95 CMAddr loopEndSync = opAddr(loopEndCF, 15);
96 sync( loopEndSync, 1 ); //sync. waitcount initialization
97
98 CMAddr loopBodyCF = createInst(&smdLoopBody);
99 writeCM( loopBodyCF, loopEndSync);
100 }
Listing 7: C with hyperOps implementation of the code in Listing 6
Figure 15 shows the hyperOp dependence graph for the code in Listing 7. Note that the data edges are
differentiated from the synchronization edges. Figure 16 shows the execution graph with control and data
dependencies. As shown in Figure 16a, the pipeline stage hyperOps are recursive hyperOps - creates its
own instance for the next iteration. In Figure 16b, synchronization edge connecting hyperOps instance
of the same iteration enforces RAW dependence. And a synchronization edge connecting hyperOps of
different iterations enforces WAR dependence. For example, the edge connect P0 and Q0 enforces a RAW
dependence, and the edge connecting Q0 and P1 enforces WAR dependence.
3.6 Executing a Kernel on Multiple CRs

In the REDEFINE execution model, computational resources are organized in a two-level hierarchy as
CEs and CRs, as shown in Figure 6. As mentioned in Section 2.6, each Orchestrator schedules hyperOps
onto CEs within the same CR. The execution model provides primitives, RFAlloc and FBind, to delegate
a hyperOp from one CR to another CR. Using these primitives programmer can express the mapping
between hyperOps and the CRs. With no dynamic load balancing support, work (load) distribution is
start
loopBody
loopBody
Pi Qi Ri Pi Qi Ri
start
Pi+1 Qi+1 Ri+1 Pi+1 Qi+1 Ri+1
loopEnd
Data edge Synchronization edge
(a)
loopEnd
(b)
Figure 15: Hierarchical dataflow graph representation of the code in Listing 7 in two different ways.
Node Pi represents the instance of hyperOp P for the iteration i. start hyperOp is part of the startup
code (C runtime) and calls the kernel entry function, loopStart.
start
loopBody
start
P0 Q0 R0
loopBody loopEnd
P1 Q1 R1
P0 Q0 R0
PN −1 QN −1 RN −1
P1 Q1 R1
loopEnd
Data edge
PN −1 QN −1 RN −1
Synchronization edge
(a) Control dependence (parent-child) graph (b) Data dependence graph
Figure 16: Execution graph for the code in Listing 7.
part of the kernel code in the REDEFINE execution model.

The computational resources or number of CRs for a kernel are statically known or specified at compile
time by the user. The computational resources are specified in terms of rectangular regions of CRs, called
fabric. Figure 6 shows a 1D arrangement of CRs, but the hardware implementation follows a 2D ar-
rangement of CRs. Thus a kernel resource requirements are specified as dimensions of rectangular fabric.
The programming abstraction provides two preprocessing macros named NUMCOL and NUMROW
through which the programmer can define kernel resource requirements. For a fabric size of 2 columns
and 3 rows, the user must define these macros as compiler option -D NUMCOL =2 -D NUMROW =3.
If not defined, default value ‘1’ is assigned to these macros. NUMCR is another useful macro that
holds the number of CRs allocated for the kernel, defined as
#d e f i n e NUMCR ( NUMCOL * NUMROW )
Table 3 shows the rFAlloc API, one of the arguments for rFAlloc is the CR Id of type CrId defined
as below.
typedef struct CrId {
uint8 t i d X ; // column i n d e x
uint8 t i d Y ; // row i n d e x
} CrId ;
In a kernel’s fabric, each CR is uniquely identified with its 2D-index of type CrId. The kernel’s start
hyperOp is always executed on CR(0,0). Thus a kernel execution starts at CR(0,0) and then gets
distributed to the rest of the fabric.
3.7 Example Discussion - Vector Addition

1 float *in1, *in2;
2 float *out;
3
4 void v32Add(int offset){
5 for(int i=offset; i<offset+32; i++){
6 out[i] = in1[i] + in2[i];
7 }
8 }
9
10 void kernelVecAdd(int N, float* a, float* b, float* c){
11 in1 = a; in2 = b; out = c;
12 for(int i=0; i<N; i=i+32){
13 v32Add(i);
14 }
15 }
Listing 8: Vector addition in C.
Listing 8 shows the C code for adding two arrays of length N. The computation is realized as multiple
smaller vector additions of length 32, defined as v32Add function. It is assumed that N is a multiple of
32 and v32Add is the basic unit of work. Listing 9 shows a C with hyperOps version of adding two arrays
using multiple CRs. With v32Add hyperOp as basic unit of work, it implements the work distribution
on a fabric of size 2 × 2.
1 #include"redefine.h"
2
3 float *in1, *in2;
4 float *out;
5
6 hyperOp void v32Add( CMAddr selfId, Op32 offset$, Op32 localSync$){
7 for(int i=offset$.i32; i< offset$.i32 + 32; i++){
8 out[i] = in1[i] + in2[i];
9 }
10 CMAddr syncAddr = localSync$.cmAddr;
11 //synchronization with localjoin
12 sync(syncAddr, -1);
13 }
14 SMD smdV32Add = {.ann = ANN NONE, .arity = 2, .fptr = ( HyOpFunc)v32Add };
15
16 hyperOp void crSync( CMAddr selfId, Op32 globalSync$){
17 CMAddr syncAddr = globalSync$.cmAddr;
18 //synchronization with globaljoin
19 sync(syncAddr, -1);
20 }
21 SMD smdCrSync = {.ann = ANN JOIN, .arity = 2, .fptr = ( HyOpFunc)crSync};
22
23 hyperOp void crWrk( CMAddr selfId, Op32 offset$, Op32 size$, Op32 globalSync$){
24 //fork-join parallelism within a CR
25 CMAddr crSyncId = createInst(&smdCrSync);
26 writeCM(crSyncId, globalSync$);
27 CMAddr localSync = opAddr(crSyncId,15);
28
29 int nHyOps = size$.i32/32; // (N/ NUMCR ) is a multiple of 32
30 sync( localSync, nHyOps);
31
32 int offset = offset$.i32;
33 for(int i=0; i<n ;i++){
34 CMAddr frId = createInst(&smdV32Add);
35 writeCM( frId, offset + i*32);
36 writeCM( opAddr(frId, 1), localSync);
37 }
38 fDelete(selfId);
39 }
40 SMD smdCrWork = {.ann = ANN NONE, .arity = 3, .fptr = ( HyOpFunc)crWrk};
41
42 hyperOp void end( CMAddr selfId){}
43 SMD smdEnd = {.ann = ANN END|ANN JOIN, .arity = 1, .fptr = ( HyOpFunc)end};
44
45 kernel void vecAdd(int N, float* a, float* b, float* c){
46 in1 = a; in2 = b; out = c;
47
48 //fork-join parallelism across CRs.
49 CMAddr endId = createInst(&smdEnd);
50 CMAddr globalSync = opAddr(endId, 15);
51 sync( globalSync, NUMCR );
52
53 int perCr = N/ NUMCR ; //N is a multiple of NUMCR
54
55 //Distributing work across ( NUMCOL x NUMROW ) fabric size.
56 int offset=0;
57 for(int j=0; j< NUMROW ; j++){
58 for(int i=0; i< NUMCOL ; i++){
59 CrId crid = {.idX=i, .idY=j};
60 //remote frame allocation
61 CMAddr frId = rFAlloc(1, crid);
62 fBind(frId, &smdCrWork);
63 writeCM( frId, offset);
64 writeCM( opAddr(frId,1), perCr);
65 writeCM( opAddr(frId,2), globalSync);
66 offset = offset + perCr;
67 }
68 }
69 }
Listing 9: Vector addition in C with hyperOps, with fabric-size 2×2. Fabric size macros are defined in
the compiler option as -D NUMCOL =2 -D NUMROW =2.
In Listing 9, VecAdd function executes on CR(0,0) and performs the work distribution across the fabric,
lines 56-66. rFAlloc (line-61) and fBind (line-62) spawns a new hyperOp for each CR. Note that the
frames allocated using rFAlloc is explicitly deleted (garbage collected) using fDelete (line-38).
Figure 17 shows the HDG of the code shown in Listing 9. As shown in Figure 17, the kernel implements a
two-level hierarchical fork-join parallelism. The parallelism pattern matches the underlying hierarchical
organization of CEs. Figure 18 shows the execution graph of the kernel with input N=256. Figure 18
shows the map between the hyperOps and the CRs, as specified in the source code.
3.8 Locks
Locks are another way to synchronize hyperOps to ensure exclusive access to a shared variable. A lock
is used when two or more hyperOps access a shared variable, and their order of access is not known
statically. Two special hyperOps are provided to support locks in the REDEFINE execution model,
called lock-server and lock-client. A lock-server hyperOp guards a shared variable and synchronizes lock-
client hyperOps to ensure mutual exclusion. In the REDEFINE execution model, creating a lock involves
allocating a context frame and binding it to the lock-server hyperOp. Thus a lock is uniquely identified
by its lock-server hyperOp’s context frame address. The REDEFINE runtime system maintains all lock-
client hyperOps waiting for the same lock in a queue, which is referred to as the lock-queue. A lock can
v32Add
v32Add
v32Add v32Add
crWrk crWrk
start
crSync
crSync
crSync start
end
crWrk
Synchronization edge end
(a)
(b)
crWrk hyperOp (hierarchical node) encompasses the work allocated for each CR. Note that the code
assumes that input the vector length N is a multiple of ( NUMCR × 32).
start
end
CR(0,0) CR(1,0) CR(0,1) CR(1,1)
crWrk crWrk crWrk crWrk
v32Add v32Add crSync v32Add v32Add crSync v32Add v32Add crSync v32Add v32Add crSync
(a) Control dependence (parent-child) graph
start
CR(0,0) CR(1,0) CR(0,1) CR(1,1)
crWrk crWrk crWrk crWrk
v32Add v32Add v32Add v32Add v32Add v32Add v32Add v32Add
crSync crSync crSync crSync
end
Synchronization edge Data edge

(b) Data dependence graph
Figure 18: Execution graph for the code in Listing 9 with input N = 256. Each node represents a
hyperOp instance of type specified inside the node. The kernel implements fork-join parallelism at two-
levels, within a CR and a cross CRs. start and end hyperOps are executed on CR(0,0). Each CR executes
two v32Add hyperOps.
be either free or owned by a lock-client hyperOp instance. When a lock is free, and its lock-queue is not
empty, the lock-server hyperOp enables the lock-client hyperOp instance at the head of the lock-queue.
To acquire and release locks, C with hyperOps offers two programming interfaces listed below.
void re lock acquire( CMAddr serverFrameAddr, CMAddr clientFrameAddr);

void re lock release( CMAddr serverFrameAddr);
re lock acquire atomically enqueues the lock-client’s frame address (clientFrameAddr) to the lock-queue.
re lock release releases the lock so that the lock-server can subsequently enable the next lock-client at
the head of the lock-queue.
HyperOp annotations ANN LSER and ANN LCLI are used for defining lock-server and lock-client hyper-
Ops, respectively. A lock-client hyperOp can have a maximum of 15 operands, and a lock-server hyperOp
can have a maximum of 13 operands. The unused operand slots are used by the runtime system to imple-
ment the lock-queue. Further, the 0th operand of a lock-server hyperOp holds the head of the lock-queue,
i.e., the next lock-client hyperOp instance that must be enabled. Section 3.8.1 presents an example in
which multiple hyperOps write to the same buffer using lock based synchronization.
3.8.1 Code Discussion

1 #include "redefine.h"
2
3 int *buffer;
4 CMAddr lockId;
5
6 //Lock-client hyperOp
7 hyperOp void bufferWrite( CMAddr selfId, Op32 wrPtr$, Op32 nSol$, Op32 joinAddr$){
8 int wrPtr = wrPtr$.i32;
9 int nSol = nSol$.i32;
10
11 // Reserve buffer space of ”nSol” capacity
12 writeCM( opAddr(lockId,1), wrPtr+nSol); //Update free write pointer.
13 re lock release(lockId); //Release lock
14
15 for(int i=0; i<nSol; i++){
16 buffer[wrPtr+i] = ...... ;
17 }
18 sync(joinAddr$.cmAddr, -1);
19 }
20 SMD smdBufferWrite = {.ann = ANN LCLI, .arity=3, .fptr = ( HyOpFunc)bufferWrite };
21
22 hyperOp void writeReq( CMAddr selfId, Op32 joinAddr$){
23 int nSol = ....... ; //no. of writes
24 CMAddr clientId = createInst(&smdBufferWrite);
25 writeCM( opAddr(clientId,1), nSol);
26 writeCM( opAddr(clientId,2), joinAddr$.cmAddr);
27
28 //Request lock on behalf of lock-client hyperOp
29 re lock acquire(lockId, opAddr(clientId,1));
30 }
31 SMD smdWriteReq = {.ann = ANN NONE, .arity=1, .fptr = ( HyOpFunc)writeReq};
32
33 hyperOp void end( CMAddr selfId){
34 fDelete(lockId); //destroying lock
35 }
36 SMD smdEnd = {.ann = ANN END|ANN JOIN, .arity = 1, .fptr = ( HyOpFunc)end};
37
38 /* Lock-server hyperOp, operand0 is reserved for lock-client frame address,
39 * which is provided as the second argument to re lock acquire. */
40 hyperOp void lockSrv( CMAddr selfId, Op32 client$, Op32 wrPtr$){
41 //Sending writepointer to the lock-client hyperOp.
42 writeCM(client$.cmAddr, wrPtr$.i32);
43 }
44 SMD smdLockSrv = {.ann = ANN LSER, .arity = 2, .fptr = ( HyOpFunc)lockSrv};
45
46 kernel void nWriteBuffer(int N, int* mem){
47 buffer = mem;
48
49 CMAddr lockFr = fAlloc(1);
50 fBind(lockFr, &smdLockServer); // creating lock
51 writeCM( opAddr(lockFr,1),0); //initializing write pointer;
52 lockId = lockFr;
53
54 CMAddr endId = createInst(&smdEnd);
55 CMAddr endJoin = opAddr(endJoin,15);
56 sync( endJoin, N);

57
58 for(int i=0; i<N; i++){
59 CMAddr wrReq = createInst(&smdWriteReq);
60 writeCM( wrReq, endJoin);
61 }
62 }
Listing 10: HyperOps synchronizing using a lock
In this example, multiple hyperOps write to an integer array. To avoid overwriting each other array
elements, hyperOps first reserve their required array space and then write to the array. The critical
section in this example is reserving the array space, and the shared variable that needs to be protected
holds the array index pointing to the head of unreserved space.
In Listing 10, bufferWrite hyperOp reserves the array space in line-12, releases the lock in line-13, and
perform writes to the array in lines 15-16. lockSrv hyperOp, lines 40-44, is the lock-server hyperOp that
guards the lock. Note lockSrv hyperOp’s annotation ANN LSER at line-44. The shared variable that
holds the index of unreserved space is realized in context memory as 1st operand of lockSrv, referred
as wrPtr$ at line-40. The bufferWrite hyperOp is enabled by the lockSrv hyperOp by providing its 0th
operand, referred as wrPtr$ at line-7. bufferWrite hyperOp, lines 7-20, is the lock-client hyperOp. Note
that lockSrv hyperOp receives bufferWrite hyperOp’s 0th operand address through re lock acquire at
line-29. Lines 49-51 creates the lock and initializes the wrPtr$. Line-34 deallocates the lock.
v32Add
bufferWrite
writeReq writeReq bufferWrite
crSync
start
start
end writeReq
(a) end
(b)
Multiple bufferWrite hyperOp instances write to the same integer array. bufferWrite hyperOp instances
avoid overwriting each other writes by using lock based synchronization. Note that the dependencies
among bufferWrite hyperOp instances are not statically defined.
Figure 19 shows the HDG of the code shown in Listing 10. Nodes corresponding to lockSrv are not shown
in Figure 19 as the runtime system spawns its instances, and their dependencies are not statically known.
Figure 20 shows the hyperOp execution graph with input N=3. The execution graph includes the lockSrv
hyperOp instance nodes as their dependencies are completely known after executing the kernel.
4 REDEFINE Compiler
The RISC-V instruction-set architecture (ISA) [6] is used for implementing the REDEFINE many-core
processor. RISC-V is an open ISA and supports customization. The opcode map for custom-0 of RISC-V
ISA is used for implementing REDEFINE execution model specific instructions. This custom extension
of RISC-V ISA is called XR. Table 4 shows XR instructions encoding. Each CE, shown in Figure 2b, is
an in-order single issue 5-stage pipelined RV32IMFXR ISA core. Figure 21 shows the overview of the
compilation flow. XR instructions are added to riscv-llvm compiler infrastructure as intrinsics. GNU
Binutils5 is updated to recognize XR instructions.
5 https://github.com/riscv/riscv-gnu-toolchain
start
writeReq0 writeReq1 writeReq2
end
bufferWrite0 bufferWrite1 bufferWrite2
(a) Control dependence (parent-child) graph
start
writeReq0 writeReq1 writeReq2
bufferWrite0
lockSrv0
bufferWrite1
lockSrv1
lockSrv2
bufferWrite2
end
(b) Data dependence graph
Figure 20: Execution graph for the code in Listing 10 with input N=3. lockSrv hyperOp instances are
spawned by the runtime system.
References
[1] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming standard for heterogeneous
computing systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, 2010.
[2] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao,
S. K. Nandy, and R. Narayan, “Redefine: Runtime reconfigurable polymorphic asic,” ACM
Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available:
http://doi.acm.org/10.1145/1596543.1596545
[3] F. Yazdanpanah, C. Alvarez-Martinez, D. Jimenez-Gonzalez, and Y. Etsion, “Hybrid dataflow/von-
neumann architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 6, pp.
1489–1509, 2014.
[4] S. V. Adve and M. D. Hill, “Weak ordering - a new definition,” in Proceedings of the 17th Annual
International Symposium on Computer Architecture, ser. ISCA ’90. New York, NY, USA: ACM,
1990, pp. 2–14. [Online]. Available: http://doi.acm.org/10.1145/325164.325100
31 25 24 20 19 15 14 12 11 7 6 0
0000000 rs2 rs1 000 00000 0001011 END rs1, rs2
0000001 00000 rs1 000 rd 0001011 CREATEINST rd, rs1
0000010 rs2 rs1 000 rd 0001011 FALLOC rd, rs1, rs2
0000011 rs2 rs1 000 rd 0001011 RFALLOC rd, rs1, rs2
0000100 rs2 rs1 000 00000 0001011 FBIND rs1, rs2
0000101 rs2 rs1 000 00000 0001011 FDELETE rs1, rs2
imm[11:5] rs2 rs1 001 imm[4:0] 0001011 WRITECM rs1, rs2, imm
imm[11:5] rs2 rs1 010 imm[4:0] 0001011 SYNC rs1, rs2, imm
Table 4: XR - a RISC-V ISA custom extension to support the REDEFINE execution model. XR maps
to RISC-V ISA’s custom-0 opcode encoding.
LLVM bitcode
with REDEFINE
Modified .o/.as .elf
C with Modified intrinsics LLVM IR Modified RISC-V
RISC-V
hyperOps Clang optimizer GNU Binutils
backend
Figure 21: C with hyperOps compiler implementation.
[5] T. Mattson, B. Sanders, and B. Massingill, Patterns for Parallel Programming, 1st ed. Addison-
Wesley Professional, 2004.
[6] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanović, “The RISC-V Instruction
Set Manual, Volume I: User-Level ISA, Version 2.0,” EECS Department, University
of California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014. [Online]. Available:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html

The REDEFINE Execution Model: Revision History

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The REDEFINE Execution Model: Revision History

Uploaded by

Copyright:

Available Formats

The REDEFINE Execution Model

Document Version 1.1-draft

C Madhava Krishna1 (MK), S K Nandy2 (SK), Ranjani Narayan1 (RN)

February 22, 2021

2 REDEFINE Execution Model 2

3 C with HyperOps Programming Abstraction 8

1 Introduction and Background

Figure 1: Host-device interface in REDEFINE programming framework.

off-chip memory I/F

L1$ L1$ L1$ L1$ CM$

L1$: Private L1-cache for global memory address space

Figure 2: REDEFINE many-core architecture with 64 CEs

1.1 The Programming Flow

Host source code (.c)

Host code compilation Kernel code compilation

OpenCL Runtime Host code Kernel code

Host-Device execution environment

Host binary Kernel binary

Host Processor REDEFINE

Figure 3: REDEFINE OpenCL framework programming flow.

2 REDEFINE Execution Model

GM: Global Memory CE: Compute Element

2.1 HyperOp: Representation and Operation Semantics

Table 2: HyperOp annotations

2.2 Context Memory

2.3 Global Memory

 Allocation and deallocation of context frames.

2.5 Compute Element

2.6 Distributed Runtime Support

GM: Global Memory CM: Context Memory Orch: Orchestrator

Figure 6: Abstract machine with distributed Orchestrator

CR Id Addr. within Context Memory Bank

Figure 7: Fields in context memory address bits

2.7 Memory Consistency Model

3 C with HyperOps Programming Abstraction

CMAddr fAlloc( int n );

3.1 HyperOp as a Function

hyperOp void hyOpFoo1( CMAddr selfId, Op32 op0, Op32 op1);

typedef union Op32{

void* ptr; //global memory pointer

typedef struct SMD{

#define SMD const SMD

3.2 Inter-hyperOp communication

Figure 9: Inter-hyperOp communication

3.3 Fork-Join Parallelism - Code Discussion

Figure 10: HyperOp dependency graph implementing fork-join parallelism.

1 hyperOp void worker( CMAddr selfId, Op32 workerId$, Op32 syncAddr$){

22 hyperOp void master( CMAddr selfId, Op32 nWork$){

3.4 Example Discussion - Recursive Fibonacci

fib fib fib fib fib

fib(2) fib(1) sum

fib(1) fib(0) sum end

(a) Control dependence (parent-child) graph (b) Data dependence graph

3.5 Pipeline Parallelism - Code Discussion

Figure 13: Example pipeline with three stages.

3.6 Executing a Kernel on Multiple CRs

Pi+1 Qi+1 Ri+1 Pi+1 Qi+1 Ri+1

Data edge Synchronization edge

(a) Control dependence (parent-child) graph (b) Data dependence graph

Figure 16: Execution graph for the code in Listing 7.

part of the kernel code in the REDEFINE execution model.

3.7 Example Discussion - Vector Addition

Synchronization edge end

Allocation and deallocation of context frames.