You are on page 1of 81

R

Such structures are called concurrent objects [],


Race since they operate correctly even when being accessed
concurrently by multiple threads.
Race Conditions
The term concurrent refers to the fact that the
real-time order in which accesses from different
threads execute is not predetermined by the program.
Race Conditions Concurrent may but does not necessarily mean simul-
taneous, i.e., that the execution of accesses overlaps in
Christoph von Praun real time.
Georg-Simon-Ohm University of Applied Sciences, Concurrent objects can be regarded as arbiters that
Nuremberg, Germany
decide on the outcome of race conditions. Threads
that participate in a race when accessing a concurrent
Synonyms object will find agreement about the outcome of the
Access anomaly; Critical race; Determinacy race; race; hence access to concurrent objects is a form of
Harmful shared-memory access; Race; Race hazard inter-thread synchronization.
A common and intuitive correctness criterion for
Definition the behavior of a concurrent object is linearizability [],
A race condition occurs in a parallel program execution
which informally means that accesses from different
when two or more threads access a common resource,
threads behave as if the accesses of all threads were exe-
e.g., a variable in shared memory, and the order of the
cuted in some total order that is compatible with the
accesses depends on the timing, i.e., the progress of
program order within each thread and the real-time
individual threads.
order.
The disposition for a race condition is in the parallel
Linearizability is not a natural property of concur-
program. In different executions of the same program
rent systems. Ordinary load and store operations on
on the same input access events that constitute a race
weakly-ordered shared memory, e.g., can expose many
can occur in different order, which may but does not
more behaviors than permitted by the strong rules
generally result in different program behaviors (non-
of linearizability. Hence when implementing concur-
determinacy).
rent objects on such systems, specific algorithms or
Race conditions that can cause unintended non-
hardware synchronization instructions are required to
determinacy are programming errors. Examples of such
ensure correct, i.e., linearizable, operation.
programming errors are violations of atomicity due to
Race conditions on access to resources that are
incorrect synchronization.
not designed as concurrent objects, e.g., shared mem-
ory locations with ordinary load and store operations,
Discussion are commonly considered as programming errors.
Introduction An example of such errors are data races. Informally, a
Race conditions are common on access to shared data race occurs when multiple threads access the same
resources that facilitate inter-thread synchronization or variable in shared memory; the accesses may occur
communication, e.g., locks, barriers, or concurrent data simultaneously and at least one access modifies the
structures. variable.

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 R Race Conditions

Model and Formalization that have the same temporal ordering but differ in the
The definition and formalization of the terminology is communication among simultaneous events.
adopted from Netzer and Miller [].

Events An event represents an execution instance of a Program execution A program execution is a triplet
T D
statement or statement sequence in the program. Every P = ⟨E, →, →⟩.
event specifies the set of shared memory locations that
are read and/or written. In this entry, events typically Feasible program executions To characterize a race
refer to individual read or write operations that access a condition in a program execution P, it is necessary to
single shared memory location. contrast P with other possible executions P′ of the same
program on the same input. P′ can differ from P in the
Simultaneous events Events do not necessarily occur temporal ordering or shared data dependence relation
instantaneously, and in parallel programs, events can be (or both). Moreover, P′ should be a prefix of P, which
unordered. means that each thread in P′ performs the same or an
A simple model that relates the occurrence of events initial sequence of events as the corresponding thread in
to wall clock time is as follows: The timing of an event P. In other words, the projection for each thread in P′ is
is specified by its start and end instants. Events e , e are the same or a prefix of the projection of the correspond-
simultaneous if the start of e occurs after the start of e ing thread in P. The set of such program executions P′ is
but before the end of e , or vice versa. Figure  illustrates called the set of feasible program executions Xf . A feasi-
the concept. ble execution is one that adheres to the control and data
dependencies prescribed by the program.
Temporal ordering relation Events that are not simul-
T It is reasonable to contrast P with execution
taneous are ordered by a temporal ordering relation →,
T
prefixes P′ , since the focus of interest is on the earliest
such that e → e :⇔ end(e ) < start(e ). In the exe- point of the program execution P where a race con-
T
cution of a sequential program → is a total order, for dition occurred and thus non-determinacy may have
T
parallel programs → is typically a partial order. been introduced. The execution that follows from that
For clarity of the presentation and without limit- point on may contain further race conditions or erro-
ing generality, the temporal ordering relations in the neous program behavior but these may merely be a
examples of this entry are total orders. consequence of the initial race condition.
Xf contains, e.g., program executions P′ that differ
Shared data dependence relation Another relation from P only in the temporal ordering relation. In other
D words, the timing in P′ can deviate from the timing in P
among events is the shared data dependence relation →
but not such that any shared data dependencies among
that specifies how events communicate through mem-
D events from different threads are resolved differently.
ory. e → e holds if e reads a value that was writ-
Thus P and P′ have the same functional behavior. An
ten by e . This relation serves to distinguish executions
execution that differs from P in Fig.  in such a way is,
T′ T′ T′ T′
e.g., P′ = ⟨{e , e , ..., e }, e → e → e → e → e , ∅⟩:
start end Reordering the read of shared variable x does not alter
event e1 -------------|----------|----> the shared data dependence relation.
A somewhat more substantial departure from P but
start end still within the scope of feasible program executions,
event e2 ----|-----------|------------> are executions where shared data dependencies among
events from different threads are resolved differently.
start end An execution that differs from P in Fig.  in such a way
T ′′ T ′′ T ′′ D′′
event e3 ------|-----|----------------> is, e.g., P′′ = ⟨{e , e , e , e }, e → e → e → e , e →
e ⟩: The update of x (event e ) occurs in this execution
Race Conditions. Fig.  e is simultaneous to e and e before the read of x in event e . This induces a new data
Race Conditions R 

initially x = 0 Programmers commonly use interleaving semantics


thread-1 thread-2 when reasoning about concurrent programs. Interleav-
ing semantics mean that the possible behaviors of a
(1) r1 = x (4) r2 = x parallel program correspond to the sequential execu-
(2) if (r1 == 0) (5) x = r2 + 1 tion of events interleaved from different threads. This
(3) x = r1 + 1 property of an execution is also called serializability [].
The resulting intuitive execution model is also known
Race Conditions. Fig.  Feasible execution: as sequential consistency [].
T T T T
P = ⟨{e , e , ..., e }, e → e → e → e → e , ∅⟩. Events ei The motivation for the distinction between data
correspond to the execution of statements marked (i). races and general races is to explicitly term those race
There are no shared data dependencies in this execution conditions, where the intuitive reasoning according to
interleaving semantics may fail; these are the data races.
A common situation where interleaving semantics fails
dependence. It is sufficient that P′′ specifies a prefix of P, in the presence of data races are concurrent accesses to
namely, up to the event where the change in the shared shared memory in systems that have a memory con-
data dependence manifests. sistency model that is weaker than sequential consis-
Finally, an execution of the program in Fig.  that tency [].
T ′′′ T ′′′
is not feasible: P′′′ = ⟨{e , e , e , e , e }, e → e → A data race ⟨a, b⟩ means that events a and b have
T ′′′ T ′′′ D′′′ the potential of occurring simultaneously in some exe-
e → e → e , e → e ⟩. This execution is not possible,
since e would not be executed if e read a value different cution. Simultaneous access can be prevented by aug-
from . menting the access sites in the program code with some
form of explicit concurrency control. That way, every
Conflicting events A pair of events is called conflicting, data race can be turned into a general race that is not
if both events access the same shared resource and at a data race.
least one access modifies the resource [].
Race conditions and concurrent objects For lineariz-
General Races and Data Races
T D able concurrent objects, the distinction between data
Given a program execution P = ⟨E, →, →⟩, let Xf be the
races and general races among events is immaterial. This
set of feasible executions corresponding to P.
is because linearizability requires that events on con-
General race A general race exists between conflicting current objects that are simultaneous behave as if they
T ′ D′ occurred in some order.
events a, b ∈ E, if there is a P′ = ⟨E′ , →, →⟩ ∈ E, such that
a, b, ∈ E′ and R
Feasible and Apparent Races
T′ T

. b a if a b, or
T′

T
Feasible races So far, the definition of a race condition
. a b if b → a, or

T′ in P is based on the set of feasible program executions
. a →
/ b. Xf . Thus, locating a race condition requires that the
In other words, there is some order among events a and feasibility of a program execution P′ , i.e., P′ ∈ Xf , is
b but the order is not predetermined by the program assessed and that in turn requires a detailed analysis
(cases  and ), or executions are feasible where a and b of shared data and control dependencies. This problem
occur simultaneously (case ). NP-hard [, ]. Race conditions defined on the basis
of feasible program executions are called feasible general
Data race A data race is a special case of a general race. races and feasible data races respectively.
A data race exists between conflicting memory accesses
T ′ D′ Apparent races Due to the difficulty of verifying the
a, b ∈ E, if there is a P′ = ⟨E′ , →, →⟩ ∈ X, such that a, b, ∈
T ′ feasibility of a program execution P, common race
E′ and a →/ b. In other words, executions are feasible detection methods widen the set of program executions
where a and b occur simultaneously. to consider. The set Xa ⊇ Xf contains all executions
 R Race Conditions

initially x,y = 0 initially x,y = 0


thread-1 thread-2 thread-1 thread-2

(1) x = 1 (3) r1 = y (1) r1 = x (4) r2 = y


(2) y = 1 (4) if (r1 == 1) (2) if (r1 == 1) (5) if (r2 == 1)
(5) r2 = x (3) y = 1 (6) x = 1

Race Conditions. Fig.  This program does not have Race Conditions. Fig.  Program with read and write
explicit synchronization that would restrict the temporal accesses to shared variables that does neither have a
T′ T T
ordering →. Thus the execution P′ = ⟨{e , ..., e }, e → e → feasible nor an apparent data race
T T
e → e → e , ∅⟩ is in Xa . P′ is however not feasible, since
control dependence does not allow execution of e if the
read in e returns  Some of the literature does not explicitly distinguish
apparent and feasible races, i.e., the term “race” may
refer to either of the two definitions, depending on the
P′ that are prefixes of P with the following restric- context.
T′
tion: The temporal ordering → obeys the program order Actual data races An actual data race exists in execu-
and inter-thread dependencies due to explicit synchro- T D
tion P = ⟨E, →, →⟩, if events a, b ∈ E are conflicting and
nization; such relation is also known as the happened- T′
before relation []. It is not required that P′ be feasible both events occur simultaneously, i.e., a →
/ b.
with respect to shared data dependencies or control
dependencies that result from those. Figure  gives an Race Conditions as Programming Errors
example. Some race conditions are due to programming errors
Race conditions that are determined using a and some are part of the intended program behavior.
superset of feasible program executions, e.g., Xa , are The distinction between the two is commonly made
called apparent general races and apparent data races along the distinction of data races and general races:
respectively. Note that this definition of race condi- data races are programming errors, whereas general
tions depends on the specifics of the synchronization races that are not data races are considered as “normal”
mechanism. behavior. While such distinction is oftentimes true,
The notion of feasible races is more restricting than exceptions exist and are common.
the notion of apparent races. This means that an algo- First, there are correct programs that entail the
rithm that detects apparent races considers executions occurrence of data races at run time. In non-blocking
that are not feasible and therefore may report race concurrent data structures [], e.g., data races are
conditions that could never occur in a real program intended and considered by the algorithmic design. The
execution. Such reports are called spurious races. development and implementation of such algorithms
Figure  shows another example to illustrate the that “tolerate” race conditions is extremely difficult,
notion of apparent races further. This program does not since it requires a deep understanding of the shared
have explicit synchronization and has read and write memory consistency model. Moreover, the implemen-
accesses to shared variables x and y. tation of such algorithms is typically not portable, since
Perhaps surprisingly, the program in Fig.  does different platforms may have different shared memory
neither have an apparent nor a feasible data race. The semantics (see “Data-Race-Free Programs”).
reason is that there is no program execution P that Data races that are not manifestations of program
contains event e or e . Hence the sets Xf or Xa , bugs are called benign data races.
which are defined based on P, cannot contain those Second, there are incorrect programs that have exe-
events either and hence neither Xf nor Xa would lead cutions with general races that are not data races. An
us to determine a feasible respectively apparent race example for such programming error is a violation of
condition. atomicity []: Accesses to a shared resource occur in
Race Conditions R 

separate critical sections, although the programmer’s phrased a minimum guarantee that shared memory
intent was to have the accesses in the same atomicity consistency models should adhere to: Data-race-free
scope. programs should only have executions that could also
occur on sequentially consistent hardware. In other
Non-determinacy Race conditions can but do not nec-
words, the behavior of parallel programs that are data-
essarily introduce non-determinacy in parallel pro-
race-free follows the intuitive interleaving semantics. So
grams. For a detailed discussion of this aspect, refer to
what means data-race-free?
the entry on “Determinacy”.

Data-Race-Free Programs Data-race-free in computer architecture Adve and


Data races are an important concept to pinpoint certain Hill [] define a data race as a property of a single
behaviors of parallel programs that are manifestations program execution: A pair of conflicting accesses par-
of programming errors. But beyond this role, the con- ticipates in a data race, if their execution is not ordered
cept of data races is also important for system architects by the happened-before relation []. The happened-
who design shared memory consistency models. before relation is defined as the combination of pro-
Intuitively, a shared memory consistency model gram order and the synchronization order due to
defines the possible values that a read may return con- synchronization events from different threads. This def-
sidering prior writes. Consider the program execution inition allows, unlike the definition according to Netzer
in Fig. . and Miller [], to validate the presence or absence of
Despite the fact that this execution is correct accord- data race by inspecting a single program execution.
ing to the memory consistency model of common mul- An execution is data-race-free, if it does not have a
tiprocessor architectures [], such program behavior data race. A program is data-race-free, if all possible
is difficult to analyze and explain. The reason is, that executions are data-race-free. “Possible executions” are
the program execution is not serializable, i.e., there is thereby implicitly defined as any executions permit-
no interleaving of events from different threads that ted by the processor, for which the memory model is
obeys program order and that would justify the results defined.
returned by the reads [].
Why should such behavior be permitted? For opti- Data-race-free in parallel programming languages
mization of the memory access path, a processor or The developers of parallel programming languages also
compiler can reorder and overlap the execution of specify the semantics of shared memory using a mem-
memory accesses such that the temporal ordering rela- ory model [, ]. Like memory models at the hardware
T level, the minimum requirement is that data-race-free
tion →, and with it the shared data dependence relation
D programs can only have behaviors that could result from
→, can differ from the program order. For the program sequentially consistent executions. R
in Fig. , one of the processors hoists the read before the The definition of what constitutes a data race is how-
writes, thus permitting the result. ever more complex, since building the model entirely
To allow such optimization on the one hand and to around the happened-before relation is not sufficient.
give programmers an intuitive programming abstrac- This is because the starting point for the definition
tion on the other hand, computer architects have of data races are programs, not program executions.
It is necessary to explicitly define which executions are
thread-1 thread-2 feasible for a given program.

(1) write(x, 1) (4) write(y, 5) Causality Manson et al. [] define a set requirements
(2) write(x, 2) (5) write(y, 7) called causality that a program execution has to meet
(3) read(y, 5) (6) read(x, 1) to be feasible. As an example, consider the program in
Fig. . The causality requirement serves to rule out the
Race Conditions. Fig.  Program execution with a data feasibility of executions that contain events e or e .
race. This execution is not serializable Intuitively, causality requires that there is a noncyclic
 R Race Conditions

chain of “facts” that justify the execution of a state- serializes the arrival of consecutive messages and hence
ment. The writes e , e may only execute if the reads exercises concurrency control.
e , e returned the value . That in turn requires, that the
writes executed. This kind of cyclic justification order is
Related Entries
forbidden by the Java memory model. This program is
Determinacy
hence data-race-free since executions that include the Memory Models
writes e , e are not feasible due to a violation of the Race Detection Techniques
causality rule.
More generally, the causality requirement reflects on
the overall definition of “data-race-free” as follows: Pro- Bibliographic Notes and Further
grams that have no data races in sequentially consistent Reading
executions are called data-race-free. Execution of such Race conditions and data races have been discussed
programs are always sequentially consistent. in early literature about the analysis and debugging of
parallel programs [, , , , ]. An informal charac-
Alternatives to causality Causality has led to signifi- terization of the term “race” was, e.g., given by Emrath
cant complexity in the description of the Java memory and Padua []: A race exists when two statements in con-
model and it is not evident that the rules for causality are current threads access the same variable in a conflicting
complete. Hence alternative proposals have been made way and there is no way to guarantee execution ordering.
to specify feasible program executions, e.g., by Saraswat A formal definition is given by Netzer and Miller [],
et al. []. which is based on the existence of program execu-
tions with certain properties as discussed by this entry.
This definition however does not lend itself to the
Effects of Data Races on Program design of a practical algorithm that validates the pres-
Optimization ence or absence of a race condition in a given pro-
Midkiff and Padua [] observed that program trans- gram execution (see the related entry on Race Detection
formations that preserve the semantics of sequential Techniques).
programs may not be correct for parallel programs with Hence for practical applications, slightly different
data races. Thus, the presence of data races inhibits definitions of data races are common.
the optimization potential of compilers for parallel pro- In the context of shared memory models, the inter-
grams. est is mainly in characterizing the absence of data races,
hence the definition of data-race-free program execu-
Race Conditions in Distributed Systems tions and data-race-free programs in [, , ].
Race conditions can also occur in distributed systems, Most definitions for data races in the context of static
e.g., when multiple client computers communicate with analysis tools make a conservative approximation of the
a shared server. Conceptually, the shared server is a program’s data-flow and control-flow and thus are akin
concurrent object that gives certain guarantees on the to Netzer and Miller’s concept of apparent data races.
behavior of concurrent accesses. Typically, concurrency For dynamic data-race detection, the approximations
control at the level of the communication protocol or at made for the definition of what constitutes a data race
the level of the server implementation will ensure that may not be conservative any longer: While the goal
semantics follow certain intuitive and well-understood of dynamic race detection tools is to detect all feasi-
correctness conditions such as linearizability. ble races on the basis of a single-input-single-execution
In MPI programs [], e.g., a race condition may (SISE) [], there are feasible data races according to []
occur at a receive operation (MPI_Recv) where the that may be overlooked according to these definitions.
source of the message to be received can be arbitrary
(MPI_ANY_SOURCE). The order in which concurrent Bibliography
messages from different sources are received can be dif- . Adve S, Gharachorloo K () Shared memory consistency
ferent in different program runs. The receive operation models: a tutorial. IEEE Comput :–
Race Detection Techniques R 

. Adve S, Hill M (June ) A unified formalization of four shared- . Nudler I, Rudolph L () Tools for the efficient development
memory models. IEEE Trans Parallel Distrib Syst ():– of efficient parallel programs. In: Proceedings of the st Israeli
. Arvind, Maessen J-W () Memory model = instruction conference on computer system engineering
reordering + store atomicity. SIGARCH Comput Archit News . Saraswat VA, Jagadeesan R, Michael M, von Praun C () A
():– theory of memory models. In: PPoPP’: Proceedings of the th
. Dinning A, Schonberg E (December ) Detecting access ACM SIGPLAN symposium on principles and practice of parallel
anomalies in programs with critical sections. In: Proceedings of programming. ACM, New York, pp –
the ACM/ONR workshop on parallel and distributed debugging, . Shasha D, Snir M (April ) Efficient and correct execution of
pp – parallel programs that share memory. ACM Trans Program Lang
. Emrath PA, Padua DA (January ) Automatic detection of Syst ():–
nondeterminacy in parallel programs. In: Proceedings of the . Sterling N (January ) WARLOCK: a static data race analysis
ACM workshop on parallel and distributed debugging, pp – tool. In: USENIX Association (ed) Proceedings of the USENIX
. Flanagan C, Freund SN () Atomizer: a dynamic atomicity winter  conference, San Diego, pp –
checker for multithreaded programs. In: POPL’: Proceedings . Weaver DL, Germond T () The SPARC architecture manual
of the st ACM SIGPLAN-SIGACT symposium on principles of (version )
programming languages, ACM, New York, pp –
. Herlihy MP, Wing JM (July ) Linearizability: a correctness
condition for concurrent objects. ACM Trans Program Lang Syst
(TOPLAS) :– Race Detection Techniques
. Lamport L (July ) Time, clock and the ordering of events in a
distributed system. Commun ACM ():– Christoph von Praun
. Lamport L (July ) How to make a correct multiprocess Georg-Simon-Ohm University of Applied Sciences,
program execute correctly on a multiprocessor. IEEE Trans Com-
Nuremberg, Germany
put ():–
. Manson J, Pugh W, Adve S () The Java memory model. In:
Proceedings of the symposium on principles of programming Synonyms
languages (POPL’). ACM, New York, pp – Anomaly detection
. Mellor-Crummey J (May ) Compile-time support for effi-
cient data race detection in shared-memory parallel programs. In:
Proceedings of the workshop on parallel and distributed debug- Definition
ging. ACM, New York, pp – Race detection is the procedure of identifying race con-
. Message Passing Interface Forum (June ) MPI: a message ditions in parallel programs and program executions.
passing interface standard. http://www.mpi-forum.org/ Unintended race conditions are a common source of
. Michael MM, Scott ML () Simple, fast, and practical
error in parallel programs and hence race detection
non-blocking and blocking concurrent queue algorithms. In:
PODC’: Proceedings of the th annual ACM symposium on
methods play an important role in debugging such
principles of distributed computing. ACM, New York, pp – errors.
. Midkiff S, Padua D (August ) Issues in the optimization of R
parallel programs. In: Proceedings of the international conference Discussion
on parallel processing, pp –
. Nelson C, Boehm H-J () Sequencing and the concurrency Introduction
memory model (revised). The C++ Standards Committee, Doc- The following paragraphs summarize the notion of
ument WG/N J/- race conditions and data races. An elaborate discussion
. Netzer R, Miller B () On the complexity of event ordering for
and detailed definitions are given in the corresponding
shared-memory parallel program executions. In: Proceedings of
the international conference on parallel processing, Pennsylvania
related entry on “race conditions.”
State University, University Park. Pennsylvania State University Race conditions. A widely accepted definition of race
Press, University Park, pp – conditions is given by Netzer and Miller []: A general
. Netzer R, Miller B (July ) Improving the accuracy of data race, or race for short, is a pair of conflicting accesses
race detection. Proceedings of the ACM SIGPLAN symposium to a shared resource for which the order of accesses is
on principles and practice of parallel programming PPOPP, pub-
not guaranteed by the program, i.e., the accesses may
lished in ACM SIGPLAN NOTICES ():–
. Netzer R, Miller B (March ) What are race conditions? Some
execute in either order or simultaneously. A subclass
issues and formalizations. ACM Lett Program Lang Syst (): of general races are data races, for which conflicting
– memory accesses may occur simultaneously.
 R Race Detection Techniques

Data races. The bulk of literature on race detec- When designing a static or dynamic race detec-
tion discusses methods for the detection of data races. tion method, trade-offs between performance, resource
Most data races are programming errors, typically due usage, and accuracy are made. In practice, it is not
to omitted synchronization. Data races can potentially uncommon that useful tools are unsound, incomplete,
cause the atomicity of critical sections to fail. or both with respect to the detection of feasible race
Violations of atomicity. Race conditions that are not conditions.
data races are common in parallel program executions.
Some of such races are however programming errors Data Race Detection
related to incorrect use of synchronization. A common There are two principal approaches to detect data races:
example are accesses to a shared resource that occur happened-before-based (HB-based) and lockset-based
in separate critical sections, although the programmer’s methods.
intent was to have the accesses in the same atomicity
scope. Such programming errors are commonly called HB-Based Data Race Detection
violations of atomicity []. Taken literally, this termi- HB-based data race detection operates typically at run-
nology seems perhaps too general, since data races, time, thus validating the occurrence of data races in a
which are non implied by this term, can also cause the program execution.
atomicity of a critical section to fail. Data race definition. A data race ⟨a, b⟩ is deemed
Detection accuracy. An ideal race detection should to have occurred in a program execution, if a and b
have two properties: () A race condition is reported are conflicting accesses to the same resource, and
if the program or a program execution, whatever is the accesses are causally unordered. The causal order
checked, has some race condition. This property is among run-time events in a concurrent systems is
called soundness. Notice that if there are several race thereby defined through the happened-before relation by
conditions, soundness does not require that all of them Lamport [].
are reported. () Every reported race condition is a gen- This definition assesses the existence of a data
uine race condition in a feasible program execution. race on one particular program execution and its HB-
This property is called completeness []. relation. Note that this is different from the defini-
Detection methods. Static methods analyze the pro- tion of feasible and apparent data races by Netzer and
gram code and do not require that the program exe- Miller [], since their definition assesses the occur-
cutes. Sound and complete detection of feasible race rence of a data race on a set of program executions.
conditions in a parallel program by static analysis is Thus, it is possible that an execution is classified as
in general undecidable due to a theoretical result by “data-race-free” according to the above (HB-based)
Ramalingam []. defintion but it is not data-race-free according to
Dynamic methods. analyze the event stream from Netzer and Miller’s defintion []. An example of such
a single program execution, and thus determine the a program execution, which is detailed in a later para-
occurrence of a race condition in one program execu- graph, is given in Fig. .
tion. Such analysis can be sound and complete. Natu- Happened-before relation. A pair of events ⟨a, b⟩
rally, dynamic methods cannot make general statements is ordered according to the happened-before relation
about the disposition of a race condition in a program. (HB-relation), if a and b occur in the same thread and
However, the accuracy of a dynamic analysis can be are ordered according to the sequential execution of
more or less sensitive to the timing of threads. Ideally, that thread, or a and b are synchronization events in
a dynamic race detection algorithm has the single input, different threads that are ordered due to the synchro-
single execution (SISE) property [], which means that nization order. Overall, the happened-before relation is
“A single execution instance is sufficient to determine the a transitive partial order.
existence of an anomaly (data race) for a given input.” In The notion of synchronization order depends on
other words, the detection accuracy should be oblivious the underlying programming model and has been orig-
to the timing of the execution. inally described for point-to-point communication in
Race Detection Techniques R 

thread-1 thread-2 thread-3 thread-1 thread-2 thread-3


x = 1
lock(L)
y = 3
lock(L) unlock(L)
y = 1
unlock(L)
x = 1 x = 3
lock(L)
y = 2 lock(L)
unlock(L) y = 1
lock(L) unlock(L)
y = 3 lock(L)
unlock(L) y = 2
unlock(L)
x = 3
a b
intra-thread order synchronization order happened-before order

Race Detection Techniques. Fig.  In execution (a), the assignments to variable x in thread- and thread- are ordered by
the HB-relation. There is however a feasible data race, since these assignments may occur simultaneously as, e.g., in
execution (b)

distributed systems []. The synchronization order has recorded precisely during a program execution with
also been defined for synchronization operations on standard vector clocks [, ]. Race detectors that build
multiprocessor architectures [, ], for fork-join syn- on standard vector clocks are, e.g., the family of Djit
chronization [], and critical sections []. detectors by Pozniansky and Schuster [, ]. The algo-
Algorithmic principles. HB-based race detectors per- rithm requires that maximum number of threads is
form three tasks: () track the HB-relation within each specified in advance.
thread; () keep an access history as sequence of logical A theoretical result from Charron-Bost [] implies
timestamps for each shared resource; () validate that, that the precise computation of vector clocks values
for every resource, critical accesses are ordered by the for n events in an execution with p processors requires
HB-relation. O(np) time and space. Due to this result, the encoding
Various HB-based race detection algorithms have of the HB-relation with vector clocks is considered to be
been proposed; they differ in the precise methods and impractical [].
run-time organization of computing the three tasks. Performance optimizations. There are two main
Accuracy. If the HB-relation is recorded precisely, sources of overhead in HB-based race detection:
HB-based data race detection on a specific program exe- First, the tracking of logical time, usually by vec- R
cution can be complete and also sound according to tor clocks or variations. Second, access operations to
the HB-based definition of data races. However, HB- shared memory and synchronization operations are
based data race detection is not sound with respect instrumented to record access histories and verify
to data races according to Netzer and Miller’s defini- orderings.
tion []: Fig.  illustrates two executions of the same Flanagan and Freund [] optimize vector clocks
program, one where the assignments to x are ordered with an adaptive representation of timestamps as fol-
by the HB-relation (a), and one where the assingments lows: The majority of memory accesses target thread-
occur simultaneously (b). An HB-based race detector local, lock-protected or shared-read data; for such “safe”
applied to execution (a) would not report the data race. access patterns a very efficient and compact timestamp
In other words, HB-based race detection does not have representation is used. This representation is adaptively
the SISE property. enhanced to a full vector clock when the access patterns
Encodings and algorithms for computing the HB- deviate from the safe ones; then, access ordering can be
relation. The HB-relation can be represented and validated according to the HB-relation.
 R Race Detection Techniques

Variants of vector clocks with alternative encod- Another run-time technique to reduce the overhead
ings of the HB-relation have been developed to improve of dynamic data race detection is to include only small
the efficiency of tracking logical time and checking fractions of the overall program execution (sampling)
the ordering among pairs of accesses. Examples for in the checking process []. Perhaps surprisingly, the
such encodings are the English-Hebrew labeling (EH) degradation of accuracy due to an incomplete event
by Nudler and Rudolph [], task recycling (TR) by trace is moderate and acceptable in the light of signif-
Dinning and Schonberg [], and offset-span labeling icant performance gains.
(OS) by Mellor-Crummey []. These algorithms make Race detection in distributed systems. Race detec-
trade-offs at different levels: For example, OS does, tion in distributed systems refers to HB-based data
unlike TR, avoid global communication when comput- race detection in distributed shared memory (DSM)
ing the timestamps; this makes the algorithm practical systems. Perkovic and Keleher, e.g., [] present an
for distributed environments. However, OS is limited to optimized DSM protocol where the tracking of HB-
nested fork-join synchronization patterns. information is piggybacked on the coherence protocol.
The implementation of encodings can be optimized Another DSM race detection technique is presented by
further for efficiency, sometimes however at the cost of Richards and Larus [].
reducing the accuracy of the detection. Dinning and
Schonberg [], e.g., limit the size of access histories per
variable. Some data races may not be reported due to Lockset-Based Data Race Detection
this optimization, which seemed to occur infrequently Lockset-based data race detection is a technique tai-
in their experiments. lored to programs that use critical sections as their pri-
A similar optimization is described by Choi and mary synchronization model. Instead of validating the
Min []: Their algorithm records only the most recent absence of data races directly, the idea of lockset-based
read and write timestamp for a shared variable; as a con- data race detection is to validate that a program or
cequence, data races that occurred in an execution may program execution adheres to a certain programming
not be not reported. They argue however that this opti- policy, called locking discipline [].
mization preserves the soundness of the detection, i.e., if Locking discipline. A simple locking policy could,
there is some data race according to the HB-definition, e.g., state that threads that access a common mem-
it is reported. An iterative debugging process based ory location must hold a mutual exclusion lock when
on deterministic replay will eventually identify all data performing the access. Compliance with this locking
races and thus lead to a program execution that is data- discipline implies that executions are data-race-free.
race-free (again according to the HB-definition). The The validation of the locking discipline is done with
significant contribution of Choi and Min [] is their static or dynamic program analysis, or combinations
systematic and provably correct debugging methodol- thereof.
ogy for data races. The key ideas of lockset-based data race detection
Compilers can reduce the overhead of race detec- are due to Savage et al. [], although the concept of lock
tion by pruning the instrumentation of accesses that covers, which is related to locksets is discussed earlier by
are provably not involved in data races, such as access Dinning and Schonberg [].
to thread-local data. Moreover, accesses that lead to Algorithmic principles. Checking a locking discipline
redundant access events [, ] at run-time may also works as follows: Each thread tracks at run-time the
be recognized by a compiler. Static analyses of con- set of locks it currently holds. Conceptually, each vari-
currency properties support this process (see “Static able in shared memory has a shadow location that holds
Analysis”). a lockset. On the first access to a shared variable, the
Even if redundant accesses cannot be safely recog- shadow memory is initialized with the lockset of the
nized by a compiler, techniques have been developed by current thread. On subsequent accesses, the lockset in
Choi et al. [] to recognize and filter redundant access shadow memory is updated by intersecting it with the
events at run-time. lockset of the accessing thread. If the intersection is
Race Detection Techniques R 

empty and the variable has been accessed by different Performance optimizations. Dynamic tools can be
threads, a potential data race is reported. supported by a compiler that reduces instrumentation
Accuracy. Lockset-based detection is sound. In and thus the frequency of run-time checks. For exam-
fact, the validation of a locking discipline has the ple, accesses that are known at compile-time not to
SISE property, i.e., all race conditions that occur in participate in data races don’t have to be instrumented
a certain execution are reported. The detection is [, , ].
however incomplete, since accesses that violate the Another strategy to reduce the run-time overhead
locking discipline, may be ordered by other means of is to track the lockset and state information in shadow
synchronization. memory not per fixed-size variable but at a larger gran-
The principles of lockset-based race detection have ularity, e.g., at the granularity of programming language
been refined with two goals in mind: To increase the objects [] or minipages [].
accuracy (reduce overreporting) and to improve the Application to non-blocking synchronization models.
performance of dynamic checkers. Although most work has studied programs with block-
Increasing the accuracy. There are several com- ing synchronization (locks), the principles are also
mon parallel programming idioms that are data-race- applicable to optimistic synchronization[, ].
free but violate the simple locking discipline stated
before. Examples are initialization without lock pro-
tection, read-sharing, or controlled data handoff. To Dynamic Methods
accommodate these patterns, the simple locking pol- So far, data race detection has been described as a
icy can be extended as follows: An abstract state dynamic program analysis. Dynamic methods are also
machine is associated with each shared memory loca- called trace-based, since subject of the analysis is an
tion. The state machine is designed to identify safe execution trace.
programing idioms in the access stream and to avoid Architecture. There are different styles for organizing
overreporting. the data collection and analysis phase:
For example, read and write accesses of the ini-
● Post-mortem methods record a complete set of rele-
tializing thread can proceed safely, even without lock-
vant run-time events in a persistent trace for offline
ing. If other threads read the variable and only read
analysis. This method was common in early race
accesses follow, a read-sharing pattern is deemed to
detection systems, e.g., by Allan and Padua []. Due
have occurred and no violation of the locking policy
to the possibly large size of the trace, this method is
is reported, even if the accessing threads do not hold
limited to short program runs.
a common lock at the time of the accesses []. Subse-
● Online methods, also called on-the-fly detection,
quent work has picked up these ideas and refined the
record events temporarily and perform the data R
state model to capture more elaborate sharing patterns
race analysis entwined with the actual program exe-
that are safe [, ].
cution. Much of the recorded information can be
The state model can introduce unsoundness, namely,
discarded as the analysis proceeds. Depending on
in cases where the timing of threads in an execution lets
the algorithm, this may but does not necessarily
an access sequence look like a safe programming idiom,
compromise the accuracy of the detection. Most
where indeed an actual data race occurred. In prac-
dynamic data race tools choose this architecture,
tice, the benefits of reduced overreporting outweigh the
e.g., Mellor-Crummey [] for an HB-based analysis
drawback due to potential unsoundness.
and Savage et al. [] for a method based on locksets.
Hybrid data race detection. Some methods for
In distributed systems, online methods are also the
dynamic data race detection [, , , ] combine
preferred analysis architecture.
the lockset algorithm [] with checks of Lamport’s
happened-before relation. Such a procedure mitigates Implementation techniques. Dynamic race detec-
the shortcomings of either approach and improves the tion is an aspect that is crosscutting and orthogonal
accuracy of the detection. to the remaining functionality of a software. Several
 R Race Detection Techniques

techniques have been used to incorporate race detection Findbugs by Ayewah et al. [] is a tool for pragmatic
in a software system: analyses of Java programs. Hovenmeyer and Pugh []
describe numerous analyses for concurrency related
● Program instrumentation augments memory and
errors. One programming idiom for concurrent pro-
synchronization accesses with checking code. Instru-
gram in Java is, e.g., that accesses to a shared muta-
mentation can occur at compile-time [], or at
ble variable are consistently protected by synchronized
run-time [].
blocks. Violations of this programming practice are
● Race detection as part of a software-based proto-
described as bug pattern called “inconsistent synchro-
col implementation in distributed shared memory
nization.” Inconsistent synchronization occurs, if a class
systems [, , ].
contains mixed (synchronized and unsynchronized)
● Hardware extensions for HB-based [, , , ] or
accesses to a field variable and no more than one third
lock-set-based [] race detection.
of the accesses (writes weighed higher than reads) occur
Limitations. An inherent limitation of dynamic race outside a synchronized block.
detection is that it is based on information from a single RacerX [] is a static data race and deadlock detec-
program execution. tion system targeted to check large operating system
This limitation is twofold. First, not all possible con- codes. The tool builds a call graph and verifies a lock-
trol paths may have been exercised on the given input. ing discipline along a calling context sensitive traver-
Second, possible race conditions may have been covert sal of the code. The system does however not have
in the specific thread schedule of the execution trace, as a pointer analysis and approximates aliasing through
e.g., in example in Fig. . This second limitation, called variable types. Moreover, the system uses a heuristic to
scheduling dependence, can be mitigated or entirely classify sections of code that are sequential and those
avoided by so-called predictive analyses [, , , ]. that are concurrent. These approximations facilitate the
The key idea of predictive (dynamic) analyses is to con- analysis of very large codes (> K lines of code). As
sider not only the order of events recorded in a spe- the analysis issues a significant number of spurious
cific execution trace but also permutations thereof. This reports, a clever ranking scheme is used to prioritize the
technique increases the coverage of the dynamic analy- large numbers of reports according to their likelihood of
sis of concurrent program executions, since it is capable being an actual bug.
to expose thread interleavings (event orders) that have Methods based on data-flow analysis. May-happen-
not been exercised. in-parallel analysis (MHP) is the foundation of many
compile-time analyses for concurrent programs. MHP
Static Methods analysis approximates the order of statements executed
Static program analysis, does not require that the by different threads and computes the may-happen-in-
program is executed. parallel relation among statements. MHP analysis in
Pragmatic methods. A pragmatic approach to find combination with a compile-time model of program
programming errors is to identify situations in the data can serve as the foundation of compile-time data
source code, where the program deviates from com- race detection.
mon programming practice. Naturally, this approach Bristow [] used an inter-process precedence graph
lends itself to find deviations from common program- for determining anomalies in programs with post-
ming idioms in concurrent programs that may lead wait synchronization. Taylor [] and Duesterwald and
to unintended race conditions. This approach is called Soffa [] extend this work and define a model for
pragmatic, since the analysis is neither sound nor com- parallel tasks in Ada programs with rendez-vous syn-
plete with respect to identifying data races. In prac- chronization. The program representation in [] is
tice, pragmatic analysis methods have turned out to modular and enables an efficient analysis of programs
be very effective with the additional benefit that in with procedures and recursion based on a data-flow
most cases, analyses do not require sophisticated data- framework. Masticola and Ryder [] generalize and
flow or concurrency analysis and hence can be very improve the approach of []. Naumovic et al. [] com-
efficient. pute the potential concurrency in Java programs at the
Race Detection Techniques R 

level of statements. The authors have shown that the Model checking. The principle of model checking is
precision of their data-flow algorithm is optimal for to explore every possible control flow-path and vari-
most of the small applications that have been evalu- able value assignment for undesired program behavior.
ated. The approach requires that the number of real Since this procedure is obviously intractable, models of
threads in the system is specified as input to the analysis. data and program are explored instead. An additional
The combination of MHP information with a model of source of complexity in parallel versus sequential pro-
program data (heap shape and reference information) grams is the timing of threads resulting in myriads of
could be used to determine conflicting data accesses. possible interleaving of actions from different threads.
This approach is discussed by Midkiff, Lee, Padua, and The main challenge of model checking for concur-
Sura [, ]. rency errors, such as data races, is hence to reduce
Static race detection for Java programs has been the state space to be explored. One idea is to consider
developed, e.g., by Choi et al. [] and Naik et al. []. only those states and transitions of individual threads
Both systems are based on a whole program analysis to that operate on shared data, and are thus visible to
determine an initial set of potentially conflicting object other threads. Another idea is to aggregate the pos-
accesses. This set is pruned (refined) by several suc- sible interleavings, e.g., by modeling multiple threads
cessive analysis steps, which are alias analysis, thread and their transitions in one common state transition
escape analysis, and lockset analysis. The Chord checker space.
by Naik et al. [] is object context-sensitive [, ], Model checking has been applied to the detection
and this feature is found to be essential, in combination of access anomalies in concurrent programs, e.g., in
with a precise, inclusion-based alias analysis, to achieve [, , , , , ]. Stoller’s model checker [] ver-
a high detection accuracy; object context sensitivity ifies adherence to a locking policy. Henzinger et al. []
incurs however a significant scalability cost. Chord describe a model checker that identifies conflicting
sacrifices soundness since it approximates lock iden- accesses that are not ordered according to the HB-
tity through may alias information when determining relation. Both works assume an underlying sequentially
common lock protection. consistent execution platform. Model checking for exe-
Type-based methods. Type systems can model and cutions on weakly-ordered memory systems have been
express data protection and locking policies in data conceived as well by [], though not particularly for the
and method declarations. Compliance of data access purpose of data race detection.
and method invocations with the declared properties
can be checked mostly statically. The main advantage Detection of Determinacy and Atomicity
of the type-based approach is its modularity, which Violations
makes it, in contrast to a whole program analysis, well The emphasis of this entry is on the detection of race
amenable to treat incomplete and large programs. The conditions that are data races. Related to race conditions R
type systems that can prove data-race-freedom have are also violations of determinacy and violations of atom-
either been proposed as extensions to existing program- icity. Techniques for their detection are discussed in the
ming languages by Bacon at al. [] or Boyapati and corresponding essays on determinacy and atomicity.
Rinard []. Flanagan and Freund [, ] present a type
system that is able to specify and check lock-protection Complexity
of individual variables. In combination with an annota- Netzer and Miller [] characterize race conditions in
tion generator [], they applied the type checker to Java terms of feasible programs executions. A program exe-
programs of up to  KLOC. The annotation generator cution is feasible, if the control flow and shared data-
is able to recognize common locking patterns and fur- dependencies in the execution could actually occur
ther uses heuristics to classify as benign certain accesses at run-time in accordance with the semantics of the
without lock protection. The heuristics are effective in program.
reducing the number of spurious warnings; some are The problem of detecting a race condition is at least
however unsound, which has not been a problem for the as hard as determining if a feasible program execu-
benchmarks investigated in []. tion exists, where a pair of suspect statements occurs
 R Race Detection Techniques

concurrently. Netzer and Miller [] found that prob- . Burckhardt S, Alur R, Martin MMK () CheckFence: checking
lem to be intractable even if shared data-dependencies consistency of concurrent data types on relaxed memory models.
are ignored. Helmbold and McDowell [] confirm In: PLDI’: Proceedings of the  ACM SIGPLAN conference
on programming language design and implementation. ACM,
and refine this result by restricting programs to cer-
New York, pp –
tain control-flow and synchronization models. For . Charron-Bost B () Concerning the size of logical clocks in
unrestricted programs, if data-dependencies are not distributed systems. Inf Process Lett ():–
ignored, the problem of deciding if two conflicting . Chen F, Serbanuta TF, Rosu G () jpredictor: a predictive
statement participate in a race is as hard as the halting runtime analysis tool for java. In: ICSE’: Proceedings of the
th international conference on software engineering. ACM,
problem [].
New York, pp –
In practice, most dynamic and also static data-flow- . Choi J-D, Min SL () Race frontier: reproducing data races
based race detection analyses have a complexity that in parallel program debugging. In: PPOPP’: Proceedings of the
is quasi-linear in the number of synchronization and third ACM SIGPLAN symposium on principles and practice of
shared resource accesses. parallel programming. ACM, New York, pp –
. Choi J-D, Lee K, Loginov A, O’Callahan R, Sarkar V,
Sridharan M (June ) Efficient and precise datarace detection
Related Entries for multithreaded object-oriented programs. In: Conference on
programming language design and implementation (PLDI’),
Determinacy
pp –
Formal Methods–Based Tools for Race, Deadlock,
. Dinning A, Schonberg E () An empirical comparison of mon-
and Other Errors itoring algorithms for access anomaly detection. In: PPOPP’:
Race Conditions Proceedings of the second ACM SIGPLAN symposium on princi-
ples & practice of parallel programming. ACM, New York, pp –
. Dinning A, Schonberg E (December ) Detecting access
Bibliography anomalies in programs with critical sections. In: Proceedings of
. Abadi M, Flanagan CE, Freund SN () Types for safe locking: the ACM/ONR workshop on parallel and distributed debugging,
static race detection for Java. Trans Program Lang Syst (TOPLAS) pp –
():– . Duesterwald E, Soffa M () Concurrency analysis in the pres-
. Adve S, Hill M (June ) Weak ordering — A new definition. In: ence of procedures using a data-flow framework. In: Proceedings
Proceedings of the annual international symposium on computer of the symposium on testing, analysis, and verification (TAV),
architecture (ISCA’), pp – pp –
. Adve S, Hill M, Miller B, Netzer R (May ) Detecting data . Engler D, Ashcraft K (October ) RacerX: Effective, static
races on weak memory systems. In: Proceedings of the annual detection of race conditions and deadlocks. In: Proceedings
international symposium on computer architecture (ISCA’), of the symposium on operating systems principles (SOSP’),
pp – pp –
. Allen TR, Padua DA (August ) Debugging fortran on a shared . Fidge CJ () Timestamp in message passing systems that pre-
memory machine. In: Proceedings of the international conference serves partial ordering. In: Proceedings of the th Australian
on parallel processing, pp – computing conference, pp –
. Ayewah N, Hovemeyer D, Morgenthaler JD, Penix J, Pugh W . Flanagan C, Freund SN (June ) Type-based race detection for
() Using static analysis to find bugs. IEEE Softw ():– Java. In: Proceedings of the conference on programming language
. Bacon D, Strom R, Tarafdar A (October ) Guava: a dialect design and implementation (PLDI’), pp –
of Java without data races. In: Proceedings of the conference on . Flanagan C, Freund SN (June ) Detecting race conditions
object-oriented programming, systems, languages, and applica- in large programs. In: Proceedings of the workshop on pro-
tions (OOPSLA’), pp – gram analysis for software tools and engineering (PASTE’),
. Balasundaram V, Kennedy K () Compile-time detection of pp –
race conditions in a parallel program. In: Proceedings of the . Flanagan C, Freund SN () FastTrack: Efficient and precise
international conference on supercomputing (ISC’), pp – dynamic race detection. In: PLDI’: Proceedings of the 
. Boyapati C, Lee R, Rinard M (November ) Ownership types ACM SIGPLAN conference on programming language design
for safe programming: preventing data races and deadlocks. In: and implementation. ACM, New York, pp –
Proceedings of the conference on object-oriented programming, . Flanagan C, Qadeer S (June ) A type and effect system for
systems, languages, and applications (OOPSLA’), pp – atomicity. In: Proceedings of the conference on programming
. Bristow G, Dreay C, Edwards B, Riddle W () Anomaly detec- language design and implementation (PLDI’), pp –
tion in concurrent programs. In: Proceedings of the international . Flanagan C, Leino R, Lillibridge M, Nelson G, Saxe J, Stata R (June
conference on software engineering (ICSE’), pp – ) Extended static checking for Java. In: Proceedings of the
Race Detection Techniques R 

conference on programming language design and implementa- . Milanova A, Rountev A, Ryder BG () Parameterized object
tion (PLDI’), pp – sensitivity for points-to analysis for Java. ACM Trans Softw Eng
. Flanagan C, Freund SN, Yi J () Velodrome: a sound and com- Methodol ():–
plete dynamic atomicity checker for multithreaded programs. In: . Min SL, Choi J-D () An efficient cache-based access anomaly
PLDI’: Proceedings of the  ACM SIGPLAN conference detection scheme. In: ASPLOS-IV: Proceedings of the th inter-
on programming language design and implementation. ACM, national conference on architectural support for programming
New York, pp – languages and operating systems. ACM, New York, pp –
. Helmbold DP, McDowell CE (September ) A taxonomy . Musuvathi M, Qadeer S, Ball T, Basler G, Nainar PA, Neamtiu I
of race detection algorithms. Technical Report UCSC-CRL-- () Finding and reproducing heisenbugs in concurrent pro-
, University of California, Santa Cruz, Computer Research grams. In: OSDI’: Proceedings of the th USENIX conference
Laboratory on operating systems design and implementation. USENIX Asso-
. Henzinger TA, Jhala R, Majumdar R () Race checking by ciation, Berkeley, pp –
context inference. In: PLDI’: Proceedings of the ACM SIG- . Muzahid A, Suárez D, Qi S, Torrellas J () Sigrace: signature-
PLAN  conference on programming language design and based data race detection. In: ISCA’: Proceedings of the th
implementation. ACM, New York, pp – annual international symposium on computer architecture. ACM,
. Hovemeyer D, Pugh W (July ) Finding concurrency bugs in New York, pp –
java. In: Proceedings of the PODC workshop on concurrency and . Naik M, Aiken A, Whaley J (June ) Effective static race detec-
synchronization in Java programs tion for Java. In: Proceedings of the conference on programming
. Jannesari A, Bao K, Pankratius V, Tichy WF () Helgrind+: language design and implementation (PLDI’), pp –
An efficient dynamic race detector. In: Proceedings of the . Naumovich G, Avrunin G, Clarke L (September ) An efficient
rd international parallel & distributed processing symposium algorithm for computing MHP information for concurrent Java
(IPDPS’). IEEE, Rome programs. In: Proceedings of the European software engineer-
. Kidd N, Reps T, Dolby J, Vaziri M () Finding concurrency- ing conference and symposium on the foundations of software
related bugs using random isolation. In: VMCAI’: Proceedings engineering, pp –
of the th international conference on verification, model check- . Netzer R, Miller B (August a) Detecting data races in parallel
ing, and abstract interpretation. Springer-Verlag, Heidelberg, program executions. Technical report TR-, Department of
pp – Computer Science, University of Wisconsin, Madison
. Lamport L (July ) Time, clock and the ordering of events in a . Netzer R, Miller B (January b) On the complexity of event
distributed system. Commun ACM ():– ordering for shared-memory parallel program executions. Tech-
. Lhoták O, Hendren L (March ) Context-sensitive points-to nical report TR , Computer Sciences Department, University
analysis: is it worth it? In: Mycroft A, Zeller A (eds) International of Wisconsin, Madison
conference of compiler construction (CC’), vol  of LNCS. . Netzer R, Miller B (March ) What are race conditions? Some
Springer, Vienna, pp – issues and formalizations. ACM Lett Program Lang Syst ():
. Mattern F () Virtual time and global states of distributed sys- –
tems. In: Proceedings of the Parallel and distributed algorithms . Netzer R, Brennan T, Damodaran-Kamal S () Debugging race
conference. Elsevier Science, Amsterdam, pp – conditions in message-passing programs. In: SPDT’: Proceed-
. Marino D, Musuvathi M, Narayanasamy S () Literace: effec- ings of the SIGMETRICS symposium on parallel and distributed
tools. ACM, New York, pp –
tive sampling for lightweight data-race detection. In: PLDI’:
Proceedings of the  ACM SIGPLAN conference on program- . Nudler I, Rudolph L () Tools for the efficient development
R
ming language design and implementation. ACM, New York, of efficient parallel programs. In: Proceedings of the st Israeli
pp – conference on computer system engineering
. Masticola S, Ryder B () Non-concurrency analysis. In: Pro- . O’Callahan R, Choi J-D (June ) Hybrid dynamic data race
ceedings of the symposium on principles and practice of parallel detection. In: Symposium on principles and practice of parallel
programming (PPoPP’), pp – programming (PPoPP’), pp –
. Mellor-Crummey J (November ) On-the-y detection of . Perkovic D, Keleher PJ (October ) Online data-race detection
data races for programs with nested fork-join parallelism. In: via coherency guarantees. In: Proceedings of the nd symposium
Proceedings of the supercomputer debugging workshop, pp on operating systems design and implementation (OSDI’),
– pp –
. Mellor-Crummey J (May ) Compile-time support for efficient . Pozniansky E, Schuster A (June ) Efficient on-the-y data race
data race detection in shared-memory parallel programs. In: Pro- detection in multi-threaded c ++ programs. In: Proceedings of the
ceedings of the workshop on parallel and distributed debugging, symposium on principles and practice of parallel programming
pp – (PPoPP’), pp –
. Midkiff S, Lee J, Padua D (June ) A compiler for multi- . Pozniansky E, Schuster A () Multirace: efficient on-the-y data
ple memory models. In: Rec. Workshop compilers for parallel race detection in multithreaded c + + programs: research articles.
computers (CPC’) Concurrency Comput: Pract Exper ():–
 R Race Detectors for Cilk and Cilk++ Programs

. Prvulovic M, Torrellas J () Reenact: using thread-level spec- . Wang L, Stoller SD () Accurate and efficient runtime detec-
ulation mechanisms to debug data races in multithreaded codes. tion of atomicity errors in concurrent programs. In: PPoPP’:
SIGARCH Comput Archit News ():– Proceedings of the eleventh ACM SIGPLAN symposium on
. Rajwar R, Goodman JR () Speculative lock elision: enabling Principles and practice of parallel programming. ACM, New York,
highly concurrent multithreaded execution. In MICRO : Pro- pp –
ceedings of the th annual ACM/IEEE international symposium . Welc A, Jagannathan S, Hosking AL (June ) Transactional
on microarchitecture. IEEE Computer Society, Washington, DC, monitors for concurrent objects. In: Proceedings of the Euro-
pp – pean conference on object-oriented programming (ECOOP’),
. Ramalingam G () Context-sensitive synchronization- pp –
sensitive analysis is undecidable. ACM Trans Program Lang Syst . Yu Y, Rodeheffer T, Chen W (October ) RaceTrack: Effi-
(TOPLAS) :– cient detection of data race conditions via adaptive tracking. In:
. Richards B, Larus JR () Protocol-based data-race detection. Proceedings of the symposium on operating systems principles
In SPDT’: Proceedings of the SIGMETRICS symposium on (SOSP’), pp –
parallel and distributed tools. ACM, New York, pp – . Yu Y, Rodeheffer T, Chen W () Racetrack: efficient detec-
tion of data race conditions via adaptive tracking. In SOSP’:
. Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T (Octo-
Proceedings of the th ACM symposium on operating systems
ber a) Eraser: a dynamic data race detector for multi-threaded
principles. ACM, New York, pp –
programs. In: Proceedings of the symposium on operating sys-
. Zhou P, Teodorescu R, Zhou Y () Hard: hardware-assisted
tems principles (SOSP’), pp –
lockset-based race detection. In: HPCA’: Proceedings of the
. Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T
 IEEE th international symposium on high performance
(b) Eraser: a dynamic data race detector for multithreaded
computer architecture. IEEE Computer Society, Washington, DC,
programs. ACM Transactions on computer systems ():
pp –
–
. Scheurich C, Dubois M (June ) Correct memory operation of
cache-based multiprocessors. In: Proceedings of th annual sym-
posium on computer architecture, Computer Architecture News,
pp – Race Detectors for Cilk and Cilk++
. Sen K, Rosu G, Agha G () Runtime safety analysis of Programs
multithreaded programs. In: ESEC/FSE-: Proceedings of the
th European software engineering conference held jointly
Jeremy T. Fineman , Charles E. Leiserson
with th ACM SIGSOFT international symposium on 
foundations of software engineering. ACM, New York, pp –
Carnegie Mellon University, Pittsburgh, PA, USA

. Shacham O, Sagiv M, Schuster A () Scaling model check-
Massachusetts Institute of Technology, Cambridge,
ing of dataraces using dynamic information. In: PPoPP’: Pro- MA, USA
ceedings of the tenth ACM SIGPLAN symposium on princi-
ples and practice of parallel programming. ACM, New York, pp
–
Synonyms
. Stoller SD (October ) Model-checking multi-threaded dis-
tributed Java programs. Int J Softw Tools Technol Transfer ():
Cilkscreen; Nondeterminator
–
. Sura Z, Fang X, Wong C-L, Midkiff SP, Lee J, Padua DA (June Definition
) Compiler techniques for high performance sequentially
The Nondeterminator race detector takes as input an
consistent Java programs. In: Proceedings of the symposium
principles and practice of parallel programming (PPoPP’), ostensibly deterministic Cilk program and an input data
pp – set and makes the following guarantee: it will either
. Taylor RN (May ) A general purpose algorithm for analyzing determine at least one location in the program that is
concurrent programs. Commun ACM ():– subject to a determinacy race when the program is run
. Visser W, Havelund K, Brat G, Park S () Model checking on the data set, or else it will certify that the program
programs. In: ASE’: Proceedings of the th IEEE international
always behaves the same on the data set, no matter
conference on automated software engineering. IEEE Computer
Society, Washington, p 
how it is scheduled. The Cilkscreen race detector does
. von Praun C, Gross T (October ) Object race detection. much the same thing for Cilk++ programs. Both can
In: Conference on object-oriented programming, systems, lan- also detect data races, but the guarantee is somewhat
guages, and applications (OOPSLA’), pp – weaker.
Race Detectors for Cilk and Cilk++ Programs R 

Discussion int x;

Introduction cilk void foo()


{
Many Cilk programs are intended to be determinis- x = x + 1;
tic, in that a given program produces the same behav- return;
ior no matter how it is scheduled. The program may }

behave nondeterministically, however, if a determinacy cilk int main() /* F */


race occurs: two logically parallel instructions update {
the same location, where at least one of the two instruc- x = 0; /* e0 */
spawn foo(); /* F1 */
tions writes the location. In this case, different runs of /* e1 */
the program on the same input may produce differ- spawn foo(); /* F2 */
/* e2 */
ent behaviors. Race bugs are notoriously hard to detect
sync;
by normal debugging techniques, such as breakpoint- printf("x is %d\n", x); /* e3 */
ing, because they are not easily repeatable. This arti- return 0;
}
cle describes the Nondeterminator and Cilkscreen race
detectors, which are systems for detecting races in Cilk Race Detectors for Cilk and Cilk++ Programs. Fig.  A
and Cilk++ programs, respectively. simple Cilk program that contains a determinacy race. In
Determinacy races have been given many different the comments at the right, the Cilk strands that make up
names in the literature. For example, they are sometimes the procedure main() are labeled
called access anomalies [], data races [], race condi-
tions [], harmful shared-memory accesses [], or gen-
eral races []. Emrath and Padua [] call a determinis-
tic program internally deterministic if the program exe- F
cution on the given input exhibits no determinacy race main()
e0 e1 e2 e3
and externally deterministic if the program has deter-
minacy races but its output is deterministic because of
the commutative and associative operations performed
on the shared locations. The Nondeterminator program F1 F2
checks whether a Cilk program is internally determin- foo() foo()
istic. Cilkscreen allows for some internal nondetermin-
spawn node sync node
ism in a Cilk++ program, specifically, nondeterminism
encapsulated by “reducer hyperobjects” []. Race Detectors for Cilk and Cilk++ Programs. Fig.  The
To illustrate how a determinacy race can occur, con- parallel control-flow dag of the program in Fig. . A spawn R
sider the simple Cilk program shown in Fig. . The node of the dag represents a spawn construct, and a sync
parallel control flow of this program can be viewed as node represents a sync construct. The edges of the dag
the directed acyclic graph, or dag, illustrated in Fig. . are labeled to correspond with code fragments from Fig. 
The vertices of the dag represent parallel control con-
structs, and the edges represent strands: serial sequences
of instructions with no intervening parallel control con-
structs. In Fig. , the strands of the program are labeled perform a read from x, increment the value, and then
to correspond to code fragments from Fig. , and the write the value back into x. Since these operations are
subdags representing the two instances of foo() are not atomic, both might update x at the same time.
shaded. In this program, both of the parallel instantia- Figure  shows how this determinacy race can cause
tions of the procedure foo() update the shared vari- x to take on different values if the strands compris-
able x in the x = x + 1 statement. This statement ing the two instantiations of foo() are scheduled
actually causes the processor executing the strand to simultaneously.
 R Race Detectors for Cilk and Cilk++ Programs

The Nondeterminator The Nondeterminator was implemented by modi-


The Nondeterminator [] determinacy-race detector fying the ordinary Cilk compiler and runtime system.
takes as input a Cilk program and an input data set and Each read and write in the user’s program is instru-
either determines at least one location in the program mented by the Nondeterminator’s compiler to perform
that is subject to a determinacy race when the program determinacy-race checking at runtime. The Nondeter-
is run on the data set, or else it certifies that the program minator then takes advantage of the fact that any Cilk
always behaves the same when run on the data set. If program can be executed as a C program. Specifically, if
a determinacy race exists, the Nondeterminator local- the Cilk keywords are deleted from a Cilk program, a C
izes the bug, providing variable name, file name, line program results, called the Cilk program’s serialization
number, and dynamic context (state of runtime stack, (also called serial elision in the literature), whose seman-
heap, etc.). tics are a legal implementation of the semantics of the
The Nondeterminator is not a program verifier, Cilk program []. The Nondeterminator executes the
because the Nondeterminator cannot certify that the user’s program as the serialization would execute, but it
program is race-free for all input data sets. Rather, it is performs race-checking actions when reads, writes, and
a debugging tool. The Nondeterminator only checks a parallel control statements occur.
program on a particular input data set. What it verifies is Since the Nondeterminator was implemented by
that every possible scheduling of the program execution modifying the Cilk compiler, it is unable to detect
produces the same behavior. If the program relies on any races that occur in precompiled library code. In con-
random choices or runtime calls to hardware counters, trast, the Cilk++ race detector, called Cilkscreen, uses
etc., these values should also be viewed as part of the binary instrumentation [, , ]. Cilkscreen dynam-
input data set. ically replaces memory accesses in a Cilk++ program
The Nondeterminator is a serial program that oper- binary with instrumented memory accesses, and thus it
ates on-the-fly, meaning that it detects races as it sim- can operate on executable binaries for which no source
ulates the execution of the program, rather than by code is available. Cilkscreen employs much the same
logging and subsequent analysis. As it executes, it algorithmic technology as the Nondeterminator, how-
maintains various data structures for determining the ever.
existence of determinacy races. An “access history”
maintains a subset of strands that access each particular
memory location. An “SP-maintenance” data structure Detecting Races in Programs that Use Locks
maintains the series-parallel (SP) relationships among The original Nondeterminator [] is designed to detect
strands. Specifically, the race detector must determine determinacy races in a Cilk program. Typically a pro-
whether two strands (that access the same memory grammer may add locks to the program to protect
location) operate logically in parallel (i.e., whether it is against the externally nondeterministic interleaving
possible to schedule both strands at the same time), or given in Fig. , thus intending and sanctioning non-
whether there is some serial relationship between the deterministic executions. Thus, it may not make sense
strands. to test the program for determinacy races, but some
weaker form of race detection may be desired.

Case 1 Case 2
F1 F2 e3 F1 F2 e3
read x = 0 read x = 0
write x = 1 read x = 0
read x = 1 write x = 1
write x = 2 write x = 1
“x is 2” “x is 1”

Race Detectors for Cilk and Cilk++ Programs. Fig.  An illustration of a determinacy race in the code from Fig. .
The value of the shared variable x read and printed by strands e can differ depending on how the instructions in the two
instances F and F of the foo() procedure are scheduled
Race Detectors for Cilk and Cilk++ Programs R 

The Nondeterminator- [] detects data races,


F
which occur when two logically parallel strands hold- S
ing no locks in common update the same location, S
S

where at least one of the two instructions modi- S S e


e P
fies the location. Although the Nondeterminator and e P S
S
Nondeterminator- test for different types of races, a S e P
F e P
significant chunk of the implementations and algo- P
S
F e
rithms used (notably, the SP-maintenance algorithm) e
e F e P
remains the same in both debugging tools. This article F
e
F
focuses on determinacy-race detection.
F

Series-Parallel Parse Trees


The structure of a Cilk program can be interpreted as a
Race Detectors for Cilk and Cilk++ Programs. Fig.  The
series-parallel (SP) parse tree rather than a series-parallel
canonical series-parallel parse tree for a generic Cilk
dag. Figure  shows a parse tree for the dag in Fig. .
procedure. The notation F represents the SP parse tree of
In the SP parse tree, each internal node is either an S-
any subprocedure spawned by this procedure, and e
node, denoted by S, or a P-node, denoted by P, and each
represents any strand of the procedure. All nodes in the
leaf is a strand of the dag. If two subtrees are children of
shaded areas belong to the procedure, and the nodes in
the same S-node, then the strands in the left subtree log-
each oval belong to the same sync block. A sequence of
ically precede the strands in the right subtree, and (the
S-nodes forms the spine of the SP parse tree, composing
subcomputation represented by) the left subtree must
all sync blocks in series. Each sync block contains an
execute before (that of) the right subtree. If two subtrees
alternating sequence of S-nodes and P-nodes. Observe
are children of the same P-node, then the strands in the
that the left child of an S-node in a sync block is always a
left subtree operate logically in parallel with those in the
strand, and that the left child of a P-node is always a
right subtree, and no ordering holds between (the sub-
subprocedure
computations represented by) the two subtrees. For two
strands e and e′ , the notation e ∥ e′ means that e and
e′ operate logically in parallel, and the notation e ≺ e′
means that e logically precedes e′ .
A canonical SP parse tree for a Cilk dag, shown strands and subprocedures on the left, and the series
in Fig. , can be constructed as follows: First, build a compositions of the sync blocks are applied in order
parse tree recursively for each child of the root proce- from last to first, then the parse tree is unique. Figure 
dure. Each sync block of the root procedure contains shows such a canonical parse tree for the Cilk dag in R
a series of spawns with strands of intervening C code Fig. .
(some of which may be empty). Then, create a parse tree One convenient feature of the SP parse tree is that
for the sync block alternately applying series and par- the logical relationship between strands can be deter-
allel composition to the child parse trees and the root mined by looking at the least common ancestor (lca)
strands. Finally, string the parse trees for the sync blocks of the strands in the parse tree. For example, consider
together into a spine for the procedure by applying a the parse tree in Fig. . The least common ancestor of
sequence of series compositions to the sync blocks. Sync strands e and e , denoted by lca(e , e ), is the root
blocks are composed serially, because a sync statement S-node, and hence e ≺ e . In contrast, lca(F , F ) is a
is never passed until all previously spawned subproce- P-node, and hence F ∥ F . To see that the dag and the
dures have completed. The only ambiguities that might parse tree represent the same control structure requires
arise in the parse tree occur because of the associativity only observing that lca(ei , ej ) is an S-node in the parse
of series composition and the commutativity of paral- tree if and only if there is a directed path from ei to ej in
lel composition. If, as shown in Fig. , the alternating the corresponding dag. A proof of this fact can be found
S-nodes and P-nodes in a sync block always place in [].
 R Race Detectors for Cilk and Cilk++ Programs

Fψ: S write a shared location  by strand e:


e3 if reader[]  e or writer[]  e
then a determinacy race exists
S
writer[] ← e

e0 P read a shared memory location  by strand e:


if writer[]  e
S then a determinacy race exists
if reader[] ≺ e
F1 e1 P then reader[] ← e

e2 Race Detectors for Cilk and Cilk++ Programs. Fig. 


Pseudocode describing the implementation of read and
F2 write operations for a determinacy-race detector. The
access history is updated on these operations
Race Detectors for Cilk and Cilk++ Programs. Fig.  The
canonical series-parallel parse tree for the Cilk dag in Fig. 

An execution of a Cilk program can be interpreted structure described in section “Series-Parallel Mainte-
as a walk or traversal of the corresponding SP parse tree. nance.” The main goal when designing an access history
The order in which nodes are traversed depends on the is to reduce the number of strands stored; the larger the
scheduler. A partial execution must obey series-parallel access history the more series-parallel queries need to
relationships, namely, that the tree walk cannot enter be performed.
the right subtree of an S-node until the left subtree has For a serial determinacy race detector, an access
been fully executed. Both subtrees of a P-node, how- history of size O() per memory location suffices. In
ever, can be traversed in arbitrary order or in parallel. particular, the access history associates with each mem-
The canonical parse tree is such that an ordinary, left- ory location ℓ two values reader[ℓ] and writer[ℓ], each
to-right, depth-first tree walk visits strands in the same storing a single strand that has previously read from or
order as the program’s serialization visits them. written to ℓ, respectively. Specifically, writer[ℓ] stores a
unique runtime ID of the strand that has most recently
Access History written to ℓ. Similarly, reader[ℓ] stores the ID of some
The Nondeterminator and Cilkscreen race detectors previous reader of ℓ, although it need not be the most
execute a Cilk program in serial, depth-first order while recent reader.
maintaining two data structures. The SP-maintenance The Nondeterminator updates the access history as
data structure, described later in this article, is used new memory accesses occur. Each read and write
to query the logical relationships among strands. The operation of the original program is instrumented to
access history maintains information as to which update the access history and discover determinacy
strands have accessed which memory locations. races. Figure  gives pseudocode describing the instru-
At a high level, an access history maintains for each mentation of read and write operations. A race
shared-memory location two sets of strands that have occurs if a strand e writes a location ℓ and discovers
read from and written to the location. As the Nonde- that either the previous reader or the previous writer of
terminator executes, strands are added to and removed ℓ operates logically in parallel with e. Similarly, a race
from the access history. Whenever an access of memory occurs whenever e reads a location ℓ and discovers that
location v occurs, each of the strands in v’s access history the previous writer operates logically in parallel with e.
are compared against the currently executing strand. If Whenever a location ℓ is written, the access history
any of these strands operates logically in parallel with writer[ℓ] is updated to be the current strand e. The read
the currently executing strand, and one of the accesses history reader[ℓ] is updated to e when a read occurs, but
is a write, then a race is reported. This query of logical only if the previous reader operates logically in series
relationships is determined by the SP-maintenance data with e.
Race Detectors for Cilk and Cilk++ Programs R 

The cost of maintaining the access history is O() access() in strand e with lock set H:
for each e, H ∈ lockers[]
plus the cost of O() series-parallel queries, for each do if e  e and H ∩ H = ∅
memory access. A correctness proof of the access his- then a data race exists
tory can be found in []. The proof involves show- redundant ← false
for each e , H   ∈ lockers[]
ing that an update (or lack of an update) to reader[ℓ] do if e ≺ e and H  ⊇ H
does not discard any important information. Specifi- then lockers[] ← lockers[] − {e , H  }
if e  e and H  ⊆ H
cally, consider three strands e , e , and e that occur in
then redundant ← true
that order in the serial execution, where e is the current if redundant = false
value of reader[ℓ], e is the currently executing strand then lockers[] ← lockers[] ∪ {e, H}
reading ℓ, and e is some future strand. If e ≺ e and
e ∥ e , then e ∥ e . Thus, a test for a conflict between Race Detectors for Cilk and Cilk++ Programs. Fig. 
e and e produces the same result as a test for a conflict Pseudocode describing the implementation of read and
between e and e . On the other hand, if e ∥ e and e ∥ write operations for a data-race detector. The access
e , then e ∥ e , and so keeping e gives at least as much history is updated on these operations using the lock-set
information as e . Since updates to reader[ℓ] do not lose algorithm
information, the access history properly detects a race
whenever the current writer operates logically in paral-
lel with any previous reader. The full proof considers the accesses is a write.” The idea is to introduce a fake lock
other two race cases separately (the current writer oper- for read accesses called the r-lock, which is implic-
ates logically in parallel with a previous writer, or the itly acquired immediately before a read and released
current reader operates logically in parallel with a pre- immediately afterward. The r-lock behaves from the
vious writer), showing that either a race is reported on race detector’s point of view just like a normal lock,
the access or an earlier writer-writer race was reported. but during an actual computation, it is never actually
acquired and released (since it does not actually exist).
Data-Race Detection The use of r-lock allows the condition for a data race to
In contrast to the Nondeterminator, which detects be stated more succinctly: If the lock sets of two parallel
determinacy races, the Nondeterminator- [] and accesses to the same location have an empty intersection,
Cilkscreen can also detect data races. The difference then a data race exists. By this condition, a data race
between the Nondeterminator and these other two (correctly) does not exist for two read accesses, since
debugging programs lies in how they deal with the their lock sets both contain the r-lock.
access history. The Nondeterminator- and Cilkscreen The access history for the algorithm for detecting
tools report data races for locking protocols in which data races records a set of readers and writers for each
every lock is acquired and released within a single memory location and their lock sets. For a given loca- R
strand and cannot be held across parallel control con- tion ℓ, the entry lockers[ℓ] stores a list of lockers: strands
structs. Strands may hold multiple locks at one time, that access ℓ, each paired with the lock set that was held
however. These tools do not detect deadlocks, but only during the access. If ⟨e, H⟩ ∈ lockers[ℓ], then location
whether during a serial left-to-right depth-first execu- ℓ is accessed by strand e while it holds the lock set H.
tion, two logically parallel strands holding no locks in Figure  gives pseudocode for the data-race-detection
common access a shared variable, where at least one of algorithm employed by the Nondeterminator- and
the strands modifies the location. Cilkscreen.
The lock set of an access (read or write) is the set This algorithm for data-race detection does not
of locks held by the strand performing the access when provide as strong a guarantee as the algorithm for
the access occurs. If the lock sets of two parallel accesses determinacy-race detection. Programming with locks
to the same location have an empty intersection, and at introduces intentional nondeterminism which may not
least one of the accesses is a write, then a data race be exposed during a serial left-to-right depth-first exe-
exists. To simplify the race-detection algorithm, a small cution. Thus, although the algorithm always finds a
trick avoids the extra condition that “at least one of the data race if one is exposed, some data races may not
 R Race Detectors for Cilk and Cilk++ Programs

be exposed. Nevertheless, for abelian programs, where Algorithm Space Time per
per node Strand Query
critical sections protected by the same lock “commute” – creation
produce the same effect on memory no matter in which English-Hebrew [32] Θ(f) Θ(1) Θ(f)
order they are executed – a guarantee can be provided. Offset-Span [24] Θ(d) Θ(1) Θ(d)
SP-bags [16] Θ(1) Θ(α(v,v ))* Θ(α(v,v ))*
For a computation generated by a deadlock-free abelian Modified SP-bags [17] Θ(1) Θ(1)* Θ(1)
program running on a given input, the algorithm guar- SP-order [4] Θ(1) Θ(1) Θ(1)
antees to find a data race if one exists, and otherwise f = number of forks/spawns in the program
d = maximum depth of nested parallelism
guarantee that all executions produce the same final
v = number of shared locations being monitored
result. A proof of the correctness of this assertion can
be found in [].
The cost of querying the access history for data- Race Detectors for Cilk and Cilk++ Programs. Fig. 
race detection is much more than the O() cost for Comparison of serial, SP-maintenance algorithms. An
determinacy-race detection. There are at most nk entries asterisk (*) indicates an amortized bound. The function α is
for each memory location, where n is the number of Tarjan’s functional inverse of Ackermann’s function
locks in the program and k ≤ n is the number of
locks held simultaneously. Thus, the running time of the
data-race detector may increase proportionally to nk in
is desirable, and hence SP-bags and SP-order are more
the worst case. In practice, however, few locks are held
appealing algorithms for serial race detectors. The sit-
simultaneously, and the running time tends to be only
uation is exacerbated in a data-race detector where the
slightly worse than for determinacy-race detection.
number of queries can be larger than T.

Series-Parallel Maintenance
The main structural component of the two Nonde-
SP-Bags
terminator programs and Cilkscreen is their algo-
Since the logical relationship between strands can be
rithm for maintaining series-parallel (SP) relationships
determined by looking at their least common ances-
among strands. These race detectors execute the pro-
tor in the SP parse tree, a straightforward approach
gram in a left-to-right depth-first order while main-
for SP-maintenance would maintain this tree explic-
taining the logical relationships between strands. As
itly. Querying the relationship between strands can then
new strands are encountered, the SP-maintenance data
be implemented naively by climbing the tree starting
structure is updated. Whenever the currently executing
from each strand until converging at their least com-
strand accesses a memory location, the Nondetermina-
mon ancestor. This approach yields expensive queries,
tor queries the SP-maintenance data structure to deter-
as the worst-case cost is proportional to the height of
mine whether the current strand operates in series or in
the tree, which is not bounded.
parallel with strands (those in the access history) that
SP-bags [] is indeed based on finding the least
made earlier accesses to the same memory location.
common ancestor, but it uses a clever algorithm that
Figure  compares the serial space and running
yields far better queries. The algorithm itself is an
times of several SP-maintenance algorithms. The SP-
adaptation of Tarjan’s offline least-common-ancestors
bags and SP-order algorithms, described in this sec-
algorithm.
tion, are employed by various implementations of the
The SP-bags algorithm maintains a collection of dis-
Nondeterminator. The English-Hebrew [] and Offset-
joint sets, employing a “disjoint sets” or “union-find”
Span [] algorithms are used in different race detectors
data structure that supports the following operations:
and displayed for comparison. The modified SP-bags
entry reflects a different underlying data structure than . Make-Set(x) creates a new set whose only mem-
the SP-bags entry, but the algorithm for both is the same. ber is x.
As the number of queries performed by a determinacy . Union(x, y) unites the sets containing x and y.
race detector can be as large as O(T) for a program that . Find(x) returns a representative for the set contain-
runs serially in T time, keeping the query time small ing x.
Race Detectors for Cilk and Cilk++ Programs R 

A classical disjoint-set data structure with “union by into PF , since the descendants of F ′ can execute in par-
rank” and “path compression” heuristics [, , ] sup- allel with the remainder of the sync block in F. When a
ports m operations on n elements in O(mα(m, n)) time, sync occurs (traversing a spine node in the parse tree) in
where α is Tarjan’s functional inverse of Ackermann’s F, the bag PF is emptied into SF , since all of F’s executed
function [, ]. descendants precede any future strands in F.
As the program executes, for each active procedure, As a concrete example, consider the program given
the SP-bags algorithm maintains two bags (unordered in Fig. . The execution order of strands and proce-
sets) with the following contents at any given time: dures is e , F , e , F , e , e . Figure  shows the state
of the S- and P-bags during each step of the execu-
● The S-bag SF of a procedure F contains the descen-
tion. As the S- and P-bags are only specified for active
dant procedures of F that logically precede the cur-
procedures, the number of bags changes during the
rently executing strand. (The descendant procedures
execution.
of F include F itself.)
To determine whether a previously executed proce-
● The P-bag PF of a procedure F contains the com-
dure (or strand) F is logically in series or in parallel with
pleted descendant procedures of F that operate logi-
the currently executing strand, simply check whether
cally in parallel with the currently executing strand.
F belongs to an S-bag by looking at the identity of
The S- and P-bags are represented using a disjoint sets Find-Set(F).
data structure as described above. Figure  illustrates the correctness of SP-bags for
The SP-bags algorithm is given in Fig. . As the the program Fig. . For example, when executing F , all
Cilk program executes in a serial, depth-first fashion, previously executed instructions in F (namely, e and
the SP-bags algorithm performs additional operations e ) serially precede F , and F does indeed belong to
whenever one of the three following actions occurs: an S-bag SF . Moreover, the procedure F operates logi-
spawn, sync, return, resulting in updates to the S- cally in parallel with F , and F belongs to the P-bag PF .
and P-bags. Whenever a new procedure F (entering the For a proof of correctness, which is omitted here, refer
left subtree of a P-node) is spawned, new bags are cre- to [].
ated. The bag SF is initially set to contain F, and PF is Combining SP-bags with the access history yields
set to be empty. Whenever a procedure F ′ returns to an efficient determinacy race detector. Consider a Cilk
its parent procedure F (completing the walk of the left program that executes in time T on one processor
subtree of a P-node), the contents of SF′ are unioned and references v shared memory locations. The SP-bags
algorithm can be implemented to check this program
for determinacy races in O(Tα(v, v)) time.
A proof of this theorem, which is provided in [],
spawn a procedure F :
is conceptually straightforward. The number of Make- R
SF ← Make-Set(F )
PF ← ∅ Set, Union, and Find-Set operations is at most O(T),
yielding a total running time of O(Tα(m, n)) for
return from a procedure F  to parent F :
PF ← Union(PF , SF  )
some value of m and n. It turns out that these val-
SF  ← ∅ ues can be reduced to v when garbage collection is
employed.
sync in a procedure F :
SF ← Union(SF , PF )
The theoretical running time of SP-bags can be
PF ← ∅ improved by replacing the underlying disjoint sets
data structure. Since the Unions are structured nicely,
Race Detectors for Cilk and Cilk++ Programs. Fig.  The Gabow and Tarjan’s data structure [], which features
SP-bags algorithm. Whenever one of three actions occurs O() amortized time per operation, can be employed.
during the serial, depth-first execution of a Cilk parse tree, The SP-bags algorithm can be implemented to check a
the operations in the figure are performed. These program for determinacy races in O(T) time, where the
operations cause SP-bags to manipulate the disjoint sets program executes in time T on one processor. A full
data structure discussion of this improvement appears in [].
 R Race Detectors for Cilk and Cilk++ Programs

Execution step State of the Bags


e0 SF = {F } PF = ∅
F1 SF = {F } PF = ∅ SF1 = {F1} PF1 = ∅
e1 SF = {F } PF = {F1}
F2 SF = {F } PF = {F1} SF2 = {F2} PF2 = ∅
e2 SF = {F } PF = {F1, F2}
e3 SF = {F, F1, F2} PF = ∅

Race Detectors for Cilk and Cilk++ Programs. Fig.  The state of the SP-bags algorithm when run on the parse tree
from Fig. 

SP-Order F: S
e3
SP-order uses two total orders to determine whether (6, 6)
S
strands are logically parallel, an English order and a
Hebrew order. In the English order, the nodes in the left e0 P
subtree of a P-node precede those in the right subtree of (1, 1)
the P-node. In the Hebrew order, the order is reversed: S
the nodes in the right subtree of a P-node precede those
F1 e1 P
in the left. In both orders, the nodes in the left subtree (3, 2)
(2, 5)
of an S-node precede those in the right subtree of the
e2
S-node. (5, 3)
Figure  shows English and Hebrew orderings for F2
the strands in the parse tree from Fig. . Notice that if x (4, 4)
belongs in the left subtree of an S-node and y belongs to
Race Detectors for Cilk and Cilk++ Programs. Fig.  An
the right subtree of the same S-node, then E[x] < E[y]
English ordering E and a Hebrew ordering H for strands in
and H[x] < H[y]. In contrast, if x belongs to the left
the parse tree of Fig. . Under each strand/procedure e is
subtree of a P-node y belongs to the right subtree of the
an ordered pair (E[e], H[e]) giving its rank in each of the
same P-node, then E[x] < E[y] and H[x] > H[y].
two orders. Equivalently, the English order is
The English and Hebrew orderings capture the SP
e , F , e , F , e , e and the Hebrew order is e , e , e , F , F , e
relationships in the parse tree. Specifically, if one strand
x precedes another strand y in both orders, then x ≺ y,
since lca(x, y) is an S-node, and hence there is a directed Labeling a static SP parse tree with an English-
path from x to y in the dag. If x precedes y in one order Hebrew ordering is straightforward. To compute the
but x follows y in the other, then x ∥ y, since lca(x, y) is English ordering, perform a depth-first traversal visit-
a P-node, and hence there is no directed path from one ing left children of both P-nodes and S-nodes before
to the other in the dag. For example, in Fig. , the fact visiting right children (an English walk). Assign label i to
e precedes e can be determined by observing E[e ] < the ith strand visited. To compute the Hebrew ordering,
E[e ] and H[e ] < H[e ]. Similarly, it follows that F perform a depth-first traversal visiting right children of
and F are logically parallel since E[F ] < E[F ] and P-nodes before visiting left children but left children of
H[F ] > H[F ]. The following lemma, proved in [], S-nodes before visiting right children (a Hebrew walk).
shows that this property always holds. Assign labels to strands as before, in the order visited.
In race-detection applications, however, these order-
Lemma  Let E be an English ordering of strands of an ings must be generated on-the-fly before the entire
SP-parse tree, and let H be a Hebrew ordering. Then, for parse tree is known. If the parse tree unfolds accord-
any two strands x and y in the parse tree, x ≺ y if and only ing to an English walk, as in the Nondeterminator,
if E[x] < E[y] and H[x] < H[y]. Equivalently, x ∥ y if then computing the English ordering is trivial. Unfor-
and only if E[x] < E[y] and H[x] > H[y], or if E[x] > tunately, computing the Hebrew ordering during an
E[y] and H[x] < H[y]. English walk is problematic. In a Hebrew ordering, the
Race Detectors for Cilk and Cilk++ Programs R 

label of a strand in the left subtree of a P-node depends OM-Insert(Heb, S, x, y) to insert these children in the
on the number of strands in the right subtree. This same order after S in the Hebrew ordering Heb. When
number is unknown, while performing an English walk, first traversing a P-node P with left child x and right
until the right subtree has unfolded completely. child y, also perform OM-Insert(Eng, P, x, y) as before,
Nudler and Rudolph [], who introduced English- but perform OM-Insert(Heb, P, y, x) to insert these
Hebrew labeling for race detection, addressed this prob- children in the opposite order in the Hebrew ordering.
lem by using large static strand labels. In particular, the To determine whether a previously executed strand
number of bits in a label in their scheme can grow lin- e′ is logically in series or in parallel with the currently
early in the number of P-nodes in the SP parse tree. executing strand e, simply check whether OM-Precedes
Although they gave a heuristic for reducing the size (Heb, e′ , e). (It is always the case that OM-Precedes
of labels, manipulating large labels is the performance (Eng, e′ , e), as the program executes in the order of an
bottleneck of their algorithm. English walk.) If OM-Precedes(Heb, e′ , e), then e′ ≺ e.
The solution in SP-order is to employ order- If not, then e′ ∥ e.
maintenance data structures [, , , ] to main- While the preceding description of SP-order with
tain the English and Hebrew orders dynamically rather respect to the parse tree fully describes the algorithm,
than using the static labels described above. An order- it is somewhat opaque with respect to the effect on Cilk
maintenance data structure is an abstract data type that code and the Cilk compiler. Figure , which is analo-
supports the following operations: gous to Fig.  for the SP-bags algorithm, describes the
SP-order algorithm with respect to Cilk keywords. The
● OM-Precedes(L, x, y): Return true if x precedes y
main point of this figure is to show that, like SP-bags,
in the ordering L. Both x and y must already exist in
incorporating SP-order into a Cilk program is fairly
the ordering L.
lightweight. In contrast to SP-bags, whose data struc-
● OM-Insert(L, x, y , y , . . . , yk ): In the ordering L,
tural elements are procedures, SP-order builds data
insert new elements y , y , . . . , yk , in that order,
structures over strands. As such, new objects or IDs are
immediately after the existing element x. Any pre-
created whenever new strands are encountered. On a
viously existing elements that followed x now fol-
low yk .
The OM-Precedes operation can be supported in
O() worst-case time. The OM-Insert operation can spawn a child procedure F  of F :
be supported in O() worst-case time for each node create strand IDs for:
e0 : the first strand in F 
inserted. In the efficient order-maintenance data struc- ec : the continuation strand in F
tures, labels are assigned to each element of the data OM-Insert(Eng, current[F ], e0 , eC )
structure, and the relative ordering of two elements OM-Insert(Heb, current[F ], ec , e0 ) R
(OM-precedes) is determined by comparing these return from a procedure F  to parent F :
labels. Keeping the labels small guarantees good query nothing
cost. The labels change dynamically as insertions occur.
following a spawn of procedure F or a sync in F :
The SP-order data structure consists of two order- create strand ID for:
maintenance data structures to maintain English and eS : the strand following the next sync in F
OM-Insert(Eng, current[F ], eS )
Hebrew orderings. (In fact, the English ordering can
OM-Insert(Heb, current[F ], eS )
be maintained implicitly during a left-to-right tree
walk. For conceptual simplicity, however, both order- Race Detectors for Cilk and Cilk++ Programs. Fig.  The
ings are represented with the data structure here.) SP-order algorithm. Whenever a spawn, sync, or return
With the data structure chosen, the implementation occurs during the serial, depth-first execution of the Cilk
of SP-order is remarkably simple. When first travers- parse tree, the operations in the figure are performed.
ing an S-node S with left child x and right child y, These operations cause SP-order to manipulate the order
perform OM-Insert(Eng, S, x, y) to insert x then y maintenance data structures. The value current[F] denotes
after S in the English ordering Eng. Also perform the currently executing strand in procedure F
 R Race Detectors for Cilk and Cilk++ Programs

spawn, strands are created for both the spawned pro- only a factor of . A full description of such a par-
cedure and the continuation strand within the parent allel access history and its correctness can be found
procedure. These strands are then inserted into Eng in [, ].
and Heb in opposite orders, as they operate logically in The other challenge in parallelizing the access his-
parallel. Whenever beginning to execute a sync block tory is in performing concurrent updates to the data
in procedure F, either after spawn’ing F or sync’ing structure. Specifically, two parallel readers may attempt
within F, a new strand eS is created to represent the to update the same reader[ℓ] at the same time. (If the
start of the next sync block. This eS is then inserted goal is to just report a single race in the program, updat-
after the currently executing strand in both Eng and ing writer[ℓ] in parallel is not as much of an issue, as
Heb, as it operates logically after the currently execut- discovering such a parallel update indicates a race on
ing strand. Unlike SP-bags, no data structural changes its own. If the goal is to report one race for each mem-
occur in SP-order when return’ing from a procedure. ory location, as in the Nondeterminator, then concur-
One interesting feature of SP-order is that elements rent updates to writer[ℓ] also matter here.) Some extra
(strands) are only added to the data structure. No ele- machinery is required to guarantee the correct value is
ments are ever reordered once added. Unreferenced recorded when concurrent updates occur. As discussed
elements may be removed as part of garbage collection. by Fineman [], concurrent updates by P processors
The (incomplete) English and Hebrew orderings at any can be resolved in O(t lg P) worst-case time, where t
time are thus consistent with the a posteriori orderings, is the serial cost of an access-history update: t = O()
with not-yet-encountered strands removed. The cor- for a determinacy-race detector, and t = O(nk ) for a
rectness of SP-order, the proof of which is given in [], data-race detector.
is arguably easier to convince oneself of than the proof
of SP-bags. Parallel SP-Maintenance
The SP-bags algorithm is inherently serial, as the cor-
Making the Nondeterminator Run rectness relies on an execution order corresponding
in Parallel to an English walk of the parse tree. Correctness of
The two Nondeterminators and Cilkscreen are serial SP-order, on the other hand, does not rely on any par-
race detectors. Even though these are debugging tools ticular execution order. SP-order thus appears like a
intended for parallel programs, they themselves run natural choice for a parallel algorithm. The main com-
serially. This section discusses how to make these race plication arises in performing concurrent updates to the
detectors run in parallel. Both the access history and underlying order-maintenance data structures of SP-
SP-maintenance algorithms must be augmented to per- order. The obvious approach for parallelizing SP-order
mit a parallel algorithm. Moreover, contention due to directly requires locking the order-maintenance data
concurrent updates to these data structures needs to be structures whenever performing an update. As with the
addressed. access history, a lock may cause all the other P −  pro-
cessors to waste time waiting for the lock, thus causing
A Parallel Access History a Θ(P) overhead per lock acquisition. For programs
The access history described in section “Access History” with short strands, a Θ(P) overhead per strand cre-
is appealing because it requires only a single reader ation results in no improvement as compared to a serial
and writer for a determinacy-race detector, thus keep- execution.
ing the cost of maintaining and querying against the The SP-ordered-bags algorithm, previously called
access history low. This strategy, however, is inherently SP-hybrid [, ], uses a two-tiered algorithm with a
serial. If only a single reader and writer are recorded, global tier and a local tier to overcome the scalabil-
a parallel race detector may not discover certain races ity problems with lock synchronization. The global tier
in the program []. Fortunately, Mellor-Crummey [] uses a parallel SP-order algorithm with locks, and the
shows that recording two readers and writers suffices local tier uses the serial SP-bags algorithm without
to guarantee correctness in a parallel execution, which locks. By bounding the number of updates to the global
increases the number of SP-maintenance queries by tier, the locking overhead is reduced.
Race Detectors for Cilk and Cilk++ Programs R 

SP-ordered-bags leverages the fact that the Cilk dynamic control-flow constructs (e.g., a fork statement)
scheduler executes a program mostly in an English inside if statements or loops are particularly diffi-
order. The only exception occurs on steals when the cult to deal with. Mellor-Crummey [] proposes using
thief processor causes a violation in the English order- static tools as a way of pruning the number of memory
ing. The thief then continues executing its local subtree locations being monitored by a dynamic race detector.
serially in an English order. Thus, any serial SP- Dynamic race detectors execute the program given
maintenance algorithm, like SP-bags, is viable within a particular input. Some dynamic race detectors per-
each of these local serially executing subtrees. (SP-bags form a post-mortem analysis based on program-
is also desirable as it interacts well with the global tier; execution logs [, , , , –], analyzing a log of
details are omitted here.) Moreover, each local SP-bags program-execution events after the program has fin-
data structure is only updated by the single processor ished running. On-the-fly race detectors, like the Non-
executing that subtree, and thus no locking is required. determinators and Cilkscreen, report races during the
The local tier permits SP queries within a single local execution of the program. Both dynamic approaches
subcomputation. are similar and use some form of SP-maintenance algo-
The global tier is responsible for arranging the local rithm in conjunction with an access history. On-the-fly
subcomputations to permit SP-queries among these race detectors benefit from garbage collection, thereby
subcomputations. SP-ordered-bags employs the paral- reducing the total space used by the tool. Post-mortem
lel SP-order algorithm to arrange the subcomputations. tools, on the other hand, must keep exhaustive logs.
Although the global tier is locked whenever an update Netzer and Miller [] provide a common terminol-
occurs, updates occur only when the Cilk scheduler ogy to unify previous work on dynamic race detection.
causes a steal. Since the Cilk scheduler causes at most A feasible data race is a race that can actually occur in
O(PT∞ ) steals, where T∞ is the span of the program an execution of the program. Netzer and Miller show
being executed, updates to the global tier add only that locating feasible data races in a general program
O(PT∞ ) to the running time of the program. is NP-hard. Instead, most race detectors, including the
Combining the local and global tiers together into ones in this article, deal with the problem of discov-
SP-ordered-bags yields a parallel SP-maintenance algo- ering apparent data races, which is an approximation
rithm that runs in O(T /P + PT∞ ) time on P pro- of the races that may actually occur. These race detec-
cessors, where T and T∞ are the work and span of tors typically ignore data dependencies that may make
the underlying program being tested. A detailed dis- some apparent races infeasible, instead considering only
cussion of SP-ordered-bags, including correctness and explicit coordination or control-flow constructs (like
performance proofs, can be found in [, ]. forks and joins). As a result, these race detectors are
conservative and report races that may not actually
Related Entries occur. R
Cilk Dinning and Schonberg’s “lock-covers” algorithm
Race Conditions [] detects apparent races in programs that use
Race Detection Techniques locks. Cheng et al. [] generalize this algorithm and
improve its running time with their All-Sets algorithm,
Bibliographic Notes and Further which is the one used in the Nondeterminator- and
Reading Cilkscreen.
Static race detectors [, , , , , ] analyze the text of Savage et al. [] give an on-the-fly race detector
the program to determine whether a race occurs. Static called Eraser that does not use an SP-maintenance algo-
analysis tools may be able to determine whether a mem- rithm, and hence would report races between strands
ory location can be involved in a race for any input. that operate in series if applied to Cilk. Their Eraser
These tools are inherently conservative, however, some- tool works on programs that have static threads (i.e., no
times reporting races that do not exist, since the static nested parallelism) and enforces a simple locking disci-
debuggers cannot fully understand the control-flow and pline. A shared variable must be protected by a particu-
synchronization semantics of a program. For example, lar lock on every access, or they report a race. The Brelly
 R Race Detectors for Cilk and Cilk++ Programs

algorithm [] employs SP maintenance while enforc- Bibliography


ing an “umbrella” locking discipline, which generalizes . Appelbe WF, McDowell CE () Anomaly reporting: a tool for
Eraser’s locking discipline. Brelly was incorporated into debugging and developing parallel numerical algorithms. In: Pro-
the Nondeterminator-, but although it theoretically ceedings of the st international conference on supercomputing
systems. IEEE, pp –
runs faster than the All-Sets algorithm in the worst case,
. Balasundaram V, Kennedy K () Compile-time detection of
most users preferred the All-Sets algorithm for the com- race conditions in a parallel program. In: Proceedings of the rd
pleteness of coverage, and the typical running time of international conference on supercomputing, ACM Press, New
All-Sets is not onerous. York, pp –
Nudler and Rudolph [] introduced the English- . Bender MA, Cole R, Demaine ED, Farach-Colton M, Zito J
() Two simplified algorithms for maintaining order in a list.
Hebrew labeling scheme for their SP-maintenance
In: Proceedings of the European syposium on algorithms, pp
algorithm. Each strand is assigned two static labels, –
similar to the labeling described for SP-order. They . Bender MA, Fineman JT, Gilbert S, Leiserson CE () On-the-
do not, however, use a centralized data structure fly maintenance of series-parallel relationships in fork-join multi-
to reassign labels. Instead, label sizes grow propor- threaded programs. In: Proceedings of the sixteenth annual ACM
tionally to the maximum concurrency of the pro- symposium on parallel algorithms and architectures, Barcelona,
Spain, June , pp –
gram. Mellor-Crummey [] proposed an “offset-
. Bruening D () Efficient, transparent, and comprehensive
span labeling” scheme, which has label lengths pro- runtime code manipulation. Ph.D. thesis, Department of Electri-
portional to the maximum nesting depth of forks. cal Engineering and Computer Science, Massachusetts Institute
Although it uses shorter label lengths than the English- of Technology
Hebrew scheme, the size of offset-span labels is not . Callahan D, Sublok J () Static analysis of low-level syn-
chronization. In: Proceedings of the  ACM SIGPLAN and
bounded by a constant as it is SP-order. Both of these
SIGOPS workshop on parallel and distributed debugging, ACM
approaches perform local decisions on strand creation Press, New York, pp –
to assign static labels. Although these approaches result . Cheng G-I, Feng M, Leiserson CE, Randall KH, Stark AF ()
in no locking or synchronization overhead for SP- Detecting data races in Cilk programs that use locks. In: Pro-
maintenance, and are inherently parallel algorithms, the ceedings of the ACM symposium on parallel algorithms and
large labels can drastically increase the work of the race architectures, June , pp –
. Choi J-D, Miller BP, Netzer RHB () Techniques for debugging
detector.
parallel programs with flowback analysis. ACM Trans Program
Dinning and Schonberg’s “task recycling” algo- Lang Syst ():–
rithm [] uses a centralized data structure to main- . Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduc-
tain series-parallel relationships. Each strand (block) tion to algorithms, rd edn. MIT Press, Cambridge
is given a unique task identifier, which consists of a . Dietz PF () Maintaining order in a linked list. In: Proceedings
of the ACM symposium on the theory of computing, May ,
task and a version number. A task can be reassigned
pp –
(recycled) to another strand during the program exe- . Dietz PF, Sleator DD () Two algorithms for maintaining order
cution, which reduces the total amount of space used in a list. In: Proceedings of the ACM symposium on the theory of
by the algorithm. Each strand is assigned a parent vec- computing, May , pp –
tor that contains the largest version number, for each . Dinning A, Schonberg E () An empirical comparison of
task, of its ancestor strands. To query the relationship monitoring algorithms for access anomaly detection. In: Proceed-
ings of the ACM SIGPLAN symposium on principles and practice
between an active strand e and a strand e recorded in
of parallel programming, pp –
the access history, task recycling simply compares the . Dinning A, Schonberg E () Detecting access anomalies
version number of e ’s task against the version number in programs with critical sections. In: Proceedings of the
stored in the appropriate slot in e ’s parent vector, which ACM/ONR workshop on parallel and distributed debugging,
is a constant-time operation. The cost of creating a new May , ACM Press, pp –
. Emrath PA, Ghosh S, Padua DA () Event synchronization
strand, however, can be proportional to the maximum
analysis for debugging parallel programs. In: Proceedings of the
logical concurrency. Dinning and Schonberg’s algo-  ACM/IEEE conference on supercomputing, November ,
rithm also handles other coordination between strands, pp –
like barriers, where two parallel threads must reach a . Emrath PA, Padua DA () Automatic detection of nondeter-
particular point before continuing. minacy in parallel programs. In: Proceedings of the workshop
Rapid Elliptic Solvers R 

on parallel and distributed debugging, Madison, Wisconsin, May . Netzer RHB, Miller BP () On the complexity of event order-
, pp – ing for shared-memory parallel program executions. In: Proceed-
. Feng M, Leiserson CE () Efficient detection of determinacy ings of the  international conference on parallel processing,
races in Cilk programs. In: Proceedings of the ACM symposium August . pp II:–
on parallel algorithms and architectures, June , pp – . Netzer RHB, Miller BP () Improving the accuracy of data race
. Fineman JT () Provably good race detection that runs in detection. In: Proceedings of the third ACM SIGPLAN sympo-
parallel. Master’s thesis, Massachusetts Institute of Technology sium on principles and practice of parallel programming, New
Department of Electrical Engineering and Computer Science, York, NY, USA. ACM Press, pp –
August  . Netzer RHB, Miller BP () What are race conditions? ACM
. Frigo M, Halpern P, Leiserson CE, Lewin-Berlin S () Reduc- Lett Program Lang Syst ():–
ers and other cilk++ hyperobjects. In: Proceedings of the twenty- . Nudler I, Rudolph L () Tools for the efficient development
first annual symposium on parallelism in algorithms and archi- of efficient parallel programs. In: Proceedings of the first Israeli
tectures, pp – conference on computer systems engineering, May 
. Frigo M, Leiserson CE, Randall KH () The implementa- . Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T ()
tion of the Cilk- multithreaded language. In: Proceedings of the Eraser: a dynamic race detector for multi-threaded programs.
ACM SIGPLAN conference on programming language design In: Proceedings of the sixteenth ACM symposium on operating
and implementation, pp – systems principles (SOSP), ACM Press, New York, pp –
. Gabow HN, Tarjan RE () A linear-time algorithm for a . Tarjan RE () Efficiency of a good but not linear set union
special case of disjoint set union. J Comput System Sci (): algorithm. J ACM ():–
– . Tarjan RE () Data structures and network algorithms. Society
. Helmbold DP, McDowell CE, Wang J-Z () Analyzing traces for Industrial and Applied Mathematics, Philadelphia
with anonymous synchronization. In: Proceedings of the  . Taylor RN () A general-purpose algorithm for analyzing
international conference on parallel processing, August . concurrent programs. Commun ACM ():–
pp II–II . Tsakalidis AK () Maintaining order in a generalized linked
. Steele GL Jr () Making asynchronous parallelism safe for list. Acta Inform ():–
the world. In: Proceedings of the seventeenth annual ACM sym-
posium on principles of programming languages, ACM Press,
pp –
. Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace
S, Reddi VJ, Hazelwood K () Pin: building customized pro- Race Hazard
gram analysis tools with dynamic instrumentation. In: Proceed-
ings of the  ACM SIGPLAN conference on programming Race Conditions
language design and implementation, ACM Press, New York,
pp –
. Mellor-Crummey J () On-the-fly detection of data races for
programs with nested fork-join parallelism. In: Proceedings of
supercomputing, pp –
Radix Sort
. Mellor-Crummey J () Compile-time support for efficient
data race detection in shared-memory parallel programs. In:
Sorting R
Proceedings of the ACM/ONR workshop on parallel and dis-
tributed debugging, San Diego, California, May . ACM Press,
pp –
. Miller BP, Choi J-D () A mechanism for efficient debugging Rapid Elliptic Solvers
of parallel programs. In: Proceedings of the  ACM SIGPLAN
conference on programming language design and implementa- Efstratios Gallopoulos
tion, Atlanta, Georgia, June , pp – University of Patras, Patras, Greece
. Nethercote N, Seward J () Valgrind: A framework for
heavyweight dynamic binary instrumentation. In: Proceedings
of the ACM SIGPLAN  conference on programming lan- Synonyms
guage design and implementaion, ACM, San Diego, June , Fast poisson solvers
pp –
. Netzer RHB, Ghosh S () Efficient race condition detection
for shared-memory programs with post/wait synchronization.
Definition
In: Proceedings of the  international conference on parallel Direct numerical methods for the solution of linear sys-
processing, St. Charles, Illinois, August  tems obtained from the discretization of certain partial
 R Rapid Elliptic Solvers

differential equations, typically elliptic and separable, papers of Buzbee [] and Sameh et al. [] and has been
defined on rectangular domains in d dimensions, with attracting the attention of computational scientists ever
sequential computational complexity O(N log N) or since. These parallel RES can solve the linear systems
less, where N is the number of unknowns. under consideration in O(log N) parallel operations on
O(N) processors instead of the fastest but impracti-
Discussion cal algorithm for general linear systems that requires
Mathematical models in many areas of science and O(log N) parallel operations on O(N  ) processors.
engineering are often described, in part, by elliptic par- Parallel implementations were discussed as early as
tial differential equations (PDEs), so their fast and reli-  for the Illiac IV (see Illiac IV) [] and sub-
able numerical solution becomes an essential task. For sequently for most important high-performance com-
some frequently occurring elliptic PDEs, it is possi- puting platforms, including vector processors, vector
ble to develop solution methods, termed Rapid Ellip- multiprocessors, shared memory symmetric multipro-
tic Solvers (RES), for the linear systems obtained from cessors, distributed memory multiprocessors, SIMD
their discretization that exploit the problem character- and MIMD processor arrays, clusters of heterogeneous
istics to achieve (almost linear) complexity O(N log N) processors, and in Grid environments. References for
or less (all logarithms are base ) for systems of N specific systems can be found for the Cray- [, ];
unknowns. RES are direct methods, in the sense that the ICL DAP []; Alliant FX/ [, ]; Caltech and
in the absence of roundoff they give an exact solution Intel hypercubes [, , , ]; Thinking Machines
and require about the same storage as iterative methods. CM- []; Denelcor HEP []; the University of Illi-
When well implemented, RES can solve the problems nois Cedar machine [, ]; Cray X-MP []; Cray
they are designed for faster than other direct or iterative Y-MP []; Cray TE [, ]; Grid environments [];
methods [, ]. The downside is their limited applica- Intel multicore processors and processor clusters [];
bility; since RES impose restrictions on the PDE, the GPUs []. Regarding the latter, it is worth noting the
domain of definition and their performance and ease extensive studies conducted at Yale on an early GPU,
of implementation may also depend on the boundary the FPS- []. Proposals for special-purpose hard-
conditions and the size of the problem. ware are in []. RES were included in the PELLPACK
Major historical milestones were the paper [], parallel problem-solving environment for elliptic PDEs
by Hyman, where Fourier analysis and marching were []. More information can be found in the annotated
proposed to solve Poisson’s equation as a precursor of bibliography provided in []. As will become clear,
methods that were analyzed more than a decade later; RES are an important class of structured matrix com-
and the paper by Bickley and McNamee [, Sect. ], putations whose design and parallelization depends on
where the inverse of the special block tridiagonal matri- special mathematical transformations not usually appli-
ces occurring when solving Poisson’s equation and a cable when dealing with direct matrix computations.
first version of the Matrix Decomposition algorithm This is a topic with many applications and the subject of
were described. Reference [] by Hockney with its intensive investigation in computational mathematics.
description of the FACR() algorithm and the solution
of tridiagonal systems with cyclic reduction (in collab- Problem Formulation
oration with Golub) is widely considered to mark the RES are primarily applicable for solving numerically
beginning of the modern era of RES. That and the paper separable elliptic PDEs whose discretization (cf. [, ,
by Buzbee, Golub, and Nielson [], with its detailed ]) leads to block Toeplitz tridiagonal linear systems of
analysis of the major RES were extremely influential in order N = mn,
the development of the field. The evolution to methods
of low sequential complexity was enabled by two key AU = F, where A = tridn [W, T, W],
developments: the advent of the FFT and fast methods and W, T ∈ Rm×m , ()
for the manipulation of Toeplitz matrices.
Interest in the design and implementation of paral- where W, T are symmetric, simultaneously diagonaliz-
lel algorithms for RES started in the early s, with the able and thus WT = TW. Symbol tridn [C, A, B] denotes
Rapid Elliptic Solvers R 

block Toeplitz tridiagonal or Toeplitz tridiagonal matri- blocks of more general solvers, e.g., in domain decom-
ces, where C, A, B are the (matrix or scalar) elements position and preconditioning.
along the subdiagonals, diagonals, and superdiagonals, Sequential RES are much faster than direct solvers
respectively, and n is their number. such as sparse Cholesky elimination (see  Sparse Direct
A simple but very common problem is Poisson’s Methods). This advantage carries over to parallel RES.
equation on a rectangular region. After suitable dis- Other methods that can be used but are addressed
cretization, e.g., using an m × n rectangular grid, () elsewhere in this volume are multigrid methods (see
becomes Algebraic Multigrid) and preconditioned conjugate

⎛ T −I ⎞⎛ U ⎞ ⎛ F ⎞ gradients (see Iterative Methods) that are the basis for


⎜ ⎟⎜  ⎟ ⎜  ⎟ software PDE packages such as PETSc.
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −I T −I ⎟⎜ U ⎟ ⎜ F ⎟
⎜ ⎟⎜  ⎟ ⎜  ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟ Mathematical Preliminaries and Notation
⎜ ⋱ ⋱ ⋱ ⎟⎜ ⋮ ⎟ = ⎜ ⋮ ⎟, ()
⎜ ⎟⎜ ⎟ ⎜ ⎟ The following concepts from matrix analysis are essen-
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −I −I ⎟⎜ ⋮ ⎟ ⎜ ⋮⎟ tial to describe and analyze RES.
⎜ T ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟ . Kronecker products of matrices and their properties.
⎝ −I T ⎠ ⎝ Un ⎠ ⎝ Fn ⎠ . Chebyshev polynomials (first and second kind)
where T = trid m [−, , −] is Toeplitz symmetric and Cd , Sd , and modified Chebyshev polynomials (sec-
tridiagonal. Vectors Ui = [Ui, , . . . , Ui,m ]⊺ ∈ Rm for i = ond kind) C˜d .
, . . . , n contain the unknowns at the ith grid row, and . The analytic expression for the eigenvalues and
Fi = [Fi, , . . . , Fi,m ]⊺ the values of the right-hand side eigenvectors of the symmetric tridiagonal Toeplitz
F including the boundary terms and scalings related to matrix T = tridm [−, α, −]: Specifically
the discretization. This is the problem for which RES are √
eminently applicable and mostly discussed in the liter- πj 
λ j = α −  cos( ) , qj =
ature of RES, hence the term “fast Poisson solvers.” The m+ m+
homogeneous case, f ≡ , is Laplace’s equation, that πj πjm ⊺
[sin( ) , . . . , sin( )] . ()
lends itself to even more specialized fast methods (not m+ m+
discussed here). RES can also be designed to solve the
general separable problem If Q = [q , . . . , qm ] then
√ the product Qy for any
πij
∑j= sin( m+ ) for i =
m
− (a(x)uxx + b(x)ux ) − (d(y)uyy + e(y)uy ) + (c(x) y ∈ R has elements m+
m 

+ c̃(y))u = f (x, y) () , . . . , m and so is the discrete sine transform (DST)


of y times a scaling factor. Also Q⊺ = Q and Q⊺ Q = I.
whose discrete form is The DST can be computed in O(m log m) opera- R
⎛T + α  I −β  I ⎞ ⎛ U ⎞ ⎛ F ⎞ tions using FFT-type methods. Similar results for
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟ the eigenstructure also hold for slightly modified
⎜ −γ I T + α  I −β  I ⎟ ⎜ U ⎟ ⎜ F ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟ matrices that occur when the boundary conditions
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⋱ ⋱ ⋱ ⎟ ⎜ ⋮⎟ = ⎜ ⋮⎟ , ()
⎜ ⎟⎜ ⎟ ⎜ ⎟ for the continuous problem are not strictly Dirichlet.
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −γ I T + α I −β I ⎟ ⎜ ⋮⎟ ⎜ ⋮⎟
⎜ n− n− n− ⎟⎜ ⎟ ⎜ ⎟ . The fact (see [, , ]) that for any nonsingular
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎝ −γn I T + α n− I ⎠ ⎝Un ⎠ ⎝Fn ⎠ T ∈ Rm×m , the matrix A = tridn [−I, T, −I] is non-
singular and A− can be written as a block matrix
for scalars α j , β j , γ j . This is no longer block Toeplitz, with general term
but the diagonal and off diagonal blocks have special ⎧

⎪ −
structure. − ⎪Sn (T)Si− (T)Sn−j (T), j ≥ i,

(A )ij = ⎨ ()
Once the equations become nonseparable and/or ⎪
⎪ −

⎪ S (T)S (T)S (T), i ≥ j.
the domain irregular, RES are not directly applicable ⎩ n j− n−i

and other methods are preferred. All is not lost, how- . The vec operator and its inverse, unvec n . Also the
ever, because frequently RES could be used as building vec-permutation matrix Π m,n Rmn×mn , that is the
 R Rapid Elliptic Solvers

unique matrix such that vec(A) = Π m,n vec(A⊺ ), Chapter (see FFTW) documents very efficient algo-
where A ∈ Rm×n . The matrix is orthogonal and rithms for the single and multiple FFTs needed for
Π m,n (In ⊗ T̃m ) = (T̃m ⊗ In )Π m,n . kernels of type (). Parallel algorithms for the fast trans-
forms required in the context of the RES have O(log m)
complexity on m processors.
Algorithmic Infrastructure
Most RES make use of the finite difference analogue of a Matrix Decomposition
key idea in the separation of variables method for PDEs, MD refers to a large class of methods to solve systems
namely, that under certain conditions, some multidi- with matrices such as (). As Buzbee noted in []: “It
mensional problems can be solved by solving several seldom happens that the application of L processors
one-dimensional problems []. In line with this, it would yield an L-fold increase in efficiency relative to a
turns out that RES for () make extensive use of and single processor, but that is the case with the MD algo-
their performance greatly depends on two basic ker- rithm.” Sameh et al. in [] provided the first detailed
nels: () Tridiagonal linear system solvers and () fast study of parallel MD for Poisson’s equation. MD can be
Fourier and trigonometric transforms. These kernels succinctly described, as shown early on by Lynch et al.
are also used extensively in collective form, that is for and Egerváry, using the compact Kronecker product
given tridiagonal T ∈ Rm×m and any Y ∈ Rm×s , solve representation of A (see [] and extensive discussion
TX = Y for s ≥ . In many cases, shifted or multiply and references in []). Of greatest relevance here is the
shifted tridiagonal systems with one or multiple right- case that A = (In ⊗ T̃m + T̃n ⊗ Im ), where any of T̃m , T̃n
hand sides must be solved. Also if the m × m matrix is diagonalizable using matrices expressed by Fourier
Q represents the discrete sine, cosine or Fourier trans- or trigonometric transforms. Denoting by Qx ∈ Rm×m
form of length m, then RES call for the fast computation (resp. Qy ∈ Rn×n ) the matrix of eigenvectors of T̃m
of QY as well as the inverse transforms Q− Y. The best (resp. T̃n ), then AU = F can be rewritten as
known parallel techniques for solving tridiagonal sys-
(In ⊗ Q⊺x ) (In ⊗ T̃m + T̃n ⊗ Im )(In ⊗ Qx )
tems are recursive doubling, cyclic reduction, and parallel
cyclic reduction, that all need O(log m) parallel arith- (In ⊗ Q⊺x ) U = (In ⊗ Q⊺x ) F
metic operations on O(m) processors, e.g., [, , ]. and equivalently as
For more flexibility, it is possible to combine those or
(In ⊗ Λ̃ m + T̃n ⊗ Im )(In ⊗ Q⊺x ) U = (In ⊗ Q⊺x ) F.
Gaussian elimination with divide-and-conquer meth-
ods. Fourier-based RES also require fast implementa- Applying the similarity transformation with the vec-
tions of the discrete cosine transform and the discrete permutation matrix Π m,n , the system becomes
Fourier transform to handle Neumann and periodic ⊺ ⊺
(Λ̃m ⊗ In + Im ⊗ T̃n )(Πm,n (In ⊗ Qx ) U) = Πm,n (In ⊗ Qx ) F.
boundary conditions. Many publications on RES (e.g., 

B
[, , , ]) include extensive discussion of these
building blocks. It is common for the tridiagonal matri- Matrix B is block diagonal, so these are m independent
ces to have additional properties that can be used for tridiagonal systems of order n each. Thus
faster processing. They are usually symmetric positive
U = (In ⊗ Qx )Π⊺m,n B − Π m,n (In ⊗ Q⊺x ) F. ()
definite, Toeplitz, and have the form tridm [−, τ, −],
where τ ≥ . When τ > , they are diagonally dominant When either of T̃m of T̃n is diagonalizable with trigono-
in which case it is possible to terminate early Gaus- metric or Fourier transforms, as in the matrix of () for
sian elimination (as implemented by O’Donnell et al. the Poisson equation, where they both are of the form
in []) and cyclic reduction (as proposed by Hockney trid[−, , −], and matrix () when the coefficients are
in []), at negligible error. ScaLAPACK includes effi- constant in one direction, MD will also be called Fourier
cient tridiagonal and narrow-banded solvers, but the MD and can be applied at total cost O(N log N). For
Toeplitz structure is not exploited (see ScaLAPACK). example, if Qx can be applied in O(m log m) opera-
A detailed description of the fast Fourier and other dis- tions, the overall cost becomes O(nm log m) for Fourier
crete transforms used in RES can be found in []. MD. From () emerge the three major computational
Rapid Elliptic Solvers R 

phases of Fourier MD for (). In phase (I), the term where matrix (In ⊗ Λ̃ m + Λ̃ n ⊗Im ) is diagonal. Multiplica-
(In ⊗ Q⊺x ) F is computed, which amounts to the DST tion by Q⊺y ⊗ Q⊺x amounts to performing n independent
of n vectors Fi , i = , . . . , n to compute the F̂i = DSTs of length m and m independent DSTs of length n;
Q⊺x Fi . In phase (II), m independent tridiagonal sys- similarly for Qy ⊗ Qx . The middle phase is an element-
(m)
tems with coefficient matrices Bj = tridn [−, λ j , −] by-element division by the diagonal of In ⊗ Λ̃ m + Λ̃ n ⊗Im
(m) that contains the eigenvalues of A.
are solved, where λ j is the jth eigenvalue of T
The parallel computational cost is O(log m +
in ( ). The right-hand sides are the m columns of
log n) operations on O(mn) processors. Hockney and
[F̂ , . . . , F̂n ]⊺ . In terms of the vec- constructs, these
Jesshope [], Swarztrauber and Sweet [], among oth-
are the m contiguous, length-n subvectors of the mn
ers, studied CFT for various parallel systems. In the
length vector Π m,n vec[F̂ , . . . , F̂n ]. The respective solu-
latter, CFT was found to have lower computational
tions Ûj , j = , . . . , m are stacked and reordered into
and communication complexity than MD and BCR for
Ū = Π⊺m,n vec[Û , . . . , Ûm ] so that phase (III) consists
systems with O(mn) processors connected in a hyper-
of independent application of a scaled, length m DST
cube. Specific parallel implementations were described
of each column of [Ū , . . . , Ūn ] = unvec n Ū. Opportu-
by O’Donnel et al. in [] for the FPS- attached array
nities for parallelism abound, and if mn processors are
processor where CFT had lower performance than MD
available and each tridiagonal system is solved using a
and FACR(), by Cote in [] for Intel hypercubes; see
parallel algorithm of logarithmic complexity, then MD
also [].
can be accomplished in O(log mn) parallel operations.
There are several ways of organizing the computation.
When there are n = m processors, phases (I) and
Block Cyclic Reduction
BCR is a solver for () that generalizes the (scalar) cyclic
(III) can be performed independently in m log m steps.
reduction of Hockney and Golub. It is also more general
This will leave the data in the middle phase distributed
than Fourier MD because it does not require knowledge
across the processors so that the multiple tridiagonal
of the eigenstructure of T neither uses fast transforms.
systems will be solved with a parallel algorithm that
A detailed analysis is found in []. BCR was key to the
would require significant data movement. Instead, one
design of FISHPAK, the influential numerical library by
can transpose the output data from phase (I) and also
Swarztrauber, Sweet, and Adams.
at the end of phase (II) so that all data is readily accessi-
It is outlined next for n = k −  blocks, though, as
ble from any processor that needs it. This approach has
shown by Sweet, the method can be modified to han-
the advantage that it builds on mature numerical soft-
dle any n. For steps r = , . . . , k − , adjacent blocks of
ware for uniprocessors. It requires, however, the use of
equations are combined in groups of three to eliminate
efficient algorithms for matrix transposition.
two blocks of unknowns; in the first step, for instance,
MD algorithms have been designed and imple-
unknowns from even-numbered blocks are eliminated R
mented on vector and parallel architectures and are
and a reduced system. A() = tridk− − [−I, A − I, −I],
frequently used as the baseline in evaluating new Pois-
containing approximately half the blocks remains. Set-
son solvers; cf. [, –, , , , , , , , ]
ting T () = T, and T (r) = (T (r−) ) − I, at the rth step
and [] for more details. More applications of MD can
the reduced system is
be found in the survey [] by Bialecki et al.
tridk−r − [−I, T (r) , −I]U(r) = F(r) ,
Complete Fourier Transform where F(r) = vec[Fr . , . . . , Fr .(k−r −) ]. These are com-
The idea of CFT was discussed by Hyman in [] and puted by
described in the context of the tensor product methods
of Lynch et al. (cf. [, ]). For low cost, this requires Fr .j = Fr− (j−) + Fr− (j+) + T (r−) Fr− j . ()
that both T̃m and T̃n are diagonalizable with Fourier or Matrix T (r) can be written in terms of Chebyshev
trigonometric transforms, as is the case for (). Then the polynomials, T (r) = Cr ( T ) = C̃r (T). The
(r)
solution of () is roots are known analytically, ρ i =  cos( (i−)
 r+
π),
U = (Qy ⊗ Qx )(In ⊗ Λ̃ m + Λ̃ n ⊗ Im )− (Q⊺y ⊗ Q⊺x ) F, therefore the product form of T (r) is also available.
 R Rapid Elliptic Solvers

(r) (r)
After r = k −  steps, only the system T (k) Uk− = Fk− where the α i are the corresponding partial fraction
remains. After this is solved, a back substitution phase coefficients that can be easily computed from the val-
(r)
recovers the remaining unknowns. Each reduction step ues of the derivative of C̃r (z) at z = ρ i . This was
requires k−r −  independent matrix-vector multipli- described by Sweet (see []) and also by Gallopoulos
cations with T (r−) . All multiplications with T (r−) are and Saad in []. Moreover, when BCR is applied for
done using its product form so that the total cost is more general values of n or other boundary conditions
O(nm log n) operations without recourse to fast trans- that lead to more general matrix rational functions, par-
forms. Even with unlimited parallelism, the systems tial fractions eliminate the need for operations with the
composing T (r) must be solved in sequence, which cre- numerator. Expression () can be evaluated indepen-
ates a parallelization bottleneck. Also the algorithm as dently for each right-hand side. Therefore, solving with
stated (sometimes called CORF []) is not numerically coefficient matrix T (r) and k−r −  right-hand sides for
viable because the growing discrepancy in the magni- r = , . . . , k −  can be accomplished by solving r inde-
tude of the terms in () leads to excessive roundoff. pendent tridiagonal systems for each right-hand side
Both problems can be overcome. Stabilization is and then combining the partial solutions by multiply-
achieved by means of a clever modification due to Bune- ing k−r −  matrices, each of size m × r with the vector
man and analyzed in detail in [], in which the unstable of r partial fraction coefficients.
recurrence () is replaced with a coupled recurrence for The algorithm has parallel complexity O(log n log m).
a pair of new vectors, that can be computed stably, by Even though this is somewhat inferior to parallel MD
exchanging tridiagonal matrix-vector multiplications and CFT, BCR has wider applicability; cf. the survey
with tridiagonal linear system solves. The number of of Bini and Meini in [] for many additional uses and
sequential operations remains O(mn log n). parallelization. References [, , , ] discuss the
The parallel performance bottleneck can be resolved application of partial fraction based parallel BCR and
using partial fractions, a versatile tool for parallel pro- evaluate its performance on vector processors and mul-
cessing first used, ingeniously, by H.T. Kung in [] to tiprocessors, multicluster vector multiprocessors, and
enable the evaluation of arithmetic expressions such hypercubes. As shown by Calvetti et al. in [], the use of
as xn , ∏ni= (x + ρ i ) and Chebyshev polynomials in few partial fractions above is numerically safe for the poly-
parallel divisions and O(log n) parallel additions. For nomials that occur when solving Poisson’s equation, but
scalars, these findings are primarily of theoretical inter- caution is required in the general case.
est since the expressions can also be computed in The above techniques for parallelizing MD and BCR
O(log n) multiplications. The idea becomes much more were implemented in CRAYFISHPAK, a package that
attractive for matrices (the problem of computing pow- contains most of the functionality of FISHPAK (results
ers of a matrix is mentioned in [] but not pursued any for the Cray Y-MP listed in []). Beyond BCR, par-
further) as is the case of BCR in the CORF or the sta- tial fractions can also be used to parallelize the com-
ble variants of Buneman. Specifically, the parallelization putation important matrix functions, e.g., the matrix
bottleneck that occurs when solving exponential.

r
(r) (r) (r) FACR
T Xr = Yr , where T = ∏(T − ρ i I) and
i= This method, proposed in [], is based on the fact that
m×( k−r −) at any step of the reduction phase of BCR, the coeffi-
Xr , Y r ∈ R
cient matrix is A(r) = tridk−r − [−I, T (r) , −I], and that
for large r is resolved because, for each r, the roots because T (r) = Cr ( T ) has the same eigenvectors
(m)
(r) λ
ρ i , i = , . . . , r are distinct and so for any right-hand as T and eigenvalues Cr ( j ), the reduced system
side f ∈ Rm , can be solved using Fourier MD. FACR(l) (acronym
for Fourier Analysis Cyclic Reduction) is a hybrid algo-
r − rithm consisting of the following phases: () Perform l
(r) (r)
(T (r) )− f = ∑ α i (T − ρ i I) f , ()
i=
steps of BCR to obtain A(l) and F (l) . () Use MD to
Rapid Elliptic Solvers R 

solve A(l) U (l) = F (l) . () Use l steps of back substitu- be computed from this and the boundary values Un+
tion to obtain the remaining subvectors of U. Research by the recurrence Uj− = Fj + Uj+ − TUj in O(mn)
of Hockney, Swarztrauber, and Temperton has shown operations. Vector Un can also be computed in O(mn)
that a value l ≈ log log m reduces the number of sequen- operations in the course of block LU of a reordering
tial operations down to O(mnl). Therefore, properly of () by solving a linear system with a matrix rational
designed FACR is faster than MD and BCR. In practice, function of Chebyshev polynomials like Sn .
the best choice for l depends on the relative perfor- The process is not numerically viable for large n,
mance of the underlying kernels and other character- thus in GM the domain is partitioned into k strips that
istics of the computer platform. The selection for vector are small enough so that sufficient accuracy is main-
and parallel architectures has been studied at length by tained when marching is used to compute the solu-
Hockney (cf. [] and references therein) and Jesshope tion in each. This is a form of domain decomposition,
and by Briggs and Turnbull []. A general conclusion whose merits for parallel processing are discussed in
is that l is typically very small (so that BCR can be the next section. The interface values of U necessary
applied in its simplest CORF form), especially for high to start the independent marches can be computed
levels of parallelism, and that it is worth determining it independently extending the process described above.
empirically. Additional parallelism is available from using partial
The parallel implementation of all steps can pro- fractions.
ceed using the techniques deployed for BCR and MD; When the right-hand-side F of ( ) is sparse and
in fact, FACR(l) can be viewed as an alternative to par- one is interested in only few values of U, then the
tial fractions to avoid the bottleneck to parallelization cost of computing such a solution with MD is reduced.
that was observed after a few steps of reduction in BCR. Based on an idea of Banegas, Vassilevski and Kuznetsov
For example, one can monitor the number of systems designed variants of cyclic reduction that generate sys-
that can be solved in parallel in BCR, and before the tems where a partial solution method is deployed to
number of independent computations is small and no build the final solution. Parallel algorithms using these
longer acceptable, switch to MD. This approach was ideas were implemented by Petrova (on a workstation
analyzed by Briggs and Turnbull in [] on a shared- cluster) [] and by Rossi and Toivanen (algorithm
memory multiprocessor. Corroborating the complexity PSCR on a Cray TE) who also showed a radix-q ver-
estimates above, FACR is frequently found to be faster sion of cyclic reduction that eliminates a factor of q ≥ 
than MD and CFT; see, e.g., [] for results on an blocks at a time. As with other RES (e.g., []), to solve
attached array processor. the general separable system (), these methods require
additional preprocessing to compute eigenvalues and
Marching and Other Methods eigenvectors that are not available analytically.
Marching methods are RES that can be used to solve The special form of the inverse blocks of ( ) in R
Poisson’s equation as well as general separable PDEs. ( ) shows that both the matrix and its inverse are
The idea was already mentioned in [] and by , “data sparse,” a property that is used to design effective
as recounted by Hockney [], a marching algorithm approximate algorithms to solve elliptic equations using
by Lorenz was cautiously cited as the fastest perform- an integral equation formulation; cf. [, ]. The explicit
ing RES for Poisson’s equation on uniprocessors, pro- formula for the inverse and partial fractions can be com-
vided that it is “used sensibly.” The caution is due to bined to solve () for any rows of the grid in O(log N)
numerical instability that can have deleterious effects parallel operations, as proposed in []. Marching can
on accuracy. Bank and Rose in [] proposed Generalized also be parallelized by diagonalizing T. Description of
Marching (GM) methods that are stable with operation this and other methods are found in the monograph
count O(n log nk ) when m = n, where k is determined [] by Vajteršic.
by the sought accuracy, instead of the theoretical O(n )
for simple but unstable marching. Domain Decomposition
The key idea in marching methods is that if one Domain decomposition (DD) is a methodology for
knows Un in (), then the remaining Un− , . . . , U could solving elliptic PDEs in which the solution is constructed
 R Rapid Elliptic Solvers

by partitioning and solving the problem on subdomains described in [, ] and in [] for the Cray TE. The
(when these do not overlap, DD is called substructur- partial solution method was also extended to three-
ing) and synthesizing the solution by combining the dimensional problems, e.g., [, cf. comments on solver
partial solutions, usually iteratively, until a satisfactory PDCD].
solution is obtained. Whenever the equations and sub- The parallel RES methods above can be extended to
domains are suitable, RES can be used as building blocks handle other types of boundary conditions (Dirichlet,
for DD methods; the algorithm outlined by Buzbee et Neumann and periodic) and operators (e.g., bihar-
al. in [, Sect. ] is a case in point and confirms that monic) and provide a blueprint for the design of
RES were an early motivation behind the development parallel RES of higher order of accuracy, e.g., the
of DD. In some cases, one can dispense of the iterative FACR-like method FFT of Houstis and Papatheodorou
process altogether and compute the solution to the PDE (see [, ]).
at sequential cost O(N log N), thus extending RES to
irregular domains. DD naturally lends itself for parallel
Related Entries
processing (see Domain Decomposition), introduc-
Algebraic Multigrid
ing another layer of parallelism in the RES described so
Domain Decomposition
far and thus providing greater flexibility in their map-
FFTW
ping on parallel architectures, e.g., when processors are
Illiac IV
organized in loosely coupled clusters, for better load
Preconditioners for Sparse Iterative Methods
balancing on heterogeneous processors, etc. One such
ScaLAPACK
method based on Fourier MD was proposed in []
Sparse Direct Methods
and can be viewed as a special case of the Spike banded
Spike
solver method developed by Sameh and collaborators
(cf. []) (Spike).
A slightly different formulation of this DD solver is Bibliography
based on reordering the equations in () combined with . Bank RE, Rose DJ () Marching algorithms for elliptic bound-
block LU. This was investigated by Chan and Resasco ary value problems. Part I: the constant coefficient case. Part II:
and then at length with hypercube implementations (cf. the variable coefficient case. SIAM J Numer Anal ():–
(Part I); – (Part II)
[] and references therein) and by Chan and Fatoohi . Bialecki B, Fairweather G, Karageorghis A () Matrix decom-
for vector multiprocessors in []. The interface val- position algorithms for elliptic boundary value problems: a sur-
ues in domain decomposition can also be computed vey. Numer Algorithms. http://www.springerlink.com/content/
using the partial solution methodology, e.g., in combi- gp/fulltext.pdf 
nation with GM as suggested by Vassilevski and studied . Bickley WG, McNamee J () Matrix and other direct methods
for the solution of systems of linear difference equations. Philos
by Bencheva in the context of parallel processing (cf.
Trans R Soc Lond A: Math Phys Sci ():–
[]). DD was also used by Barberou to increase the per- . Bini DA, Meini B () The cyclic reduction algorithm: from
formance of the partial solution method of Rossi and Poisson equation to stochastic processes and beyond. Numer
Toivanen on a computational Grid of high-performance Algorithms ():–
systems []. . Birkhoff G, Lynch RE () Numerical solution of elliptic prob-
lems. SIAM, Philadelphia
. Börm S, Grasedyck L, Hackbusch W () Lecture notes
Extensions /: hierarchical matrices. Technical report, Max Planck
The above methods can be extended to solve Pois- Institut fuer Mathematik in den Naturwissenschaften, Leipzig
son’s equation in three dimensions. The corresponding . Botta EFF et al () How fast the Laplace equation was solved
matrix will be multilevel Toeplitz: It is block Toeplitz in . Appl Numer Math ():–
tridiagonal with each block being of the form (). Sameh . Briggs WL, Turnbull T () Fast Poisson solvers for MIMD
computers. Parallel Comput :–
in [] proposed a six phase parallel MD-like algorithm
. Buzbee B, Golub G, Nielson C () On direct methods for solv-
that combines independent FFTs in two dimensions ing Poisson’s equation. SIAM J Numer Anal ():– (see
and tridiagonal system solutions along the third dimen- also the comments by Buzbee in Current Contents, , September
sion. Similar parallel RES for the Intel hypercube were )
Rapid Elliptic Solvers R 

. Buzbee BL () A fast Poisson solver amenable to parallel . Intel Cluster Poisson Solver Library – Intel Software
computation. IEEE Trans Comput C-():– Network. http://software.intel.com/en-us/articles/intel-cluster-
. Calvetti D, Gallopoulos E, Reichel L () Incomplete partial poisson-solver-library/
fractions for parallel evaluation of rational matrix functions. . Iserles A () Introduction to numerical methods for differen-
J Comput Appl Math :– tial equations. Cambridge University Press, Cambridge
. Chan TF, Fatoohi R () Multitasking domain decomposi- . Johnsson L () Solving tridiagonal systems on ensemble archi-
tion fast Poisson solvers on the Cray Y-MP. In: Proceedings of tectures. SIAM J Sci Statist Comput :–
the fourth SIAM conference on parallel processing for scientific . Jwo J-S, Lakshmivarahan S, Dhall SK, Lewis JM () Compar-
computing. SIAM, Philadelphia, pp – ison of performance of three parallel versions of the block cyclic
. Cote SJ () Solving partial differential equations on a MIMD reduction algorithm for solving linear elliptic partial differential
hypercube: fast Poisson solvers and the alternating direction equations. Comput Math Appl (–):–
method. Technical report UIUCDCS-R--, University of . Knightley JR, Thompson CP () On the performance of some
Illinois at Urbana-Champaign, Urbana rapid elliptic solvers on a vector processor. SIAM J Sci Statist
. Ericksen JH () Iterative and direct methods for solving Pois- Comput ():–
son’s equation and their adaptability to Illiac IV. Technical report . Kung HT () New algorithms and lower bounds for the par-
UIUCDCS-R--, Department of Computer Science, Univer- allel evaluation of certain rational expressions and recurrences.
sity of Illinois at Urbana-Champaign J Assoc Comput Mach ():–
. Ethridge F, Greengard L () A new fast-multipole accelerated . Lynch RE, Rice JR, Thomas DH () Tensor product analysis of
Poisson solver in two dimensions. SIAM J Sci Comput (): partial differential equations. Bull Am Math Soc :–
– . McBryan OA () Connection machine application perfor-
. Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons mance. Technical report CH-CS--, Department of Com-
RJ, Romine CH, Sameh AH, Voigt RG () Parallel algorithms puter Science, University of Colorado, Boulder
for matrix computations. SIAM, Philadelphia . McBryan OA, Van De Velde EF () Hypercube algo-
. Gallopoulos E. An annotated bibliography for rapid ellip- rithms and implementations. SIAM J Sci Stat Comput ():
tic solvers. http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/ s–s
Papers/myresbib . Meurant G () A review on the inverse of symmetric tridiag-
. Gallopoulos E, Saad Y () Parallel block cyclic reduction algo- onal and block tridiagonal matrices. SIAM J Matrix Anal Appl
rithm for the fast solution of elliptic equations. Parallel Comput ():–
():– . O’Donnell ST, Geiger P, Schultz MH () Solving the Poisson
. Gallopoulos E, Saad Y () Some fast elliptic solvers for paral- equation on the FPS-. Technical report YALE/DCS/TR,
lel architectures and their complexities. Int J High Speed Comput Yale University
():– . Petrova S () Parallel implementation of fast elliptic solver.
. Gallopoulos E, Sameh AH () Solving elliptic equations on Parallel Comput ():–
the Cedar multiprocessor. In: Wright MH (ed) Aspects of com- . Polizzi E, Sameh AH () SPIKE: A parallel environment for
putation on asynchronous parallel processors. North-Holland, solving banded linear systems. Comput Fluids ():–
Amsterdam, pp – . Resasco DC () Domain decomposition algorithms for ellip-
. Giraud L () Parallel distributed FFT-based solvers for -D tic partial differential equations. Ph.D. thesis, Yale University,
Poisson problems in meso-scale atmospheric simulations. Int J Department of Computer Science
High Perform Comput Appl ():– . Rossi T, Toivanen J () A parallel fast direct solver for block
R
. Hockney R () A fast direct solution of Poisson’s equation tridiagonal systems with separable matrices of arbitrary dimen-
using Fourier analysis. J Assoc Comput Mach :– sion. SIAM J Sci Stat Comput ():–
. Hockney R, Jesshope C () Parallel computers. Adam Hilger, . Rossinelli D, Bergdorf M, Cottet G-H, Koumoutsakos P ()
Bristol GPU accelerated simulations of bluff body flows using vortex
. Hockney RW () Rapid elliptic solvers. In: Hunt B (ed) particle methods. J Comput Phys ():–
Numerical methods in applied fluid dynamics. Academic, . Sameh AH () A fast Poisson solver for multiprocessors. In:
London, pp – Birkhoff G, Schoenstadt A (eds) Elliptic problem solvers II. Aca-
. Hockney RW () Characterizing computers and optimizing demic, New York, pp –
the FACR(l) Poisson solver on parallel unicomputers. IEEE Trans . Sameh AH, Chen SC, Kuck DJ () Parallel Poisson and bihar-
Comput C-():– monic solvers. Computing :–
. Houstis EN, Rice JR, Weerawarana S, Catlin AC, Papachiou P, . Swarztrauber PN () A direct method for the discrete
Wang K-Y, Gaitatzes M () PELLPACK: a problem-solving solution of separable elliptic equations. SIAM J Numer Anal
environment for PDE-based applications on multicomputer plat- ():–
forms. ACM Trans Math Softw ():– . Swarztrauber PN, Sweet RA () Vector and parallel methods
. Hyman MA (–) Non-iterative numerical solution of for the direct solution of Poisson’s equation. J Comput Appl Math
boundary-value problems. Appl Sci Res B :– :–
 R Reconfigurable Computer

. Sweet RA () A parallel and vector cyclic reduction algorithm.


SIAM J Sci Statist Comput ():– Reconstruction of Evolutionary
. Sweet RA () Vectorization and parallelization of FISHPAK. Trees
In: Dongarra J, Kennedy K, Messina P, Sorensen DC, Voigt RG
(eds) Proceedings of the fifth SIAM conference on parallel pro- Phylogenetics
cessing for scientific computing. SIAM, Philadelphia, pp –
(see also http://www.cisl.ucar.edu/softlib/CRAYFISH.html)
. Sweet RA, Briggs WL, Oliveira S, Porsche JL, Turnbull T ()
FFTs and three-dimensional Poisson solvers for hypercubes. Par-
allel Comput :–
Reduce and Scan
. Temperton C () Fast Fourier transforms and Poisson solvers
Marc Snir
on Cray-. In: Hockney RW, Jesshope CR (eds) Infotech state
of the art report: supercomputers, vol . Infotech, Maidenhead,
University of Illinois at Urbana-Champaign, Urbana,
pp – IL, USA
. Tromeur-Dervout D, Toivanen J, Garbey M, Hess M, Resch MM,
Barberou N, Rossi T () Efficient metacomputing of ellip- Synonyms
tic linear and non-linear problems. J Parallel Distrib Comput
Parallel Prefix Sums; Prefix; Prefix Reduction
():–
. Vajteršic M () Algorithms for elliptic problems: efficient
sequential and parallel solvers. Kluwer, Dordrecht Definition
. Van Loan C () Computational frameworks for the fast Given n inputs x , . . . , xn , and an associative operation
Fourier transform. SIAM, Philadelphia ⊗, a reduction algorithm computes the output x ⊗ ⋯ ⊗
. Widlund O () A Lanczos method for a class of nonsymmetric
xn . A prefix or scan algorithm computes the n outputs
systems of linear equations. SIAM J Numer Anal ():–
. Zhang Y, Cohen J, Owens JD () Fast tridiagonal solvers on the x , x ⊗ x , . . . , x ⊗ ⋯ ⊗ xn .
GPU. In: Proceedings of the th ACM SIGPLAN PPoPP ,
Bangalore, India, pp – Discussion
The reduction problem is a subset of the prefix problem,
but as we shall show, the two are closely related. The
reduction and scan operators are available in APL []:
Reconfigurable Computer Thus +/   computes the value  and +\   computes
the vector   . The two primitives appears in other
Blue CHiP programming languages and libraries, such as MPI [].
There are two interesting cases to consider:
Random access: The inputs x , . . . , xn are stored in con-
secutive location in an array.
Reconfigurable Computers Linked list: The inputs x , . . . , xn are stored in consecu-
tive locations in a linked list; the links connect each
Reconfigurable computers utilize reconfigurable logic
element to its predecessor.
hardware – either standalone, as a part of, or in
combination with conventional microprocessors – We analyze these two cases using a circuit or PRAM
for processing. Computing using reconfigurable logic model, and ignoring communication costs.
hardware that can adapt to the computation can attain
only some level of the performance and power efficiency Random-Access Reduce and Scan
gain of custom hardware but has the advantage of being The sequential circuit for a reduction computation is
retargetable for many different applications. shown in Fig. ; the corresponding sequential code is
shown in Algorithm . This circuit has W = n − 
Bibliography operations and depth D = n − . It also computes a scan.
. Hauck S, DeHon A () Reconfigurable computing: the the-
One can take advantage of associativity in order to
ory and practice of FPGA-based computing. Morgan Kaufmann, change the order of evaluation, so as to reduce the par-
Burlington allel computation time. An optimal parallel reduction
Reduce and Scan R 

abcdefgh

abcdefg

abcdef

abcde

abcd

abc

ab

a b c d e f g h

Reduce and Scan. Fig.  Sequential prefix circuit

abcdefgh

abcd efgh

ab cd ef gh

a b c d e f g h

Reduce and Scan. Fig.  Binary tree reduction

algorithm is achieved by using a balanced binary tree The algorithm has W = n −  operations and depth R
for the computation, as shown in Figs.  and , for n = . D = ⌈log n⌉, which is optimal; it requires ⌊n/⌋ pro-
The corresponding algorithm is shown in Algorithm . cesses. If the number p of processes is smaller, then the
inputs are divided into segments of ⌈n/p⌉ consecutive
inputs; a sequential reduction is applied to each group,
Algorithm  Sequential reduction and scan followed by a parallel reduction of p values; we obtain
for (i = ; i ≤ n; i ++) a running time of ⌈n/p⌉ −  + log p. In particular, the
xi = xi− ⊗ xi ; reduction of n values can be computed in O(log n) time,
using n/ log n processes.
Given a tree that computes the reduction of n inputs,
Algorithm  Binary tree reduction one can build from it a parallel prefix circuit, as shown in
for (i = ; i ≤ k; i ++) Figs.  and , for n = : The reduction is computed when
forall (j = i ; j ≤ k ; j+= i ) values move up the tree, while the remaining prefix val-
xj = xj−i− ⊗ xj ; ues are computed when values move down the tree. The
algorithm is shown in Algorithm , for n = k . This
 R Reduce and Scan

abcdefgh

abcd efgh

ab cd ef gh

a b c d e f g h

Reduce and Scan. Fig.  Binary tree reduction

abcdefgh

abcd efgh

abcdef
ab cd ef gh

abc abcde abcdefg


a b c d e f g h

Reduce and Scan. Fig.  Tree prefix computation

abc abcde abcdefg

abcdef

abcdefgh

abcd efgh

ab cd ef gh

a b c d e f g h

Reduce and Scan. Fig.  Parallel prefix circuit

algorithm performs n−⌊lg n⌋− ≤ W ≤ n−⌊lg n⌋− Algorithm : One computes prefix products sequen-
operations and requires ⌊lg n⌋ −  ≤ D ≤ ⌊lg n⌋ steps, tially for each sublist, combine the sublist prod-
with n/ processors. ucts to compute a global parallel prefix, and then
If we have p ≤ n/ processes, then one can com- update sequentially elements in each sublist by the
pute the parallel prefix using the algorithm shown in product of the previous sublists. The running time is
Reduce and Scan R 

Algorithm  Parallel prefix algorithm The parallel prefix algorithm has many applica-
for (i = ; i ≤ k; i ++) tions, too. Blelloch shows in [] how to use paral-
forall (j = i ; j ≤ k ; j+= i ) lel prefix for sorting (quicksort and radix sort) and
xj = xj−i− ⊗ xj ; merging, graph algorithms (minimum spanning trees,
for (i = k − ; i ≥ ; i −−) connected components, maximum flow, maximal inde-
forall (j = i + i− ; j ≤ k ; j+= i ) pendent set, biconnected components), computational
xj = xj−i− ⊗ xj ; geometry (convex hull, K-D tree building, closest pair
in plane, line of sight), and matrix operations. These
algorithms use parallel prefix with the operations + and
Algorithm  Parallel prefix with limited number of max. Other usages are listed in []: addition in arbitrary
processes precision arithmetic, polynomial evaluation, recurrence
// We assume that p divides n solution, solution of tridiagonal linear systems, lexical
forall (i = ; i < p; i + +) analysis, tree operations, etc. Greenberg et al. show how
for (j = i × n/p + ; j ≤ (i + ) × n/p; j ++) the use of parallel prefix can speed up discrete event
xj = xj− ⊗ xj ; simulations [].
parallel prefix(xn/p , xn/p , . . . , xn ); A few other examples are shown below.
forall (i = ; i < p; i + +)
for (j = i × n/p + ; j ≤ (i + ) × n/p; j ++) Enumeration and packing: One has n processes and
xj = xi×n/p ⊗ xj ; wants to enumerate all processes that satisfy some
condition. Process i sets xi to  if it satisfies the condi-
tion,  otherwise. A scan computation will enumerate
the processes with a , giving each a distinct ordinal
D = n/p + lg p + O() and the number of operations number. The same approach can be used to pack all
is W = ( − /p)n + O(). nonzero entries of a vector into contiguous locations.
It follows that the parallel prefix computation can be Enumeration is used in radix-sort and quicksort to bin
done in time O(log n) using n/ log n processes. keys.
The last circuit can be improved: Snir shows in []
that any parallel prefix circuit must satisfy W+D ≥ n− Segmented scan: A segmented parallel prefix operation
and that this bound can be achieved for any  log n − computes parallel prefix within segments of the full list,
 ≤ D ≤ n − . Constructions for D in the range log n ≤ rather than on the entire list. Thus +\ [  |    | ]
D ≤  log n −  are given by Ladner and Fisher in []; yields [  |    | ].
they achieve D = ⌈log n⌉ + k and W = ( + {−k} )n − Let < S, ⊗ > be a semi-ring (S is a set closed under
o(n). Thus, it is possible to have parallel prefix circuits the associative binary operator ⊗). Define a new semi- R
of optimal depth and linear size. ring < Ŝ,⊗ > whose elements are S ∪ {∣s : s ∈ S} and the
The upper and lower bounds hold for circuits that operation ⊗ ˆ is defined as follows:
only use the operation ⊗. Bilardi and Preparata study
prefix computations in a more general Boolean network a⊗b
ˆ = a ⊗ b if a, b ∈ S
model []. In this model, prefix can be computed in time ∣a⊗b
ˆ = ∣c, where c = a ⊗ b;
T by a network of size S = Θ((n/T) lg(n/T)), in gen- a⊗∣b
ˆ = ∣a⊗∣b
ˆ = ∣b.
eral, or size S = Θ(n/T) if the operation ⊗ satisfies some The new operation is associative; a parallel prefix
additional properties, for any T in a range [c lg n, n]. computation with the new operation computes seg-
mented parallel prefixes for the old operation.
Applications
Parallel reduction is a frequent operation in parallel Carry-lookahead adder: Consider the addition of two
algorithms: It is required for parallel dot products, par- binary numbers an , . . . , a and bn , . . . , b . Let ci be the
allel matrix multiplications, counting, etc. carry bit at position i. Then the addition result is
 R Reduce and Scan

sn+ , . . . , , s , where sn+ = cn , and si = ai ⊕ bi ⊕ ci− , In this formulation, one has a set F of functions that
for i = , . . . , n. is closed under composition. This general formulation
Define pi = ai ∨ bi (carry propagate) and si = ai ∧ bi leads to a practical algorithm if functions in F have a
(carry set). The carry bits for the adder are defined by representation such that function composition is easily
the recursion computed in that representation. The linear recurrence
example is such a case: The affine functions x → ax + b
c = ;
is represented by the matrix
ci = pi ci− ∨ si .
a b
This can be written, in matrix form, as ( )
 
c   
( )=( ) and function composition corresponds, in this repre-
   
sentation, to matrix product.
ci  pi s i ci−  Another example is provided by deterministic finite
( )=( )( )
      state transducers: The sequence of outputs generated by
with addition being ∨ (or) and multiplication being ∧ the automaton can be computed in logarithmic time
(and). Thus, if from the sequence of inputs.
Finally, consider a sequence of accesses to a
  p s memory location that are either loads, stores, or
M = ( ) , Mi = ( i i ) , for i = , . . . n,
    fetch&adds. Then the values returned by the loads and
we can compute the carry bits by computing the prefixes fetch&adds can be computed in logarithmic time [].
Mi ⋯M , for i = , . . . , n; any parallel prefix circuit can (Fetch&add(address, increment) is defined as the atomic
be used to build a carry-lookahead adder. execution of the following code: {temp = &address;
&address += increment; return temp}; the opera-
Linear recurrences: Consider an order one linear recur- tion increments a shared variable and returns its old
rence, of the form value.)
xi = ai xi− + bi .
Linked List Reduce and Scan
The can be written as
The previous algorithms are not easily applied in the
xi  a i bi xi−  case where elements are in a linked list, as the rank of the
( )=( )( ),
      elements is not known (indeed, computing the ranks is
with the usual interpretation of addition and multiplica- a parallel prefix computation!): It is not obvious which
tion. This has the same formal structure as the previous elements should be involved in the computation at each
recursion. stage.
The same approach extends to higher-order linear One approach is the recursive-doubling algorithm
recurrences: A recurrence of the form xi = ∑kj= aij xi−j + shown in Algorithm , The algorithm has ⌈lg n⌉ stages
bi can be solved in parallel by computing the parallel and uses n processes, one per node in the linked list; at
prefix of the product of (k + )×(k + ) matrices. stage i, each process computes the product of the (up

Function composition: Given n functions f , . . . , fn con-


sider the problem of computing f , f  f , . . . , fn  ⋯  f . Algorithm  Recursive doubling algorithm
Since function composition is associative, then a paral- for (i = ; i ≤ ⌈log n⌉; i ++)
lel prefix algorithm can be used. This is a generalization forall v in linked list
of the usual formulation for prefix: Define if v.next ≠ NULL {
v.x = (v.next → x) ⊗ v.x;
f : x → a  ,
v.next = v.next → next;
fi : x → x ⊗ ai , for i = , . . . , n
}
then fi  ⋯  f = a ⊗ ⋯ ⊗ ai .
Reduce and Scan R 

to) i values at consecutive nodes in the list ending with D = ⌈log n⌉ steps. The ⌈log n⌉ depth is optimal, but this
the process’ node, and links that node to the node that algorithm is not work-efficient, as it uses Θ(n log n)
is i positions down the list. This algorithm performs operations, compared to Θ(n) for the sequential
W = n⌈log n⌉ − ⌈log n⌉ +  operations and requires algorithm.

Algorithm  Generic linked list reduce algorithm

Reduce(list L) {
if (L.head == &v)&&(v.next == NULL) return v.x; // list has only one element
else {
pick an indepedendent subset V of active nodes from L;
forall v ∈ V {
// delete predecessor of v from the list
ptr = v.next;
(v.x = ptr → x) ⊗ v.x;
v.next = ptr → next;
}
}
barrier;
Reduce(L);
}

Algorithm  Generic linked list scan algorithm

Scan(list L) {
if ((L.head == &v)&&(v.next == NULL)) return v.x; // list has only one element
else {
pick an indepedendent subset V of active nodes from L;
forall v ∈ V {
// delete predecessor of v from the list
ptr = v.next;
v.x = (ptr → x) ⊗ v.x; R
v.next = ptr → next;
}
}
barrier;
Scan(L);
barrier;
forall v ∈ V {
// reinsert predecessor of v in list
if (ptr.next ≠ NULL) ptr.x = ptr.next → x ⊗ ptr.x;
v.next = ptr;
}
barrier;
}
 R Reduce and Scan

A set of nodes of the linked list is independent if The problem, then, is to find a large independent
each node in the set has a predecessor outside the subset and schedule the processes to handle the nodes
set. A linked list reduction algorithm that performs no in the subset at each round.
superfluous products has the generic form shown in One possible approach is to use randomization for
Algorithm . At each step of this parallel algorithm symmetry breaking: At each round, active processes
one replaces nonoverlapping pairs of adjacent nodes choose randomly, with probability /, whether to par-
with one node that contains the product of these two ticipate in this round; if a process decided to participate
nodes, until we are left with a single node. Given such and its successor also decided to participate, it drops
reduction algorithm, we can design a parallel prefix from the current round. The remaining processes form
algorithm, by “climbing back” the reduction tree, as an independent set. On average, / of the processes that
shown in Algorithm . The reduction is illustrated, are active at the beginning of a round participate in the
for a possible schedule, on the top of Fig. , while round. With overwhelming probability, the algorithm
the unwinding of the recursion tree that computes will require a logarithmic number of rounds.
the remaining prefix values is shown on the bottom Deterministic “coin-tossing” algorithms have been
of Fig. . proposed by various authors. Cole and Vishkin are

a b c d e f

a ab c cd e ef

a ab c abcd e ef

a ab c abcd e abcdef

a ab c abcd e abcdef

a ab c abcd e abcdef

a ab abc abcd abcde abcdef

Reduce and Scan. Fig.  Linked list parallel prefix


Reduce and Scan R 

1 –1
0

1 –1 1 –1

0 0

1 –1 1 –1
1 –1
0 0
0

1 –1 1 –1
0 0

Reduce and Scan. Fig.  Tree depth computation

showing in [] how to solve deterministically the linked Connection Machine


list parallel prefix problem in time O(log n log∗ n) using MapReduce
n/ log n log∗ n processes. NESL
Ultracomputer, NYU
Applications
Linked list prefix computations can be used to solve a
variety of graph problems, such as computing spanning Bibliography
forests, connected components, biconnected compo- . Bilardi G, Preparata FP () Size-time complexity of Boolean
nents, and minimum spanning trees []. networks for prefix computations. J ACM :–
A simple example is shown below. . Blelloch GE () Scans as primitive parallel operations. IEEE R
Tram Comp :–
. Chatterjee B, Blelloch GE, Zagha M () Scan primitives for
Tree depth computation: Given a tree, one wishes to
vector computers. In: Proceedings of supercomputing conference.
compute for each node the distance of that node from IEEE Computer Society Press, New York
the root. One builds an Euler tour of the tree, as shown . Cole R, Vishkin U () Deterministic coin tossing with applica-
in Fig. . Each tree node is replaced by three nodes in the tions to optimal parallel list ranking. Inform Control :–
tour, with value  (moving down),  (moving right), and . Greenberg AG, Lubachevsky BD, Mitrani I () Superfast par-
allel discrete event simulations. ACM Trans Model Comput Simul
− (moving up). One can easily see that a prefix sum on
:–
the tour will compute the depth of each node. A variant . Iverson KE () A Programming language. Wiley, New York
of this algorithm (with − replaced by ) will number . Kruskal CP, Rudolph L, Snir M () Efficient synchronization
the nodes in preorder []. on multiprocessors with shared memory. ACM TOPLAS :–

. Kruskal CP, Rudolph L, Snir M () Efficient parallel algorithms
Related Entries for graph problems. Algorithmica :–
Array Languages . Ladner RE, Fischer MJ () Parallel prefix computation. J Assoc
Collective Communication Comput Mach :–
 R Relaxed Memory Consistency Models

. Message Passing Interface Forum () MPI: a message-passing counterproductive, to the purpose of the list. In mak-
interface standard. Int J Supercomput Appl High Perform Com- ing a shopping list, for example, we usually list items in
put :– the order we happen to think of them, but rarely will
. Snir M () Depth-size trade-offs for parallel prefix computa-
such an order coincide with the most efficient path for
tion. J Algorithms :–
. Tarjan RE, Vishkin U () An efficient parallel biconnectivity locating the items in a store. Even for a structure that
algorithm. Siam J Comput :– may have no natural linear ordering, such as a graph, we
nevertheless typically number its nodes in some order,
which may or may not facilitate efficient processing. In
general, finding an optimal ordering is often a difficult
combinatorial problem, as famously exemplified by the
Relaxed Memory Consistency traveling salesperson problem.
Models Similarly, in programming when we write a for loop,
we specify an ordering that may or may not have any sig-
Memory Models
nificance for the successive tasks to be done inside the
loop, yet the semantics of most programming languages
usually require the compiler to schedule the tasks in the
specified sequential order, even if the tasks are totally
Reliable Networks independent and could validly be done in any order.
This constraint often goes unnoticed in serial compu-
Networks, Fault-Tolerant tation, but it can seriously inhibit parallel processing
of sequential loops. If each successive task depends on
the result of the immediately preceding one, then the
given sequential order must be honored in order for
the program to execute correctly. But if the tasks have
Rendezvous no such interdependence, for example, in a loop ini-
Synchronization tializing all the elements of an array to zero, then the
element assignments could be done in any order, or
simultaneously in parallel.
Unfortunately, conventional programming lan-
guages offer no mechanism for indicating when tasks
Reordering are independent, which has motivated the development
of compiler techniques for discovering and exploiting
Michael Heath
loop-based parallelism, as well as compiler directives,
University of Illinois at Urbana-Champaign, Urbana,
such as those provided by OpenMP, that enable pro-
IL, USA
grammers to specify parallel loops. It has also motivated
the development of higher-level languages in which
structures such as arrays are first-class objects that can
Definition be referenced as a whole without having to enumerate
Reording for parallelism is a technique for enabling or
array elements sequentially. Although smart compilers,
enhancing concurrent processing of a list of items by
compiler directives, and higher-level objects can help
determining an ordering that removes or reduces serial
identify and preserve potential parallelism, it may still
dependencies among the items.
be difficult to determine how best to exploit the paral-
lelism, especially if the ordering affects the total amount
Discussion of work required. In such cases, the necessary reorder-
Ordered lists are ubiquitous and seem entirely natu- ing may be considerably deeper than it is reasonable to
ral in many aspects of daily life. But the specific order expect any compiler or automated system to identify
in which items are listed may be irrelevant, or even or exploit.
Reordering R 

These issues are well illustrated by computational concurrency and can yield a poor load balance, whereas
linear algebra. When we write a vector in coordinate assigning rows or columns to processors in a cyclic
notation, we implicitly specify an ordering of the coor- manner (like dealing cards) has the opposite effects.
dinate basis for the vector space that is generally arbi- Optimizing the tradeoff between these extremes may
trary. Similarly, when we write a system of n linear suggest a hybrid block-cyclic assignment, with smaller
equations in n unknowns, Ax = b, where A is an n × n blocks of tunable size assigned cyclically.
matrix, b is a given n-vector, and x is an n-vector to These issues are further complicated for matrices
be determined, the ordering of the rows of A is arbi- that are sparse, meaning that the entries are mostly
trary, as each equation in the system must be satisfied zeros, so that it is advantageous to employ data
individually, irrespective of the order in which they hap- structures that store only the nonzero entries (along
pen to be listed. Consequently, the rows of the matrix A with information indicating their locations within the
can be permuted arbitrarily without affecting the solu- matrix) and algorithms that operate on only nonzero
tion x. The column ordering of A is also arbitrary in the entries. For a sparse matrix, the ordering of the rows
sense that it corresponds to the arbitrary ordering cho- and columns strongly affects the total amount of work
sen for the entries of x, so that permuting the columns and storage required to compute a factorization of the
of A affects the solution only in that it correspondingly matrix. Sparsity can also introduce additional paral-
permutes the entries of x. To summarize, the solution lelism into the factorization process, beyond the sub-
to the system Ax = b is given by x = Qy, where y stantial parallelism already available in the dense case.
is the solution to the permuted system PAQ y = Pb, To illustrate the effects of ordering for sparse sys-
and P and Q are any permutation matrices, so in this tems, we focus on systems for which the sparse matrix
sense the original and permuted systems are mathemat- A is symmetric and positive definite, so that we will not
ically equivalent, although the computational resources have the additional complication of ordering to preserve
required to solve them may be quite different. This numerical stability as well. In this case, for a direct solu-
freedom in choosing the ordering can be exploited to tion we compute the Cholesky factorization A = LLT ,
enhance numerical stability (e.g., partial or complete where L is lower triangular, so that the solution x to the
pivoting) or computational efficiency, or a combination system Ax = b can be computed by solving the lower
of these. triangular system Ly = b for y by forward substitution
The order in which the entries of a matrix are and then the upper triangular system LT x = y for x by
accessed can have a dramatic effect on the performance back substitution.
of matrix algorithms, in part because of the mem- To see how the ordering of the sparse matrix affects
ory hierarchy (registers, multiple levels of cache, main the work, storage, and parallelism, we need to take a
memory, etc.) typical of modern microprocessors. A closer look at the Cholesky algorithm, shown in Fig. ,
dense matrix is usually stored in a two-dimensional in which only the lower triangle of the input matrix A R
array, whose layout in memory (typically row-wise or is accessed, and is overwritten by the Cholesky factor
column-wise) strongly affects the cache behavior of L. The outer loop processes successive columns of the
programs that access the matrix entries. Many matrix matrix. The first inner loop, labeled cdiv(k), scales col-
algorithms, such as matrix multiplication and matrix umn k by dividing it by the square root of its diagonal
factorization, systematically access the entries of the entry, while the second inner loop, labeled cmod( j, k),
matrix using nested loops whose indices can validly modifies each subsequent column j by subtracting from
be arranged in any order, so the order can be cho- it a scalar multiple, ajk , of column k.
sen to make optimal use of data residing in the most The first important observation to make about the
rapidly accessible level of the memory hierarchy. In Cholesky algorithm is that a given cmod( j, k) operation
addition, the order in which matrix rows or columns may introduce new nonzero entries, called fill, into col-
(or both, in the case of a two-dimensional decompo- umn j in locations that were previously zero. Such fill
sition) are assigned to processors may strongly affect entries require additional storage and work to process,
parallel efficiency. Assignment by contiguous blocks, for so we would like to limit the amount of fill, which is
example, tends to reduce communication but inhibits dramatically affected by the ordering of the matrix, as
 R Reordering

for k = 1 to n Even if the cdiv operations must be performed serially,



akk = akk
for i = k + 1 to n
however, multiple cmod operations can still be per-
aik = aik=akk {cdiv(k)} formed in parallel, and this is the principal source of
end parallelism for a dense matrix.
for j = k + 1 to n
The interdependence among the columns of the
for i = j to n
aij = aij − aik ajk {cmod(j;k)} Cholesky factor L is precisely characterized by the elim-
end ination tree, which has one node per column of A, with
end the parent of node j being the row index of the first
end
subdiagonal nonzero in column j of L. In particular,
Reordering. Fig.  Cholesky factorization algorithm each column of L depends only on its descendants in
the elimination tree. For a dense matrix, this poten-
tial source of additional parallelism is of no help, as the
elmination tree is simply a linear chain. But for a sparse
matrix, more advantageous tree structures are possible,
depending on the ordering chosen.
A small example is shown in Fig. , in which a
two-dimensional mesh G(A), whose edges correspond
to nonzero entries of matrix A, is ordered row-wise,
resulting in a banded matrix A and Cholesky factor
Reordering. Fig.  “Arrow” matrix example illustrating L. With this ordering, the elimination tree T(A) is
dependence of fill on ordering. Nonzero a linear chain, implying sequential execution of the
entries indicated by × corresponding cdiv operations, as each column of L
depends on all preceding columns.
An alternative ordering is shown in Fig. , in which
illustrated for an extreme case by the “arrow” matrix the same two-dimensional mesh is ordered by nested
whose nonzero pattern is shown for a small example in dissection, where the graph is recursively split into pieces
Fig.  with two different orderings. The ordering on the with the nodes in the separators at each level num-
left results in a dense Cholesky factor (complete fill), bered last. In this example, numbering the “middle”
whereas the ordering on the right results in no new nodes last, that is, , , and , leaves two disjoint pieces
nonzero entries (no fill). In general, we would like to that are in turn split by their “middle” nodes, num-
choose an ordering for the matrix A that minimizes fill bered  and , leaving four isolated nodes, numbered
in the Cholesky factor L, but this problem is known to , , , and . The absence of edges between the sepa-
be NP-complete, so instead we use heuristic ordering rated pieces of the graph induces blocks of zeros in the
strategies such as minimum degree or nested dissection, matrix A that are preserved in the Cholesky factor L,
which effectively limit fill at reasonable cost. which accordingly suffers fewer fill entries than with
A second important observation is that the cdiv the banded ordering, and hence requires less storage
operation that completes the computation of a given and work to compute. Equally important, the elimina-
column of L cannot take place until the modifica- tion tree now has a hierarchical structure, with multiple
tions (cmods) by all previous columns have been done. leaf nodes corresponding to cdiv operations that can
Thus, the successive cdiv operations appear to require be executed simultaneously in parallel at each level of
serial execution, and this is indeed the case for a dense the tree up to the final separator. The lessons from
matrix. But a given cmod( j, k) operation has no effect, this small example apply much more broadly: with
and hence need not be done, if ajk = , so a sparse an appropriate ordering, sparsity can enable additional
matrix may therefore enable additional parallelism that parallelism that is unavailable for dense matrices. A
is not available in the dense case. Once again, however, divide-and-conquer approach, exemplified by nested
this potential benefit depends critically on the ordering dissection, is an excellent way to identify and exploit
chosen for the rows and columns of the sparse matrix. such parallelism, and it typically also reduces the overall
Reordering R 

7
7 8 9
6

4 5 6 5

4
1 2 3
3

2
G(A) A L
1

T(A)

Reordering. Fig.  Two-dimensional mesh G(A) ordered row-wise, with corresponding matrix A, Cholesky factor L, and
elimination tree T(A). Nonzero entries indicated by × and fill entries by +

9
4 6 5
8
7 8 9 7
3 6
1 3 2
1 2 4 5

G(A) A L T(A)

Reordering. Fig.  Two-dimensional mesh G(A) ordered by nested dissection, with corresponding matrix A, Cholesky
factor L, and elimination tree T(A). Nonzero entries indicated by × and fill entries by +

work and storage required for sparse factorization by a natural row-wise ordering, with the corresponding
reducing fill. matrix shown, and indeed only one row can be pro-
In addition to direct methods based on matrix fac- cessed at a time.
torization, reordering also plays a major role in parallel By reordering the mesh and corresponding matrix
implementation of iterative methods for solving sparse using a red-black ordering, in which the nodes of the
linear systems. For example, reordering by nested dis- mesh alternate colors in a checkerboard pattern as R
section can be used to reduce communication in parallel shown in Fig. , with the red nodes numbered before
sparse matrix-vector multiplication, which is the pri- the black ones, solution components having the same
mary computational kernel in Krylov subspace iterative color no longer depend directly on each other, as can be
methods such as conjugate gradients. Another exam- seen both in the graph and by the form of the reordered
ple is in parallel implementation of the Gauss-Seidel matrix. Thus, all of the components corresponding to
or Successive Overrelaxation (SOR) iterative methods. red nodes can be updated simultaneously in parallel,
These methods repeatedly sweep through the successive and then all of the black nodes can similarly be done in
rows of the matrix, updating each corresponding com- parallel, so the algorithm proceeds in alternating sweeps
ponent of the solution vector x by solving for it based between the two colors with ample parallelism.For an
on the most recent values of all the other components. arbitrary sparse matrix, the same idea still works, but
This dependence of each component update on the more colors may be required to color its graph, and
immediately preceding one seems to require strictly hence the resulting parallel implementation will have as
serial processing, as illustrated for a small example in many successive parallel phases as the number of colors.
Fig. , where a two-dimensional mesh is numbered in One cautionary note, however, is that the convergence
 R Reordering

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4

G(A)

Reordering. Fig.  Two-dimensional mesh G(A) ordered row-wise, with corresponding matrix A. Nonzero entries
indicated by ×

15 7 16 8

5 13 6 14

11 3 12 4

1 9 2 10

G(A)
A

Reordering. Fig.  Two-dimensional mesh G(A) with red-black ordering and corresponding reordered matrix A. Nonzero
entries indicated by ×

rate of the underlying iterative method may depend the Fast Fourier Transform (FFT), where reordering can
on the particular ordering, so the gain in performance be used to optimize the tradeoff between latency and
per iteration may be offset somewhat if an increased bandwidth for a given parallel architecture.
number of iterations is required to meet the same con-
vergence tolerance.
In this brief article we have barely scratched the Related Entries
surface of the many ways in which reordering can be Dense Linear System Solvers
used to enable or enhance parallelism. We focused Graph Partitioning
mainly on computational linear algebra, but additional Layout, Array
examples abound in many other areas, such as the Loop Nest Parallelization
use of multicolor reordering to enhance parallelism in Linear Algebra, Numerical
solving ordinary differential equations using waveform Loops, Parallel
relaxation methods or in solving partial differential Scheduling Algorithms
equations using domain decomposition methods. Yet Sparse Direct Methods
another example is in the parallel implementation of Task Graph Scheduling
Roadrunner Project, Los Alamos R 

Bibliographic Notes and Further


Reading Roadrunner Project, Los Alamos
For a broad overview of parallelism in computational
linear algebra, see [–]. For a general discussion of Don Grice , Andrew B. White, Jr.

sparse factorization methods, see [, ]. For specific IBM Corporation, Poughkeepsie, NY, USA

discussion of reordering for parallel sparse elimination, Los Alamos National Laboratory, Los Alamos, NM, USA
see [–].
Synonyms
Petaflop barrier
Bibliography
. Davis TA () Direct methods for sparse linear systems. SIAM,
Philadelphia, PA
Definition
. Demmel JW, Heath MT, van der Vorst HA () Parallel numerical The Los Alamos Roadrunner Project was a joint Los
linear algebra. Acta Numerica :– Alamos and IBM project that developed the first super-
. Dongarra JJ, Duff IS, Sorenson DC, van der Vorst HA () computer to break the Petaflop barrier on a Linpack run
Numerical linear algebra for high-performance computers. SIAM, for the Top list.
Philadelphia
. Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW,
Plemmons RJ, Romine CH, Sameh AH, Voigt RG () Parallel
Discussion
algorithms for matrix computations. SIAM, Philadelphia Introduction
. George A, Liu JW-H () Computer solution of large sparse In , Los Alamos and IBM began to explore
positive definite systems. Prentice Hall, Englewood Cliffs
the potential of a new class of power-efficient high-
. Liu JW-H () Computational models and task scheduling for
parallel sparse Cholesky factorization. Parallel Comput :– performance computing systems based on a radical
. Liu JW-H () Reordering sparse matrices for parallel elimina- view of the future of supercomputing. The resulting
tion. Parallel Comput :– LANL Roadrunner project had three primary roles: ()
. Liu JW-H () The role of elimination trees in sparse factoriza- an advanced architecture which presaged the primary
tion. SIAM J Matrix Anal Appl :–
components of next generation supercomputing sys-
tems; () a uniquely capable software and hardware
environment for investigation and solution of science
and engineering problems of importance to the US
Resource Affinity Scheduling
Department of Energy (DOE). Section . describes
Affinity Scheduling some of the key LANL applications that were studied;
and () achievement of the first sustained petaflops.
A significant point in the thinking surrounding
petascale and now exascale computing, is that the chal- R
Resource Management for lenge is not just “solving the same problem faster” but
Parallel Computers rather “solving a better problem” by harnessing the leap
in computing performance; that is, the key to better
Operating System Strategies
predictive simulations in national nuclear security, in
climate and energy simulations, and in basic science
is increasing the fidelity of our computational science
Rewriting Logic by using better models, by increasing model resolution,
and imbedding uncertainty quantification within the
Maude solution itself.
The computing capability of the supercomputer on
the top of the Top list continues to grow at a
Ring staggering rate. In , Roadrunner broke the sus-
tained petaflops boundary on the list for the first time.
Networks, Direct This represented a performance improvement of over a
 R Roadrunner Project, Los Alamos

factor of , in just  years. As a result of the availabil- (LANL), the DOE/LANL Roadrunner project was a
ity of such large machines, and the accompanying price joint venture between LANL and IBM to address com-
and performance improvements for smaller machines, puting challenges with a range of technical solutions
HPC solutions are now moving into markets that were that are opening up new opportunities for business, sci-
previously considered beyond the reach of computing entific, and social advances. The Roadrunner project is
technology. named after the state bird of New Mexico where the
More comprehensive drug simulations, car crash Roadrunner system has been installed at LANL.
simulations, and climate models are advances one might To address the power and performance issues, a fun-
have easily predicted, but the impacts of HPC on movie damental design decision was made to use a hybrid
making, games, and even remote surgery are things that programming approach that combines standard x-
might not have been so easy to see on the horizon. Today architecture processors and Cell BE–based accelera-
even financial institutions are using real-time modeling tors. The Cell BE accelerators provide a much denser
to guide investment choices. The growth of HPC oppor- and power-efficient peak performance than the stan-
tunities in markets that did not involve traditional HPC dard processors, but introduce a different dimension in
users is what is driving the business excitement around programming complexity because of the hybrid archi-
the improvements in computing capability. tecture and the necessity to explicitly manage memory
However, these opportunities do not come with- on the Cell processor. When the project first started,
out their own technical challenges. The total amount of this approach was considered radical, but now these
power and cooling required to drive some large systems, attributes are widely accepted as necessary components
the programming complexity as hundreds of thousands of next generation systems leading to exascale. The
of threads need to be harnessed into a single solution, Roadrunner program was targeted at getting a head
and overall system reliability, availability, and service- start on the software issues involved in programming
ability (RAS) are the biggest hurdles. It was important to these architectures.
address the fundamental technology challenges facing
further advances in application performance. Roadrunner Hardware Architecture
Designed and developed for the US Department of As shown in Fig.  and described in the Sect.
Energy (DOE) and Los Alamos National Laboratory “Management for Scalability,” the Roadrunner system

Roadrunner is a hybrid petascale system


delivered in 2008

Connected Unit cluster 6,528 dual-core Opterons ⇒ 47 TF


180 compute nodes w/Cells 12,240 Cell eDP chips ⇒ 1.2 PF
12I/O nodes

17 clusters
3,264 nodes
288-port IB 4x DDR 288-port IB 4x DDR

12 links per CU to each of 8 switches

Eight 2nd-stage 288-port IB 4X DDR switches

Roadrunner Project, Los Alamos. Fig.  Overall roadrunner structure


Roadrunner Project, Los Alamos R 

is a cluster of over , triblade nodes (described in Roadrunner design was the use of heterogeneous core
detail below). The nodes are organized in smaller clus- types to maximize the energy efficiency of the system.
ters called connected units (CUs). Each CU is composed Having cores designed for specific purposes allows for
of  computational triblade nodes,  I/O nodes, and energy efficiency by eliminating circuits in the cores that
a service node. Each CU has its own -port Voltaire will not be used. The three core types in the Roadrunner
Infiniband (IB) switch. [] These CUs are then con- system are all programmable and not hardwired algo-
nected to each other with a standard second level switch rithmic cores, such as a cryptography engine, but the
topology where each of the first level switches has con- circuits are optimized for the type of code that will run
nections to each of the second level switches. There are on each core type.
 of these CUs in the overall system. There are two computational chips in the system:
The triblade node architecture and implementa- the PowerXCell i .GHz chip which is packaged
tion described in the Sect. “Triblade Architecture” on the QS [] blade and the AMD Opteron dual
were chosen in order to use existing hardware-building core .GHz chip that is packaged on the LS []
blocks and to capitalize on the RAS characteristics of blade.
the IBM Blade Center [] infrastructure. Ease of con- The PowerXCell i has eight specialized SPEs (Syn-
struction and service were critical requirements because ergistic Processing Elements) that run the bulk of the
of the short schedule and the large number of parts in application floating point operations but do not need
the final system. Since the compute nodes make up the the circuits nor the state to run the OS. It also has a
bulk of the system hardware, they were the focus of the PowerPC core that is neither as space nor energy effi-
hardware and software design efforts. cient as the SPEs but it does run the OS on the QS
and controls the program flow for the SPEs.
Management for Scalability The AMD chip runs the MPI stack and controls
The management configuration of the Roadrunner sys- the communications between the triblades in the sys-
tem was designed for scalability from the beginning. tem. It also controls the overall application flow and
The primary concern was the ability to use and control does application work for sections of the code that
individual CUs without impacting the operation of the are not floating point intensive. The LS blade has
other CUs. That was the fundamental reason for hav- its own address space, separate from the QS, which
ing a first level IB switch dedicated to each CU. One introduces another level of hybridization into the node
can reboot the nodes in a CU, and even replace nodes structure.
in a CU, without impacting the operation of the other The triblades, as shown in Fig. , are the compute
CUs, or applications running in those CUs. It is possi- nodes that contain these computational chips and the
ble to isolate the first level switches from the second level heart of the system. They consist of three computational
switches with software commands so that other CUs do blades: one LSand two QSs and a fourth expansion R
not route data to a CU that is under debug or repair. blade that handles the connectivity among the three
The I/O nodes are dedicated to use by nodes within computational blades as well as the connection to the
the CU which allows file system traffic to stay within a CU IB switch.
CU and not utilize any second level switch bandwidth. The LS is a dual socket, dual core AMD based
This also allows for file system connectivity and perfor- blade, so there are four AMD cores per triblade. The
mance measurements to be done in isolation, including QS is a dual socket PowerXCell i (Cell BE chip)
rebooting the CU, at the CU level. Actually, there are two blade, so there are four Cell BE chips in the triblade to
sets of six I/O nodes for each CU which each serve half pair with the four AMD cores. The software model most
of the compute nodes, thus providing further perfor- commonly used is one MPI (message passing interface)
mance isolation and greater performance consistency task per AMD core-Cell BE chip pair. There is  GB of
for applications. main storage on the LS and a total of  GB of main
storage on the pair of QSs. Having matched storage
Triblade Architecture of  GB on each AMD core and each Cell BE chip sim-
As described in the Sect. “Heterogeneous Cores and plifies designing and coding the applications that share
Hybrid Architecture,” a fundamental innovation in the data structures on each processing unit in the pair, since
 R Roadrunner Project, Los Alamos

there can be duplicate copies kept on each side and between the process running on the AMD core and the
only updates to the structures need to be communicated process running on the partner Cell BE chip.
Since the triblade structure presents dual address
spaces, (one on the LS and one on the QS) to each of
the MPI tasks, as shown in Fig. , having bandwidth and
latency efficient methods for transferring data between
these address spaces is critical to the scaling success of
the applications.
A significant architectural feature of the triblade is
the set of four dedicated peripheral component inter-
connect express (PCIe)-x links that connect the four
AMD cores on the LS to their partner Cell BE chips on
the QSs. All four links can transmit data in both direc-
tions simultaneously for moving Cell BE data structures
up to the Opteron processors and Opteron structures
down to the Cell BE chips.
A common application goal, not surprisingly, was to
maximize the percentage of the work done by the Cell
BE synergistic processing elements (SPEs) []. When
the SPEs need to communicate boundary information
with their logical neighbor SPEs in the problem space,
that are located in physically different triblades, they
need to send the data up to their partner AMD core,
which in turn will send it, usually using MPI, to the
AMD core that is the partner of the actual target SPE.
The receiving AMD will then send the data down to
Roadrunner Project, Los Alamos. Fig.  Triblade
the actual target SPEs. Having the four independent
architecture

Roadrunner Project, Los Alamos. Fig.  System configuration


Roadrunner Project, Los Alamos R 

PCIe links allows all four Cell BE chips to send the Technology Drivers
data to and from their partner AMD cores at the The ongoing reduction in the rate of processor fre-
same time. quency improvement has led to three fundamental
Figure  shows a logical picture of two triblades con- issues in development to achieve dramatically bet-
nected together with their IB link to show the different ter application performance: () Software complexity
address spaces that exist and need to be managed by which is caused by the growing number of cores needed
the application data flow. What is shown is one of the for ever greater performance, () Energy requirements
four pairs of address spaces on each of the triblades. driven by the increased number of cores and chips
Each AMD core and each Cell BE chip has its own needed for performance improvements, and () RAS
logical address space. Since each AMD core is paired requirements driven by the ever increasing number of
with its own Cell BE chip, there are four pairs of logi- circuits and components in the larger systems. All three
cal address spaces on a triblade. Address space  is the of these issues were important features of the Roadrun-
AMD address space for Core  in the AMD chip on ner project overall design and implementation.
the left of triblade . That core is partnered with Cell
BE- on QS- which owns address space . There are Software Complexity
corresponding address spaces  and  on triblade . For the last several decades, lithographic technology
There is a set of library functions that can be called improvements guided by Dennard’s CMOS Scaling The-
from the application on the AMD or the Cell BE to ory [] in accordance with Moore’s law [] have not
transfer data between address spaces  and , and  only provided circuit density improvements, but the
and . OpenMPI is used to transfer data between the shrinkage in the vertical dimension also provided an
AMD address spaces  and  using typical MPI and accompanying frequency increase. This meant that for
IB methods. The difference between the standard MPI applications to scale in performance, one did not need
model and the triblade MPI model is that in the tri- to modify the software to utilize additional cores or
blade model, an MPI task controls both a standard threads, since system performance improved simply
Opteron core and the heterogeneous Cell BE chip. This because of the frequency scaling.
is similar in some ways to having multiple threads run- The elimination of future frequency scaling puts
ning under a single MPI task except that the address the burden of application performance improvement
spaces need to be explicitly controlled and the cores are on the application software and programming environ-
heterogeneous. ment, and requires a doubling of the number of cores
The ability to control the data flow both between the applied to a problem in order to achieve a doubling of
hybrid address spaces and within the Cell BE address the peak performance available to the application. Since
space allows the application to optimize the utiliza- sustained application performance for strong scaling,
tion of the memory bandwidths in the system. As chip and to a lesser extent weak scaling, is likely to be less R
densities increase, but circuit and pin frequencies do than linear in the number of cores available, more than
not, utilizing pin bandwidths (both memory bandwidth double the number of cores will usually be required to
and communication bandwidth) efficiently will become achieve double the sustained application performance.
increasingly important. The Cell BE chip local stores, At a nominal  GHz, effectively a million functional
which act like software controlled cache structures, units are required for petascale sustained performance,
allow for some very efficient computational models. and the number of cores will continue to grow as greater
As an example, the DGEMM (double-precision gen- sustained performance is desired. Finding ways to cope
eral matrix multiply) routine that ran in the hybrid with this fundamental issue, in a way that is tenable
Linpack application [], was over % efficient, and for both algorithm development and code maintenance,
was the heart of the computational efficiency that led while still achieving reasonable fractions of nominal
to the sustained petaflops achievement. The same soft- peak performance for the sustained performance on
ware cache model applies to the separate address spaces real (as opposed to test or benchmark) applications, was
that exist in the AMD and Cell BE chips within the a major goal of the project. Since application scaling
MPI task. going to petascale and beyond was such a major focus,
 R Roadrunner Project, Los Alamos

considerable time and energy were spent on making The different applications employed multiple meth-
sure that real applications would scale on the machine. ods to exploit the capabilities of Roadrunner. Milagro
While high-level language constructs and advances, used a host-centric accelerator offload model where
in and of themselves were not a major focus of the the majority of the application was unchanged, (except
project, rethinking the applications themselves was. The for the control code which was modified due to the
LANL application and performance modeling teams, as new hybrid structure of the triblade,) and selected
well as the IBM application and performance model- computationally intensive portions were modified to
ing teams, analyzed how the construction of the base move their most performance critical sections to the
algorithm flow and the interaction with the projected accelerator (Cell BE). VPIC employed an accelerator-
hardware would work. centric approach where the majority of the program
The IBM Linpack team modified Linpack to oper- logic was migrated to the accelerator and the host
ate in the hybrid environment and modeling was done systems and the InfiniBand interconnect served pri-
to optimize the structure of the code around the antic- marily to provide accelerator to accelerator message
ipated performance characteristics of the system. The communications.
workhorse of the sustained petaflops performance was Early performance results on the applications
the DGEMM algorithm on the Cell BE chip since it showed speed-ups over an unaccelerated cluster in the
provided over % of the computation required by the range of – times depending on how much work could
overall solution of the Linpack problem. The AMD be moved to the Cell. Between  KLOC (thousand lines
core provided the overall control flow for the applica- of code) and  KLOC needed to be modified to achieve
tion, performed the OpenMPI communication, and did these speed-ups. Some of the applications had obvious
some of the algorithmic steps that could be overlapped compute kernels that could be transferred to the Cell
with the Cell BE DGEMM calculations. The final struc- BE and those applications required fewer changes to the
ture of the solutions was both hybrid (between the x code than the cases where the compute parts were more
and Cell BE) units and heterogeneous on the Cell BE distributed or where the control code itself had to be
itself []. restructured, as was the case with Milagro. But, even in
Linpack was only one of the codes that were studied the case of Milagro, it was a small percentage of the code
in great detail, and had the machine only been useful that was changed.
for a Linpack run, and not for critical science and DOE Understanding of how to construct applications and
codes, it would not have been built. A full year was spent lay out data structures to achieve reasonable sustained
analyzing and testing pieces of several applications prior scaled performance will be important to future machine
to making the decision to go forward with the actual designs as well as provide input and examples for the
machine manufacture and construction. translation of high-level language constructs into oper-
Los Alamos National Laboratory revamped sev- ational code. One example of a hardware feature in the
eral applications to the Roadrunner architecture. These base architecture of the QS is the ability for software
applications represent a wide variety of workload types. to control the Local Store memory traffic [, ]. Since
Among the applications ported are []: memory bandwidth will be an important commodity
to optimize around in the future processor designs,
● VPIC – full relativistic, charge-conserving, D understanding how best to do data pipelining from
explicit particle-in-cell code. an application point of view is important base data to
● SPaSM – Scalable Parallel Short-range Molecular have. Hardware controlled flow of memory pre-fetching
Dynamics code, originally developed for the Think- (cache) is a good feature for processors to have in gen-
ing Machines Connection Machine model CM-. eral, but optimal efficiency will require explicit appli-
● Milagro – Parallel, multidimensional, object-oriented cation tuning of bandwidth utilization. Automating the
code for thermal x-ray transport via implicit Monte software control based on algorithms, hints, or language
Carlo methods on a variety of meshes. constructs is something that can be added later to take
● SweepD – Simplified -group D Cartesian dis- some of the burden off of the software developers but for
crete ordinates kernel representative of a neutron now, understanding how best to use features like these
transport code. is more important.
Roadrunner Project, Los Alamos R 

Energy it []. Software tools such as OpenMP and OpenCL are


As was stated earlier, energy consumption and cool- emerging which will help hide the heterogeneity from
ing of IT facilities are an increasing critical problem. the application user, but the complexity will still exist in
The business-as-usual technology projections in  the system itself.
indicate that by  large-scale systems will be more The Roadrunner solution was not only the first to
expensive to operate than to acquire []. Due to the break the sustained petaflops Linpack barrier and there-
increase in power requirements as ultra-scale comput- fore first on the Top list in June  but it was third
ing systems grow in computing capability, it is often the on the June  Green as well. The first and second
case that new facilities need to be built to house the places on that Green list went to QS only based
anticipated supercomputing machines. systems, so the energy efficiency of using the heteroge-
Another innovation in the Roadrunner project was neous core approach is obvious. It is a trend that will
minimizing the overall power required by using hetero- continue into the future as the industry works to fig-
geneous core types that optimize the power utilization, ure out the best utilization of the increasing number of
not by lowering the thread performance (frequency) circuits available on chips while also making the com-
but by eliminating unnecessary circuits in the SPEs putational efficiency of the circuits at the application
themselves. As was discussed previously, this presents a level improve.
different set of issues to the programmer. The Roadrun-
ner system, using heterogeneous core types, uses fewer Reliability, Availability, and Serviceability
overall threads for the same performance than a homo- As machines grow in size and complexity, and applica-
geneous core machine using the same amount of power, tions scale to use increasing amounts of hardware for
but the core types need separate compiled code, and long run times, the reliability and availability character-
more explicit thought about where each piece of the istics become increasingly important. For an application
algorithm should run. to make forward progress, it needs to either check-
point its progress often enough or have a mechanism
Heterogeneous Cores and Hybrid Architecture for migrating processes running on failing parts of the
Using heterogeneous core types on the PowerXCelli system onto other working hardware. The Roadrunner
was a way of dealing with the computational density system chose to use the checkpoint/restart approach
and power efficiency. Using a hybrid node architecture since it seemed the most generally practical for a
that included a combination of the QSs for compu- machine of its size and for the types of applications that
tational efficiency and the LS for the general pur- it would be running. High availability (HA) through
pose computations and communications control was failover system design, at the node or even cluster
a way of overcoming the limitations of the PowerPC level, was not a cost-effective approach for the tech-
core on the Cell BE chip. One can easily imagine future nologies that are employed, and checkpointing could be R
supercomputers based on SoC designs that do not tolerated.
required the additional level of heterogeneity utilized in Of course, there are certainly parts of the sys-
Roadrunner. tem, like the memory DIMMs (dual in-line mem-
The PowerXCelli has two core types, a general pur- ory modules), that have error-correcting facilities as
pose embedded PowerPC core (the power processing well as memory sparing capabilities, and these increase
element, PPE) which handles all of the operating system the inherent overall node availability. But making the
and hybrid communication tasks and eight specialized nodes themselves redundant or failover capable out-
compute cores (the synergistic processing elements, side of the checkpoint/restart process was not an
SPEs). The SPEs take up significantly less chip area objective.
than the PowerPC core since they only need the facil- Three technology developments whose impact on
ities and instruction set to handle the intense computa- service were very critical to the overall success, and in
tional tasks. This adds to the software complexity of the particular the ability to build the entire system in a short
application algorithms because tasks must be assigned time, are IBM BladeCenter technology, the Roadrunner
appropriately to the cores, and additional memory space triblade mechanical design, and the integrated optical
management is required, but the results are well worth cable that was used for the IB subsystem [].
 R Router Architecture

The IBM BladeCenter was designed from the out- The HPC community needs to be thinking “Ultra Scale”
set to maximize serviceability and ease of upgradability. computing in order to harness the emerging techno-
The modular nature of the BladeCenter chassis with logical advances and head into the exascale arena with
the ability to just slide new blades into an existing and the momentum that IBM and LANL built crossing the
pretested infrastructure that has the system manage- petaflops boundary.
ment and power management built in, and I/O connec-
tions that can stay in place as the blades are slid in and
out, provides a very robust base for building large-scale Bibliography
systems. This was proven in practice when the petaflop . Los Alamos National Laboratory, LANL Roadrunner. http://
barrier was broken only four days after the triblades www.lanl.gov/roadrunner/
for the final CU were delivered to the lab from man- . Voltaire Inc. Infiniband Products. http://www.voltaire.com/
ufacturing. They were plugged in, verified to be good, Products/Infiniband
. xCAT: Extreme Cloud Administration Tool .. http://xcat.
integrated in with the other CUs, and then the entire
sourceforge.net/
system turned over to the applications group in less . IBM Corporation, IBM BladeCenter LS. http://www-.ibm.
than  h. com/systems/bladecenter/hardware/servers/ls/index.html
. Vogt J-S, Land R, Boettiger H, Krnjajic Z, Baier H () IBM
Conclusion BladeCenter QS: design, performance and utilization in hybrid
computing systems. IBM J Res Dev () Paper :–
Roadrunner has achieved each and every one of its orig-
. Gschwind M () Integrated execution: a programming model
inal goals, and more. First and foremost, fully capable for accelerators. IBM J Res Dev () Paper :–
simulations from fluid mechanics to radiation trans- . Kistler M, Gunnels J, Brokenshire D, Benton B () Program-
port, from molecular to cosmological scale, are running ming the Linpack benchmark for Roadrunner. IBM J Res Dev
routinely and efficiently on this system. The develop- () Paper :–
ment of these new capabilities provides a roadmap for . IBM Corporation, IBM BladeCenter. http://www-.ibm.com/
systems/bladecenter/
developing exascale applications over the next decade.
. EMCORE Corporation, Infiniband Optics Cable. http://www.
While there will always be challenges to getting emcore.com/fiber_optics/emcoreconnects
to the next orders of magnitude in computing perfor- . Moore G () Cramming more components onto integrated
mance, the Roadrunner project collaboration between circuits. Electronics ():–
LANL and IBM was the first to recognize that a new . Los Alamos National Laboratory, LANL Applications. http://lanl.
gov/roadrunner/rrseminars.shtml
paradigm was necessary for high-performance com-
. Humphreys J, Scaramella J IDC The Impact of Power and
puting and to address the energy efficiency and com- Cooling on Data Center Infrastructure,’ Document #,
putational performance of a new generation of these May . http://www--ibm.com/systems/resources/systems_
systems. z_pdf_IDC_ImpactofPowerandCooling.pdf
There was concentrated effort on the software algo- . Green List. http://www.green.org/
rithm structures needed to implement heterogeneous . Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A ()
Design of ion-implanted MOSFETs with very small physical
and hybrid computing models at this scale. It has shown
dimensions. IEEE J Solid State Circuits SC-():–
that Open Source components can be used to advan- . IBM Corporation, IBM Deep Computing. http://www-.ibm.
tage when the technical challenges attract interested and com/systems/deepcomputing/
motivated participants.
The use of a modular building block approach to
hardware design, including the BladeCenter manage-
ment and control infrastructure, provided a very stable
base for doing the software experimentation. With some
of the breakthroughs that this program produced, the
HPC community is now well positioned to solve, or help Router Architecture
solve, the scaling issues that will produce major break-
throughs in business, science, energy, and social arenas. Switch Architecture
Routing (Including Deadlock Avoidance) R 

decisions are made. In source routing, the full path


Router-Based Networks is computed at the source node, before injecting the
packet into the network. The path is stored as a list of
Networks, Direct
switch output ports in the packet header, and interme-
Networks, Multistage
diate switches are configured according to this infor-
mation. On the contrary, in distributed routing, each
switch computes the next link that will be used by the
Routing (Including Deadlock packet while the packet travels across the network. By
Avoidance) repeating this process at each switch, the packet reaches
its destination. The packet header only contains the des-
Pedro López tination node identifier. The main advantage of source
Universidad Politécnica de Valencia, Valencia, Spain routing is that switches are simpler. They only have to
select the output port for a packet according to the infor-
Synonyms mation stored in its header. However, since the packer
Forwarding header itself must be transmitted through the network,
it consumes network bandwidth. On the other hand,
Definition distributed routing has been used in most hardware
The routing method used in an interconnection net- routers for efficiency reasons, since packet headers are
work defines the path followed by packets or mes- more compact. Also, it allows more flexibility, as it
sages to communicate two nodes, the source or origin may dynamically change the path followed by packets
node, which generates the packet, and the destination according to network conditions or in case of faults.
node, which consumes the packet. This method is often Routing algorithms can be also classified as deter-
referred to as the routing algorithm. ministic or adaptive. In deterministic routing, the path
followed by packets exclusively depends on their source
Discussion and destination nodes. In other words, for a given
Introduction source-destination pair, they always choose the same
Processing nodes in a parallel computer commu- path.
nicate by means of an interconnection network. On the contrary, in adaptive routing, the path fol-
The interconnection network is composed of switches lowed by packets also depends on network status, such
interconnected by point to point links. Topology is the as node or link occupancy or other network load infor-
representation of how these switches are connected. mation. Adaptive routing increases routing flexibility,
Direct or indirect regular topologies are often used. In which allows for a better traffic balance on the network,
direct topologies every network switch has an associ- thus improving network performance. R
ated processing node. In indirect topologies, only some Deterministic routing algorithms are simpler to
switches have processing nodes attached to them. The implement than adaptive ones. On the other hand,
k-ary n-cube (which includes tori and meshes) and adaptive routing introduces the problem of out of
k-ary n-tree (an implementation of fat-trees) are the order delivery of packets, as two consecutive pack-
most frequently used direct and indirect topologies, ets sent from the same source to the same desti-
respectively. In the absence of a direct connection nation may follow different paths. An easy way of
among all network nodes, a routing algorithm is increasing the number of routing options is virtual
required to select the path packets must follow to com- channel multiplexing []. In this case, each physical link
municate every pair of nodes. is decomposed into several virtual channels, each one
with its own buffer, which can be separately reserved.
Taxonomy of Routing Algorithms Physical link is shared among all the virtual chan-
Routing algorithms can be classified according to sev- nels using time division multiplexing. Adaptive routing
eral criteria. The first one considers the place where algorithms may provide all the feasible routing options
 R Routing (Including Deadlock Avoidance)

to forward the packet towards its destination (fully the topology and to the routing strategy used on that
adaptive routing) or only a subset (partially adaptive topology. This is the approach used for the fixed regular
routing). topologies and routing algorithms used in large parallel
If adaptive routing is used, the selection function computers.
chooses one of the feasible routing options according to There are hybrid implementations, such as the Flex-
some criteria. For instance, in case of distributed adap- ible Interval Routing (FIR) [] approach, which defines
tive routing, the selection function may select the link the routing algorithm by programming a set of reg-
with the lowest buffer occupation or with the lowest isters associated to each output port. The strategy is
number of busy virtual channels. able to implement the most commonly-used deter-
A routing algorithm is minimal if the paths it pro- ministic and adaptive routing algorithms in the most
vides are included in the set of shortest or minimal paths widely used regular topologies, with O(log(N)) storage
from source to destination. In other words, with min- requirements.
imal routing, every link crossed by a packet reduces
its distance to destination. On the contrary, a non- Deadlock Handling
minimal routing algorithm allows packets to use paths A deadlock in an interconnection network occurs when
longer than the shortest ones. Although these routing some packets cannot advance toward their destination
algorithms are useful to circumvent congested or faulty because all of them are waiting on one another to release
areas, they use more resources than strictly needed, resources (buffers or links). In other words, the involved
which may negatively impact on performance. Addi- packets request and hold resources in a cyclic fash-
tionally, paths having an unbounded number of allowed ion. There are three strategies for deadlock handling:
non-minimal hops might result in packets never reach- deadlock prevention, deadlock avoidance, and deadlock
ing their destination, situation that is referred to as recovery.
livelock. Deadlock prevention incurs in a significant over-
Routing algorithms can be implemented in several head and is only used in circuit switching. It consists
ways. Some routers use routing tables with a number of of reserving all the required resources before start-
entries equal to the number of destinations. With source ing transmission. In deadlock avoidance, resources are
routing, these tables are associated with each source requested as packets advance but routing is restricted
node, and each entry contains the whole path towards in such a way that there are no cyclic dependen-
the corresponding destination. With distributed rout- cies between channels. Another approach consists of
ing, these tables (known as forwarding tables) are asso- allowing the existence of cyclic dependencies between
ciated to each network switch, and contain the next channels while providing some escape paths to avoid
link a packet destined to the corresponding entry must deadlock, therefore increasing routing flexibility. Dead-
follow. With a single entry per destination, only deter- lock recovery strategies allow the use of unrestricted
ministic routing is allowed. The main advantage of fully adaptive routing, potentially outperforming dead-
table-based routing is that any topology and any rout- lock avoidance techniques. However, these strategies
ing algorithm can be used in the network. However, require a deadlock detection mechanism [] and a
routing based on tables suffers from a lack of scalabil- deadlock recovery mechanism [, ] that is able to
ity. The size of the table grows linearly with network recover from deadlocks. If a deadlock is detected,
size (O(N) storage requirements), and, most important, the recovery mechanism is triggered, resolving the
the time required to access the table also depends on deadlock. Progressive recovery techniques deallocate
network size. buffer resources from other “normal” packets and reas-
An alternative to tables is to place a specialized hard- sign them to deadlocked packets for quick delivery.
ware at each node that implements a logic circuit and Disha [] and software-based recovery [] are exam-
computes the output port to be used as a function of the ples of this approach. Regressive techniques deallocate
current and destination nodes and the status of the out- resources from deadlocked packets by killing and later
put ports. The implementation is very efficient in terms re-injecting them at the original source router (i.e.,
of both area and speed, but the algorithm is specific to abort-and-retry).
Routing (Including Deadlock Avoidance) R 

In deadlock-avoidance, the routing algorithm rest- the network and continue using full routing flexibility
ricts the paths allowed by packets to only those ones until delivered.
that keep the global network state deadlock-free. Dally These ideas can be stated more formally as follows. A
[] proposed the necessary and sufficient condition for routing subfunction R associated to the routing func-
a deterministic routing algorithm to be deadlock-free. tion R supplies a subset of the links provided by R (the
Based on this condition, he defined a channel depen- set of escape channels), therefore restricting its rout-
dency graph (CDG) and established a total order among ing flexibility, while still being able to deliver packets
channels. Routing is restricted to visit channels in order, from any valid location in the network to any desti-
to eliminate cycles in the CDG. The CDG associated to nation. The concept of indirect dependency between
a routing algorithm on a given topology is a directed two escape channels is defined as follows. A packet
graph, where the vertices are the network links and may reserve a set of adjacent channels ci , ci+ ,.., ck− ,
the edges join those links with direct dependencies ck , where only ci and ck are provided by R, but ci+ ,..
between them, i.e., those adjacent links that could be ck− are not. The extended CDG for R has as vertices
consecutively used by the routing algorithm. The rout- the escape channels, and as edges both the direct and
ing algorithm is deadlock free if and only if the CDG indirect dependences between escape channels.
is acyclic. To design deadlock-free routing algorithms The theory states that a routing function R is
based on this idea while keeping network connectivity, deadlock-free if there exists a routing subfunction R
it may be required that physical channels are split into without cycles in its extended CDG. A simple way to
several virtual channels, and the ordering is performed design adaptive routing algorithms based on the theory
among the set of virtual channels. Although this idea is to depart from a well-known deterministic deadlock-
was initially proposed for deterministic routing, it has free routing function, adding virtual channels in a reg-
been also applied to adaptive routing. For instance, in ular way. The new additional channels can be used for
the turn model [], a packet produces a turn when it fully adaptive routing. As mentioned above, when a
changes direction, usually changing from one dimen- packet uses an escape channel at a given node, it can
sion to another in a mesh. Turns are combined into freely use any of the available channels supplied by the
cycles. Prohibiting just enough turns to break all the routing function at the next node.
cycles prevents deadlock. Duato also developed necessary and sufficient con-
Although imposing an acyclic CDG avoids dead- ditions for deadlock-free adaptive routing under differ-
lock, it results in poor network performance, as routing ent switching techniques, and for different sets of input
flexibility is strongly constrained. Duato [] demon- data for the routing function. All of those theoretical
strates that a routing algorithm can have cycles in the extensions are based on the principle described above.
CDG while remaining deadlock-free. This allows the
design of deadlock-free routing algorithms without sig- R
nificant constraints imposed on routing flexibility. The Routing in k-ary n-cubes
key idea that supports deadlock freedom despite the A k-ary n-cube is a n-dimensional network, with k
existence of cyclic dependencies is the concept of escape nodes in each dimension, connected in a ring-fashion.
path. Packets can be routed without restrictions, pro- Every node is connected to n neighbors. The number
vided that there exists a subset of network links – of nodes is N = kn . This topology is often referred to as a
the escape paths – without cyclic dependencies, which n-dimensional torus, and it has been used in some of the
allow every packet to escape from a potential cycle. An largest parallel computers, such as the Cray TE or the
important characteristic of the escape paths is that it IBM BlueGene (both of them use a D-torus). The n-
must be possible to deliver packets from any point in the dimensional mesh is a particular case where the nodes
network to their corresponding destination using the of each dimension are connected as a linear array (i.e.,
escape paths alone. However, this does not imply that there are not wraparound connections). The symme-
a packet that used an escape path at some router must try and regularity of the k-ary n-cube simplify network
continue using escape paths until delivered. Indeed, implementation and packet routing as the movement
packets are free to leave the escape paths at any point in of a packet along a given network dimension does not
 R Routing (Including Deadlock Avoidance)

modify the number of remaining hops in any other the H and L channels belong to the set provided by the
dimension toward its destination. routing subfunction. Figure  shows the extended CDG
In a mesh, dimension-order routing (DOR) avoids for this routing algorithm in a -node unidirectional
deadlocks (and livelocks) by allowing only the use of ring, which is acyclic, and thus the routing algorithm
minimal paths that cross the network dimensions in is deadlock-free. This routing algorithm will provide
some total order. That is, the routing algorithm does not several routing options, depending on the relative loca-
allow the use of links of a given dimension until no other tions of the node that currently stores the packet and the
links are needed by the packet in all of the preceding destination node. For instance, in a D-torus, if both
dimensions to reach its destination. In a D-mesh, this X and Y channels remain to be crossed, two adaptive
is easy to achieve by crossing dimensions in XY order. virtual channels in both dimensions and one escape
This simple routing algorithm is not deadlock-free channel (the H or L in the X dimension) will be pro-
in a torus. The wraparound links create a cycle among vided. The selection function is in charge of making
the nodes of each dimension. This is easily solved by the final choice, selecting one of the links also con-
multiplexing each physical channel into two virtual sidering network status. A simple selection function
channels (H -high- and L -low-), and restricting rout- may be based on assigning static priorities to virtual
ing in such a way that H channels can only be used channels, returning the first free channel according to
if the coordinate of the destination node in the corre- this order. Priorities could be assigned to give more
sponding dimension is higher than the current one and preference to adaptive channels. More sophisticated
vice versa. Consider a -node unidirectional ring (a - selection functions could be also used, such as select-
ary -cube) with nodes ni , i = , , ,  and two channels ing a channel of the least multiplexed physical chan-
connecting each pair of adjacent nodes. Let cHi and cLi , nel to minimize the negative effects of virtual channel
i = , , ,  be output channels of node ni . Figure  multiplexing.
shows the CDG corresponding to the routing algorithm
presented above. As there are not cycles in the channel
dependency graph, it is deadlock free. Routing in k-ary n-trees
An adaptive routing algorithm for tori can be eas- Clusters of PCs are being considered as a cost-effective
ily defined by merely adding a new virtual channel to alternative to small and medium scale parallel comput-
each physical channel (the A -adaptive- channel). This ing systems. These machines use any of the commercial
new channel can be freely used, even to cross network high-performance switch-based point-to-point inter-
dimensions in any order, thus providing fully adaptive connects. Multistage interconnection networks (MINs)
routing on the k-ary n-cube. In this routing algorithm, are the most usual choice. In particular, fat-trees have
risen in popularity in the past few years (for instance,
Myrinet, InfiniBand, Quadrics).
cH0
A fat-tree is based on a complete tree that gets
n0 n1 thicker near the root. Processing nodes are located at
cL0 the leaves. However, assuming constant link bandwidth,
the number of ports of the switches increases as we
cH3 cL3 cL1 cH1 approach the root, which makes the physical imple-
mentation unfeasible. For this reason, some alternative
implementations have been proposed in order to use
cL2 switches with fixed arity. In k-ary n-trees, bandwidth
n3 n2
is increased as we approach the root by replicating
switches. A k-ary n-tree (see Fig. ) has n stages, with
cH2
k (arity) links connecting each switch to the previous or
Routing (Including Deadlock Avoidance). Fig.  Channel to the next stage. A k-ary n-tree is able to connect N = kn
dependency graph for a -ary -cube processing nodes using nkn− switches.
Routing (Including Deadlock Avoidance) R 

cA0 cH0

n0 n1 Direct dependency
c Indirect dependency
L0
cA1

cH3
cL3 cL1 cH1

cA3
cL2
n3 n2

cA2 cH2

Routing (Including Deadlock Avoidance). Fig.  Extended channel dependency graph for a -ary -cube

0 stage 0 1 2
000 0 8
2,4,6 4 4
1 0,00 1,00 0 2,00
3,5,7 0
6
001 2 4
2
010 0,4,6 9
1 5 5
3 1,5,7 1 1 2,01
0,01 1,01
011 3 7
5
4
100
2 0,2,6 6 0 10
2
5 0,10 4 1,10 2 2,10
1,3,7
101 6 6

6
110 0,2,4 1 11
3 7
5 3
7 0,11 1,3,5 1,11 3 2,11 R
111 7 5

Number of paths:7 6 4

Routing (Including Deadlock Avoidance). Fig.  Load-balanced deterministic routing in a -ary n-tree

In a k-ary n-tree, minimal routing from a source any of its up output ports. Once a nearest common
to a destination can be accomplished by sending pack- ancestor has been reached, the packet is turned around
ets upwards to one of the nearest common ancestors of and sent downwards to its destination and just a single
the source and destination nodes and then, from there, path is available.
downwards to destination. When crossing stages in the A deterministic routing algorithm for k-ary n-trees
upward direction, several paths are possible, thus pro- can be obtained by reducing the multiple ascending
viding adaptive routing. In fact, each switch can select paths in a fat-tree to a single one for each source-
 R Routing (Including Deadlock Avoidance)

destination pair. The path reduction should be done most popular topology agnostic routing scheme used in
trying to balance network link utilization. That is, all the cluster networks. Routing is based on an assignment of
links of a given stage should be used by a similar num- direction labels (up or down) to the links in the network
ber of paths. A simple idea is to shuffle, at each switch, by building a breadth first spanning tree (BFS). Cyclic
consecutive destinations in the ascending phase []. channel dependencies are avoided by prohibiting mes-
In other words, consecutive destinations in a given sages to traverse a link in the up direction after having
switch are distributed among the different ascending traversed one in the down direction. Although simple,
links, reaching different switches in the next stage. Fig- this approach generates a network unbalance because
ure  shows the destination node distribution in the a high percentage of traffic is forced to cross the root
ascending and descending links of a -ary -tree fol- node. Derived from up∗ /down∗ , some improvements
lowing this idea. In the figure, each ascending link has were proposed, such as assigning directions to links by
been labeled with the destinations it can forward to. computing a depth first spanning tree (DFS), the Flex-
As can be seen, packets destined to the same node ible Routing scheme, which introduced unidirectional
reach the same switch at the last stage, independently of routing restrictions to break cycles in each direction at
their source node. Each switch of the last stage receives different positions, or the Segment-based Routing algo-
packets addressed only to two destinations, and pack- rithm, which splits a topology into subnets, and subnets
ets destined to each one are forwarded through a dif- into segments. This allows placing bidirectional turn
ferent descending link (as there are two descending restrictions locally within a segment, which results in
links per switch). Therefore, this mechanism efficiently a larger degree of freedom.
distributes the traffic destined to different nodes, and Most of these routing algorithms do not guaran-
balances load across the links. tee that all packets will be routed through a minimal
path. This causes an increment in the packet latency
and an inefficient use of network resources, affecting the
overall network performance. The Minimal Adaptive
Routing in Irregular Networks. Agnostic routing algorithm, which is based on Duato’s theory,
Routing adds a new virtual channel which provides minimal
As stated above, direct topologies and multistage net- routing, keeping the paths provided by up∗ /down∗ as
works are often used when performance is the primary escape paths. The In-Transit Buffers mechanism [] also
concern. However, in the presence of some switch or provides minimal routing for all the packets. To avoid
link failures, a regular network will become an irreg- deadlocks, packets are ejected from the network, tem-
ular one. In fact, most of the commercially available porarily stored in the network interface card at some
interconnects for clusters support irregular topologies. intermediate hosts and later re-injected into the net-
An alternative for tolerating faults in regular networks work. However, this mechanism requires some sup-
without requiring additional logic is the use of generic port at NICs and at least one host to be attached
or topology agnostic routing algorithms. These routing to every switch in the network, which cannot always
algorithms can be applied to any topology and pro- be guaranteed. Other approaches make use of vir-
vide a valid and deadlock free path for every source- tual channels to route all the packets through mini-
destination pair of nodes in the network (if they are mal paths while still guaranteeing deadlock freedom.
physically connected). Therefore, in the presence of For instance, Layered Shortest Path (LASH) guaran-
any possible combination of faults, a topology agnostic tees deadlock freedom by dividing the physical net-
routing algorithm provides a valid solution for routing. work into a set of virtual networks using separate vir-
An extensive number of topology agnostic rout- tual channels. Minimal paths between every source-
ing strategies have been proposed. They differ in destination pair of hosts are spread onto these layers,
their goals and approaches (e.g., obtaining minimal such that each layer becomes deadlock free. In Transi-
paths, fast computation time) and the resources they tion Oriented Routing (TOR), unlike the LASH routing,
need (typically virtual channels). The oldest topology up∗ /down∗ is used as a baseline routing algorithm in
agnostic routing algorithm is up∗ /down∗ [], being the order to decide when to change to a new virtual net
Routing (Including Deadlock Avoidance) R 

work (when a forbidden transition down→up appears). table-based routing that could be used in commer-
To remove cyclic channel dependencies, virtual net- cial switches. The load-balanced deterministic routing
works are crossed in increasing order, thus guaranteeing algorithm described for fat-trees can be found in [].
deadlock freedom. However, due to the large number of Up∗ /down∗ routing was initially proposed for Autonet
routing restrictions imposed by up∗ /down∗ , providing []. The ITB mechanism [] was proposed to improve
minimal paths among every pair of hosts may require a performance of source routing in the context of Myrinet
high number of virtual channels. networks.

Related Entries
Clusters
Bibliography
Deadlocks
. Anjan KV, Pinkston TM () DISHA: a deadlock recovery
Flow Control scheme for fully adaptive routing. In: Proceedings of the th
Infiniband international parallel processing symposium, Santa Barbara, CA,
Interconnection Networks pp –
Myrinet . Dally WJ () Virtual-channel flow control. IEEE Trans Parallel
Distributed Syst ():–
Network of Workstations
. Duato J () A new theory of deadlock-free adaptive rout-
Networks, Direct
ing in wormhole networks. IEEE Trans Parallel Distributed Syst
Networks, Fault-Tolerant ():–
Networks, Multistage . Dally WJ, Seitz CL () Deadlock-free message routing in
Switch Architecture multiprocessor interconnection Networks. IEEE Trans Comput
Switching Techniques C-():–
. Dally WJ, Towles B () Principles and practices of intercon-
nection networks. Morgan Kaufmann, San Francisco, CA
Bibliographic Notes and Further . Duato J, Yalamanchili S, Ni LM () Interconnection networks:
Reading an engineering approach. Morgan Kaufmann, San Francisco, CA
Dally’s [] and Duato’s [] books are excellent text- . Flich J, López P, Malumbres MP, Duato J () Boosting the per-
books on Interconnection Networks and contain abun- formance of myrinet networks. IEEE Trans Parallel Distributed
dant material about routing. The sources of the the- Syst ():–
. Gómez C, Gilabert F, Gómez ME, López P, Duato J () Deter-
ories about deadlock-free routing can be found in ministic vs. adaptive routing in fat-trees. In: Proceedings of the
[] and []. Although virtual channels where intro-  international parallel and distributed processing sympo-
duced in [], they were formally proposed in [] as sium, IEEE Computer Society Press, Long Beach, CA
“virtual channel flow control.” The Turn-model, which . Gilabert F, Gómez ME, López P, Duato J () On the influence
allows adaptive routing in tori without virtual chan- of the selection function on the performance of fat-trees. Lecture

nels was proposed in []. The selection function has Notes in Computer Science , Springer, Berlin R
. Glass CJ, Ni LM () The turn model for adaptive routing. In:
some impact on network performance. Several selec- Proceedings of the th international symposium on computer
tion functions have been proposed. For instance, [] architecture, ACM, New York, pp –
proposed a set of possible selection functions in the . Gómez ME, López P, Duato J () FIR: an efficient routing
context of fat-trees. Deadlock-recovery strategies are strategy for tori and meshes. J Parallel Distributed Comput-
based on the assumption that deadlocks are rare. This Elsevier ():–
. Martínez JM, López P, Duato J () A cost-effective approach
assumption was evaluated in [], were the frequency to deadlock handling in wormhole networks. IEEE Trans Parallel
of deadlocks was measured. Disha and software-based Distributed Syst ():–
recovery, two progressive deadlock-recovery mecha- . Martínez JM, López P, Duato J () FCD: flow control based
nisms, were proposed in [] and [], respectively. distributed deadlock detection mechanism for true fully adaptive
To make deadlock-recovery feasible, an efficient dead- routing in wormhole networks. IEEE Trans Parallel Distributed
Syst ():–
lock detection mechanism that matches the low dead-
. Schroeder MD et al () Autonet: a high-speed, self-
lock occurrence must be used, such as the FCD configuring local area network using point-to-point links.
deadlock-detection mechanism proposed in []. FIR Technical report SRC research report , DEC, April
[] is an alternative implementation to algorithmic and 
 R R-Stream Compiler

. Warnakulasuriya S, Pinkston TM () Characterization of as manycores, heterogeneity, mixed execution mod-
deadlocks in interconnection Networks. In: Proceedings of the els, and explicit memory communications manage-
th international parallel processing symposium, IEEE Com- ment. DARPA wanted R-Stream for programming the
puter Society, Washington, DC, pp –
architectural space spanned by these chips from a sin-
gle high-level and target-independent source. Since
the PCA architectures uniformly presented a very
high ratio of on-chip computational rates to off-chip
R-Stream Compiler bandwidth, R-Stream research prioritized transforma-
tions that increased arithmetic intensity (the ratio of
Benoit Meister, Nicolas Vasilache, David computation to communication) in concert with paral-
Wohlford, Muthu Manikandan Baskaran, Allen lelization, and on creating mappings that executed the
Leung, Richard Lethin program in a streaming manner – hence the name of
Reservoir Labs, Inc., New York, NY, USA the compiler, R-Stream (the “R” stands for Reservoir).
As the R-Stream compiler team refined the map-
ping algorithms, the conception of what the high-level,
Definition machine independent programming language should
R-Stream (R-Stream is a registered trademark of be also evolved. Streaming as an execution model
Reservoir Labs, Inc.) is a source-to-source, auto- had been well established for many decades, as the
parallelizing compiler developed by Reservoir Labs, Inc. dominant execution paradigm for programming dig-
R-Stream compiles programs in the domains of high- ital signal processors (DSPs): software-pipelining of
performance scientific (HPC) and high-performance computation with direct memory access (DMA). The
embedded computing (HPEC), where loop nests, dense PCA research program investigated whether stream-
matrices, and arrays are the common idioms. It gen- ing should be considered as the programming model.
erates mapped programs, i.e., optimized programs for Initially, two streaming-based research programming
parallel execution on the target architecture. R-Stream languages, Stanford Brook [] and MIT StreamIt [],
targets modern, heterogeneous, multi-core architec- were supported in the PCA program. These languages
tures, including multiprocessors with caches, systems provide means for expressing computations as aggre-
with accelerators, and distributed memory architec- gate operations (Brook calls them kernels; StreamIt calls
tures that require explicit memory management and them filters) on linear streams of data. However, as more
data movement. The compiler accepts the C language as understanding of the compilation process was acquired,
input. Depending on the target platform, the compiled the R-Stream compiler team started to move away from
output program can be in C with the appropriate target streaming as being the right expression form for algo-
APIs or annotations for parallel execution, other data rithms for several reasons:
parallel languages such as CUDA for GPUs, or dataflow
assembly for FPGA targets. . The expressiveness of the streaming languages was
limited. They were biased toward one-dimensional
History streams, but radar signal processing involves multi
DARPA funded R-Stream development in the Poly- dimensional data “cubes” processed along various
morphous Computer Architecture (PCA) research dimensions [].
program, starting in . The PCA program devel- . Transformations for compiling streaming languages
oped new programmable computer architectures for are isomorphic to common loop transformations –
approaching the energy efficiency of application- e.g., parallelization, tiling, loop fusion, index set
specific fixed function processors, particularly those splitting, array expansion/contraction, and so forth.
for advanced signal processing in radars. These new . While streaming notations were found to sometimes
PCA chips included RAW [], TRIPS [], Smart be a good shorthand for expression of a particular
Memories [], and Monarch []; these chips inno- mapping of an algorithm to a particular architec-
vated architecture features for energy efficiency, such ture, that particular mapping would not run well
R-Stream Compiler R 

on another architecture without significant trans- only a small dedicated runtime layer that is specialized
formation. The notation and idioms of the stream- to each architecture.
ing languages actually frustrate translation, because
a compiler would have to undo them to extract Using R-Stream
semantics for re mapping. Programmers write succinct loop nests in sequential
ANSI C. The loop nest can be imperfect but must fit
Consequently, the R-Stream team moved away from
within an extended static control program form. That is,
streaming as a programming paradigm to instead sup-
the loops have affine parametric extents and static inte-
porting input programs (mainly loops) written in the
ger strides. Multidimensional array references must be
C language: C can more succinctly express the multi
affine functions of loop index variables and parameters.
dimensional radar algorithms (compared to the stream-
Certain dynamic control features are allowed, such as
ing languages); the basic theory of loop optimization
conditionals that are affine functions of index variables
for C was well understood; and C can be written in a
and parameters. Programs that are not directly within
textbook manner with the semantics plain. However,
the affine static control program form can be handled
the compilation challenges coming from the architec-
by wrapping the code in an abstraction layer and pro-
tural features in the PCA chips were open problems. To
viding a compliant image function in C that models the
attack these problems, the R-Stream team adopted and
effects of the wrapped function.
extended the powerful polyhedral framework for loop
The programmer indicates the region to be mapped
optimization.
with a simple single pragma. Again, R-Stream does not
PCA features are now in commercial computing
require the programmer to indicate how this mapping
chips. Programming such chips is complex, requiring
should proceed: not through any particular idiomatic
more than just parallelization. Thus, the R-Stream com-
correspondence of the loop nest to particular hardware
piler is being used by programmers and researchers
features, nor through detailed specifications encoded in
to map C to chips such as Cell, GPGPUs, Tilera, and
pragma. R-Stream automatically determines the map-
many-core SMP. It is also being extended in research on
ping, based on the target machine and emits trans-
compilation to ExaScale supercomputers. R-Stream is
formed code. For example, a loop nest expressed in
available commercially and also licensed in source form
ANSI C can be rendered into optimized CUDA [].
to research collaborators.
Architecture
Design Principles R-Stream is built on a set of independent and reusable
R-Stream performs high-level automatic parallelization. optimizations on two intermediate representations
The high-level mapping tasks that R-Stream performs (IRs). Unlike many source-to-source transformation
include parallelism extraction, locality improvement, tools, R-Stream shuns ad hoc pattern-matching- R
processor assignment, managing the data layout, and based transformations in favor of mathematically
generating explicit data movements. R-Stream takes as sound transformations, in particular dependence- and
input sequential programs written in C, with the kernels semantics-based transformations in the polyhedral
to be optimized marked with a simple pragma, but with model. The two intermediate representations are
all mapping decisions for the target automated. designed to support this view. The first intermediate
Low-level or machine-level optimizations, such as representation, called Sprig, is used in the scalar opti-
instruction selection, instruction scheduling, and reg- mizer. It is based on static single assignment (SSA) form
ister allocation, are left to an external low-level compiler [], similar to Click’s graph-based IR [], augmented
that takes the mapped source produced by R-Stream. with operators from value dependence graph [] for
R-Stream currently performs mostly static mapping expressing state dependencies. The second intermedi-
for all its target architectures. Resource and schedul- ate representation is the generalized dependence graph
ing decisions are computed statically at compile time, (GDG), the representation used in the polyhedral map-
with very little dynamic decisions done at runtime. For per. Conversions from the SSA representation to the
all architectures, the mapping is also “bare-metal,” with GDG representation are performed by the raising phase,
 R R-Stream Compiler

Inputs Forward in time Outputs

Polyhedral MPA
Machine mapper
model
Multiple
mappings OpenMP
Generalized dependence graph
(Polyhedral representation)

Tilera
Raising Lowering

Static single assignment form Code STI cell


Serial Parsing
generators
C code
Multiple Multiple
optimizations optimizations Clearspeed
Scalar Scalar
optimizations optimizations

CUDA

R-Stream Compiler. Fig.  R-Stream’s architecture

and the converse by the lowering phase. These com- code can be effectively optimized by the low-level com-
ponents are shown as different parts of the R-Stream piler, R-Stream’s code generation back-end performs
compiler flow in Fig. . syntax recovery to convert a program in the scalar
representation into idioms that look human-like. The
engineering challenge is that R-Stream is designed to
Scalar Optimizer represent and transform programs in semantics-based
The main task of the scalar optimizer is to simplify and IRs, far from the original syntax and language. Source
normalize the input source so that subsequent analysis reconstruction is more complex than simple unparsing.
of the program in the raising phase can proceed. The R-Stream uses an algorithm using early coalescing of
scalar optimizer also runs after mapping and lowering, ϕ-functions to exit SSA, extensions to Cifuentes [, ]
to remove redundancies and simplify expressions that to detect high-level statement forms, and then localized
result from the polyhedral scanning phases. The scalar pattern matching to generate syntax. Sprig also repre-
optimizer provides traditional optimizations. sents types in terms of the input source language, to
allow for faithful reproduction of those types at output.
Syntax Recovery
The output of R-Stream has to be processed by low-level Machine Model
compilers corresponding to the target architectures. R-Stream supports Cell, Tilera, GPUs, ClearSpeed, sym-
These low-level compilers often have trouble compil- metric multiprocessors, and FPGAs as targets. These
ing automatically generated programs, even if the code are rendered and detailed in machine models (MM),
is semantically legal, because they pattern-match opti- expressed in an XML-based description language. The
mizations to the code idioms that humans typically target description is an entity graph connecting physical
write. For example, many low-level compilers cannot components that implicitly encodes the system capabil-
perform loop optimizations on loops that are expressed ities and execution models that can be exploited by the
using gotos and labels. Thus, to ensure that the output mapper.
R-Stream Compiler R 

The nodes in the entity graph are processors, mem- projection of the GDG into the target machine space
ories, data links, and command links. Processor entities (across processors) and time (schedule), and then find a
are computation engines, including scalar processors, good description of that projection in terms of the tar-
SIMD processors, and multiprocessors. These are struc- get machine execution model. That is, the mapper finds
tured as a multi dimensional grid geometry. Memory a schedule of execution of computation, memory use,
entities are data-caches, instruction-caches, combined and communication.
data and I-caches, and main and scratchpad memories. As the mapping process proceeds, the GDG may
There are two types of edges in the graph. Data-link be split into multiple GDGs, for submapping prob-
edges stand for explicit communication protocols that lems to portions of a hierarchical or heterogeneous
require software control, such as DMA. Command-link machine.
edges stand for instructions that one processor can send The GDG (defined below) uses polyhedra to model
to another entity, such as thread spawning or DMA the computations, dependences, and memory refer-
initiation. ences of the part of the program being mapped. The
The final ingredient in an MM is the morph. A polyhedral description provides compactness and preci-
morph is a parallel computer, or a part of one, on sion, for a broad set of useful loop nest structures, and
which the mapping algorithms are focused. The term enables tractible analytical formulations of concepts like
morph derives from the polymorphous aspect of the “dataflow analysis” and “coarse-grained parallelization.”
DARPA PCA program for efficient chips that could A GDG is a multigraph, where the nodes repre-
be reconfigured; a morph would describe one among sent statements (called polyops), and the edges repre-
many configurations of a polymorphous architecture. sent the dependences between the statements. The basic
R-Stream also uses the term “morph” to describe the information attached to each polyop node includes:
single configuration for a non-reconfigurable portion of
● Its iteration domain, represented as a polyhedral set
a target.
● A list of array references, each represented as an
Each morph in a MM contains a host processor
affine function from the iteration space, to the data
and a set of processing elements (PEs), plus its entity
space of the corresponding array
graph. The host processor and the PEs are allowed to
● A predicate function indicating under what condi-
be identical, e.g., in a SMP environment. Associated
tions the statement should be executed
with the hosts and PEs is its topology and a list of
● Data-dependent conditionals are supported through
submorphs.
the predicates.
A machine model is subdivided into a set of morphs.
This mechanism allows a mapping problem for a com- Attached to each edge of the GDG is its dependence
plex machine to be decomposed into a set of smaller polyhedron, a polyhedral set that encodes the depen-
problems, each concentrating on one morph of the dence between the iterations of the edge’s nodes. R
machine. In hierarchical mapping, the set of morphs The polyhedral sets used in the GDG are unions of
form a tree, with sub-morphs representing submapping intersections of integer lattices with parametric polyhe-
problems after the high-level mapping problems have dra, called Z-Domains. Z-Domains are the essence of
been completed. the mathematical language of polyhedral compilation.
The use of unions allows for describing a broad range
The Generalized Dependence Graph of “shapes” of programs, the integer lattices provide
The GDG is the IR used by the mapper. All phases of the precision through complex mappings. Using paramet-
mapper take a set of GDGs as an input and produce a set ric polyhedra also provides a richness of description;
of modified GDGs as output. Thus, mapping proceeds it enables parametrically specified families of programs
strictly in this IR. to be optimized at once and it enables describing
Initially, the input to the mapper is a single GDG that loop nests and array references that depend on iter-
represents the part of the program that the program- ation variables from enclosing loop nests. R-Stream
mer has indicated should be mapped. Mapping phases includes a ground-up implementation of a library for
annotate and rewrite this graph, essentially choosing a manipulating Z-Domains that provides performance
 R R-Stream Compiler

and functionality that we could not obtain had we used . Perform induction variable detection using the algo-
Loechner’s Polylib []. rithm of Pop []. The algorithm has been extended
to detect inductions based on address arithmetic.
Mapping Flow Affine loop bounds and affine address expressions
The basic mapping flow in R-Stream is through the can be converted into the iteration domains and
following mapper phases: access functions in the GDG.
. Within each basic block, partition the operators
● Perform raising to convert mappable regions into into maximal subsets of operators connected via
the polyhedral form. value flow such that the partitions can be sequen-
● Perform dependence analysis and/or dependence tially ordered. These maximal subsets are made into
removal optimizations such as array expansion. individual polyops.
● Perform affine scheduling to extract parallelism. . Finally, convert the loop iteration domains and
● Perform task formation (a generalization of tiling) access functions into constraints form in the GDG.
and processor placement. Also perform thread par-
titioning to split the mapped region among the host
processor and the co-processing elements. Dependence Analysis and Array Expansion
● Perform memory promotion to improve locality. If R-Stream provides a more advanced covering depen-
necessary, also perform data layout transformations dence analysis from Vasilache [] to remove transi-
and insert communications. tively implied dependencies, which improves upon the
● Perform lowering to convert the mapped program algorithms from Feautrier [].
back into the scalar IR, Sprig. R-Stream provides a novel corrective array expan-
sion based on Vasilache’s violated dependence analysis
These phases are mixed and matched and can be recur- [, ]. Corrective array expansion essentially fuses
sively called for different target architectures. For exam- array expansion phases into scheduling phases.
ple, for SMP targets, it is not necessary to promote A “violated” dependence is a relationship between
memory and insert explicit communications, since such a source and a target instruction of the program that
machines have caches. (However, in some cases-explicit exhibit a memory-based dependence in the original
copies also help on cache-based machines by compact- program that is not fulfilled under a chosen schedule.
ing the footprint of the data accessed by a task; we call A violation arises when the target statement is sched-
such efforts “virtual scratchpad.”) For a target machine uled before the source statement in the transformed
with a host processor and multiple GPUs, R-Stream program.
will first map coarsely across GPUs (inserting thread Such a violation can be corrected by either recom-
controls and explicit communictions) and then recur- puting a new schedule or by changing the memory
sively perform mapping for parts of the code that will locations read and written. This allows for better sched-
be executed within the GPUs. ules with as small a memory footprint as possible by
performing the following steps:
Raising
. Provide a subset of the original dependencies in
The raising phase is responsible for converting loop
the program to the affine scheduler phase so that it
nests inside mappable regions into the polyhedral form.
determines an aggressive schedule with maximum
R-Stream’s raising process is as follows:
parallelism.
. Mappable regions are identified by the program- . When the schedule turns out to be incorrect with
mer via pragmas. Mappable regions can also be regards to the whole set of dependencies, perform
expanded by user-directed inlining pragmas. a correction of the resulting schedule by means of
. Perform loop detection and if/then/else region renaming and array expansion.
detection. . Perform lazy corrections targeting only the very
. Perform if-conversion [] to convert data dependent causes of the semantics violations in the aggressively
predicates into predicated statements. scheduled program.
R-Stream Compiler R 

Unlike previous array expansion algorithms in the lit- To support hierarchical/heterogeneous architec-
erature by Feautrier [], Barthou et al. [], and Offner tures, the scheduling algorithm can be used hierar-
et al. [], which tend to expand all arrays fully and ren- chically. Different levels of the hardware hierarchy
der the program in dynamic single assignment form, the are optimized using very different cost functions. At
above algorithm only expands arrays by need – i.e., only the outermost levels of the hierarchy, weight is put
if the parallelism exposed by the scheduling algorithm on obtaining coarse-grained parallelism with minimal
requires the array to be expanded. communications that keep as many compute cores busy.
At the innermost levels, after communications have
Affine Scheduling been introduced, focus is set to fine-grained parallelism,
Under the exact affine representation of dependences in contiguity and alignment to enable hardware features
the GDG, it was known that useful scheduling proper- such as vectorization. Tiling also separates mapping
ties of programs can be optimized such as maximal fine- concerns between different levels of the hierarchy.
grained parallelism using Feautrier’s algorithm [],
maximal coarse-grained parallelism using Lim and Task Formation and Placement
Lam’s algorithm [], or maximal parallelism given a The streaming execution model that R-Stream targets is
(maximal) fusion/distribution structure using Bond- as follows:
hugula’s et al.’s algorithm []. R-Stream improved on
● A bulk set of input data Di is loaded into a memory
these algorithms by introducing the concept of affine
M that is close to processing element P.
fusion and this enables a single seamless formulation
● A set of operations T, called “task,” works on Di .
of the joint optimization of cost functions representing
● A live-out data set Do produced by T is unloaded
trade-offs between amount of parallelism and amount
from M to make room for the next task’s input data.
of locality at various depths in the loop nest hierarchy.
Scheduling algorithms in R-Stream search an optimiza- The role of task formation is to partition the pro-
tion space that is either constructed on a depth-by- gram’s operations into such tasks in a way that min-
depth basis as solutions are found, or based on the imizes the program execution time. Since computing
convex space of all legal multi dimensional schedules is much faster than transferring data, one of the goals
as had been illustrated by Vasilache []. R-Stream of task formation is to obtain a high computation-per-
allows direct optimization of the tradeoff function using communication ratio for the tasks. The memory M may
Integer Linear Programming (ILP) solvers as well as be a local or scratchpad memory, which holds a fixed
iterative exploration of the search space. Redundant number of bytes. This imposes a capacity constraint:
solutions in the search space are implicitly pruned out the working data set of a task should not overflow M’s
by the combined tradeoff function as they exhibit the capacity. Finally, the considered processing element P
same overall cost. In practice, a solution to the opti- could be composite, in the sense that it may be itself R
mization problem represents a whole class of equivalent formed of several processing elements. Task formation
scheduling functions with the same cost. also ensures that enough parallelism is provided to such
Building upon the multi dimensional affine fusion processing elements.
formulation, R-Stream also introduced the concept of R-Stream’s task formation algorithm improves on
joint optimization of parallelism, locality, and amount prior algorithms for tiling by Ancourt et al. [],
of contiguous memory accesses. This additional met- Xue [], and Ahmed et al. []. The algorithm is
ric is targeted at memory hierarchies where access- enabled by using Ehrhart polynomial techniques devel-
ing a contiguous set of memory references is crucial oped by Clauss et al. [], Verdoolage et al. [], and
to obtain high performance. Such hardware features Meister et al. []. The Ehrhart polynomials are used
include hardware and software prefetchers, and explicit to quickly count the volume (number of integer points)
vector load and store into and from registers, and coa- in Z-Domains that model the data footprint implied by
lescing hardware in GPUs. R-Stream defines the cost of prospective iteration tilings. The R-Stream task forma-
contiguous memory accesses up to an affine data layout tion algorithm directly tiles imperfect loop nests inde-
transformation. pendently. The prior work requires perfect loop nests or
 R R-Stream Compiler

expanded imperfect loop nests into a perfect form that nest with additional information that identifies which
forced all sections of code, even vastly unrelated ones innermost loops transfer the data required for one task.
into the same tiling structure. When there is explicit bulk communication hardware
The R-Stream task formation algorithm uses evalu- between the source and target memories, R-Stream con-
ators and manipulators, which can be associated with catenates element-wise transfer operations into strided
task groups, loops or the whole function. Evaluators transfer (e.g., DMA) commands.
are a function of the elements of the entity they are
associated with. They can be turned into a constraint
(implicitly: evaluator must be nonnegative) or an objec- Target-Specific Optimizations: GPU
tive function. Manipulators enforce constraints on the While R-Stream can target many different architectures,
entity they are associated with. For instance, the tile size the GPU target is an illustrative case. GPUs are many-
of a loop can be made to be a multiple of a certain size, core architectures that have multiple levels of paral-
or a power of two, and a set of loops can be set to have lelism with outer coarse-grained MIMD parallelism and
the same tile size. Evaluators and manipulators can be inner finer-grained SIMT (Single Instruction Multiple
combined (affine combination, sequence, conditional Threads) parallelism. Furthermore, they have complex
sequence) to produce a custom tiling problem. memory hierarchy with memories of varied latencies
The task formation algorithm addresses whether the and varied characteristics. To achieve high performance
task can be executed in parallel or sequentially; it uses on such architectures, it is very important to exploit
heuristics that walk the set of operations of the GDG all levels of parallelism and utilize appropriate memo-
and groups the operations based on fusion decisions ries in an efficient manner. There are three important
made by the scheduler, data locality considerations, and aspects to be addressed by an automatically parallelizing
the possibility of executing the operations on the same high-level compiler for GPUs, namely,
processor.
● To identify and utilize outer coarse-grained and
The task formation algorithm also can detect a class
inner fine-grained parallelism
of reductions and mark such loops. It then uses asso-
● To make GPU DRAM (“global memory”) accesses
ciativity of the scanned operations to relax polyhe-
to be contiguous and aligned in the parallel loop
dral dependence constraints when computing the loop’s
● To make proper utilization of the on-chip memories,
tiling constraints.
namely “shared memory” and “registers”

Memory Promotion R-Stream exploits the affine scheduling algorithm men-


R-Stream performs memory promotion for architec- tioned earlier, to have, whenever possible, coarse-
tures with fast but limited scratch-pad memories on dis- grained parallelism at outer level and fine-grained
tributed memory architectures. When data are migrated parallelism at inner level with contiguously aligned
from one memory to another, the compiler also per- memory accesses. The affine scheduler tries to find a
forms a reorganization of the data layout to improve right mix of parallelism, data locality, and data conti-
storage utilization, or locality of reference, or to enable guity. As a result, not all global memory array accesses
other optimizations. Such data layout reorganization can be made contiguous by the schedule. In such cases,
often comes “for free,” especially if it can be overlapped and also in cases where global memory accesses are used
with computation via hardware support, such as DMA. multiple times, R-Stream generates data transfers from
This technique is a generalization of Schreiber et al. []. global DRAM to shared memory (which resides on chip
and is shared by all threads in a multiprocessor) as a
Communication Generation and DMA element-wise copy loop nest such that the global mem-
Optimizations ory accesses are contiguous in the transfer. The global
When the data set is promoted from a source memory memory accesses are then appropriately replaced by the
to a target memory, R-Stream generates data transfer shared memory accesses. As a result, R-Stream achieves
from the source array to the target array and back. Those data contiguity in global memory and also improves
transfers are represented as a element-wise copy loop data reuse across threads.
R-Stream Compiler R 

#pragma rstream map


void gaussseidel2D_9points(real_t (*A)[N], int pT, int pN) {
int i, j, t;
for (t=0; t<pT; t++) {
for (i=1; i<pN-1; i++) {
for (j=1; j<pN-1; j++) {
A[i][j] = C0*A[i][j] +
C1*(A[i-1][j-1] + A[i-1][j] + A[i-1][j+1] +
A[i ][j-1] + + A[i ][j+1] +
A[i+1][j-1] + A[i+1][j] + A[i+1][j+1]);
}
}
}
}

R-Stream Compiler. Fig.  R-Stream’s input: Gauss-Seidel nine-point stencil

#pragma unroll
for (i1 = (__maxs_32(__maxs_32(-480 * j + -64 * k + -16 * (int)b1ockIdx.x + -
(int)threadIdx.y + 7 >> 3, - (int)threadIdx.y + 7 >> 3), __maxs_32(-64 *
i + -1440 * j + -32 * k + -48 * (int)b1ockIdx.x + - (int)threadIdx.y + 31 +
7 >> 3, 480 * j + 64 * k + 16 * (int)b1ockIdx.x + - (int)threadIdx.y +
-2076 + 7 >> 3))); i1 <= _t3; i1++) {
int _t6; int j1;
_t6 = (__mins_32(__mins_32(480 * j + 64 * k + -8 * i1 + 16 * (int)b1ockIdx.x +
- (int)threadIdx.x + - (int)threadIdx.y + 66 >> 4, -64 * i + -960 * j + 32
* k + -32 * (int)b1ockIdx.x + - (int)threadIdx.x + 2078 >> 4), __mins_32(-
(int)threadIdx.x + 2047 >> 4, __mins_32(480 * j + 64 * k + 8 * i1 + 16 *
(int)b1ockIdx.x + - (int)threadIdx.x + (int)threadIdx.y + 2 >> 4, 480 * j +
64 * k + 16 * (int)b1ockIdx.x + - (int)threadIdx.x + 33 >> 4))));
#pragma unroll
for (j1 = (__maxs_32(__maxs_32(480 * j + 64 * k + -8 * i1 + 16 * (int)b1ockIdx.x
+ - (int)threadIdx.x + - (int)threadIdx.y + -31 + 15 >> 4, __maxs_32(-64 *
i + -960 * j + 32 * k + -32 * (int)b1ockIdx.x + - (int)threadIdx.x + -30 +
R
15 >> 4, 64 * i + 1920 * j + 96 * k + 64 * (int)b1ockIdx.x + -
(int)threadIdx.x + -2107 + 15 >> 4)), __maxs_32(- (int)threadIdx.x + 15 >>
4, __maxs_32(480 * j + 64 * k + 8 * i1 + 16 * (int)b1ockIdx.x + -
(int)threadIdx.x + (int)threadIdx.y + -63 + 15 >> 4, 480 * j + 64 * k + 16
* (int)blockIdx.x + - (int)threadIdx.x + -46 + 15 >> 4)))): j1 <= _t6; j1++)
{
A_l_l[8 * i1 + (int)threadIdx.y][46 + (-480 * j + -64 * k) + (16 * j1 + (-16 *
(int)b1ockIdx.x + (int)threadIdx.x))] = A_l[-31 + (64 * i + 1440 * j) + (32
* k + 8 * i1 + (48 * (int)b1ockIdx.x + (int)threadIdx.y))][16 * j1 +
(int)threadIdx.x];
}
}
syncthreads();

R-Stream Compiler. Fig.  Excerpt of R-Stream’s CUDA output: Gauss-Seidel nine-point stencil
 R R-Stream Compiler

Example Bibliography
Figure  shows input code to R-Stream. The input code . nVidia CUDA Compute Unified Device Architecture Program-
is ANSI C, for a simple nine point Gauss Seidel stencil ming Guide (Version .), June 
operated on a D array. The input style that can be raised . Ahmed N, Mateev N, Pingali K () Tiling imperfectly-nested
loop nests. In: Supercomputing ’: proceedings of the 
is succinct and the user does not give directives describ-
ACM/IEEE conference on supercomputing (CDROM), IEEE
ing the mapping, only the indication that the compiler Computer Society, Washington, DC, pp –
should map the function. The compiler determines the . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion
mapping structure automatically. The total output code of control dependence to data dependence. In: Proceedings of
for this important kernel must be hundreds of lines long the th ACM SIGACT-SIGPLAN symposium on principles of
programming languages, New York, pp –
to achieve performance in CUDA and on GPUs; a small
. Ancourt C, Irigoin F () Scanning polyhedra with DO loops.
excerpt of the output from R-Stream in CUDA from this In: Proceedings of the rd ACM SIGPLAN symposium on prin-
example is in Fig. . One can see that the compiler has ciples and practice of parallel programming, Williamsburg, VA,
formed parallelism that is implicit in the CUDA thread pp –, Apr 
and block structure. The chosen schedule balances par- . Barthou D, Cohen A, Collard JF () Maximal static expansion.
allelism, locality, and in particular contiguity, to enable In: Proceedings of the th ACM SIGPLAN-SIGACT symposium
on principles of programming languages, New York, pp –
coalescing of loads. Also illustrated is the creation of
. Bastoul C () Efficient code generation for automatic paral-
local copies of the array, the generation of explicit copies lelization and optimization. In: Proceedings of the international
into the CUDA scratchpad “shared memories.” symposium on parallel and distributed computing, Ljubjana,
pp –, Oct 
. Bastoul C, Vasilache N, Leung A, Meister B, Wohlford D, Lethin R
() Extended static control programs as a programming model
Related Entries for accelerators: a case study: targetting Clearspeed CSX with
Parallelization, Automatic the R-Stream compiler. In: First workshop on Programming Mod-
els for Emerging Architectures (PMEA)
. Bondhugula U, Hartono A, Ramanujan J, Sadayappan P ()
Bibliographic Notes and Further A practical automatic polyhedral parallelizer and locality opti-
mizer. In: ACM SIGPLAN Programming Languages Design and
Reading Implementation (PLDI ’), Tucson, Arizona, June 
The final report on R-Stream research for DARPA is . Buck I () Brook v. specification. Technical report, Stanford
Lethin et al. []. More information on R-Stream is University, Oct 
available from various subsequent workshop papers by . Cifuentes C () A structuring algorithm for decompilation. In:
Leung et al. [], Meister et al. [], and Bastoul et al. []. Proceedings of the XIX Conferencia Latinoamericana de Infor-
Several US and International patent applications by matica, Buenos Aires, Argentina, pp –
. Cifuentes C () Structuring decompiled graphs. Technical
this encyclopedia entry’s authors are pending and will
Report FIT-TR- –, Department of Computer Science, Uni-
publish with even more detail on the optimizations in versity of Tasmania, Australia, . Also in Proceedings of the
R-Stream. th international conference on compiler construction, ,
Feautrier is credited with defining the polyhedral pp –
model in the late s and early s [, , ]. . Clauss P, Loechner V () Parametric analysis of polyhedral
iteration spaces. In: IEEE international conference on applica-
Darte et al. provided a detailed comparison of polyhe-
tion specific array processors, ASAP’. IEEE Computer Society,
dral techniques to classical techniques, in terms of the Los Alamitos, Calif, Aug 
benefits of the exact dependence relations, in  []. . Click C, Paleczny M () A simple graph-based intermediate
Much theoretical work accumulated on the polyhedral representation. ACM SIGPLAN Notices, San Francisco, CA
model in the decades after its invention, but it took the . Cytron R, Ferrante J, Rosen BK, Zadeck FK () Efficiently com-
work of Quillere, Rajopadhye, and Wilde to unlock the puting static single assignment form and the control dependence
graph. ACM Trans Program Lang Syst ():–
technique for application by showing effective code gen-
. Darte A, Schreiber R, Villard G () Lattice-based memory
eration algorithms []; these were later improved by allocation. IEEE Trans Comput ():–
Bastoul [] and Vasilache [] (the latter thesis also pro- . Feautrier P () Array expansion. In: Proceedings of the nd
viding a good bibliography on the polyhedral model). international conference on supercomputing, St. Malo, France
Run Time Parallelization R 

. Feautrier P () Parametric integer programming. RAIRO- . Sankaralingam K, Nagarajan R, Gratz P, Desikan R, Gulati
Recherche Opérationnelle, ():– D, Hanson H, Kim C, Liu H, Ranganathan N, Sethumadha-
. Feautrier P () Dataflow analysis of array and scalar references. van S, Sharif S, Shivakumar P, Yoder W, McDonald R, Keckler
Int J Parallel Prog ():– SW, Burger DC () The distributed microarchitecture of the
. Feautrier P () Some efficient solutions to the affine schedul- TRIPS prototype processor. In: th international symposium on
ing problem. Part I. One-dimensional time. Int J Parallel Prog microarchitecture (MICRO), Los Alamitos, Calif, Dec 
():– . Schreiber R, Cronquist DC () Near-optimal allocation of
. Feautrier P () Some efficient solutions to the affine schedul- local memory arrays. Technical Report HPL--, Hewlett-
ing problem. Part II. Multidimensional time. Int J Parallel Prog Packard Laboratories, Feb 
():– . Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald
. StreamIt Group () Streamit language specification, version B, Hoffmann H, Johnson P, Lee JW, Lee W, Ma A, Saraf A,
.. Technical report, Massachusetts Institue of Technology, Oct Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S,
 Agarwal A () The raw microprocessor: a computational fabric
. Lethin R, Leung A, Meister B, Szilagyi P, Vasilache N, Wohlford D for software circuits and general purpose programs. Micro, Mar
() Final report on the R-Stream . compiler DARPA/AFRL 
Contract # F--C-, DTIC AFRL-RI-RS-TR--. . Vasilache N, Bastoul C, Cohen A, Girbal S () Violated depen-
Technical report, Reservoir Labs, Inc., May  dence analysis. In: Proceedings of the th international confer-
. Leung A, Meister B, Vasilache N, Baskaran M, Wohlford D, ence on supercomputing (ICS’), Cairns, Queensland, Australia.
Bastoul C, Lethin R () A mapping path for multi-GPGPU ACM, New York, NY, USA, pp –
accelerated computers from a portable high level programming . Vasilache N, Cohen A, Pouchet LN () Automatic correction
abstraction. In: Third Workshop on General-Purpose Computa- of loop transformations. In: th international conference on par-
tion on Graphics Processing Units, GPGPU-, Mar  allel architecture and compilation techniques (PACT’), IEEE
. Lim AW, Lam MS () Maximizing parallelism and minimiz- Computer Society Press, Brasov, Romania, pp –, Sept 
ing synchronization with affine transforms. In: Proceedings of the . Vasilache NT () Scalable program optimization techniques in
th annual ACM SIGPLAN-SIGACT symposium on principles the polyhedral model. PhD thesis, Université Paris Sud XI, Orsay,
of programming languages, Paris, France, pp – Sept 
. Loechner V () Polylib: a library for manipulating . Verdoolaege S, Seghir R, Beyls K, Loechner V, Bruynooghe M
parametrized polyhedra. Technical report, University of Louis () Analytical computation of Ehrhart polynomials: enabling
Pasteur, Strasbourg, France, Mar  more compiler analyses and optimizations. In: Proceedings of
. Mai K, Paaske T, Jayasena N, Ho R, Dally W, Horowitz M () the  international conference on compilers, architecture,
Smart memories: a modular reconfigurable architecture. In: Pro- and synthesis for embedded systems, ACM Press, New York,
ceedings of the international symposium on Comuter architec- pp –
ture, pp –, June  . Weise D, Crew R, Ernst M, Steensgaard B () Value depen-
. Meister B, Leung A, Vasilache N, Wohlford D, Bastoul C, Lethin R dence graph: representation without taxation. In: ACM sym-
() Productivity via automatic code generation for PGAS plat- posium on principles of programming languages, New York,
forms with the R-Stream compiler. In: Workshop on asynchrony pp –
in the PGAS programming model, June  . Xue J () On tiling as a loop transformation. Parallel Process
. Meister B, Verdoolaege S () Polynomial approximations in Lett ():–
the polytope model: bringing the power of quasi-polynomials to
R
the masses. In: ODES-: th workshop on optimizations for DSP
and embedded systems, Apr 
. Offner C, Knobe K () Weak dynamic single assignment form. Run Time Parallelization
Technical Report HPL--, HP Labs
. Pop S, Cohen A, Silber G () Induction variable analysis with Joel H. Saltz , Raja Das
delayed abstractions. In: Proceedings of the  international 
Emory University, Atlanta, GA, USA
conference on high performance embedded architectures and 
IBM Corporation, Armonk, NY, USA
compilers, Barcelona, Spain
. Quilleré F, Rajopadhye S, Wilde D () Generation of effi-
cient nested loops from polyhedra. Int J Parallel Prog (): Synonyms
– Parallelization
. Rettberg RD, Crowther WR, Carvey PP, Tomlinson RS () The
Monarch parallel processor hardware design. Computer :– Irregular array accesses arise in many scientific
. Richards MA () Fundamentals of radar signal processing. applications including sparse matrix solvers, unstruc-
McGraw-Hill, New York tured mesh partial differential equation (PDE) solvers,
 R Run Time Parallelization

and particle methods. Traditional compilation tech- the indirection array, which is the object interaction
niques require that indices to data arrays be symbol- list, to slowly change over the life of the computa-
ically analyzable at compile time. A common charac- tion. Figure  shows an example of an adaptive code.
teristic of irregular applications is the use of indirect In this figure, ia, ib, and ic are the indirection arrays.
indexing to represent relationships among array ele- Arrays x and y are the data arrays containing properties
ments. This means that data arrays are indexed through associated with the objects of the system.
values of other arrays, called indirection arrays. Figure  The inspector/executor strategy can be used to
depicts a simple example of a loop with indirection parallelize static irregular applications targeted for
arrays. The use of indirection arrays prevents compil- distributed parallel machines and can achieve high
ers from identifying array data access patterns. Inability performance. Additional optimizations are required
to characterize array access patterns symbolically can to achieve high performance when adaptive irregular
prevent compilers from generating efficient code for applications are parallelized using inspector/executor
irregular applications. approach. The example adaptive code in Fig.  shows
The inspector/executor strategy involves using com- that the indirection array ic does not change its value
pilers to generate code to examine and analyze data for every iteration of the outer loop. It typically has to
references during program execution. The results of this be regenerated every few iterations of the outer loop,
execution-time analysis may be used () to determine providing the opportunity for optimization.
which off-processor data needs to be fetched and where Inspector/executor strategies have also been adopted
the data will be stored once it is received and () to to optimize I/O performance when applications need
reorder and coordinate execution of loop iterations in to carry out large numbers of small non-contiguous
problems with irregular loop carried dependencies, as I/O requests. The inspector/executor strategy was ini-
seen in the example in Fig. . The initial examination tially articulated in the late s and early s,
and analysis phase is called the inspector, while the as described in Mirchandaney [], Saltz [], Koel-
phase that uses results of analysis to optimize program bel [], Krothapalli [], and Walker []. This article
execution is called the executor. addresses use of inspector/executor methods to opti-
Irregular applications can be divided into two sub- mize retrieval and management of off-processor data
classes: static and adaptive. Static irregular applications and in parallelization of loops with loop-carried depen-
are those in which each object in the system interacts dencies created by indirection arrays. Use of inspec-
with a predetermined fixed set of objects. The indirec- tor/executor methods and related compiler frameworks
tion arrays, which capture the object interactions, do to optimize I/O has also been described.
not change during the course of the computation (e.g.,
unstructured mesh PDE solver). In adaptive irregular Inspector/Executor Methods and
applications, each object interacts with an evolving list Distributed Memory
of objects (e.g., molecular dynamics codes). This causes The inspector/executor strategy works as follows: Dur-
ing program execution, the inspector examines data ref-
for i = 1 to n do erences made by a processor and determines () which
x(ia(i)) + = y(ib(i)) off-processor data elements need to be obtained and
end do () where the data will be stored once it is received.
The executor uses the information from the inspector to
Run Time Parallelization. Fig.  Example of a loop invol- gather off-processor data, carry out the actual compu-
ving indirection arrays tations, and then scatter results back to their home pro-
cessors after the computational phase is completed. A
for i = 1 to n do
central strategy of inspector/executor optimization is to
y(i) + = y(ia(i)) + y(ib(i)) + y(ic(i))
end do identify and exploit situations that permit reuse of infor-
mation obtained by inspectors. As discussed below, it
Run Time Parallelization. Fig.  Example of loop-carried is often possible to relocate preprocessing outside a set
dependencies determined by a subscript array of loops or to carry out interprocedural analysis and
Run Time Parallelization R 

for t = 1, step //outer step loop


for i = 1, n //inner loop
x(ia(i)) = x(ia(i)) + y(ib(i))
endfor //end inner loop
if (required) then
regenerate ic(:)
endif
for i = 1, n //inner loop
x(ic(i)) = x(ic(i)) + y(ic(i))
endfor //end inner loop
endfor //end outer step loop

Run Time Parallelization. Fig.  Example loop from adaptive irregular applications

for i = 1, n do for i = 1, n do for I =1, n do


x(ia(ib(i))) = .... if (ic(i)) then do j = ia(i), ia(i + 1)
end do x(ia(i)) = ... x(ia(j)) = ...
end if end do
end do end do

Run Time Parallelization. Fig.  Examples of loops with complex access functions

transformation so that results from an inspector can be computation phase and to scatter computed data back to
reused many times. home processors. Communication schedules determine
A variety of program analysis methods and trans- the volume of communication and number of commu-
formations have been developed to generate programs nication startups. A variety of approaches to schedule
that can make use of inspector/executor strategies. In generation have been taken. In the CHAOS system [],
loops with simple irregular access patterns such as those a hash table data structure is used to support a two-
in Fig. , a single inspector/executor pair can be gener- phase schedule generation process. The index analy-
ated by the straightforward method described in []. sis phase examines data access patterns to determine
Many application codes contain access patterns with which references are off-processor, remove duplicate
more complex access functions such as those depicted off-processor references, assign local buffers for off- R
in Fig. . Subscripted subscripts and subscripted guards processor references, and translate global indices to
can make indexing of one distributed array dependent local indices. The results of index analyses are managed
on values in another. To handle this kind of situation, an in a hash table data structure. The schedule generation
inspector must itself be split into an inspector-executor phase then produces communication schedules based
pair as described in []. A dataflow framework was on hash table information. The CHAOS hash table con-
developed to help place executor communication calls, tains a bit array used to identify which indirection
determine when it is safe to combine communications arrays entered each element into the hash table. This
statements, move them into less frequently executed approach facilitates efficient management and coordi-
code regions, or avoid them altogether in favor of nation of analysis information derived from programs
reusing data that is already buffered locally. with multiple indirection arrays and provides support
for building incremental and merged communication
Communication Schedules schedules.
A communication schedule is used to fetch off- In some cases it can be advantageous to employ
processor elements into a local buffer before the communication schedules that carry out bounding box
 R Run Time Parallelization

or entire-array bulk read or write operations rather for i = 1 to n do


than using schedules that only communicate accessed y(ia(i)) = ...
array elements. These approaches differ in costs associ- ...
ated with both the inspector and the executor phases. ... = y(ib(i))
The bounding box approach requires only that the end do
inspector compute a bounding box containing all the
Run Time Parallelization. Fig.  Subscript
necessary elements and no inspector is required to
array-determined loop carried dependences with output
retrieve the entire array. A compiler and runtime system
dependence
using performance models to choose between element-
wise gather, bounding box bulk read, and entire-array
retrieval were implemented by the Titanium group dis-
cussed in []. important cases of irregular single-indexed accesses and
A variety of techniques have been incorporated into simple indirect array accesses.
CHAOS and into efficiently supported applications such An inspector/executor approach, applicable to loops
as particle codes, which manifest access patterns that without output dependencies, can be carried out
change from iteration to iteration. The crucial observa- through generation of an inspector that identifies wave-
tion is that in many such codes, the communication- fronts of concurrently executable loop iterations fol-
intensive loops are actually carrying out a generalized lowed by an executor that transforms the original loop
reduction in which it is not necessary to control the L into two loops, L and L . The new outer loop L is
order of elements stored. Hwang [] demonstrates that sequential and the new inner loop L involves all loop
order-independence can be used to dramatically reduce indexes assigned to each wavefront, as described in Saltz
schedule generation overhead. []. This is analogous to loop skewing, as described by
Wolfe [], except that the identification of parallelism
Runtime Parallelization is carried out during program execution.
A variety of runtime parallelization techniques have A doacross loop construct can also be used as an
been developed for indirectly indexed loops with data executor. The wavefronts generated by the inspector
dependencies among iterations, as shown in Figs.  are used sort indices assigned to each processor into
and . Most of this work has targeted shared memory ascending wavefront order. Full/empty bit or busy-
architectures due to the fine grained concurrency that waiting synchronization enforces true dependencies;
typically results when such codes are parallelized. One computation does not proceed until the array element
of the earliest techniques for runtime parallelization of required for the computation is available. A doacross
loops involves the use of a key field associated with each executor construct was introduced in [], and this
indirectly indexed array element to order accesses. The concept has been greatly refined through a variety
algorithm repeatedly sweeps over all loop iterations in of schemes that support operation level synchroniza-
alternating analysis and computation phases. An itera- tion in the work of Chen and Singh [, ]. A very
tion is allowed to proceed only if all accesses to array fine-grained data flow inspector/executor scheme that
elements y(ia(i)) and y(ib(i)) by iterations j < i have also employed operation level synchronization was pro-
completed. These sweeps have the effect of partitioning posed and tested on the CM-, as described in Chong’s
the computation into wavefronts. Key fields are used to work [].
ensure that dependences are taken into account. This Many situations occur in which runtime informa-
is discussed further in []. This strategy is extended tion might demonstrate that: () all loop iterations
by allowing concurrent reads to the same entry, as are independent and could execute in parallel or ()
Midkiff explains, and through development of a system- that cross-iteration dependences are reductions. A kind
atic dependence analysis and program transformation of inspector/executor technique has been developed
framework for singly nested loops. Intra-procedural that involves () carrying out compiler transformations
and interprocedural analysis techniques are presented that assume parallelism or reduction parallelism, ()
in an article by Lin [] to analyze the common and running the generated speculative loop, and then ()
Run Time Parallelization R 

invoking tests to assess correctness of the speculative for i = 1 to n do


parallel execution, as discussed in Rauchwerger []. x(i) = y(ia(i)) + z(i)
The compilation and runtime support framework gen- end do
erate tests and recovery code that need to be invoked if
Run Time Parallelization. Fig.  Simple irregular loop
the tests demonstrate that the speculation was unsuc-
cessful.

it is running on, and uses that information to figure out


Structural Abstraction
where its data and iteration fit in the global computa-
Over the course of a decade Baden explored three suc-
tion. Let my_elements represent the number of itera-
cessive runtime systems to treat data-dependent com-
tions assigned to a processor and also the number of
munication patterns and workload distributions: LPAR,
elements of data arrays x, y, z and indirection array ia
LPAR-X, and KeLP [, ]. The libraries were used in var-
assigned to the processor. The code generated by the
ious applications including: structured adaptive mesh
compiler and executed by each processor is shown in
refinement for first principles simulation of real mate-
Fig. .
rials, genetic algorithms, cluster identification for spin
models in statistical mechanics, turbulent flow, and cell
microphysiology. This effort contributed the notion of Compiler Implementations
structural abstraction which supports user-level geo- Inspector/executor schemes have been implemented in
metric meta-data that admit data-dependent decompo- a variety of compilers. Initial inspector/executor imple-
sitions used in irregular applications. These also enable mentations in the early s were carried out in the
communication to be expressed in a high level geo- ARF [], [] and Kali [] compilers as discussed
metric form which may be manipulated at runtime to by Saltz and Koelbel, respectively. Inspector/executor
optimize communication. KeLP was later extended to schemes were subsequently implemented in Fortran D;
handle more general notions of sparse sets of data that Fortran D [, ]; the Polaris compiler discussed by
logically communicate within rectangular subspaces of Lin []; the SUPERB [] compiler; the Vienna Fortran
Zd, even if the data does not exist as rectangular sets compiler []; the Titanium Java compiler []; com-
of data, e.g., particles, as noted by Baden. A multi-tier pilation systems designed to support efficient imple-
variant of KeLP, KeLP, subsequently appeared to treat mentation of OpenMP on platforms that support MPI,
hierarchical parallelism in the then-emerging SMP clus- discussed by Basumallik []; to platforms that support
ter technologies, with support to mask communication Global Arrays, discussed by Liu []; and in HPF com-
delays []. pilers described in examples by Benkner and Merlin
[, ]. Many of these compiler frameworks support
Loop Transformation user-specified data and iteration partitioning methods. R
The compiler parallelizes the loop shown in Fig.  by A framework that supports compiler based analysis
generating the corresponding inspector/executor pair. about runtime data and iteration reordering transfor-
Assume that all arrays are aligned and distributed in mations is described in [].
blocks among the processors, and the iterations of the Speculative parallelization techniques have been
i-loop are likewise block-partitioned. The resulting proposed to effectively parallelize loops with com-
computation mapping is equivalent to that produced by plex data access patterns. Various software [, , ]
the “owner computes” rule, a heuristic that maps the and hardware [, ] techniques have been proposed.
computation of an assignment statement to the proces- The hardware techniques involve increased hardware
sor that owns the left-hand-side reference. Data array complexity but can catch cross-iteration data depen-
y is indexed using array ia, causing a single level of dencies immediately with minimal overheads. On the
indirection. other hand, full software solutions do not require new
The compiler generates a single executable: a copy hardware but in certain cases can lead to significant
of this executable runs on each processor. The program performance penalties. The first known work on spec-
running on each processor determines what processor ulative parallelization techniques to parallelize loops
 R Run Time Parallelization

//inspector code
for i = 1, my_elements
index_y(i) = ia(i)
endfor
sched_ia = generate_schedule(index_y)//call the inspector schedule
generation codes
call gather(y(begin_buffer), y, sched_ia)
//executor code
for i = 1, my_elements
x(i)= y(index_y(i)) + z(i)
endfor

Run Time Parallelization. Fig.  Transformed irregular loop

with complex data access pattern was proposed by of iterations between the processors, possibly causing
Rauchwerger et al. []. Instead of breaking loops into work imbalance.
inspector/executor computations, they do a specula-
tive execution of the loop as a doall with a runtime Bibliography
check (the LRPD test) to assess whether there are . Baden SB, Fink SJ () A programming methodology for dual-
any cross-iteration dependencies. The scheme supports tier multicomputers. IEEE Trans Softw Eng ():–
back-tracking and serial re-execution of the loop if the . Baden SB, Kohn SR () Portable parallel programming of
numerical problems under the LPAR system. J Parallel Distrib
runtime check fails. The biggest advantage of this tech-
Comput ():–
nique is that the access pattern of the data arrays does . Basumallik A, Eigenmann R () Optimizing irregular shared-
not have to be analyzed separately and the runtime tests memory applications for distributed-memory systems. In: Pro-
are performed during the actual computation of the ceedings of the eleventh ACM SIGPLAN symposium on princi-
loop. They used this technique on loops, which cannot ples and practice of parallel programming (PPoPP’), New York,
– Mar . ACM, New York, pp –
be parallelized statically, from the PERFECT Bench-
. Benkner S, Mehrotra P, Van Rosendale J, Zima H () High-
mark and showed that in many cases, it leads to code level management of communication schedules in HPF-like
that performs better than the inspector/executor strat- languages. In: Proceedings of the th international confer-
egy. The scheme can be used for automatic paralleliza- ence on supercomputing (ICS’), Melbourne. ACM, New York,
tion of complex loops, but its main limitation is that pp –
dependency violations are detected after the computa- . Brezany P, Gerndt M, Sipkova V, Zima HP () SUPERB sup-
port for irregular scientific computations. In: Proceedings of the
tion of the loop, and a severe penalty has to be paid in
scalable high performance computing conference (SHPCC-),
terms of rolling back the computation and executing Williamsburg, – Apr , pp –
it serially. Gupta et al. [] proposed a number of new . Chen DK, Torrellas J, Yew PC () An efficient algorithm
runtime tests that improve the techniques presented in for the run-time parallelization of DOACROSS loops. In: Pro-
[]; thus significantly reducing the penalties associated ceedings of the  ACM/IEEE conference on supercomputing,
Washington, DC, pp –
with misspeculation. Dang et al. [] proposed a new
. Chong FT, Sharma SD, Brewer EA, Saltz JH () Multiproces-
technique that transforms a partially parallel loop into sor runtime support for fine-grained, irregular DAGs. Parallel
a sequence of fully parallel loops. It is a recursive tech- Process Lett :–
nique where in each step all remaining iterations of the . Cintra M, Martinez J, Torrellas J () Architectural support for
irregular loop are executed in parallel. After the exe- scalable speculative parallelization in shared-memory multipro-
cution, the LRPD test is done to detect dependencies. cessors. In: Proceedings of th annual international symposium
on computer architecture, Vancouver, pp –
The iterations that were correct are committed, and the
. Dang F, Yu H, Rauchwerger L () The R-LRPD test: specula-
recursive process starts with the remaining iterations. tive parallelization of partially parallel loops. In: Proceedings of
The limitation of the technique is that the loop iterations the international parallel and distributed processing symposium
have to be statically block-scheduled in increasing order (IPDPS), Ft. Lauderdale
Runtime System R 

. Das R, Saltz J, von Hanxleden R () Slicing analysis and indi- . Rauchwerger L, Padua D () The LRPD test: speculative run-
rect accesses to distributed arrays. Lecture notes in computer time parallelization of loops with privatization and reduction
science, vol . Springer, Berlin/Heidelberg parallelization. In: Proceedings of the ACM SIGPLAN  con-
. Gupta M, Nim R () Techniques for speculative run-time ference on programming language design and implementation
parallelization of loops. In: Proceedings of the  ACM/IEEE (PLDI’), La Jolla, – June . ACM, New York, pp –
conference on supercomputing (SC’), pp – . Saltz JH, Berryman H, Wu J () Multiprocessors and run-time
. Hwang YS, Moon B, Sharma SD, Ponnusamy R, Das R, Saltz compilation. Concurr Pract Exp ():–
JH () Runtime and language support for compiling adaptive . Saltz JH, Mirchandaney R, Crowley K () Run-time par-
irregular programs. Softw Pract Exp ():– allelization and scheduling of loops. IEEE Trans Comput
. Koelbel C, Mehrotra P, Van Rosendale J () Supporting shared ():–
data structures on distributed memory machines. In: Sympo- . Singh DE, Martín MJ, Rivera FF () Increasing the parallelism
sium on principles and practice of parallel programming. ACM, of irregular loops with dependences. In: Euro-Par, Klagenfurt,
New York, pp – pp –
. Krothapalli VP, Sadayappan P () Dynamic scheduling of . Strout M, Carter L, Ferrante J () Compile-time composi-
DOACROSS loops for multiprocessors. In: International confer- tion of run-time data and iteration reordering. Program Lang Des
ence on databases, parallel architectures and their applications Implement ():–
(PARBASE-), Miami, pp – . Su J, Yelick K () Automatic support for irregular computa-
. Lin Y, Padua DA () Compiler analysis of irregular mem- tions in a high-level language. In: Proceedings, th IEEE Inter-
ory accesses. In: Proceedings of the ACM SIGPLAN  con- national Parallel and distributed processing symposium, Atlanta,
ference on programming language design and implementation. pp b, – Apr 
Vancouver, pp – . Ujaldon M, Zapata EL, Chapman BM, Zima HP () Vienna-
. Liu Z, Huang L, Chapman BM, Weng TH () Efficient imple- Fortran/HPF extensions for sparse and irregular problems and
mentation of OpenMP for clusters with implicit data distribution. their compilation. IEEE Trans Parallel Distrib Syst ():–
WOMPAT, Houston, pp – 
. Lusk EL, Overbeek RA () A minimalist approach to portable, . von Hanxleden R, Kennedy K, Koelbel C, Das R, Saltz J ()
parallel programming. In: Jamieson L, Gannon D, Douglass Compiler analysis for irregular problems in Fortran D. In: Pro-
R (eds) The characteristics of parallel algorithms. MIT Press, ceedings of the fifth international workshop on languages and
Cambridge, MA, pp – compilers for parallel computing. Springer, London, pp –
. Merlin JH, Baden SB, Fink S, Chapman BM () Multiple . Wu J, Das R, Saltz J, Berryman H, Hiranandani S () Dis-
data parallelism with HPF and KeLP. Future Gener Comput Syst tributed memory compiler design for sparse problems. IEEE
():– Trans Comput ():–
. Midkiff SP, Padua DA () Compiler algorithms for synchro- . Walker D () The implementation of a three-dimensional PIC
nization. IEEE Trans Comput ():– Code on a hypercube concurrent processor. In: Conference on
. Mirchandaney R, Saltz JH, Smith RM, Nicol DM, Crowley K hypercubes, concurrent computers, and application, Pasadena
() Principles of runtime support for parallel processors. In: . Wolfe M () More iteration space tiling. In: Proceedings of the
Proceedings of the second international conference on supercom-  ACM/IEEE conference on supercomputing (Supercomput-
puting (ICS’), St. Malo, pp – ing’), pp –
. Ponnusamy R, Hwang YS, Das R, Saltz JH, Choudhary A, Fox G
() Supporting irregular distributions using data-parallel lan- R
guages. Parallel Distrib Technol Syst Appl ():–
. Prvulovic M, Garzaran MJ, Rauchwerger L, Torrellas J ()
Removing architectural bottlenecks to the scalability of specu-
Runtime System
lative parallelization. In: Proceedings, th annual international
symposium on Computer architecture, pp – Operating System Strategies

You might also like