Professional Documents
Culture Documents
In the last three decades a large number of compiler transformations for optimizing
programs have been implemented. Most optimization for uniprocessors reduce the
number of instructions executed by the program using transformations based on the
analysis of scalar quantities and data-flow techniques. In contrast, optimization for
high-performance superscalar, vector, and parallel processors maximize parallelism
and memory locality with transformations that rely on tracking the properties of
arrays using loop dependence analysis.
This survey is a comprehensive overview of the important high-level program
restructuring techniques for imperative languages such as C and Fortran.
Transformations for both sequential and various types of parallel architectures are
covered in depth. We describe the purpose of each traneformation, explain how to
determine if it is legal, and give an example of its application.
Programmers wishing to enhance the performance of their code can use this survey
to improve their understanding of the optimization that compilers can perform, or as
a reference for techniques to be applied manually. Students can obtain an overview of
optimizing compiler technology. Compiler writers can use this survey as a reference for
most of the important optimizations developed to date, and as a bibliographic reference
for the details of each optimization. Readers are expected to be familiar with modern
computer architecture and basic program compilation techniques.
INTRODUCTION
As optimizing compilers become more
Optimizing compilers have become an es- effective, programmers can become less
sential component of modern high-perfor- concerned about the details of the under-
mance computer systems. In addition to lying machine architecture and can em-
translating the input program into ma- ploy higher-level, more succinct, and
chine language, they analyze it and ap- more intuitive programming constructs
ply various transformations to reduce its and program organizations. Simultane-
running time or its size. ously, hardware designers are able to
This research has been sponsored in part by the Defense Advanced Research Projects Agency (DARPA)
under contract DABT63-92-C-O026, by NSF grant CDA-8722788, and by an IBM Resident Study Program
Fellowship to David Bacon,
Permission to copy without fee all or part of this material is granted provided that the copies are not made
or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
and its date appear, and notice is given that copying is by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission.
@ 01994 0360-0300/94/1200-0345 $03.50
to Fortran 90, with minor variations and Programmers unfamiliar with Fortran
extensions; most examples use only fea- should also note that all arguments are
tures found in Fortran 77. We have cho- passed by reference.
sen to use Fortran because it is the de When describing compilation for vector
facto standard of the high-performance machines, we sometimes use array nota-
engineering and scientific computing tion when the mapping to hardware reg-
community. Fortran has also been the isters is clear. For instance, the loop
input language for a number of research
doalli=l,64
projects studying parallelization [Allen et a[i] = a[i] + c
al. 1988a; Balasundaram et al. 1989; end do all
Polychronopoulos et al. 1989]. It was cho-
sen by these projects not only because of could be implemented with a scalar-vec-
its ubiquity among the user community, tor add instruction, assuming a machine
but also because its lack of pointers and with vector registers of length 64. This
its static memory model make it more would be written in Fortran 90 array
amenable to analysis. It is not yet clear notation as
what effect the recent introduction of a[l :64] = a[l :64] + c
pointers into Fortran 90 will have on
optimization. or in vector machine assembly language
The optimizations we have presented as
are not specific to Fortran—in fact, many
commercial compilers use the same in- LF F2, c(R30) ;Ioad c into register F2
termediate language and optimizer for ADDI R8, R30, #a ;Ioad addr. of a into R8
both C and Fortran. The presence of un- LV VI, R8 ;Ioad vector a[l :64] to
restricted pointers in C can reduce oppor- V1
ADDSV VI, F2, VI ;add scalar to vector
tunities for optimization because it is
Sv VI. R8 ;store vector in a[l :64]
impossible to determine which variables
may be referenced by a pointer. The pro-
cess of determining which references may Array notation will not be used when
point to the same storage locations is loop bounds are unknown, because there
called alias analysis [Banning 1979; Choi is no longer an obvious correspondence
et al. 1993; Cooper and Kennedy 1989; between the source code and the fixed-
Landi et al. 1993]. length vector operations that perform the
The only changes to Fortran in our computation. To use vector operations,
examples are that array subscripting is the compiler must perform the transfor-
denoted by square brackets to distin- mation called strip-mining, which is dis-
guish it from function calls; and we use cussed in Section 6.2.4.
do all loops to indicate textually that all
the iterations of a loop may be executed 2. TRANSFORMATION ISSUES
concurrently. To make the structure of
For a compiler to apply an optimization
the computation more explicit, we will
to a program, it must do three things:
generally express loops as iterations
rather than in Fortran 90 array notation. (1) Decide upon a part of the program to
We follow the Fortran convention that optimize and a particular transfor-
arrays are stored contiguously in mem- mation to apply to it.
ory in column-major form. If the array is
(2) Verify that the transformation either
traversed so as to visit consecutive loca- does not change the meaning of the
tions in the linear memory, the first (that program or changes it in a restricted
is, leftmost) subscript varies the fastest. way that is acceptable to the user.
In traversing the two-dimensional array
(3) Transform the program.
declared as a[n, m], the following array
locations are contiguous: a[n – 1, 3], In this survey we concentrate on the
a[n, 31, a[l, 41, a[2, 41. last step: transformation of the program.
However, in Section 5 we introduce de- tions in the two executions produces the
pendence analysis techniques that are same result.
used for deciding upon and verifying
many loop transformations. Nondeterminism can be introduced by
Step (1) is the most difficult and poorly language constructs (such as Ada’s se-
understood; consequently compiler de- lect statement), or by calls to system or
sign is in many ways a black art. Be- library routines that return information
cause analysis is expensive, engineering about the state external to the program
issues often constrain the optimization (such as Unix time( ) or read()).
strategies that the compiler can perform. In some cases it is straightforward to
Even for relatively straightforward cause executions to be identical, for in-
uniprocessor target architectures, it is stance by entering the same inputs, In
possible for a sequence of optimizations other cases there may be support for de-
to slow a program down. For example, an terminism in the programming environ-
attempt to reduce the number of instruc- ment—for instance, a compilation option
tions executed may actually degrade per- that forces select statements to evaluate
formance by making less efficient use of their guards in a deterministic order. As
the cache. a last resort, the programmer may have
As processor architectures become to replace nondeterministic system calls
more complex, (1) the number of dimen- temporarily with deterministic opera-
sions in which optimization is possible tions in order to compare two versions.
increases and (2) the decision process is To illustrate many of the common
greatly complicated. Some progress has problems encountered when trying to op-
been made in systematizing the applica- timize a program and yet maintain cor-
tion of certain families of transforma- rectness, Figure l(a) shows a subroutine,
tions, as discussed in Section 8. However, Figure l(b) is a transformed version of it.
all optimizing compilers embody a set of The new version may seem to have the
heuristic decisions as to the transforma- same semantics as the original, but it
tion orderings likely to work best for the violates Definition 2.1.1 in the following
target machine(s). ways:
subroutine tricky (a, b,n, m,k) ● Memory fault. If k > m but n < 1, the
integer n, m, k reference to b[k] is illegal. The refer-
real a[m], b[m] ence would not be evaluated in the
original program because the loop body
doi=l, n is never executed, butit would occurin
a[i] = b[k] + a[i] + 100000.0 the call shown in Figure l(c) to the
end do transformed code in Figure l(b).
return e Different results. a and b may be com-
(a) original program pletely or partially aliased to one an-
other, changing the values assigned to
a in the transformed program. Figure
subroutine tricky(a,b,n,m,k)
l(d) shows how this might occur: If the
integer n, m, k
call is to the original subroutine, b[k]
real a[m], b[m], C
is changed when i = 1, since k = n and
b [n] is aliased to a[l ]. In the trans-
C = b[k] + 100000.0
formed version, the old value of b [k] is
doi=n, i,-1
used for all i, since it is read before the
a[i] = a[i] + C
loop. Even if the reference to b[k] were
end do
moved back inside the loop, the trans-
return
formed version would still give differ-
(b) transformed program
ent results because the loop traversal
order has been reversed.
k = m+l As a result of these problems, slightly
n=O different definitions are used in practice.
call tricky(a,b,n,m,k) When bitwise identical results are de-
(c) possible call to tricky sired, the following definition is used:
For languages and standards that de- niques. The advantage for analysis is
fine a specific semantics for exceptions, that there is only one entry point, so
meeting Definition 2.1.2 tightly con- control transfer need not be considered
strains the permissible transforma- in tracking the behavior of the code.
tions. If the compiler may assume that * Innermost loop. To target high-perfor-
exceptions are only generated when the
mance architectures effectively, com-
program is semantically incorrect, a
pilers need to focus on loops. Most of
transformed program can produce differ-
the transformations discussed in this
ent results when an exception occurs and
survey are based on loop manipulation.
still be legal. When exceptions have a
Many of them have been studied or
well-defined semantics, as in the IEEE
widely applied only in the context of
floating-point standard [American Na-
the innermost loop.
tional Standards Institute 1987], many
transformations will not be legal under 0
Perfect loop nest. A loop nest is a set of
this definition. loops one inside the next. The nest is
However, demanding bitwise identical called a perfect nest if the body of ev-
results may not be necessary. An alterna- ery loop other than the innermost con-
tive correctness criterion is: sists of only the next loop in the nest.
Because a perfect nest is more easily
Definition 2.1.3 A transformation is summarized and reorganized, several
legal if, for all semantically correct ex- transformations apply only to perfect
ecutions of the original program, the nests.
original and the transformed programs
e General loop nest. Any loop nesting,
perform equivalent operations for identi-
perfect or not.
cal executions. All permutations of semi-
commutative operations are considered ●
Procedure. Some optimizations, mem-
equivalent. ory access transformations in particu-
lar, yield better improvements if they
Since we cannot predict the degree to are applied to an entire procedure at
which transformations of semicommuta- once. The compiler must be able to
tive operations change the output, we manage the interactions of all the ba-
must use an operational rather than an sic blocks and control transfers within
observational definition of equivalence. the procedure. The standard and rather
Generally, in practice, programmers ob- confusing term for procedure-level op-
serve whether the numeric results differ timization in the literature is global
by more than a certain tolerance, and if optimization.
they do, force the compiler to employ 0 InterProcedural. Considering several
Definition 2.1.2. proc~dures together often exp~ses more
opportunities for optimization; in par-
2.2 Scope ticular, procedure call overhead is of-
ten significant, and can sometimes be
Optimizations can be applied to a pro- reduced or eliminated with inter-
gram at different level~ of granularity. procedural analysis.
As the scope of the transformation is en-
larged, the cost of analysis generally in-
We will generally confine our attention
creases. Some useful gradations of to optimizations beyond the basic block
complexity are: level.
* Statement. Arithmetic expressions are
the main source of potential optimiza-
tion within a statement. 3. TARGET ARCHITECTURES
tectures, which for our purposes are su- ware detects that the result of a vector
perscalar, vector, SIMD, shared-memory multiply is used by a vector add, and
multiprocessor, and distributed-memory uses a strategy called chaining in which
multiprocessor machines. These architec- the results from the multiplier’s pipeline
tures have in common that they all use are fed directly into the adder. Other
parallelism in some form to improve compound operations are sometimes
performance. implemented as well.
The structure of an architecture dic- Use of such compound operations may
tates how the compiler must optimize allow an application to run up to twice as
along a number of different (and some- fast. It is therefore important to organize
times competing) axes. The compiler must the code so as to use multiply-adds wher-
attempt to: ever possible.
e maximize use of computational re-
sources (processors, functional units, 3.1 Characterizing Performance
vector units),
. minimize the number of operations There are a number of variables that we
performed, have used to help describe the perfor-
● minimize use of memory bandwidth mance of the compiled code quantita-
(register, cache, network), and tively:
- minimize the size of total memory e S is the hardware speed of a single
required.
processor in operations per second.
While optimization for scalar CPUS has Typically, speed is measured either in
concentrated on minimizing the dynamic millions of instructions per second
instruction count (or more precisely, the (MIPS) or in millions of floating-point
number of machine cycles required), operations per second (megaflops).
CPU-intensive applications are often at e P is the number of processors.
least as dependent upon the performance 0 F is the number of operations executed
of the memory system as they are on the
by a program.
performance of the functional units.
In particular, the distance in memory
0
T is the time in seconds to run a pro-
between consecutively accessed elements gram.
of an array can have a major perfor- e U = F/ST is the utilization of the ma-
mance impact. This distance is called the chine by a program; a utilization of 1 is
stride. If a loop is accessing every fourth ideal, but real programs typically have
element of an array, it is a stride-4 loop. significantly lower utilizations. A su-
If every element is accessed in order, it is perscalar machine can issue several in-
a stride-1 loop. Stride-1 access is desir- structions per cycle, but if the program
able because it maximizes memory local- does not contain a mixture of opera-
ity and therefore the efficiency of the tions that matches the mixture of func-
cache, translation lookaside buffer (TLB), tional units, utilization will be lower.
and paging systems; it also eliminates Similarly, a vector machine cannot be
bank conflicts on vector machines. utilized fully unless the sizes of all of
Another key to achieving peak perfor- the arrays in the program are an exact
mance is the paired use of multiply and multiple of the vector length.
add operations in which the result from 0
Q measures reuse of operands that are
the multiplier can be fed into the adder. stored in memory. It ii the ratio of the
For instance, the IBM RS/6000 has a number of times the operand is refer-
multiply-add instruction that uses the enced during computation to the num-
multiplier and adder in a pipelined fash- ber of times it is loaded into a register.
ion; one multiply-add can be issued each As the value of Q rises, more reuse is
cycle. The Cray Y-MP C90 does not have being achieved, and less memory band-
a single instruction; instead the hard- width is consumed. Values of Q below
1 indicate that redundant loads are level intermediate language, and object
occurring. code. First, optimizations are applied to a
0 Qc is an analogous quantity that mea- high-level intermediate language (HIL)
sures reuse of a cache line in memory. that is semantically very close to the
It is the ratio of the number of words original source program. Because all the
that are read out of the cache from semantic information of the source m-o-
that particular line to the number of gram is available, higher-level tran~for-
times the cache fetches that line from mations are easier to apply. For instance,
memory. array references are clearly distinguish-
able, instead of being a sequence of low-
level address calculations.
3.2 Model Architectures
Next, the program is translated to
In the Appendix we present a series of low-level intermediate form (LIL), essen-
model architectures that we will use tially an abstract machine language, and
throughout this survey to demonstrate optimized at the LIL level. Address com-
the effect of’ various compiler transforma- putations provide a major source of
tions. The architectures include a super- potential optimization at this level. For
scalar CPU (S-DLX), a vector CPU instance, two references to a[5, 3] and
(V-DLX), a shared-memory multiproces- a[7, 3] both require the computation of
sor (sMX), and a distributed-memory the address of column 3 of array a.
multiprocessor (dMX). We assume that Finally, the program is translated to
the reader is familiar with basic princi- object code, and machine-specific opti-
ples of modern computer architecture, in- mization are performed. Some optimiza-
cluding RISC design, pipelining, caching, tion, like code collocation, rely on profile
and instruction-level parallelism. Our information obtained by running the
generic architectures are based on DLX, program. These optimizations are of-
an idealized RISC architecture intro- ten implemented as binary-to-binary
duced by Hennessey and Patterson translations.
[1990]. The high-level optimization phase be-
gins by doing all of the large-scale re-
structuring that will significantly change
4. COMPILER ORGANIZATION
the organization of the program. This in-
Figure 2 shows the design of a hypotheti- cludes various procedure-restructuring
cal compiler for a superscalar or vector optimlizations, as well as scalarization,
machine. It includes most of the general- which converts data-parallel constructs
purpose transformations covered in this into loops. Doing this restructuring at
survey. Because compilers for parallel the beginning allows the rest of the com-
machines are still a very active area of piler to work on individual procedures in
research, we have excluded these ma- relative isolation,
chines from the design. Next, high-level data-flow optimiza-
The purpose of this compiler design is tion, partial evaluation, and redundancy
to give the reader (1) an idea of how the elimination are performed. These opti-
various types of transformations fit to- mization simplify the program as much
gether and (2) some of the ordering is- as possible, and remove extraneous code
sues that arise. This organization is by that could reduce the effectiveness of
no means definitive: different architec- subsequent analysis.
tures dictate different designs, and opin- The main focus in compilers for high-
ions differ on how best to order the performance architectures is loop opti-
transformations. mization. A sequence of transformations
Optimization takes place in three dis- is performed to convert the loops into a
tinct phases, corresponding to three dif- form that is more amenable to optimiza-
ferent representations of the program: tion: where possible, perfect loop nests
high-level intermediate language, low- are created; subscript expressions are
F %
Source Program
Perfect Loop Ne.os
lnductton Vartables
I
Loop Reordering
Loop Intercharrga
Loop Skewing
Loop Reversal
“F
Loop Tthng
Loop Coalescing
Strip Mining
Loop Unrolling
Cycle Shnnkma
Tail-recursion Ehmmatlon
t
Low-level IL Generation
High-Level Dataflow Optimization
Low-lsvel Inferresdlate
Constant Propagationand Folding
Language (UL)
Copy Props ation
Common.%%sxpreswonElimination Low-level Dataflow Optimization
Loop-invariant Code MotIon Constant Propagation and Folding
Loop Unswtfching Copy Propagation
Stranoth Reduction Common Subaxpression Elimination
Loq-mvanant Code Motion
Strength Reduction of lnd.Var. Exprs.
Reassociation
InductIon Variable Elimination
Partial Evaluation
Algebralc Slmplficatron Low-level Redundancy Eliminatiorr
Short-circuiting
&%%%%%X:?duct’On
=
Redundancy Elimination
UnreachableCode Elimmation
UsaleasCede Elimmation
Dead Variable Eltminat!on
! LooIz Preparation I I
= Cross-Call Re Ister Allocation
Loop Distribution
Loop Normalization Assemblv L@muaee 1
Loop Peeling
Scalar Expanaion Micro-optimization
Peephok Optimization
Depeedsnce Vectors Superoptlmlzatlon
t
Loop Preparation II 1
ForwardSubstation
Array Padding
Array Alignment
Idiom and ReductionRecognition
LOODCollaosmo
Erectable Progrom 1
t
Final Opttmtzed Co&
do il = 11, u1
do id = [d, Ud
end do
end do
<
For a dependence I * J, we define the are found, the compiler tries to describe
direction vector W = (w ~, ..., w~) where them with a direction or distance vector.
If the subscript expressions are too com-
if iP < jp
I
plex to analyze, the compiler assumes
. if ip = jp the statements are fully dependent on
‘P= —
> one another such that no change in exe-
ifip >jP.
cution order is permitted.
do i= l,n
a[i] = a[i] + sqrt(x)
doi= 1, n end do
a[il = a[i] + c
(a) original loop
end do
(a) original code
if (n > O) C = sqrt(x)
do i = l,n
LF F4 , C(R30) ;load c Into F4 a[i] = a[i] + C
LU R8 , n(R30) ;load n into R8 end do
LI R9 , #1 ;set i (R9) to 1 (b) after code motion
ADDI R12, R30, #a ;Ri2=address(a[l])
Loop:14ULTI R1O, R9, #4 ;RiO=i*4 Figure9. Loop-invariant code motion.
ADDI RIO,R12,R1O ;RiO=address(a[i+l])
LF F5, -4(R1O) ;load a[i] into F5
ADDF F5, F5, F4 ;a[i] :=a[i]+c
SF -4(R1O), F5 ;store new a[il
SLT Ril, R9, R8 ;Ril = i<n? a loop, but frees the register used by the
ADDI R9, R9, #i ; i=i+l induction variable.
BNEZ Rll, LOOp ;if i<n, goto Loop Figure 8(d) shows the result of apply-
(b) initial compiled loop
ing induction variable elimination to the
strength-reduced code from the previous
section.
LF F4, c(R30) ;load c into F4
LU R8, n(R30) ;load n into R8 6.1.3 Loop-lnvarianf Code Motion
LI R9, #l ;set i (R9) to 1
When a computation appears inside a
ADDI RI0,R30, #a ;RIO=address(a[i])
100P, but its result does not change be-
Loop : LF F5, (R1O) ;load a[i] into F5
tween iterations, the compiler can move
ADDF F5, F5, F4 ;a[i]:=a[i]+c
that computation outsidethe loop [Ahoet
SF (R1O), F5 ;store new a[i]
al, 1986; Cocke and Schwartz 1970].
ADDI R9, R9, #1 ;i=i+l
Code motion canbe applied at a high
ADDI R1O, Ri0,#4 ;RIO=address(a[i+l])
level to expressions in the source code, or
SLT Rll, R9, R8 ;Rll = i<n?
at a low level to address computations.
BNEZ Rll, LOOp ;if i<n, goto Loop
The latter is particularly relevant when
(c) after strength reduction—RiO k a running sum
indexing multidimensional arrays or
instead of being recomputed from X9 and R12
dereferencing pointers, as when the in-
ner loop of a C program contains an ex-
pression like a.b + c.d[i].
LF F4, c(R30) ;load c into F4
Figure 9(a) shows an example inwhich
L!d R8, n(R30) ;load n into R8
an expensive transcendental function call
ADDI RIO, R30,#a ;RiO=address(a[i])
is moved outside of the inner loop. The
MULTI R8, R8, #4 ; R8=n*4
test in the transformed code in Figure
ADDI R8, RIO, R8 ;R8=address(a[n+l])
9(b) ensures that if the loop is never
Loop : LF F5, (R1O) ;load a[i] Into F5
executed, themoved codeis not executed
ADDF F5, F6, F4 ;aCil:=a[il+c
either, lest it raise an exception.
SF (R1O), F5 ;store new a[il
ADDI R1O, R10,#4 ;RIO=address(a[i+l]) Theprecomputed value is generally as-
SLT Rll, R1O,R8 ;Rll= R1O<R8? signed to a register. If registers are
BNEZ Rll, LOOp ;if Rll, goto Loop scarce, and the expression movedis inex-
pensive to compute, code motion may ac-
(d) after elimination of induction variable (R9)
tuallydeoptimize the code, since register
Figure8. Induction variable optimlzations. spills willbe introduced in the loop.
Although code motionis sometimes re-
ferred to as code hoisting, hoistingis a
Loop interchange exchanges the position Care must be taken that these benefits
of two loops in a perfect loop nest, gener- do not cancel each other out, For in-
ally moving one of the outer loops to the stance, an interchange that improves
innermost position [Allen and Kennedy register reuse may change a stride-1 ac-
1984; 1987; Wolfe 1989b]. Interchange is cess pattern to a stride-n access pattern
one of the most powerful transformations with much lower overall performance due
and can improve performance in many to increased cache misses. Figure 12
ways. demonstrates the dramatic effect of dif-
Loop interchange may be performed to: ferent strides on an IBM RS /6000.
In Figure 13(a), the inner l’oop accesses
e enable vectorization by interchanging
array a with stride n (recall that we are
an inner, dependent loop with an outer,
assuming the Fortran convention of col-
independent loop;
umn-maj or storage order). By inter-
* improve vectorization by moving the changing the loops, we convert the inner
independent loop with the largest range loop to stride-1 access, as shown in
into the innermost position; Figure 13(b).
e improve parallel performance by mov- For a large array in which less than
ing an in-depende-nt loop outwa~d in a one column fits in the cache, this opti-
do i = l,n doi=2, n
do j = l,n do j = 1, n-1
total [i] = total[il + a[i, jl a[i, j] = a[i-l, j+i]
end do end do
end do end do
(a) original loop nest (a)
J J
do j = l,n 123
123
do i = i,n
total [i] = total[il + a[i, jl
end do
end do
do i=l, n do i=l, n
do j=l, n a[i] = a[i] +C
a[i, j] = b[j, i] x[i+ll = x[i]*7 + x[i+i] + a[i]
end do end do
end do (a) original loop
(a) original loop
do all i=l, n
do TI=I, n, 64 a[i] = a[i]+c
do TJ=l, n, 64 end do all
do i=TI, min(TI+63, n) do i=i, n
do j=TJ, min(TJ+63ufi) x[i+l] = x[i]*7 + x[i+l] + a[i]
a[i, j] = b[j, i] end do
end do (b) after loop distribution
end do
end do Figure 20. Loop distribution.
end do
(b) tiled loop
do i=l, n
a[i] = a[i] + c
end do
(a) the initial loop
1 L:LF F5, (RIO) ADDI R1O, RiO, #4 ;load a[l] into F5 ;RIO=address(a[l+l] )
3 ADDF FE., F5, F4 SLT Ril, RI0,R8 ;a[il=a[il+c ;Rll= RIO<R8?
6 SF -4(R1O), F5 BNEZ Rll, L ;store new a[ll ;lf Rll, goto L
(b) the compiled loop body for S-DLX. This is the loop from Fig 8(d) after instruction scheduling
i L:SF (RIO), F6 ADDF F6, F5, F4 ;store new a[i] ; a[i+21 =a[i+21 +C
2 SF 4( R1O), F8 ADDF F8, F7, F4 ;store new a[i+l] ;a[i+3]=a[i+3]+c
3 LF F5, 16(RIO) ADDI R1O, R1O, #8 ;load a[l+4] Into F5 ;R1O=address (a[l+21)
4 LF F7, 12(RIO) SLT Rli, RIO, R8 ;load a[i+5] into F7 ;Rli= RIO<R87
5 BNEZ Rll, L ;if Rll, goto L
(e) after unrolling 2 times and then software pipehnmg (loop prologue and epilogue omitted)
do all i=l, n
Overlappxl
or=.~.- do all j=l, m
a[i, jl = a[i, jl + c
rme end do all
(a) LcJOp
unrolling
end do all
(a) original loop
Tmle
(b) SoftwarePipelimng do all T=l, n*m
i. = ((T-1) / m)*m + 1
Figure 24. Loop unrolling vs. software pipelining. j = MOD(T-1, m) + 1
a[i, jl = a[i, jl + c
end do all
Finally, Figure 23(e) shows the result (b) coalesced loop
of combining software pipelining (s = 3)
with unrolling (u = 2). The loop takes 5
cycles per iteration, or 21/z cycl:s per real TA [n*m]
result. Unrolling alone achieves 2 /~ cy- equivalence (TA, a)
cles per result. If software pipelining is do all T = 1, n*m
combined with unrolling by u = 3, the TA [T] = TA [T] + c
resulting loop would take 6 cycles per end do all
iteration or two cycles per result, which (c) collapsed loop
is optimal because only one memory op-
Figure 25. Loop coalescing vs. collapsing.
eration can be initiated per cycle.
Figure 24 illustrates the difference be-
tween unrolling and software pipelining:
unrolling reduces overhead, while in Aiken and Nicolau [1988a] and Lam
pipelining reduces the startup cost of [1988].
each iteration.
Perfect pipelining combines unrolling
6.3.3 Loop Coalescing
by loop quantization with software
pipelining [Aiken and Nicolau 1988a; Coalescing combines a loop nest into a
1988b]. A special case of software pipelin- single loop, with the original indices com-
ing is predictive commoning, which is puted from the resulting single induction
applied to memory operations. If an ar- variable [Polychronopoulos 1987b; 1988].
ray element written in iteration i – 1 is Coalescing can improve the scheduling of
read in iteration i, then the first element the loop on a parallel machine and may
is loaded outside of the loop, and each also reduce loop overhead.
iteration contains one load and one store. In Figure 25(a), for example, if n and m
The IW/6000 XL C/Fortran compiler are slightly larger than the number of
[0’Brien et al. 1990] performs this processors P, then neither of the loops
optimization. schedules as well as the outer parallel
If there are no loop-carried depen- loop, since executing the last n – P iter-
dence, the length of the pipeline is the ations will take the same time as the
length of the dependence chain. If there first P. Coalescing the two loops ensures
are multiple, independent dependence that P iterations can be executed every
chains. thev can be scheduled together time except during the last (nm mod P)
subject to ~esource availability, o; loop iterations, as shown in Figure 25(b).
distribution can be applied to put each Coalescing itself is always legal since it
chain into its own loop. The scheduling does not change the iteration order of the
constraints in the presence of loop- loop. The iterations of the coalesced loop
carried dependence and conditionals are may be parallelized if all the original
more complex; the details are discussed loops were parallelizable. A criterion for
doi=l, n do i = 1, n/2
a[i] = a[i] + c 1 a[i+l] = a[i+l] + a[i]
end do end do
do i = 1, n-3
do i = 2, n+l 2 b[i+l] = b[i+l] + b[i] + a[i+3]
b[i] = a[i-i] * b[i] end do
end do
(a) original loops
(a) original loops
do i = 1, n/2
doi=l, n COBEGIN
a[i] = a[i] + c a[i+l] = a[i+i] + a[i]
end do if (i J 3) then
b[i-2] = b[i-2]+b[i-3]+a[i]
doi=l, n end if
b[i+l] = a[i] * b[i+l] COEND
end do end do
(b) after normalization, the two loops can be do i = n/2-3, n-3
fused b[i+l] = b[i+l] + b[i] + a[i+3]
end do
Figure 27. Loop normalization.
(b) after spreading
dence S’z ‘~) SI in the fused loop due to This section describes loop transforma-
the write to a in S1 and the read of a in tions that operate on whole loops and
S’z. By executing the statement Sz three completely alter their structure.
iterations later within the new loop, it is
possible to execute the two statements in
6.4.1 Reduction Recognition
parallel, which we have indicated textu-
ally with the COBEGIN / COEND com- A reduction is an operation that com-
pound statement in Figure 28(b). putes a scalar value from an array, Com-
The number of iterations by which the mon reductions include computing either
body of the second loop must be delayed the sum or the maximum value of the
is the maximum dependence distance be- elements in an array. In Figure 29(a), the
tween any statement in the second loop sum of the elements of a are accumulated
and any statement in the first loop, plus in the scalar S. The dependence vector for
1. Adding one ensures that there are no the loop is (l), or ( <). While a loop with
dependence within an iteration. For this direction vector (<) must normally be
reason, there must not be any scalar de- executed serially, reductions can be par-
pendence between the two loop bodies. allelized if the operation performed is
Unless the loop bodies are large, associative. Commutativity provides ad-
spreading is primarily beneficial for ex- ditional opportunities for reordering.
posing instruction-level parallelism. 13e- In Figure 29(b), the reduction has been
pending on the amount of instruction vectorized by using vector adds (the in-
parallelism achieved, the introduction of ner do all loop) to compute TS; the final
a conditional may negate the gain due to result is computed from TS using a scalar
spreading. The conditional can be re- loop. For semicommutative and semias-
moved by peeling the first few (in this sociative operators like floating-point
case 3) iterations from the first loop, at multiplication, the validity of the trans-
the cost of additional loop overhead. formation depends on the language se-
real a[9,512]
6.5.1 Array Padding
do i = 1, 512
Padding is a transformation whereby un-
a[l, i] = a[l, i] + c
used data locations are inserted between
end do
the columns of an array or between ar-
rays. Padding is used to ameliorate a (b) padding a eliminates memory bank
particular:
Figure 31. Array padding
cache miss jamming in detail and pre- (b) after scalar expansion
sent a unified framework for inter- and
Figure 32. Scalar expansion.
intraarray padding to handle set con-
flicts and jamming.
The disadvantages of padding are that
it increases memory consumption and performed for any compiler-generated
makes the subscript calculations for temporaries in a loop. To avoid creating
operations over the whole array more unnecessarily large temporary arrays, a
complex, since the array has “holes.” In vectorizing compiler can perform scalar
particular, padding reduces the benefits expansion after strip mining, expanding
of loop collapsing (see Section 6.3.4). the temporary to the size of the vector
strip.
6.5.2 Scalar Expansion Scalar expansion can also increase in-
Loops often contain variables that are struction-level parallelism by removing
used as temporaries within the loop body. dependence.
Such variables will create an antidepen-
6.5.3 Array Contraction
dence S’z ‘~) S1 from one iteration to the
next, and will have no other loop-carried After transformation of a loop nest, it
dependence. Allocating one temporary may be possible to contract scalars or
for each iteration removes the depen- arrays that have previously been ex-
dence and makes the loop a candidate for panded. It may also be possible to con-
parallelization [Padua et al. 1980; Wolfe tract other arrays due to interchange or
1989b], as shown in Figure 32. If the the use of redundant storage allocation
final value of c is used after the loop, c by the programmer [Wolfe 1989b].
must be assigned the value of T[n]. If the iteration variable of the pth loop
Scalar expansion is a fundamental in a loop nest is being used to index the
technique for vectorizing compilers, and k th dimension of an array x, then di-
was performed by the Burroughs Scien- mension k may be removed from x if (1)
tific Processor [Kuck and Stokes 1982] loop p is not parallel, (2) all distance
and Cray-1 [Russell 1978] compilers. An vectors V involving x have VP = O, and
alternative for parallel machines is to (3) x is not used subsequently (that is, x
use private variables, where each proces- is dead after the loop). The latter two
sor has its own instance of the variable; conditions are true for compiler-
these may be introduced by the compiler expanded variables unless the loop struc-
(see Section 7.1.3) or, if the language suP- ture of the program was changed after
ports private variables, by the expansion. In particular, loop distribu-
programmer. tion can inhibit array contraction by
If the compiler vectorizes or paral- causing the second condition to be
lelizes a loop, scalar expansion must be violated.
doallj=l, n end do
T[i, j] = a[i, j]*3 end do
doallj=l, n end do
end do
cedures (or procedure groups) with the placement jumps, the code should be
largest number of calls between them. organized to keep related sections close
Within a procedure, basic blocks can be together in memory, in particular those
grouped in the same way (although the sections executed most frequently
direction of the control flow must be [Szymanski 19781.
taken into account), or a top-down algo- Displacement minimization can also be
rithm can be used that starts from the applied to data. For instance, a base reg-
procedure entry node. Basic blocks with a ister may be allocated for a Fortran com-
frequency estimate of zero can be moved mon block or group of blocks:
to a separate page to increase locality
common /big / q, r, x[20000], y, z
further. However, accessing that page
may require long displacement jumps to If the array x contains word-sized ele-
be introduced (see the next subsection), ments, the common block is larger than
creating the potential for performance the amount of memory indexable by the
loss if the basic blocks in question are offset field in the load instruction (216
actually executed. bytes on S-DLX). To address y and z,
Procedure inlining (see Section 6.8.5) multiple-instruction sequences must be
can also affect code locality, and has been used in a manner analogous to the long-
studied both in conjunction with [Hwu jump sequences above. The problem is
and Chang 1989] and independent of avoided if the layout of big is:
[McFarling 1991] code positioning. Inlin-
common /big / q, r, y, z, x[20000]
ing improves performance often by reduc-
ing overhead and increasing locality, but
if a procedure is called more than once in 6.6 Partial Evaluation
a loop, inlining will often increase the Partial evaluation refers to the general
number of cache misses because the pro- technique of performing part of a compu-
cedure body will be loaded more than tation at compile time. Most of the classi-
once. cal optimizations based on data-flow
analysis are either a form of partial eval-
6.5.6 Dkplacement Minimization uation or of redundancy elimination
The target of a branch or a jump is usu- (described in Section 6.7). Loop-based
data-flow optimizations are described in
ally specified relative to the current value
of the program counter (PC). The largest Section 6.1.
offset that can be specified varies among
architectures; it can be as few as 4 bits. If 6.6.1 Constant Propagation
control is transferred to a location Constant propagation [Callahan et al.
outside of the range of the offset, a multi- 1986; Kildall 1973; Wegman and Zadeck
instruction sequence or long-format in- 1991] is one of the most important opti-
struction is required to perform the jump. mization that a compiler can perform,
For instance, the S-DLX instruction and a good optimizing compiler will ap-
BEQZ R4, error is only legal if error is ply it aggressively, Typically programs
within 215 bytes. Otherwise, the instruc- contain many constants; by propagating
tion must be replaced with: them through the program, the compiler
can do a significant amount of precompu-
BNEZ R4, cent ; reversed test tation. More important, the propagation
LI R8, error ;get low bits reveals many opportunities for other op-
LUI R8, error >>16 ;get high bits timization. In addition to obvious possi-
JR R8 ;jump to target bilities such as dead-code elimination,
cent:
loop optimizations are affected because
constants often appear in their induction
This sequence requires three extra in- ranges. Knowing the range of the loop,
structions. Given the cost of long dis- the compiler can be much more accurate
n=64 t = i*4
C=3 S=t
doi=l, n print *, a[s]
a[i] = a[i] + c r.t
end do a[r] = a[r] + c
(a) original code (a) original code
doi=l,64 t . 1*4
a[i] = a[i] + 3 print *, a [t]
end do a[t] = a[t] + c
(b) after constant propagation (b) after copy propagation
6.6.5 Reassociation
6.6.9 Superoptimizing
call putchar(c[i], 6)
A superoptimizer [Massalin 1987] repre-
call fgets(d, 100, 7)
sents the extreme of optimization, seek-
(b) after format compilation
ing to replace a sequence of instructions
Figure39. Format compilation
with the optimal alternative, It does an
exhaustive search, beginning with a sin-
gle instruction. If all single instruction
sequences fail, two-instruction sequences
ofpolynomials ,using the identity
are searched, and so on.
n–1
A randomly generated instruction se-
a~xn + a~.lx + .. . +alx +aO
quence is checked by executing it with a
=(a~x”-l+a~_lx’-2 + . ..+al) small number of test inputs that were
X+ao. run through the original sequence. If it
passes these tests, a thorough verifica-
tion procedure is applied.
6.6.8 110 Format Compilation Superoptimization at compile time is
practical only for short sequences (on the
Most languages provide fairly elaborate
order of a dozen instructions). It can also
facilities for formatting input and output.
be used by the compiler designer to find
The formatting specifications are in ef-
optimal sequences for commonly occur-
fect a formatting sublanguage that is
ring idioms, It is particularly useful for
generally “interpreted” at run time, with
eliminating conditional branches in short
a correspondingly high cost for character
instruction sequences, where the pipeline
input and output.
stall penalty may be larger than the cost
Formatted writes can be converted al-
of executing the operations themselves
most directly into calls to the run-time
[Granlund and Kenner 1992].
routines that implement the various for-
mat styles. These calls are then likely
6.7 Redundancy Elimination
candidates for inline substitution. Figure
39 shows two 1/0 statements and their There are many optimizations that im-
compiled equivalents. Idiom recognition prove performance by identifying redun-
has been performed to convert the im- dant computations and removing them
plied do loop into an aggregate [Morel and Renvoise 1979]. We have al-
operation. ready covered one such transformation,
Note that in Fortran, a format state- loop-invariant code motion, in Section
ment is analogous to a procedure defini- 6.1.3. There a computation was being
tion, which may be invoked by any performed repeatedly when it could be
number of read or write statements, ‘The done once.
same trade-off as with procedure inlining Redundancy-eliminating transforma-
applies: the formatted 1/0 can be ex- tions remove two other kinds of computa-
panded inline for higher efficiency, or tions: those that are unreachable and
encapsulated as a procedure for code those that are useless. A computation is
compactness (inlining is described in Sec- unreachable if it is never executed; re-
tion 6.8.5). moving it from the program will have no
Format compilation is done by the VAX semantic effect on the instructions exe-
Fortran compiler [Harris and Hobbs cuted. Unreachable code is created by
1994] and by the Gnu C compiler [Free programmers (most frequently with con-
Software Foundation 1992]. ditional debugging code), or by transfor-
program. end do
Both unreachable and useless code are (b) after constant propagation and unreachable
often created by constant propagation, code elimination
described in Section 6.6.1. In Figure 40(a),
the variable debug is a constant. When
integer c, n, debug
its value is propagated, the conditional
debug = O
expression becomes if (O > 1). This ex-
n=O
pression is always false, so the body of
a . b+7
the conditional is never executed and can
call foo(a)
be eliminated, as shown in Figure 40(b).
Similarly, the body of the do loop is never (c) after redundant control elimination
ment is not necessary, it can remove the first operand [Arden et al, 1962]. For
code. This can be done for any nonglobal example, the control expression in
variable that is not liue immediately af- If ((a = 1) and (b = 2)) then
ter the defining statement. Live-variable c=~
analysis is a well-known data-flow prob- end if
lem [Aho et al. 1986]. In Figure 40(c), the
is known to be false if a does not equal 1,
values computed by the assignment
regardless of the value of b. Short circuit-
statements are no longer used; they have
ing would convert this expression to:
been eliminated in (d).
if (not (a = 1)) goto 10
if (not (b = 2)) goto 10
C=5
6.7.3 Dead-Var/ab/e Elimination
10 continue
After a series of transformations, partic-
Note that if any of the operands in the
ularly loop optimizations, there are often
boolean expression have side-effects,
variables whose value is never used. The
short circuiting can change the results of
unnecessary variables are called dead
the evaluation. The alteration may or
variables; eliminating them is a common
may not be legal, depending on the lan-
optimization [Aho et al. 1986].
guage semantics. Some language defini-
In Figure 40(d), the variables C, n, and
tions, including C and many dialects of
debug are no longer used and can be
Lisp, address this problem by requiring
removed; (e) shows the code after the
short circuiting of boolean expressions.
variables are pruned.
6.8 Procedure Call Transformations
Number Usage
RO Always zero; writes are ignored
Ri Return value when returning from a procedure call
R2. .R7 The first six words of the arguments to the procedure call
R8. .R15 8 caller save registers, used as temporary registers by callee
R16. .R25 10 callee save registers. These registers must be preserved across a call.
R26 . . R29 Reserved for use by the operating system
R30 Stack pointer
R31 Return address during a procedure call
FO. .F3 The first four floating point arguments to the procedure call
F4. .F17 14 caller save floating point registers
F18. .F31 14 callee save floating point registers
The value R30 + s serves as a uirtual (5) The frame is removed from the stack.
frame pointer that points to the base of (6) ~~~ds is transferred to the return
the stack frame, avoiding the use of a
second dedicated register. For languages
that cannot predict the amount of stack Calling a procedure is a four-step pro-
space used during execution of a proce- cess:
dure, an additional general-purpose reg-
ister can be used as a frame pointer, or
(1) The values of any of the registers
the size of the frame can be stored at the R1 – RI 5 that contain live values are
top of the stack (R30 + O).
saved. If the values of any global
On entering a procedure, the return variables that might be used by the
address is in R31. The first six words of callee are in a register and have been
the procedure arguments appear in reg- modified, the copy of those variables
isters R2 – R7, and the rest of the argu- in memory is updated.
ment data is on the stack. Figure 41
(2) The arguments are stored in the des-
shows the layout of the stack frame for a
ignated registers and, if necessary,
procedure invocation.
on the stack.
A similar convention is followed for
floating-point registers, except that only (3) A linked jump is made to the target
four are reserved for arguments. procedure; the CPU leaves the ad-
Execution of a procedure consists of six dress of the next instruction in R31.
steps: (4) Upon return, the saved registers are
restored, and the registers holding
(1) Space is allocated on the stack for the
global variables are reloaded.
procedure invocation.
(2) The values of registers that will be To demonstrate the structure of a pro-
modified during procedure execution cedure and the calling convention, Figure
(and that must be preserved across 42 shows a simple function and its com-
the call) are saved on the stack. If the piled code. The function (foo) and the
procedure makes any procedure calls function that it calls (max) each take two
itself, the saved registers should in- integer arguments, so they do not need to
clude the return address, R31. pass arguments on the stack. The stack
(3) The procedure body is executed. frame for foo is three words, which are
(4) The return value (if any) is stored in used to save the return address (R31 )
RI, and the registers that were saved and register R 16 during the call to max,
in step 2 are restored. and to hold the local variable d. R31
high addresses
arg n
ar’z 1
1
locals and temporaries
saved registers
frame size
low addresses
4
integer c, b integer x, y
d = c+b max=x
e = max(b, d) else
return end if
end return
end
(a) source code for function foo
(a) source code for function rnax
similar effect, and does not require inter- recursive logical function Inarray(a, x,i,n)
8
“!IIII
Serial Decomposition
(b) ( block, sertd) decornpos,t;on of a on four
Serial decomposition is the degenerate processors
case, where an entire dimension is allo-
cated to a single processor. Figure 52(a) call FORK(Pnum)
shows a simple loop nest that adds c to do j = (n/Pnuro)*Pid+i, rnin( (n/Pnum)*(Pid+i) , n)
each element of a two-dimensional array, doi=l, n
the decomposition of the array onto pro- a[i, j] = a[i, j] + c
cessors (b), and the loop transformed into end do
end do
code for sMX (c). Each column is mapped
call JOIN()
serially, meaning that all the data in
(c) corresponding ( block, serial) scheduling of
that column will be placed on a single
the loop, using block size n/Pnurn
processor. As a result, the loop that iter-
ates over the columns (the inner loop) is
*2345678
unchanged.
1
2
o 1
3
Block Decomposition 14
6
ments into one group of adj scent ele- 1
2 3
ments per processor. Using the previous 8
E13
example, the rows of array a are divided
(d) ( block, block) decomposition of a on four
between the four processors, as shown in processors
Figure 52(b). Because the rows have been
divided across the processors, the outer Figure 52. Serial and block decompositions of an
array on the ehared-memory multiprocessor sMX.
loop in Figure 52(c) has been modified so
that each processor iterates only over the
local portion of each row.
Figure 52(d) shows the decomposition across the processors, each of which has
if block scheduling is applied across both its own private address space.
the horizontal and vertical dimensions. As a result, naming a nonlocal array
A major difference between compila- element requires a mapping function
tion for shared- and distributed-memory from the original, global name space to a
machines is that shared-memory ma- processor number and local array indices
chines provide a global name space, so within that processor. The examples
that any particular array element can be shown here are all for shared memory,
named by any processor using the same and therefore sidestep these complexi-
subscripts as in the sequential code. On ties, which are discussed in Section 7.3.
the other hand, on a distributed-memory However, the same decomposition princi-
machine, arrays must usually be divided ples are used for distributed memory.
I doj=l, n
f2345678 doi=l, j
1 total = total + a[i, jl
2 end do
3
end do
la
5
(a) loop
6
i
call FORK (Pnund
do j = Pid+l, n, Pnum
doi=l, n
a[i, jl = a[i, jl + c
end do
end do
(b) elements of a read by the loop
call JOIN()
7.2 Exposing Coarse-Grained Parallelism dence between them that prevent the
Transferring data can be an expensive compiler from executing both simultane-
ously on different processors. However, it
operation on a machine with asyn-
is often the case that only some of the
chronously executing processors. One way
iterations conflict. Split uses memory
to avoid the inefficiency of frequent com-
munication is to divide the program into summarization to identify the iterations
in one loop nest that are independent of
subcomputations that will execute for an
extended period of time without generat- the other one, placing those iterations in
a separate loop. Applying split to an ap-
ing message traffic. The following opti-
plication exposes additional sources of
mization are designed to identify such
computations or to help the compiler concurrency and pipelining.
create them.
7.2.3 Graph Partitioning
7.2.1 Procedure Call Parallelization Data-flow languages [Ackerman 1982;
One potential source of coarse-grained McGraw 1985; Nikhil 1988] expose paral-
parallelism in imperative languages is lelism explicitly. The program is con-
the procedure call. In order to determine verted into a graph that represents basic
whether a call can be performed as an computations as nodes and the move-
independent task, the compiler must use ment of data as arcs between nodes.
interprocedural analysis to examine its Data-flow graphs can also be used as an
behavior. internal representation to express the
Because of its importance to all forms parallel structure of programs written in
of optimization, interprocedural analysis imperative languages.
has received a great deal of attention One of the major difficulties with
from compiler researchers. One approach data-flow languages is that they expose
is to propagate data-flow and dependence parallelism at the level of individual
information through call sites [Banning arithmetic operations. As it is impracti-
1979; Burke and Cytron 1986; Callahan cal to use software to schedule such a
et al. 1986; Myers 1981]. Another is to small amount of work, early projects fo-
summarize the array access behavior of a cused on developing architectures that
procedure and use that summary to eval- embed data-flow execution policies into
uate whether a call to it can be executed hardware [Arvind and Culler 1986;
in parallel [Balasundaram 1990; Calla- Arvind et al. 1980; Dennis 1980]. Such
han and Kennedy 1988a; Li and Yew machines have not proven to be success-
1988; Triolet et al, 1986]. ful commercially, so researchers began to
develop techniques for compiling data-
7.2.2 Split flow languages on conventional architec-
tures. The most common approach is to
A more comprehensive approach to pro-
interpret the data-flow graph dynami-
gram decomposition summarizes the data
cally, executing a node representing a
usage behavior of an arbitrary block of
computation when all of its operands are
code in a symbolic descriptor [Graham et
available. To reduce scheduling over-
al. 1993]. As in the previous section, the
head, the data-flow graph is generally
summary identifies independence be-
transformed by gathering many simple
tween computations—between procedure
computations into larger blocks that are
calls or between different loop nests. Ad-
executed atomically [Anderson and
ditionally, the descriptor is used in ap-
Hudak 1990; Hudak and Goldberg 1985;
plying the split transformation.
Mirchandaney et al. 1988; Sarkar 1989].
Split is unusual in that it is not ap-
plied in isolation to a computation like a
7.3 Computation Partitioning
loop nest. Instead, split considers the re-
lationship of pairs of computations. Two In addition to decomposing data as dis-
loop nests, for example, may have depen- cussed in Section 7.1, parallel compilers
ber of processors, and the Pid of the (c) after guard combination
processor.
We have made a number of simplifying
LB = (n/Pnum)*Pid + 1
assumptions: the value of n is an even
UB = (n/Pnum)*(Pid + 1)
multiple of Pnum, and the arrays are
doi=LB, UB
already distributed across the processors
a[i] = a[i] + c
with a block decomposition. We have also
b[i] = b[i] + c
chosen an example that does not require
end do
interprocessor communication. To paral-
lelize the loops encountered in practice, (d) after bounds reduction
by hoisting them to the earliest point of cvcles reauired to send one word to the
where they can be correctly computed. number of ‘cycles required for a single
Hoisting often reveals that identical operation on dMX is about 500:1.
guards have been introduced and that all Like vectors on vector processors, com-
but one can be removed. In Figure 58(b), munication o~erations on a distributed-
the two guard expressions are identical memory machine are characterized by a
because the arrays have the same decom- startup time, t~, and a per-element cost,
position. When the guards are hoisted to t~, the time required to send one byte
the beginning of the loop, the compiler once the message has been initiated. On
can eliminate the redundant guard. The dMX, t, = 10PS and th = 100ns, so
second pair of bound computations be- sending ‘one 10-byte message costs 11 ps
comes useless and is also removed. The while sending two 5-byte messages costs
final result is shown in Figure 58(c). 21 ps. To avoid paying the startup cost
unnecessarily. . . the o~timizations
. in this
section combine data from multiple mes-
7.3.3 Bounds Reduction
sages and send them in a single
In Figure 58(c), the guards control which operation.
iterations of the loop perform computa-
tion. Since the desired set of iterations is
7.4.1 Message Vectorization
a contiguous range, the compiler can
achieve the same effect by changing the Analysis can often determine the set of
induction expressions to reduce the loop data items transferred in a loop. Rather
bounds [Koelbel 1990]. Figure 58(d) than sending each element of an array in
shows the result of the transformation. an individual message, the compiler can
group many of them together and send
them in a sirwle block transfer. Because
7.4 Communication Optimization
this is analo~ous to the way a vector
An important part of compilation for dis- processor interacts with memory, the op-
tributed-memory machines is the analy- timization is called message vectoriza-
sis of an application’s communication tion [Balasundaram et al. 1990; Gerndt
needs and introduction of explicit mes- 1990; Zima et al. 1988].
sage-passing operations into it. Before a Figure 59(a) shows a sample loop that
statement is executed on a given proces- multiplies each element of array a by the
sor, any data it relies on that is not mirror element of array b. Figure 59(b) is
available locally must be sent. A simple- an inefficient parallel version for proces-
minded approach is to generate a sors O and 1 on dMX. To simplify the
message for each nonlocal data item code, we assume that the arrays have
referenced by the statement, but the already been allocated to the processors
resulting overhead will usually be un- using a block decomposition: the lower
acceptably high. Compiler designers have half of each array is on processor O and
developed a set of optimizations that re- the upper half on processor 1.
duce the time required for communica- Each processor begins by computing
tion. the lower and upper bounds of the range
The same fundamental issues arise in that it is res~onsible for. During each
communication optimization as in opti- iteration, it s&ds the element of”b that
mization for memory access (described the other processor will need and waits
in Section 6.5): maximizing reuse, mini- to receive the corresponding message.
mizing the working set, and making use Note that Fortran’s call-by-reference
of available parallelism in the em-nmuni- semantics c.on~ert the arrav . reference
cation system. However, the problems are implicitly into the address of the corre-
magnified because while the ratio of one sponding element, which is then used by
memory access to one computation on the low-level communication routine to
S-DLX is 16:1, the ratio of the number extract the bytes to be sent or received,
Wholey [1992] begins by computing a pilers for these machines must look
detailed cost function for the computa- through branches to find additional in-
tion. The function is the input to a hill- structions to execute. The multiple paths
climbing algorithm that searches the of control require the introduction of
space of possible allocations of data to speculative execution—computing some
processors. It begins with two processors, results that may turn out to be unused
then four, and so forth; each time the (ideally at no additional cost, by func-
number of processors is doubled, the al- tional units that would otherwise have
gorithm uses the most efficient allocation gone unused).
from the previous stage as a starting The speculation can be handled dy-
point. namically by the hardware [ Sohi and
In addition to general-purpose map- Vajapayem 1990], but that complicates
ping algorithms, the compiler can use the design significantly. The trace-sched-
specific transformations like loop flatten- uling alternative is to have the compiler
ing [von Hanxleden and Kennedy 1992] do it by identif~ng paths (or traces)
to avoid idle time due to nonuniform through the control-flow graph that
computations. might be taken. The compiler guesses
which way the branches are likely to go
and hoists code above them accordingly.
7.6 VLIW Transformations
Superblock scheduling [Hwu et al. 19931
The Very Long Instruction Word (VLIW) is an extension of trace scheduling that
architecture is another strategy for exe- relies on program profile information to
cuting instructions in parallel. A VLIW choose the traces.
processor [Colwell et al. 1988; Fisher et
al. 1984; Floating Point Systems 1979;
8. TRANSFORMATION FRAMEWORKS
Rau et al. 1989] exposes its multiple
functional units explicitly; each machine Given the many transformations that
instruction is very large (on the order of compiler writers have available, they face
512 bits) and controls many independent a daunting task in determining which
functional units. Typically the instruc- ones to apply and in what order. There is
tion is divided into 32-bit pieces, each of no single best order of application; one
which is routed to one of the functional transformation can permit or prevent a
units. second from being applied, or it can
With traditional processors, there is change the effectiveness of subsequent
a clear distinction between low-level changes. Current compilers generally
scheduling done via instruction selection use a combination of heuristics and a
and high-level scheduling between pro- partially fixed order of applying
cessors. This distinction is blurred in transformations.
VLIW architectures. They rely on the There are two basic approaches to this
compiler to expose the parallel opera- problem: uniffing the transformations in
tions explicitly and schedule them at a single mechanism, and applying search
compile time. Thus the task of generat- techniques to the transformation space.
ing object code incorporates high-level In fact there is often a degree of overlap
resource allocation and scheduling deci- between these two approaches.
sions that occur only between processors
on a more conventional parallel
8.1 Unified Transformation
architecture.
Trace scheduling [Ellis 1986; Fisher A promising strategy is to encode both
1981; Fisher et al. 1984] is an analysis the characteristics of the code being
strategy that was developed for VLIW transformed and the effect of each trans-
architectures. Because VLIW machines formation; then the compiler can quickly
require more parallelism than is typi- search the space of possible sets of trans-
cally available within a basic block, com- formations to find an efficient solution.
trix theory [Banerjee 1991; Wolf and Lam a[i, j] = a[i-l, jl + a[i, jl
1991]. It is applicable to any loop nest end do
whose dependence can be described with end do
a distance vector; a subset of the loops
Figure 60. Unimodular transformation example.
that require a direction vector can also
be handled. The transformations that a
unimodular matrix can describe are in-
terchange, reversal, and skew. Any loop nest whose dependence are
The basic principle is to encode each all representable by a distance vector can
transformation of the loop in a matrix be transformed via skewing into a canon-
and apply it to the dependence vectors of ical form called a fully permutable loop
the loop. The effect on the dependence nest. In this form, any two loops in the
pattern of applying the transformation nest can be interchanged without chang-
can be determined by multiplying the ing the loop semantics. Once in this
matrix and the vector. The form of the canonical form, the compiler can decom-
product vector reveals whether the trans- pose the loop into the granularity that
formation is valid. To model the effect of matches the target architecture [Wolf and
applying a sequence of transformations, Lam 1991].
the corresponding matrices are simply Sarkar and Thekkath [1992] describe a
multiplied. framework for transforming perfect loop
Figure 60 shows a loop nest that can nests that includes unimodular transfor-
be transformed with unimodular matri- mations, tiling, coalescing, and parallel
ces. The distance vector that describes loop execution. The transformations are
the loop is D = (1, O), representing the encoded in an ordered sequence. Rules
dependence of iteration i on i – 1 in the are provided for mapping the dependence
outer loop. vectors and loop bounds, and the trans-
Because of the dependence, it is not formation to the loop body is described by
legal to reverse the outer loop of the nest. a template.
The reversal transformation is Pugh [1991] describes a more ambi-
tious (and time-consuming) technique
that can transform imperfectly nested
R=–::.
[1 loops and can do most of the transforma-
tions possible through a combination of
statement reordering, interchange, fu-
The product is sion, skewing, reversal, distribution, and
parallelization. It views the transforma-
RD=P1=
H
‘:.
tion problem as that of finding
schedule for a set of operations
nest. A method is given for generating
the best
in a loop
terconnect. The heuristics are organized benchmarks measure the combined effec-
hierarchically. The main functions are: tiveness of the compiler and the target
description of parallelism in the program machine. Thus two compilers that target
and in the machine; matching of program the same machine can be compared by
parallelism to machine parallelism; and using the SPEC ratings of the generated
control of restructuring. code.
For parallel machines, the contribution
of the compiler is even more important,
9. COMPILER EVALUATION
since the difference between naive and
Researchers are still trying to find a good optimized code can be many orders of
way to evaluate the effectiveness of com- magnitude. Parallel benchmarks include
pilers. There is no generally agreed upon SPLASH (Stanford Parallel Applications
way to determine the best possible per- for Shared Memory) [Singh et al. 1992]
formance of a particular program on a and the NASA Numerical Aerodynamic
particular machine, so it is difficult to Simulation (NAS) benchmarks [Bailey
determine how well a compiler is doing. et al. 1991].
Since some applications are better struc- The Perfect Club [Berry et al. 1989] is
tured than others for a given architec- a benchmark suite of computationally in-
ture or a given compiler, measurements tensive programs that is intended to help
for a particular application or group of evaluate serial and parallel machines.
applications will not necessarily predict
how well another application will fare.
9.2 Code Characteristics
Nevertheless, a wide variety of mea-
surement studies do exist that seek to A number of studies have focused on the
evaluate how applications behave, how applications themselves, examining their
well they are being compiled, and how source code for programmer idioms or
well they could be compiled. We have profiling the behavior of the compiled
divided these studies into several groups. executable.
Knuth [1971] carried out an early and
influential study of Fortran programs. He
9.1 Benchmarks
studied 440 programs comprising 250,000
Benchmarks have received by far the lines (punched cards). The most impor-
most attention since they measure deliv- tant effect of this study was to dramatize
ered performance, yielding results that the fact that the majority of the execu-
are used to market machines. They were tion time of a program is usually spent in
originally developed to measure machine a very small proportion of the code. Other
speed, not compiler effectiveness. The interesting statistics are that 9570 of all
Livermore Loops [McMahon 1986] is one the do loops incremented their index
of the early benchmark suites; it sought variable by 1, and 4070 of all do loops
to compare the performance of supercom- contained only one statement.
puters. The suite consists of a set of small Shen et al. [1990] examined the sub-
loops based on the most time-consuming script expressions that appeared in a set
inner loops of scientific codes. of Fortran mathematical libraries and
The SPEC benchmark suite [Dixit scientific applications. They applied vari-
1992] includes both scientific and gen- ous dependence tests to these expres-
eral-purpose applications intended to be sions. The results demonstrate how the
representative of an engineering/scien- tests compare to one another, showing
tific workload. The SPEC benchmarks are that one of the biggest problems was un-
widely used as indicators of machine per- known variables. These variables were
formance, but are essentially uniproces- caused by procedure calls and by the use
sor benchmarks. of an array element as an index value
As architectures have become more into another array. Coupled subscripts
complex, it has become obvious that the also caused problems for tests that exam-
ine a single array dimension at a time One study of four Perfect benchmark
(coupled subscripts are discussed in programs compiled on the Alliant FX/8
Section 5.4). produced speedups between 0.9 (that is,
A study by Petersen and Padua [1993] a slowdown) and 2.36 out of a potential
evaluates approximate dependence tests 32; when the applications were tuned by
against the omega test [Pugh 1992], find- hand, the speedups ranged from 5.1 to
ing that the latter does not expose sub- 13.2 [Eigenmann et al. 1991]. In another
stantially more parallelism in the Perfect study of 12 Perfect benchmarks compiled
Club benchmarks. with the KAP compiler for a simulated
machine with unlimited parallelism, 7
applications had speedups of 1.4 or less;
9.3 Compiler Effectiveness
two applications had speedups of 2-4;
As we mentioned above, researchers find and the rest were sped up 10.3, 66, and
it difficult to evaluate how well a com- 77. All but three of these applications
piler is doing. They have come up with could have been sped up by a factor of 10
four approaches to the problem: or more [Padua and Petersen 1992].
Lee et al. [1985] study the ability of the
e examine the compiler output by hand Parafrase compiler to parallelize a mix-
to evaluate its ability to transform code; ture of small programs written in For-
e measure performance of one compiler tran. Before compilation, while loops
against another; were converted to do loops, and code for
e compare the performance of executa- handling error conditions was removed.
ble compiled with full optimization With 32 processors available, 4 out of 15
against little or no optimization; and applications achieved 30920 efficiency, and
e compare an application compiled for 2 achieved 10% efficiency; the other
parallel execution against the sequen- 9 out of 15 achieved less than 10%
tial version running on one processor. efficiency. Out of 89 loops, 59 were
parallelized, most of them loops that
Nobayashi and Eoyang [1989] compare initialized array data structures. Some
the performance of supercomputer com- coding idioms that are amenable to im-
pilers from Cray, Fujitsu, Alliant, proved analysis were identified.
Ardent, and NEC. The compilers were Some preliminary results have also
applied to various loops from the Liver- been reported for the Fortran D compiler
more and Argonne test suites that re- [Hiranandani et al. 1993] on the Intel
quired restructuring before they could be iPSC/860 and Thinking Machines CM-5
computed with vector operations. architectures.
Carr [1993] applies a set of loop trans-
formations (unroll-and-jam, loop inter-
9.4 Instruction-Level Parallelism
change, tiling, and scalar replacement) to
various scientific applications to find a In order to evaluate the potential gain
strategy that optimizes cache utilization. from instruction-level parallelism, re-
Wolf [1992] uses inner loops from the searchers have engaged in a number of
Perfect Club and from one of the SPEC studies to measure an upper bound on
benchmarks to compare the performance how much parallelism is available when
of hand-optimized to compiler-optimized an unlimited number of functional units
code. The focus is on improving locality is assumed. Some of these studies are
to reduce traffic between memory and discussed and evaluated in detail by Rau
the cache. and Fisher [1993].
Relatively few studies have been per- Early studies [Tjaden and Flynn 1970]
formed to test the effectiveness of real were pessimistic in their findings, mea-
compilers in parallelizing real programs. suring a maximum level of parallelism
However, the results of those that have on the order of two or three—a result
are not encouraging. that was called the Flynn bottleneck. The
main reason for the low numbers was SPEC89 suite. The resulting code was
that these studies did not look beyond executed on a variety of simulated
basic blocks. machines. The range includes unrealisti-
Parallelism can be exploited across cally omniscient architectures that com-
basic block boundaries, however, on ma- bine data-flow strategies for scheduling
chines that use speculative execution. h- with perfect branch prediciton. They also
stead of waiting until the outcome of a restricted models with limited hardware.
conditional branch is known, these archi- Under the most optimistic assumptions,
tectures begin executing the instructions they were able to execute 17 instructions
at either or both potential branch tar- per cycle; more practical models yielded
gets; when the conditional is evaluated, between 2.0 and 5.8 instructions per cy-
any computations that are rendered in- cle. Without looking past branches, few
valid or useless must be discarded. applications could execute more than 2
When the basic block restriction is re- instructions per cycle.
laxed, there is much more parallelism Lam and Wilson [1992] argue that one
available. Riseman and Foster [1972] as- major reason why studies of instruction-
sumed replicated hardware to support level parallelism differ markedly in their
speculative execution. They found that results is branch prediction. The studies
there was significant additional paral- that perform speculative execution based
lelism available, but that exploiting it on prediction yield much less parallelism
would require a great deal of hardware. than the ones that have an oracle with
For the programs studied, on the order of perfect knowledge about branches.
216 arithmetic units would be necessary The authors speculate that executing
to expose 10-way parallelism. branches locally will not yield large de-
A subsequent study by Nicolau and grees of parallelism; they conduct experi-
Fisher [1984] sought to measure the par- ments on a number of applications and
allelism that could be exploited by a find similar results to other studies (be-
VLIW architecture using aggressive com- tween 4.2 and 9.2). They argue that com-
piler analysis. The strategy was to com- piler optimizations that consider the
pile a set of scientific programs into an global flow of control are essential.
intermediate form, simulate execution, Finally, Larus [1993] takes a different
and apply a scheduling algorithm approach than the others, focusing
retroactively that used branch and ad- strictly on the parallelism available in
dress information revealed by the simu- loops. He examines six codes (three inte-
lation. The study revealed significant ger programs from SPEC89 and three
amounts of parallelism—a factor of tens scientific applications), collects a trace
or hundreds. annotated with semantic information,
Wall [1991] took trace data from a and runs it through a simulator that as-
number of applications and measured the sumes a uniform shared-memory archi-
amount of parallelism under a variety of tecture with infinite processors. Two of
models. His study considered the effect of the scientific codes, the ones that were
architectural features such as varying primarily array manipulators, showed a
numbers of registers, branch prediction, potential speedup in the range of 65–250.
and jump prediction. It also considered The potential speedup of the other codes
the effects of alias analysis in the com- was 1.7–3, suggesting that the tech-
piler. The results varied widely, yielding niques developed for parallelizing sci-
as much as 60-way parallelism under the entific codes will not work well in
most optimistic assumptions. The aver- parallelizing other types of programs.
age parallelism on an ambitious combi-
nation of architecture and compiler
CONCLUSION
support was 7.
Butler et al. [1991] used an optimizing We have described a large number of
compiler on a group of programs from the transformations that can be used to im-
-.
I 64
-&
32 Branch 64 from RAM memory
Integer * * 64 KB Cache
Unit Unit
32 32 64 bytedline
? _
1024 lines
32 32
32 Int Registers * Addr ~
“ Load/Store 4-way set- ~
64 Unit s-l associative Xlation Pg$
*
32 FP Registers
64
—
Floating
Point Unit
field is 16 bits. The address for load and high 16 bits; then the high 16 bits must
store instructions is computed by adding be set with a load upper immediate (LUI)
the 16-bit immediate operand to the reg- instruction.
ister. To create a full 32-bit constant, the The program counter is the special reg-
low 16 bits must first be set with a load ister PC. Jumps and branches are rela-
immediate (Ll) instruction that clears the tive to PC + 4; jumps have a 26-bit signed
offset, and branches have a 16-bit signed vector support. This new architecture,
offset. Integer branches test the GPRs for V-DLX, has eight vector registers, each
zero; floating-point branches test a spe- of which holds a vector consisting of up
cial floating-point condition register to 64 floating-point numbers. The vector
(FPCR). functional units perform all their opera-
There is no branch delay slot. Because tions on data in the vector and scalar
the number of instructions executed per registers.
cycle varies on a superscalar machine, it We discuss only the functional units
does not make sense to have a fixed num- that perform floating-point addition,
ber of delay slots. The number of delay multiplication, and division, though vec-
slots is also heavily dependent on the tor machines typically have units to per-
pipeline depth, which may vary from one form integer and logical operations as
chip generation to the next. well. V-DLX issues only one scalar or one
Instead, static branch prediction is vector instruction per cycle, but noncon-
used: forward branches are always pre- flicting scalar and vector instructions can
dicted as not taken, and backward overlap each other.
branches are predicted as taken. If the A special register, the vector length
prediction is wrong, there is a one-cycle register (VLR), controls the number of
delay. quantities that are loaded, operated upon,
Floating-point divides take 14 cycles or stored in any vector instruction. By
and all other floating-point operations software convention, the VLR is normally
take four cycles, except when the result set to 64, except when handling the last
is used by a store operation, in which few iterations of a loop.
case they take three cycles. A load that The vector operations are described in
hits in the cache takes two cycles; if it Table A.2. They include arithmetic opera-
misses in the cache it takes 16 cycles. tions on vector registers and load/store
Integer multiplication takes two cycles. operations between vector registers and
All other instructions take one cycle. memory. Vector operations take place ei-
When a load or store is followed by an ther between two vector registers or
integer instruction that modifies the ad- between a vector register and a scalar
dress register, the instructions may be register. In the latter case, the scalar
issued in the same cycle. If a floating- value is extended across the entire vec-
point store is followed by an operation tor. All vector computations have a vec-
that modifies the register being stored, tor register as the destination.
the instructions may also be issued in The speed of a vector operation de-
the same cycle. pends on the depth of the pipeline in its
implementation. The first result appears
A.2 Vector DLX
after some number of cycles (called the
For vectorizing transformations, we will startup time). After the pipeline is full,
use a version of DLX extended to include one result is computed per clock cycle. In
Table A.3. Startup Times in Cycles on V-DLX 90 array section expressions are exam-
[ Operation I Start-Up Time I ples of explicitly data parallel operations.
APL [Iverson 1962] also contains a wide
variety of data-parallel array operators.
~
Examples of control-parallel operations
,
I Vector Load 12 j are cobegin / coend blocks and
doacross loops [Cytron 19861.
For both shared- and distributed-mem-
ory multiprocessors, we assume that the
the meantime, the processor can con- programming environment initializes the
tinue to execute other instructions. Table variables Pn urn to be the total number of
A.3 gives the startup times for the vector processors and Pid to be the number of
operations in V-DLX. These times should the local processor. Processors are num-
not be compared directly to the cycle bered starting with O.
times for operations on S-DLX because
vector machines typically have higher A.3. 1 Shared-Memory DLX Multiprocessor
clock speeds than microprocessors, al-
though this gap is closing. sMX is our prototypical shared-memory
A large factor in the sQeed of vector parallel architecture. It consists of 16 S-
architec~ures is their mem’ory system. V- DLX processors and 512MB of shared
DLX has eight memory banks. After the main memory. The processors and mem-
load latency, data is supplied at the rate ory system are connected together by a
of one word per clock cycle, provided that bus. Each processor has an intelligent
the stride with which the data is ac- cache controller that monitors the bus (a
cessed does not cause bank conflicts (see snoopy cache). The caches are the same
Section 6.5.1). as on S-DLX, except that they contain
Current examples of vector machines 256KB of data. The bandwidth of the bus
are the Cray C-90 [Oed 1992] and IBM is 128 megabytes/second.
ES 9000 Model 900 VF [Gibson and Rao The processors share the memory
1992]. units; a program running on a processor
can access any memory element, and the
system ensures that the values are main-
A.3 Multiprocessors
tained consistently across the machine.
When multiple processors are employed Without caching, consistency is easy to
to execute a program, many additional maintain; every memory reference is
issues arise. The most obvious issue is handled by the main memory controller.
how much of the underlying machine ar- However, performance would be very poor
chitecture to expose to the programmer. because memory latencies are already too
At one extreme, the programmer can high on sequential machines to run well
make explicit use of hardware-supported without a cache; having many processors
operations by programming with locks, share a common bus would make the
fork/join primitives, barrier synchro- problem much worse by increasing mem-
nizations, and message send and receive. ory access latency and introducing band-
These operations are typically provided width restrictions. The solution is to give
as system calls or library routines. Sys- each processor a cache that is smart
tem call semantics are usually not de- enough to resolve reference conflicts.
fined by the language, making it difficult A snoopy cache implements a sharing
for the compiler to optimize them protocol that maintains consistency while
automatically. still minimizing bus traffic. A processor
High-level languages for large-scale that modifies a cache line invalidates all
parallel processing provide primitives for other copies and is said to own that cache
expressing parallelism in one of two ways: line. There are a variety of cache co-
control parallel or data parallel. Fortran herency protocols [Stenstrom 1990;
Table A.4, Memory reference latency in sMX pass through a processor en route to some
Type of Memory Reference / Cycles ] other destination in the mesh are han-
Read value available in local cache I
1
21 dled by the network processor without
Read value owned by ‘ other mocessor I 1; i interrupting the CPU.
Read value nobody owns “ II 20 I The latency of a message transfer is
Write value owneri locally 1 ten microseconds plus 1 microsecond per
Write value owned by other Drocessor I 18
10 bytes of message (we have simplified
Write value nobody ;wns ‘ 22
the machine by declaring that all remote
processors have the same latency). Com-
munication is supported by a message
library that provides the following calls:
Eggers and Katz 1989], but the details
are not relevant to this survey. From the “ SEND(buffer, nbytes,target)
compiler writer’s perspective, the key is- Send nbytes from buffer to the proces-
sue is the time it takes to make a mem- sor target. If the message fits in the
ory reference. Table A.4 summarizes the network processor’s memory, the
latency of each kind of memory reference call returns after the copy (1 microsec-
in sMX, ond/ 10 bytes). If not, the call blocks
Typically, shared-memory multiproces- until the message is sent. Note that
sors implement a number of synchroniza- this raises the potential for deadlock,
tion operations. FORK(n) starts identical but we will not concern ourselves with
copies of the current program running on that in this simplified model.
n processors; because it must copy the ● BROADCAST(buffer, nbytes)
stack of the forking processor to a private Send the message to every other pro-
memory for each of the n – 1 other pro- cessor in the machine. The communica-
cessors, it is a very expensive operation.
tion library uses a fast broadcasting
JOIN( ) desynchronizes with the previous
algorithm, so the maximum latency is
FORK, and makes the processor available roughly twice that of sending a mes-
for other FORK operations (except for the sage to the furthest edge of the mesh.
original forking processor, which
* RECEIVE (buffer, nbytes)
proceeds serially).
Wait for a message to arrive; when it
BARRIER( ) performs a barrier synchro-
does, up to nbytes of it will be put in
nization, which is a synchronization point
the buffer. The call returns the total
in the program where each processor
number of bytes in the incoming mes-
waits until all processors have arrived at
sage; if that value is greater than
that point.
nbytes, RECEIVE guarantees that sub-
sequent calls will return the rest of
A.3.2 Distributed-Memory DLX Multiprocessor
that message first.
dMX is our hypothetical distributed-
memory multiprocessor, The machine ACKNOWLEDGMENTS
consists of 64 S-DLX processors (num-
We thank Michael Burke, Manish Gupta, Monica
bered O through 63) connected in an 8 x 8 Lam, and Vivek Sarkar for their assistance with
mesh. The network bandwidth of each various parts of this survey. We thank John
link in the mesh is 10 MB per second. Boyland, James Demmel, Alex Dupuy, John Hauser,
Each processor has an associated net- James Larus, Dick Muntz, Ken Stanley, the mem-
work processor that manages communi- bers of the IBM Advanced Development Technology
cation; the network processor has its own Institute, and the anonymous referees for their
pool of memory and can communicate valuable comments on earlier drafts.
BALASUNDARAM, V. 1990. A mechanism for keeping BLELLOCH, G, 1989, Scans as primitive parallel op-
useful internal information in parallel pro- erations. IEEE Trans. Comput. C-38, 11 (Nov.),
gramming tools: The Data Access Descriptor, 1526-1538.
J. Parall. DtstrLb. Comput. 9, 2 (June), BROMLEY, M., HELLER, S., MCNERNEY, T., AND
154-170. STEELE, G. L., JR. 1991. Fortran at ten gi
BALASUNDARAM, V., Fox, G., KENNEDY, K., AND gaflops: The Connection Machine convolution
KREMER, U. 1990. An interactive environment compiler. In Proceedings of the SIGPLAN Con-
for data partitioning and distribution. In Pro- ference on Programmmg Language Destgn and
ceedings of the 5th Dlstnbuted Memory Corn- Implementation (Toronto, Ontano, June). SIG-
puter Conference (Charleston, South Carolina, PLAN Not. 26, 6, 145-156.
Apr.). IEEE Computer Society Press, Los BURKE, M. AND CYTRON, R. 1986. Interprocedural
Alamitos, Calif. dependence analysis and parallelization. In
BALASUNDARAM, V., KENNEDY, K., KREMER, U., Proceedings of the SIGPLAN Symposium on
MCKINLEY, K., AND SUBHLOK, J. 1989. The Compiler Construction (Palo Alto, Calif., June).
ParaScope editor: An interactive parallel pro- SZGPLANNot. 21, 7 (July), 162-175. Extended
gramming tool. In Proceedings of Superconz- version available as IBM Thomas J. Watson
putmg ’89 (Reno, Nev., Nov.). ACM Press, New Research Center Tech. Rep. RC 11794.
York, 540-550. BURNETT, G. J. AND COFFMAN, E. G., JR. 1970. A
BALL, J. E. 1979. Predicting the effects of optimiza- study of interleaved memory systems. In Pro-
tion on a procedure body. In Proceedings of the ceedings of the Spring Joint AFIPS Computer
SIGPLAN Symposium on Compiler Construc- Conference. vol. 36. AFIPS, Montvale, N.J.,
tion (Denver, Color., Aug.). SIGPLAN Not. 14, 467-474.
8, 214-220. BURSTALL, R. M. AND DARLINGTON, J. 1977. A trans-
formation system for developing recursive pro-
BANERJEE, U. 1991. Unimodular transformations of
grams. J. ACM 24, 1 (Jan.), 44-67.
double loops. In Aduances in Languages and
Compilers for Parallel Processing, A. Nicolau, BUTLER) M., YEH, T., PATT, Y., ALsuP, M., SCALES,
Ed. Research Monographs in Parallel and Dis- H., AND SHEBANOW, M. 1991, Single instruction
tributed Computing, MIT Press, Cambridge, stream parallelism is greater than two. In Pro-
Mass., Chap. 10. ceedings of the 18th Annual International Sym-
posium on Computer Architecture (Toronto,
BANERJEE, U. 1988a. Dependence Analysis for Su-
Ontario, May). SIGARCH Comput. Arch. Neuw
percomputing. Kluwer Academic Publishers,
19, 3, 276-286.
Boston, Mass.
CALLAHAN, D. AND KENNEDY, K. 1988a. Analysis of
BANEKJEE, U. 1988b. An introduction to a formal
interprocedural side-effects in a parallel pro-
theory of dependence analysis. J. Supercom-
gramming environment J. Parall. Distrib
put. 2, 2 (Oct.), 133-149.
Cornput. 5, 5 (Oct.), 517-550.
BANERJEE, U. 1979. Speedup of ordinary programs, CALLAHAN, D. AND KENNEDY, K. 1988b. Compihng
Ph.D. thesis, Tech. Rep. 79-989, Computer Sci- programs for distributed-memory multiproces-
ence Dept., Univ. of Illinois at Urbana- sors. J. Supercompzd. 2, 2 (Oct.), 151–169.
Champaign.
CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990.
BANERJEE, U., CHEN, S. C., KUCK, D. J., AND TOWLE, Improving register allocation for subscripted
R. A. 1979. Time and parallel processor bounds variables. In Proceedings of the SIGPLAN Con-
for Fortran-like loops. IEEE Trans. Comput. ference on Programming Language Design and
C-28, 9 (Sept.), 660-670. Implementation (White Plains, N.Y., June).
BANNING, J. P. 1979. An efficient way to find the SIGPLAN Not. 25, 6, 53-65.
side-effects of procedure calls and the aliases of CALLAHAN,
D , COCKE, J., AND KENNEDY, K. 1988.
variables. In Conference Record of the 6th ACM Estimating interlock and improving balance for
Symposium on Principles of Programming Lan - pipelined architectures J. Parall. Distrib.
guages (San Antonio, Tex., Jan.). ACM Press, Comput. 5, 4 (Aug.), 334-358.
29-41. CALLAHAN, D., COOPER, K., KENNEDY, K., AND
BERNSTEIN, R. 1986. Multiplication by integer con- TORCZON, L. 1986. Interprocedural constant
stants. Softw. Pratt. Exper. 16, 7 (July), propagation. In Proceedings of the SIGPLAN
641-652. Sympos~um on CompJer Construction (Palo
Alto, Cahf., June). SIGPLAN Not. 21, 7 (July),
BERRY, M., CHEN, D., Koss, P., KUCK, D., Lo, S.,
152-161.
PANG, Y., POINTER, L., ROLOFF, R., SAMEH, A.,
CLEMENTI, E , CHIN. S , SCHNEI~ER, D , Fox, G , CARR, S. 1993. Memory-hierarchy management.
ME SSINA, P., WALKER, D., HSIUNG, C., Ph.D. thesis, R]ce Umversity, Houston, Tex.
SCHWARZMEIER, J., LUE, K., ORSZAG, S., S~IDL, CHAITIN, G. J. 1982. Register allocation and spilling
F., JOHNSON, O., GOODRUM, R., AND MARTIN, J. via graph coloring. In Proceedings of the SIG-
1989. The Perfect Club benchmarks: Effective PLAN Symposium on Compiler Construction
performance evaluation of supercomputers. Znt. (Boston, Mass., June). SIGPLAN Not 17, 6,
J. Supercomput. Appl. 3, 3 (Fall), 5-40. 98-105.
CFIMTIN, G. J,, AUSLANDER, M. A., CHANDRA, A. K., 221-228. Also available as IBM Tech Rep. RC
COCKE, J., HOPKINS, M. E., AND MARKSTEIN, 8111, Feb. 1980.
P. W. 1981. Register allocation via coloring. COCKE, J. AND SCHWARTZ, J. T. 1970. Programming
Comput. Lang. 6, 1 (Jan.), 47-57. languages and their compders (preliminary
CHAMBERS, C. .MXD UNGAR, D. 1989. Customization: notes). 2nd ed. Courant Inst. of Mathemalzcal
Optimizing compiler technology for SELF, a Sciences, New York Umverslty, New York.
dynamically-typed object-oriented program- COLWELL, R. P,, NIX, R, P,, ODONNEL, J. J.,
ming language. In proceedings of the SIG- PAPWORTH, D. B., AND RODMAN, P, K. 1988. A
PLAN Conference on Programming Language VLIW architecture for a trace scheduling com-
Design and Implementation (Portland, Ore., piler. IEEE Trans. Comput. C-37, 8 (Aug.),
June). SIGPLAN Not. 24, 7 (July), 146-160. 967-979.
CEATTERJEE, S., GILBERT, J. R., AND SCHREIBER, R. COOPER, K, D, AND KENNEDY, K, 1989, Fast inter-
1993a. The alignment-distribution graph. In procedural ahas analysie, In Conference Record
Proceedings of the 6th International Workshop of the 16th ACM Symposmm on Prmc~ples of
on Languages and Compilers for Parallel Com- Programmmg Languages (Austin, Tex., Jan.).
puting. Lecture Notes in Computer Science, ACM Press, New York, 49-59.
vol. 768. 234–252.
COOPER, K, D., HALL, M. W,, AND KENNEDY, K. 1993.
CHATTERJEE, S., GILBERT, J. R., SCHREIBER, R., AND A methodology for procedure cloning. Comput.
TENG, S.-H. 1993b. Automatic array alignment Lang. 19, 2 (Apr.), 105-117.
in data-parallel programs. In Conference Record
COOPER, K. D., HALL, M. W., AND TORCZON, L. 1992.
of the 20th ACM Sympo.mum on Prmczples of
Unexpected side effects of inline substitution
Programmmg Languages (Charleston, S. Car-
A case study. ACM Lett. Program. Lang. Syst
olina, Jan.). ACM Press, New York, 16–28.
1, 1 (Mar.), 22-32.
CHEN, S. C. AND KUCK, D. J. 1975. Time and paral-
COOPER, K D., HALL, M. W., AND TORCZON, L 1991.
lel processor bounds for linear recurrence sys-
An experiment with inline substitution. Softw
tems. IEEE Trans. Comput. C-24 7 (July),
Pratt. Exper. 21, 6 (June), 581-601
701-717.
CRAY RESEARCH 1988 CFT77 Reference Manual.
CHOI, J.-D., BURKE, M., AND CARINI, P. 1993. Effi- Publication SR-0018-C. Cray Research, Inc..
cient flow-sensitive interprocedural computa- Eagan, Minn.
tion of pointer-induced aliases and side effects.
CYTRON, R. 1986. Doacross: Beyond vectorizatlon
In Conference Record of the 20th ACM Sympo-
for multiprocessors. In Proceedings of the Inter-
sium on Principles of Programming Languages
national Conference on Parallel Processing (St.
(Charleston, S. Carolina, Jan.). ACM Press,
Charles, Ill., Aug.). IEEE Computer Society,
New York, 232-245.
Washington, D. C., 836-844.
CHOW, F C 1988. Minimizing register usage
CYTRON, R. AND FERRANTE, J. 1987. What’s in a
penalty at procedure calls. In Proceedings of
name? -or- the value of renaming for paral-
the SIGPLAN Conference on Programming
lehsm detection and storage allocation. Pro-
Language Design and Implementation (Atlanta,
ceedings of the Internat~onal Conference on
Ga., June). SIGPLAN Not. 23, 7 (July), 85-94.
Parallel Procewng (University Park, Penn.,
CHOW, F. C. AND HENNESSEY, J. L. 1990. The prior- Aug.), Pennsylvania State University Press,
ity-based coloring approach to register alloca- University Park. Pa., 19-27.
tion. ACM Trans. Program. Lang. Syst. 12, 4
DANTZIG, G. B. AND EAVES, B. C. 1974. Fourier-
(Oct.), 501-536. Motzkin ehmination and sits dual with applica-
CLARKR, C. D. AND PEYTON-JONES, S. L. 1985. Strict- tion to integer programming. In Combinatorial
ness analysis—a practical approach. In Func- Programming: Methods and Applications
tional Programming Languages and Computer (Versailles, France, Sept.). B. D. Roy, Ed. D.
Architecture. Lecture Notes in Computer Sci- Reidel, Boston, Mass., 93-102.
ence, vol. 201. Springer-Verlag, Berlin, 35–49. DENNIS, J B 1980 Data flow supercomputers
CLINGER, W. D. 1990 How to read floating point Computer 13, 11 (Nov.), 48-56.
numbers accurately. In proceedings of the SIG- DIXIT, K. M. 1992. New CPU benchmarks from
PLAN Conference on Programmmg Language SPEC. In Digest of Papers, Spring COMPCON
Deslogn and Implementation (White Plains, 1992, 37th IEEE Computer Society Interna-
NY., June). SIGPLAN Not. 25, 6, 92-101. tional Conference (San Francisco, Calif., Feb.).
COCKE, J. 1970. Global common subexpression elim- IEEE Computer Society Press, Los Alamitos,
ination. In proceedings of the ACM Symposmm Calif., 305-310
on Compder OptLmizatzon (July). SIGPLAN DONGARRA, J. AND HIND, A. R. 1979. Unrolling loops
Not. 5, 7, 20–24. m Fortran. Softzo, Pratt. Exper. 9, 3 (Mar.),
219-226.
COCKR, J. AND MARKSTEIN, P. 1980. Measurement of
program improvement algorithms. In Proceed- EGGERS, S. J. AND KATZ, R. H. 1989 Evaluating the
ings of the IFIP Congress (Tokyo, Japan, Oct.). performance of four snooping cache coherency
North-Holland, Amsterdam, Netherlands, protocols. In proceedings of the 16th Annual
FREE SOFTWARE FOUNDATION. 1992. gcc 2,x Refer- HATFIELD, D. J. AND GERALD, J. 1971. Program re-
ence Manual. Free Software Foundation, Cam- structuring for virtual memory, IBM Syst. J.
bridge, Mass. 10, 3, 168-192,
FREUDENEERGER, S. M., SCHWARTZ, J. T., AND SHARIR, HENNESSY, J. L, AND PATTERSON, D. A. 1990. Com-
M. 1983. Experience with the SETL optimizer. puter Architecture: A Quantitative Approach.
ACM Trans. Program. Lang. Syst. 5, 1 (Jan.), Morgan Kaufmann Publishers, San Mateo,
26-45. Calif,
GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. HEWLETT-PACKARD. 1992. PA-RZSC 1.1 Architecture
Strategies for cache and local memory manage- and Instruction Manual. 2nd ed. Part Number
ment by global program transformation, J. 09740-90039. Hewlett-Packard, Palo Alto, Calif.
Parall. Distr~b. Comput. 5, 5 (Oct.), 587-616. HIGH PERFORMANCE FORTRAN FORUM. 1993. High
GERNDT, M. 1990. Updating distributed variables in Performance Fortran language specification,
local computations. Concurrency Pratt. Ilxper. version 1.0. Tech. Rep. CRPC-TR92225, Rice
2, 3 (Sept.), 171-193. University, Houston, Tex.
GIBSON, D. H. AND RAO, G. S. 1992. Design of the HIRANANDAM, S., KENNEDY, K., AND TSENG, C.-W.
IBM System/390 computer family for numeri- 1993. Preliminary experiences with the
cally intensive applications: An overview for Fortran D compiler. In Proceedmg.s of Super-
engineers and scientists. IBM J. Res. Dev. 36, computing ’93 (Portland, Ore., Nov.). IEEE
4 (July), 695-711. Computer Society Press, Los Alamitos, CaIif.,
GIRKAR, M. AND POLYCHRONOPOULOS, C. D. 1988. 338-350.
Compiling issues for supercomputers. In Pro- HIRANANDANL S., KRNNEDY, K., AND TSENG, C.-W.
ceedings of Supercomputing ’88 (Orlando, Fla., 1992. Compiling Fortran D for MIMD dis-
tributed-memory machines. Commun. ACM 35, programs. Soflw. Pratt. Exper. 1,2 (Apr.-June),
8 (Aug.), 66-80. 105-133.
HUDAK, P. AND GOLDBERG, B. 1985. Distributed exe- KOELBEL, C 1990. Compiling programs for non-
cution of functional programs using serial com- shared memory machines Ph D thesis, Purdue
binators IEEE Trans. Comput. C-34, 10 (Ott), University, West Lafayette, Ind.
881-890. KOELBEL, C AND MEHROTRA, P. 1991 Compiling
HWU, W. W. AND CHANG, P. P. 1989 Achieving high global name-space parallel loops for distributed
instruction cache performance with an optimiz- execution. IEEE Trans. Parall Dlstnb. Syst.
ing compiler. In Proceedings of the 16th An- 2, 4 (Oct.), 440-451
n ual International Symposium on Computer KRANZ, D., KELSE~, R., REES, J., HUDAK, P., PHILBIN,
Architecture (Jerusalem, Israel, May) J , AND ADAMS, N. 1986. ORBIT: An optimizing
SIGARCH Comput. Arch. News 17, 3 (June), compiler for Scheme. In Proceedings of the
242-251. SIGPLAN Symposmm on Compder Construc-
Hwu, W. W., MAHLKE, S. A., CHEN, W. Y., CHANG, tion (Palo Alto, Calif, June). SIGPLAN Not
P. P., WARTER, N. J., BRINGMANN, R. A., 21, 7 (July), 219-233
OUELLETTE, R. G., HANK, R. E., KIYOHARA, T., Kuc~, D. J. 1978. The Structure of Computers and
HAAO, G E., HOLM, J, G., AND LAVERY, D M Computations Vol. 1. John Wdey and Sons,
1993. The superblock: An effective technique New York.
for VLIW and superscalar compilation. J. Su -
KUCK, D. J. 1977. A survey of parallel machine
percomput 7, 1/2 (May), 229-248
orgamzation and programming. ACM Comput.
IBM 1992 Optimization and Tuning Guide for the Suru. 9, 1 (Mar.), 29-59.
XL Fortran and XL C Compilers. 1st ed. Publi-
KUCK, D. J, AND STOKES, R, 1982. The Burroughs
cation SC09-1545-00, IBM, Armonk, N.Y.
Sclentitic Processor (BSP). IEEE Trans. Com-
IBM. 1991 IBM RISC System/6000 NIC Tuning put, C-31, 5 (May), 363-376.
Guide for Fortran and C Publication GG24-
KUCK, D. J., KUHN, R. H,, PADUA, D,, LEASURE, B,,
3611-01, IBM, Armonk, NY.
AND WOLFE, M. 1981, Dependence graphs and
IRIGOIN, F. AND TRIOLET, R. 1988. Supernode parti- compiler optimizatlons, In Conference Record of
tioning In Conference Record of the 15th ACM the 8th ACM Svmposlum on Prmctples of Pro-
Symposium on Principles of Programming Lan - grammmg Languages (Williamsburg, Vs., Jan,).
guages (San Diego, Calif., Jan) ACM Press, ACM Press, New York, 207-218.
New York, 319-329 LAM, M S 1988. Software pipelining: An effective
IVERSON, K. E. 1962. A F’rogrammzng Language scheduling technique for VLIW machines In
John Wiley and Sons, New York. Proceedings of the SIGPLAN Conference on
KENNEDY, K. AND MCKINLEY, K. S. 1990 Loop dis- Programming Language Design and Implemen-
tation (Atlanta, Ga., June). SIGPLAN Not. 23,
tribution with arbitrary control flow In Pro-
7 (July), 318-328
ceedings of Supercomputing ’90 (New York,
N.Y, Nov. ) IEEE Computer Society Press, Los LAM, M. S AND WILSON, R P 1992. Limits of con-
Alamitos, Calif., 407-416 trol flow on parallelism In Proceedings of the
19th Annual International Symposium on
KENNEDY, K., MCKTNLEY, K. S., AND TSENG, C.-W.
Computer Architecture (Gold Coast, Australia,
1993. Analysis and transformation in an inter-
May). SIGARCH Comput Arch News 20, 2,
active parallel programming tool Concurrency
46-57
Pratt. Exper 5, 7 (Oct.), 575-602.
LAM, M, S., ROTHBERG, E. E., AND WOLF, M. E. 1991.
KILDALL, G. 1973. A unified approach to global pro-
The cache performance and optlmlzatlon of
gram optimization In Conference Record of the
blocked algorithms. In Proceedings of the 4th
ACM Sympostum on Prmclples of Program-
International Conference on Architectural
mmg Languages (Boston, Mass , Oct.). ACM
Support for Programmmg Languages and Op-
Press, New York, 194–206.
erating Systems (Santa Clara, Cahf,, Apr.).
KNOBE, K. AND NATARAJAN, V. 1993. Automatic data SIGPLAN Not. 26, 4, 63–74.
allocation to minimize communication on SIMD
LAMPORT, L. 1974 The parallel execution of DO
machmes. J. Supercomput. 7, 4 (Dec.), 387–415.
loops. Commun. ACM 17, 2 (Feb.), 83-93.
KNOBE, K., LUKAS, J, D., AND STEELE, G. L., JR.
LANDI, W., RYDER, B. G., AND ZHANG, S. 1993. Inter-
1990, Data optimlzatlon: Allocation of arrays to
procedural modification side effect analysis
reduce communication on SIMD machmes. J.
with pointer abasing. In Proceedings of the
Parall Distrib, Comput. 8, 2 (Feb.), 102-118.
SIGPLAN Conference on Programmmg Lan-
KNOBE, K., LUKAS, J D., AND STEELE, G. L., JR guage Design and Implementation (Al-
1988. Massively parallel data optimization, In buquerque, New Mexico, June), SZGPLAN Not.
The 2nd. Symposium on the Frontiers of Mas- 28, 6, 56-67,
siuely Parallel Computation (Fairfax, Vs., Oct.). LARUS, J. R. 1993 Loop-level parallelism in nu-
IEEE, Piscataway, N.J., 551-558. meric and symbolic programs. IEEE Trans.
KNUTH, D. E. 1971. An empirical study of Fortran Parall. Dwtrtb. Syst 4, 7 (July), 812-826.
LEE, G., KRUSKAL, C. P., AND KUCK, D. J. 1985. An MAYDAN, D. E., HENNESSEY, J. L., AND LAM, M. S.
empirical study of automatic restructuring of 1991. Efficient and exact data dependence
nonnumerical programs for parallel processors. analysis. In Proceedings of the SIGPLAN Con-
IEEE Trans. Comput. C-34, 10 (Oct.), 927-933. ference on Programming Language Design and
LI, Z. 1992. Array privatization for parallel execu- Implementation (Toronto, Ontario, June). SIG-
tion of loops. In Proceedings of the ACM Inter- PLAN Not. 26, 6, 1-14.
national Conference on Supercomputmg MCFARLING, S. 1991. Procedure merging with in-
(Washington, D. C., July). ACM Press, New struction caches. In Proceedings of the SIG-
York, 313-322. PLAN Conference on Programming Language
LI, J. AND CHEN, M. 1991. Compiling communica- Design and Implementation (Toronto, Ontario,
tion-eftlcient programs for massively parallel June). SIGPLAN Not. 26, 6, 71-79.
machines. IEEE Trans. Parall. Distrib. Syst. MCGRAW, J. R. 1985. SISAL: Streams and iteration
2, 3 (July), 361-376. in a single assignment language. Tech. Rep.
LI, J. AND CHEN, M. 1990. Index domain alignment: M-146, Lawrence Livermore National Labora-
Minimizing cost of cross-referencing between tory, Livermore, Calif.
distributed arrays. In The 3rd Symposium on
MCMAHON, 1?. M. 1986. The Livermore Fortran ker-
the Frontiers of Massively Parallel Computa-
nels: A computer test of numerical performance
tion, J. Jaja, Ed. IEEE Computer Society Press,
range. Tech. Rep. UCRL-55745, Lawrence Liv-
Los Alamitos, Calif., 424-433.
ermore National Laboratory, Livermore, Calif.
LI, Z. AND YEW, P. 1988. Interprocedural analysis
MICHIE, D. 1968. “Memo” functions and machine
for parallel computing. In Proceedings of the
learning. Nature 218, 19-22.
International Conference on Parallel Process-
ing. F. A. Briggs, Ed. Vol. 2. Pennsylvania State MILLSTEIN, R. E. AND MUNTZ, C. A. 1975. The IL-
University Press, University Park, Pa., LIAC-IV Fortran compiler. In Programming
221-228. Languages and Compilers for Parallel and Vec-
tor Machines (New York, N.Y., Mar). SIG-
LI, Z., YEW, P., AND ZHU, C. 1990. Data dependence
PLAN Not. 10, 3, 1-8.
analysis on multi-dimensional array refer-
ences. IEEE Trans. Parall. D~str~b. Syst. 1, 1 MIRCHANDANEY, R., SALTZ, J. H., SMITH, R. M., NICOL,
(Jan.), 26-34. D. M., AND CROWLEY, K. 1988. Principles of
LOVRMAN, D. B. 1977. Program improvement by runtime support for parallel processors. In Pro-
source-to-source transformation. J. ACM 1, 24 ceedings of the ACM International Conference
(Jan.), 121-145. on Supercomputing (St. Male, France, July).
ACM Press, New York, 140-152.
Lucco, S. 1992. A dynamic scheduling method for
irregular parallel programs. In Proceedings of MOREL, E. AND RENVOME, C. 1979. Global optimiza-
the SIGPLAN Conference on Programming tion by suppression of partial redundancies.
Language Design and Implementation (San Commun. ACM 22, 2 (Feb.), 96-103.
Francisco, Calif., June). SIGPLAN Not. 27, 7 MUCHNICK, S. S. AND JONES, N., (EDs.) 1981. Pro-
(July), 200-211. gram Flow Analysis. Prentice-Hall, Englewood
MACE, M, E, 1987, Memory Storage Patterns m Cliffs, N.J.
Parallel Processing. Kluwer Academic Publish- MURAOKA, Y. 1971. Parallelism exposure and ex-
ers, Norwell, Mass. ploitation in programs. Ph.D. thesis, Tech. Rep.
MACE, M. E. AND WAGNER, R. A. 1985. Globally 71-424, Univ. of Illinois at Urbana-Champaign.
optimum selection of memory storage patterns. MYERS, E. W. 1981. A precise interprocedural data
In Proceedings of the International Conference
flow algorithm. In Conference Record of the 8th
on Parallel Processing, D. DeGrott, Ed. IEEE ACM Symposium on Principles of Program-
Computer Society, Washington, D. C., 264-271.
ming Languages (Williamsburg, Vs., Jan.).
MARKSTEIN, P., MARKSTEIN, V., AND ZADECK, K. 1994. ACM Press, New York, 219-230.
Strength reduction. In Optimization in t30mpd-
NICOLAU, A. 1988. Loop quantization: A generalized
ers. ACM Press, New York, Chap. 9 To be
loop unwinding technique. J. Parall Distrib.
published.
Comput. 5, 5 (Oct.), 568-586.
MASSALIN, H. 1987. Superoptimizer: A look at the
NICOLAU, A. AND FISHER, J. A. 1984. Measuring the
smallest program. In Proceedings of the 2nd
parallelism available for very long instruction
International Conference on Architectural Sup-
word architectures. IEEE Trans. Comput. C-33,
port for Programmmg Languages and Operat-
11 (Nov.), 968-976.
ing Systems (Palo Alto, Calif., Oct.). SIGPLAN
Not. 22, 10, 122-126. NIKHIL, R. S. 1988. ID reference manual, version
MAYDAN, D. E., AMAMSINGHE, S. P., AND LAM, M. S. 88.0. Tech. Rep. 284, Laboratory for Computer
1993. Array data flow analysis and its use in Science, Massachusetts Instztute of Technol-
array privatization. In Conference Record of the ogy, Cambridge, Mass.
20th ACM Sympostum on Principles of Pro- NOBAYASHL H. AND EOYANG, C. 1989. A comparison
gramming Languages (Charleston, S. Carolina, study of automatically vectorizing Fortran
Jan.). ACM Press, New York, 1-15. compilers. In Proceedings of Supercomputing
’89 (Reno, Nev., Nov.), ACM Press, New York, Proceedings of the International Conference on
820-825 Parallel Processing. Volume 2, Pennsylvania
OBRIEN, K., HAY, B., MINISH, J., SCHAFFER, H , State Umverslty Press, Umversity Park, Pa.,
SCHLOSS, B., SHEPHERD, A., AND ZALESKI, M. 39-48.
1990 Advanced compiler technology for the PRESBERG, D. L. AND JOHNSON, N. W. 1975 The
RISC System/6000 architecture. In IBM RISC Paralyzer: IVTRAN’S parallehsrn analyzer and
System/6000 Technology. Publication SA23- synthesizer, In Programmmg Languages and
2619. IBM Corporation, Mechanicsburg, Penn Compders for Parallel and Vector Machines
OED, W. 1992. Cray Y-MP C90: System features (New York, NY., Mar.). SIGPLAN Not. 10, 3,
and early benchmark results. Parall. Comput. 9-16.
18, 8 (Aug.), 947-954. PUGH, W. 1992. A practical algorithm for exact
OEHLER, R. R. AND BLASGEN, M, W. 1991. IBM RISC array dependence analysls Commun. ACM 35,
System/6000: Architecture and performance. 8 (Aug.), 102-115,
IEEE Micro 11, 3 (June), 14-24,
PUGH, W. 1991, Umform techniques for loop opti-
PADUA, D. A. AND PETERSEN, P. M. 1992 Evaluation mization, In proceedings of the ACM Interna-
of parallelizing compilers. Parallel Computing tional Conference on Supercomputmg (Cologne,
and Transputer Appllcatlons, (Barcelona, Germany, June). ACM Press, New York.
Spain, Sept ) CIMNE, 1505-1514. Also avail-
RAU, B. AND FISHER, J. A. 1993. InstructIon-level
able as Center for Supercomputing Research
parallel processing: History, overview, and per-
and Development Tech Rep. 1173.
spectwe, J. Supercomput. 7, 1/2 (May), 9–50
PADUA, D. A, AND WOLFE, M. J. 1986. Advanced
RAU, B., YEN, D. W. L., YEN, W., AND TOWI,E, R. A.
compiler optlmlzations for supercomputers.
1989. The Cydra 5 departmental supercom-
Commun. ACM 29, 12 (Dec.), 1184-1201.
puter: Design phdosophies, decisions, and
PADUA, D. A., KUCK, D. J. AND LAWRIE, D. 1980.
trade-offs, Computer 22, 1 (Jan,), 12-34.
High-speed multiprocessors and compilation
techniques IEEE Trans. Comput. C-29, 9 REES, J., CLINGER, W, ET AL. 1986 Revised3 report
(Sept.), 763-776. on the algorithmic language Scheme. SIG-
PLAN Not 21, 12 (Dee), 37-79.
PETERSEN, P. M. AND PADUA, D A 1993 Static and
dynamic evaluation of data dependence analY- RISEMAN, E. M, AND FOSTER, C. C, 1972, The mhlbl-
sis. In Proceedings of the ACM International tion of potential parallehsrn by conditional
Conference on Supercomputmg (Tokyo, Japan, jumps. IEEE Trans, Comput, C-21, 12 (Dec.),
July). ACM Press, New York, 107-116. 1405-1411.
PETTIS, K, AND HANSEN, R. C, 1990. Profile guided ROGERS, A. M, 1991. Compiling for locality of refer-
code posltionmg. In Proceedings of the SIG- ence. Ph.D. thesis, Tech Rep. TR 91-1195, Dept.
PLAN Conference on Programmmg Language of Computer Science, Cornell University
Des~gn and Implementation (Whte Plains,
NY., June). SIGPLAN Not. 25, 6, 16-27. RUSSELL, R. M. 1978. The CRAY-1 computer sys-
tem. Commun. ACM 21, 1 (Jan.), 63–72,
POLYCHRONOPOULOS, C. D. 1988. Parallel Program-
ming and Compilers. Kluwer Academic Pub- SABOT, G. AND WHOLEY, S. 1993 CMAX: A Fortran
lishers, Boston, Mass. translator for the ConnectIon Machine system.
POLYCHRONOPOULOS, C D 1987a Advanced loop In proceedings of the ACM International Con-
optlmlzations for parallel computers In Pro- ference on Supercomputmg (Tokyo, Japan,
ceedings of the Ist International Conference on July). ACM Press, New York.
Supercomputmg. Lecture Notes in Computer SARKAR, V. 1989. Partitioning and Scheduling Par-
Science, vol 297 Springer-Verlag, Berlin, allel Programs for Multiprocessors. Research
255-277. Monographs in Parallel and Distributed Com-
POLYCHRONOPOULOS, C. D. 1987b, LOOP coalescing: puting. MIT Press, Cambridge, Mass
A compiler transformation for parallel ma-
SARKAR, V. AND THEKKATH, R. 1992. A general
chines, In Proceedings of the International
framework for Iteration-reordering transforma-
Conference on Parallel Processing (Umverslty
tions. In Proceedings of the SIGPLAN Confer-
Park, Penn., Aug.). Pennsylvania State Univer-
ence on Programming Language Design and
sity Press, University Park, Pa., 235–242.
Implementation (San Francisco, Calif., June).
POLYCHRONOPOULOS, C. D. AND KUCK, D. J. 1987. SIGPLAN Not. 27, 7 (July), 175-187,
Guided self-scheduling: A practical scheduhng
scheme for parallel supercomputers. IEEE SCHEIFLER, R. W. 1977. An analysis of inline substi-
Trans. Comput. C-36, 12 (Dec.), 1425-1439, tution for a structured programming language.
Commun. ACM 20, 9 (Sept.), 647-654.
POLYCHRONOPOULOS, C. D., GIRKAR, M., HAGHIGHAT,
M. R., LEE, C. L., LEUNG, B., AND SCHOUTEN, D. SHEN, Z , LI, Z., AND YEW, P -C. 1990. An empirical
1989. Parafrase-2: An environment for paral- study of Fortran programs for parallelizing
Ielizing, partitioning, synchronizing, and compilers. IEEE Trans. Parall. Distrib. Syst,
scheduling programs on multiprocessors. In 1, 3 (July), 356-364.
SINGH, J. P. AND HENNESSY, J. L. 1991. An empirical Computer Science Dept., Rice University,
investigation of the effectiveness and limita- Houston, Tex.
tions of automatic parallelization. In Proceed-
Tu, P, AND ~ADUA, D. A, 1993, Automatic array
ings of the International Symposium on Shared
privatization. In Proceedings of the 6th Inter-
Memory Multiprocessing, (Apr.), 213-240.
national Workshop on Languages and Compd-
SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992. ers for Parallel Computing. Lecture Notes in
SPLASH: Stanford parallel applications for Computer Science, vol. 768. Springer-Verlag,
shared memory. SZGARCH C’omput. Arch, Berlin, 500-521.
News 20, 1 (Mar.), 5-44. Also available as
VON HANXLEDEN, R. AND KENNEDY, K. 1992. Relax-
Stanford Univ. Tech. Rep. CSL-TR-92-526.
ing SIMD control flow constraints using loop
SITES, R. L., (ED.) 1992. Alpha Architecture Refer- transformations. In Proceedings of the SIG-
ence Manual. Digital Press, Bedford, Mass. PLAN Conference on Programming Language
SOHI, G. S. AND VAJAPAYEM, S. 1990. Instruction Design and Implementation (San Francisco,
issue logic for high-performance, interruptable, Calif., June). SIGPLAN Not. 27, 7 (July),
multiple functional unit, pipelined computers. 188-199.
IEEE Trans. Comput. C-39, 3 (Mar.), 349-359. WALL, D. W. 1991. Limits of instruction-level paral-
STEELE, G. L., JR. 1977. Arithmetic shifting consid- lelism. In Proceedings of the 4th Interncutonal
ered harmful, SIGPLAN Not. 12, 11 (Nov.), Conference on Arch~tectural Support for Pro-
61-69. gramming Languages and Operating Systems
STEELE, G, L., JR, AND WHITE, J, L. 1990. How to (Santa Clara, Calif., Apr.). SIGPLAN Not. 26,
print floating-point numbers accurately, In 4, 176-188.
Proceedings of the SIGPLAN Conference on WANG, K. 1991. Intelligent program optimization
Programming Language Design and Implemen- and parallelization for parallel computers.
tation (White Plains, N.Y., June). SIGPLAN Ph.D. thesis, Tech. Rep. CSD-TR-91-030, Pur-
Not, 25, 6, 112-126. due University, West Lafayette, Ind.
STENSTROM, P. 1990. A survey of cache coherence WANG, K. AND GANNON, D. 1989. Applying Al tech-
schemes for multiprocessors. Computer 23, 6 niques to program optimization for parallel
(June), 12-24. computers. In Parallel Processing for Super-
SUN MICROSYSTEMS. 1991. SPARC Architecture computers and Art~fictal Intelligence, K. Hwang
Manual, Version 8. Part No. 800-1399-08. Sun and D. Llegroot, Eds. McGraw Hill, New York,
Microsystems, Mountain View, Calif. Chap. 12,
f?IZYMANSHI, T. G. 1978. Assembling code for ma- WEDEL, D. 1975. Fortran for the Texas Instruments
chines with span-dependent instructions. Corn- ASC system. In Programmmg Languages and
mum. ACM 21, 4 (Apr.), 300–308. Compilers for Parallel and Vector Machines
TANG, P, AND YEW, P, 1990. Dynamic proceesor (New York, N.Y., Mar.). SZGPLAN Not. 10, 3,
self-scheduling for general parallel neeted 119-132.
loops, IEEE Trans. Comput. C-39, 7 (July), WEGMAN, M. N. AND ZADECK, F. K. 1991. Constant
919-929. propagation with conditional branches, ACM
THINKING MACHINES. 1989. Connection Machine, Trans. Program. Lang. Syst. 13, 2 (Apr.),
Model CM-2 Technical Summary. Thinking 181-210,
Machines Corp., Cambridge, Mass. WEISS, M. 1991. Strip mining on SIMD architec-
TJADEN, G. S. AND FLYNN, M. J. 1970. Detection and tures. In Proceedings of the ACM International
parallel execution of parallel instructions. Conference on Supercomputtng (Cologne, Ger-
IEEE Trans. Comput. C-19, 10 (Oct.), 889-895. many, June). ACM Press, New York, 234–243.
TORRELLAS, J., LAM, H. S., AND HENNESSY, J. L. WHOLEY, S. 1992. Automatic data mapping for dis-
1994. False sharing and spatial locality in mul- tributed-memory parallel computers. In Pro-
tiprocessor caches. IEEE Trans. Comput. 43, 6 ceedings of the ACM International Conference
(June), 651-663. on Supercomputing (Washington, D. C., July).
ACM Pressj New York, 25–34.
TOWLE, R. A. 1976, Control and data dependence
for program transformations. Ph.D. thesis, WOLF, M. E. 1992. Improving locality and paral-
Tech. Rep. 76-788, Computer Science Dept., lelism in nested loops. Ph.D. thesis, Computer
Univ. of Illinois at Urbana-Champaign. Science Dept., Stanford University, Stanford,
TRIOLET, R., IRIGOIN, F., AND FEAUTRIER, P. 1986. Calif.
Direct parallelization of call statements. In WOLFE, M. J. 1989a. More iteration space tiling. In
Proceedings of the SIGPLAN Symposium on Proceedings of Supercomputing ’89 (Reno, Nev.,
Compder Construction (Palo Alto, Calif., June). Nov.). ACM Press, New York, 655–664.
SIGPLAN Not. 21, 7 (July), 176-185. WOLFE, M. J. 1989b. OptLmLzmg Supercompders for
TSENG, C.-W. 1993. An optimizing Fortran D com- Supercomputers. Research Monographs in Par-
piler for MIMD distributed-memory machines. allel and Distributed Computing. MIT Press,
Ph.D. thesis, Tech. Rep. Rice COMP TR93-199, Cambridge, Mass.
WOLI, M, E, AND LAM, M, S. 1991, A loop transfor- ZIMA, H P, BAST, H ,J , Ah_D GERNnT, M 1988
mation theory and an algorlthm to maxlmlze SUPERB. A tool for semi-automatic
parallelism. IEEE Trans. Parall, Dlstrlb. Syst. SIMD\MIMD parallehzatlon Parall C’ornput
2, 4 (Oct.), 452-471, 6, 1 (Jan,), 1-18
WOLFE, M J ANI) TSEN~, C 1992 The Power Test
for data dependence. IEEE Trans. Parall Dzs-
trzb S-yst 3, 5 (Sept ), 591–601